Hi. I am travis and I live and work in a place long ago abandoned by the gods: New York City. By day, I am a web developer at Blue Apron. By night, i am a normal person.

Scraping Websites: An Exercise in Futility

I was working on a project. That project is now dead. So it goes.

My girlfriend and I thought "Wouldn't it be nice if we could see all the NYC concerts and other thingies we might be interested in (art installations, museum exhibits) all in one streamlined interface!?" In other words, I don't want to look at the calendars on a million different websites. I want it all on one page NOW.

I know what you're thinking: "There are already products that basically do that." Get outta town...I build shit, and I build it hard, and I don't stop to wonder what the point is.

In any case, there are two options for obtaining all of this event information, and they're not pretty:

  1. Partner with the owners of the website and have them provide an API (as if these people have the resources to accomplish that or care even slightly)
  2. Just take (scrape) it

The latter is more my style, so that's how we proceeded.

Unsurprisingly, every website has different DOM. Unsurprisingly, many of those websites are terrible. Some look like macaroni sculptures made by a 4-year-old. Others look like a paranoid schizophrenic trying to explain the weird dream he had last night. 

As it turns out, it's quite difficult to write pretty generic code that can reliably scrape all these different forms of chaos. I came up with the following, reasonably ok solution. The gist of it is a configuration per website, like this:

def bowery_config
  {
    base_url: 'www.boweryballroom.com',
    path: '/calendar',
    event_list: '.tfly-calendar',
    event_list_item: '.vevent',
    name: proc { |e| e.css('.one-event').css('.url').map(&:text).join("\n") },
    description: proc do |e|
      e.css('.one-event').css('.supports').map { |s| s.css('a').first.text }.join("\n")
    end,
    url: proc { |e| e.css('.one-event').css('.url').first['href'] },
    date: proc { |e| e.css('.date').css('span').first['title'] }
  }
end

The first four values get us iterating over the site's event list, while the latter four parse values out of each event. This process looks a bit like this:

def parse_events!
  fail Exceptions::EmptyCalendarError if event_list.blank?

  event_list.each do |data|
    event = parse(data)
    next unless event[:name].present?
    Event.where(event).first_or_create!
  end
end

def event_list
  source = Net::HTTP.get(config[:base_url], config[:path])
  html = Nokogiri::HTML(source)

  calendar_html = html.css(config[:event_list])
  calendar_html.css(config[:event_list_item])
end

def parse(event)
  {
    name: parse_attribute(:name, event),
    url: parse_attribute(:url, event),
    description: parse_attribute(:description, event),
    date: parse_date(:date, event),
    venue_id: venue.id
  }
end

Now we have a normalized version of their data to do with as we please! And we chose to pretty much just give up at that point, cause I already saw Animal Collective like 5 years ago.

Check out the code on github if you so desire. You can see, firsthand, all of the unusual, dirty things that go into website scraping.

If it Barks Like a Dog

Ripples! I Made Them!