Cyrus Stoller home about consulting

How to write a webscraper

Since I get asked this a lot, here’s a basic tutorial on how to write a basic web scraper. I recommend that you install either ruby or python (both are installed by default on Macs). I’ll be going through this with ruby, but it’s simple enough to do it in python as well using urllib2.

Here are the basic steps:

  1. Open the webpage you want to scrape information from
  2. Select the information you want from that page using CSS selectors
  3. Write the information you want to keep to a file so you can analyze it later

Here’s an example scraper I posted on Github to find all words used in course titles in the Stanford course catalog that are longer than 5 characters.

If looking at the source code is too intimidating for you, here are the highlights. First open the Terminal. By default on Macs this is in Applications/Utilities. I prefer using iTerm2, but it doesn’t make a big difference here. Type the following at the command prompt:

sudo gem install nokogiri

If you have rvm installed then you can leave off the sudo. If you don’t know what rvm is, then you probably don’t have it installed.

Next, load up a text editor (I recommend Textmate or Sublime, but you can also using emacs or if you really have to vim) and start a file called <<YOUR APP NAME>>.rb. If you don’t know what I’m talking about try reading section 4 of this cheat sheet.

First tell your computer which libraries you’re going to using to talk to the internet, by typing:

require 'open-uri'
require 'nokogiri'

Then if you say,

doc = Nokogiri::HTML(open(<<YOUR DESIRED URL>>))

doc will be the HTML document that you want to extract information from. If you say,

doc.css(<<NAME OF THE CSS SELECTOR>>)

you’ll have easy access to all of the HTML elements that fit that CSS selector.

If you don’t know what a CSS selector is, try using this bookmarklet it allows you to click what you want to scrape and then it will tell you the CSS selector you’ll need to use to scrape that information.

Now I’m assuming you have the appropriate CSS selector to grab the elements you want. So if you want all of the <p> tags, you would type doc.css('p'), which will return an array of Nokogiri objects. You probably want the text in those elements - to get the text just call .text on the Nokogiri objects. To see more in depth documentation go to the Nokogiri Homepage.

Once you have the text you want you can write it to a file that you can later analyze using programs like Excel, though honestly it’d probably be quicker to just write the rest of the analysis in your ruby script. To do this, type,

output_file = File.new('output.csv','w')

'output.csv' will be your file’s title and you say 'w', so that you have writing privileges. This file will appear in the same directory that you run your ruby script from. Next, to write something to that file, you say,

output_file.write(<<STRING YOU WANT TO WRITE>>)

It’s that easy. Separate what you want in each cell in a row with a comma and then insert a newline by writing \n (the newline character) to start a new row.

You should be all set. Good luck writing scrapers. It’s totally doable. Hopefully this will get you started with enough to know what to look up when you start writing these scripts yourself. There’s no need to hire a cheap hacker just to write a scraper for you.

If you want to write a form to collect data from people, first, check to see how far you can get with Google Forms. They’re super easy to make and dump the data right into a Google Spreadsheet, making them easy to share and analyze. If that’s not good enough for what you need, checkout sites like Wufoo. These should cover most of your needs and save you from needing to hire a cheap hacker.

Let me know if this is helpful and if there’s anything else you’d like me to clarify.

Category Tutorial