You’ve always got to pick the right tool for the right job, and many would say that Ruby is the right tool for most jobs. It is pretty good, that’s for sure.

But with hpricot, it is a no brainer when it comes to web scraping. hpricot is a library for extracting contents from web pages to do with what you will. Chief amongst the features you’ll want for such a library is simple and fast ways to parse the tree of the site you are scraping, and hpricot has them in abundance. I haven’t found anything simpler.

And then just now I find out about the firebug extension for firefox. One of the tricky things with scraping is manually figuring out the path through the tree you need to traverse to get to the bit of the page you are looking for. This blog shows how much simpler it is with firebug

Ruby Screen-Scraper in 60 Seconds



Monday, April 20, 2009

« Back