FastCrawl is a php web scraping framework that uses cURL, MySQL and xpath for lightning fast accurate web scraping and crawling.

When it comes to doing web crawls, 90% of processing time is waiting for network I/O. That is the web request is signifigantly slower than any other part of the system, and becomes the bottleneck. In other languages one would use multithreading to speed things up. Unfortunatly, PHP doesn't support multithreading.

PHP does support nonblocking I/O. This tehcnique is ideal since we can simulate multiple threads, and service requests with the socket select() command. But reproducing the http protocol using sockets is a poor use of time. A more robust technique is to use cURL.

cURL supports non blocking I/O natively in the driver. The problem is you have to build all the code around it to be usefull.

Enter FastCrawl. It uses cUrl and mysql as a threadsafe queue. In your driver code, you just load up urls, pass an optional callback (the default callback populates the raw_data table), then process the resulting data from the table normally. Because it uses database transactions, you can even run multiple php instances.

Fast crawl supports any parsing anyway you want, but just how much faster is using Built in functions vs simple_html_dom? Check out the interactive benchmark!

FastCrawl has built in support for csv files, so your output always comes out in a format that you can use right away.

Download Fastcrawl now!

See Fastcrawl in action!

profile for Byron Whitlock at Stack Overflow, Q&A for professional and enthusiast programmers

Downloadable Tools

EasyCSV - Run SQL queries against CSV files (windows).

FastCrawl - PHP web crawling framework.

Creative Commons License

Fun Stuff

DOM Benchmarks - So you think simple html dom is the greatest thing since sliced bread? Think again. The PHP built in libraries DOM(inate).

Craigslist Gigfinder - Find gigs in the computer section. No need to click over multiple cities any more, the gigfinder puts it all in one place!

Online image resizer - Quick and dirty image resizer

Plate Techtonics - A fun way to show off jquery transitions.