I always wanted to be able to automate an HTTP browsing process, for example; to extract every article from a website concerning a specific subject, or to diagnose a website by checking the availability and speed of each page. There is a ton of different applications for this bot. It has to be able to make an HTTP request to a website, then do a specific action for each page and then have some kind of heuristic function to choose which page will be visited next.
The program is divided in a series of steps:
- Generating the HTTP request toward the right website/page.
- Wrapping the request with the layers headers.
- Sending the request and getting the response.
- Extracting the HTML code from the response
- Executing a callback function on the response.
- Decide which page will be analyzed next.
- Repeat.
This program could easily be developed as a software for an OS or as a web app using PHP. I won’t go any further since there is a lot of different bots doing this. But if you are actually building one, let me know!
List of existing HTTP bots (taken from wikipedia):
- Aspseek is a crawler, indexer and a search engine written in C++ and licensed under the GPL
- crawler4j is a crawler written in Java and released under an Apache License. It can be configured in a few minutes and is suitable for educational purposes.
- DataparkSearch is a crawler and search engine released under the GNU General Public License.
- GNU Wget is a command-line-operated crawler written in C and released under the GPL. It is typically used to mirror Web and FTP sites.
- GRUB is an open source distributed search crawler that Wikia Search <http://wikiasearch.com> uses to crawl the web.
- Heritrix is the Internet Archive‘s archival-quality crawler, designed for archiving periodic snapshots of a large portion of the Web. It was written in Java.
- ht://Dig includes a Web crawler in its indexing engine.
- HTTrack uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in C and released under the GPL.
- ICDL Crawler is a cross-platform web crawler written in C++ and intended to crawl Web sites based on Web-site Parse Templates using computer’s free CPU resources only.
- mnoGoSearch is a crawler, indexer and a search engine written in C and licensed under the GPL (Linux machines only)
- Nutch is a crawler written in Java and released under an Apache License. It can be used in conjunction with the Lucene text-indexing package.
- Open Search Server is a search engine and web crawler software release under the GPL.
- Pavuk is a command-line Web mirror tool with optional X11 GUI crawler and released under the GPL. It has bunch of advanced features compared to wget and httrack, e.g., regular expression based filtering and file creation rules.
- the tkWWW Robot, a crawler based on the tkWWW web browser (licensed under GPL).
- YaCy, a free distributed search engine, built on principles of peer-to-peer networks (licensed under GPL).
June 5, 2012 at 10:00 pm
helo, am thinking to do a mini project on web crawler will dat b a good idea ?
plese reply asap..
June 6, 2012 at 11:09 pm
what do you want it to do?