PHP, MySQL and AJAX Web Crawler

Web crawlers or spiders have all ways been one of the things that have kept me learning more and more about PHP (and even perl) and still I have not mastered the art of making a perfect PHP web crawler. The fact is, or what I believe to be so, is that PHP is just not that well equipped to deal with the amount of processing needed to make an efficient web crawler. I once began to learn perl (only for about 3 days) simply to convert a crawler which I had wrote and once finished, the difference in processing speed was phenomenal.

Anyways, lets get on with dishing out some code.

The following code was my first attempt into writing a class so for the experts out there, I am sorry!

<?php

class crawler {

	public $url, $val, $ref;
	private $data;
	
	public function __construct( $url ) {
		// lets define some stuff
		$this->url = $url;
		$this->max_link = false;
		$this->local = true;
		$this->tableName = self::get_table_name($url);
	}
	public function get_url(){
		// returns the url used in the object
		return $this->url;	
	}
	public function local( $val ) {
		// local refers to whether or not you want to store and follow outbound links or just stay local to the url
		if( is_bool( $val ) ) {
			$this->local = $val;
		}else{
			Throw new Exception('Local Error: only boolean is valid');
		}
	}
	public function get_table_name( $url ){
		// return the table name
		$info = parse_url( $url );
		return preg_replace("#^www\.#is", "", $info['host'] );	
	}
	public function max_link( $val ) {
		// this was implemented for testing. Stop processing after X amount of links have been indexed.
		$this->max_link = $val;	
	}
	public function get_cur_page(){
		$table = self::get_table_name( $this->url );
		$con = Database::getInstance();
		$sql = "SELECT url FROM `$table` WHERE crawling = '1'";
		if( $re = $con->query( $sql ) ) {
			if( $re->num_rows != 0 ) {
				return $re->fetch_object()->url;
			}
		}
	}
	public function x_more( $val ) {
		// if continuing from a previously terminated crawl, terminate after another X more links have been indexed
		$this->x_more = $val;	
	}
	public function complete_percentage(){
		// this is a rough percentage guide. Simply finds the amount of indexed urls and the amount of found but unindexed urls and works out a percentage.
		$con = Database::getInstance();
		$re = $con->query("SELECT COUNT(id) as `total_entries` FROM `" . $this->tableName . "` ORDER BY id");
		$ar = $re->fetch_object();
		$re2 = $con->query("SELECT COUNT(id) as `indexed_entries` FROM `" . $this->tableName . "` WHERE indexed = '1' ORDER BY id");
		$ar2 = $re2->fetch_object();
		$per_percent = ( ( 100 / $ar->total_entries ) * $ar2->indexed_entries );
		return sprintf( "%01.2f", $per_percent );
	}
	public function create_table( $table ) {
		$con = Database::getInstance();
		$table = "CREATE TABLE `isiteconvert`.`$table` (
				`id` INT( 10 ) NOT NULL AUTO_INCREMENT PRIMARY KEY ,
				`url` VARCHAR( 255 ) NOT NULL UNIQUE,
				`already_crawled` INT( 1 ) NOT NULL DEFAULT '0',
				`indexed` INT( 1 ) NOT NULL DEFAULT '0',
				`crawling` INT( 1 ) NOT NULL DEFAULT '0',
				`to_crawl` INT( 1 ) NOT NULL DEFAULT '0',
				`crawl_error` INT( 1 ) NOT NULL DEFAULT '0'
				) ENGINE = MYISAM ;";
		$con->query( $table );	
	}
	private function resolve_href ( $base, $href ) {
		// wrote by Isaac Z. Schlueter posted on php.net
		// creates an absolute path from a base url and a relative path
		if (!$href)
			return $base;
		if( $rel_parsed = @parse_url($href) ) {
			if (array_key_exists('scheme', $rel_parsed))
				return $href;
			$base_parsed = parse_url("$base ");
			if (!array_key_exists('path', $base_parsed))
				$base_parsed = parse_url("$base/ ");
			if ($href{0} === "/")
				$path = $href;
			else
				$path = dirname($base_parsed['path']) . "/$href";
			$path = preg_replace('~/\./~', '/', $path);
			$parts = array();
			foreach ( explode('/', preg_replace('~/+~', '/', $path)) as $part ) { 
				if ($part === "..")
					array_pop($parts);
				elseif ($part!="") 
					$parts[] = $part;
			}
			$dir =  ( ( array_key_exists('scheme', $base_parsed)) ? $base_parsed['scheme'] . '://' . $base_parsed['host'] : "" ) . "/" . implode("/", $parts);
			return str_replace( "\/", '', $dir );
		}else{
			return false;
		}
	}
	private function clean_continued_scan(){
		$con = Database::getInstance();	
		$sql = "SELECT id FROM `" . $this->tableName . "` WHERE crawling = '1'";
		if( $re = $con->query( $sql ) ) {
			$total = $re->num_rows;
			if( $total > 1 ) {
				$x = 1;
				while( $ob = $re->fetch_object() ) {
					if( $x != $total ) {
						$sql = "UPDATE `" . $this->tableName . "` SET crawling = '0', indexed = '0', to_crawl = '1', crawl_error = '0', already_crawled = '0' WHERE id = '" . $ob->id . "'";	
						if( ! $con->query( $sql ) ) die( $con->error );
					}
					$x++;
				}
			}
		}
	}
	public function scrape(){
		$this->clean_continued_scan();
		// this is where we begin.
		$con = Database::getInstance();		
		$con->query("INSERT INTO `" . $this->tableName . "` ( url, to_crawl ) VALUES ( '" . $con->real_escape_string($this->url) . "', '1' )");
		// check to see if there are any more lnks to be scraped ( 10 at a time to keep memory down )
		$sql = "SELECT * FROM `" . $this->tableName . "` WHERE crawling = '1' OR to_crawl = '1' ORDER BY crawling, to_crawl LIMIT 20";
		if( $main_re = $con->query( $sql ) ) {
			if( $main_re->num_rows != 0 ) {
				while( $main_data = $main_re->fetch_object() ) {
					$href = $main_data->url;
					// mark down that this url is currently being crawled, this will help re-proccessing a previously terminated process.
					$con->query("UPDATE `" . $this->tableName . "` SET crawling = '1' WHERE id = '" . $main_data->id . "'");
					$this->current_page = $href;
					// get the page's contents into a string
					if( ! $html = @file_get_contents( $href ) ) {
						// if not, mark down that it has returned an error and start again!!
						$con->query("UPDATE `" . $this->tableName . "` SET already_crawled = '1', indexed = '0', to_crawl = '0', crawl_error ='1', crawling = '0' WHERE id = '" . $main_data->id . "'");	
					}else{
						// lets find all the href's or area attribute values
						preg_match_all( "#<\s*a[^>]+(?:(?:href)|(?:area))=[\"']([^\"']+)[\"'][^>]+>#is", $html, $matches );
						// matches[1] will hold the array of urls
						$links = $matches[1];
						// foreach of the urls
						foreach( $links as $link ) {
							if( $re = $con->query("SELECT * FROM `" . $this->tableName . "` WHERE indexed = '1'") ) {
								// finding out if the urls indexed exceed or match the max link set
								if( $this->max_link !== false && $re->num_rows > $this->max_link ) break;
								// turn the url into a absolute path if is not already
								if( $new_href = rtrim( self::resolve_href( $href, rtrim( $link, "/" ) ), "/") ) {
									// if the link is not a javascript trigger or has the extension .gif, .jpg, .png, .js, .ico or .css
									if( substr( $new_href, -1 ) != "#" && ! preg_match( "#(?:(?:\.gif)|(?:\.jpg)|(?:\.png)|(?:\.js)|(?:\.ico)|(?:\.css)$|\#)#is", $new_href) ) {
										$re = $con->query("SELECT * FROM `" . $this->tableName . "` WHERE url = '" . $con->real_escape_string($new_href) . "'");
										if( ( $this->local === true && preg_match( "#^" . preg_quote( $this->url ) . "#is", $new_href ) ) || $this->local === false ) {
											// if the url is ok and it has not already been added into our table
											if( $new_href !== false && $re->num_rows == 0 ) {
												// insert and start again!!
												$con->query("INSERT INTO `" . $this->tableName . "` ( url, to_crawl ) VALUES ( '" . $con->real_escape_string($new_href) . "', '1' )");
											}	
										}										
									}
								}
							}else die( $con->error );
						}
					}
					// mark that this url has been scanned for new urls and has been indexed correctly
					$con->query("UPDATE `" . $this->tableName . "` SET already_crawled = '1', indexed = '1', to_crawl = '0', crawl_error ='0', crawling = '0' WHERE id = '" . $main_data->id . "'");
				}
				$main_re->free_result();
			}
		}
	}
	
	public function __destruct(){
		$con = Database::getInstance();	
		$con->close();
	}
}

?>

Explanation

I’m not going to go into too much detail about the code above but i will do a run down of the methods that are not so self explanatory.

max_link() can be set if you want to set a maximum amount of URLs to be indexed, which was initially implemented for testing.

x_more() was again initially implemented for testing purposes. This allows indexing of x more URLs (if a previous scan was incomplete, it will index x more URLs).

complete_percentage() was the last thing I added to this class. It basically returns a completion percentage (indexed URLs against URLs found). It is not really that accurate (well it is I suppose) as on larger sites more URLs are being found to be scanned than indexed URLs, so it could jump from 70% back down to 20%.

resolve_href() is a life saver. I actually spent many weeks trying to write my own long winded version of this but always came up with some anomaly to throw a spanner in the works. This was written by Isaac Z. Schlueter which was posted on php.net (http://www.php.net/manual/en/function.realpath.php#85388). It basically gives us an absolute path no matter the URL found in the href/area attribute.

scrape() is where all the magic happens. Read the comments in the code if you would like a step by step insight into how it works. Basically stores the URL, scans the page for new URLs, checks if they are already indexed or marked for indexing and if not, stick them in the database and mark them up for indexing.

How To Use

To get the ball rolling the first piece of code is the following:

<?php
	// include relevent files
	require "./classes/class_mysqli.php";
	require "./classes/crawler.php";
	$con = Database::getInstance();
	$table = crawler::get_table_name( $_POST['url'] );
	// create a table for this url if not already exists
	if( ! $con->query("SHOW TABLES LIKE '"  . $table . "'")->num_rows ) {
		crawler::create_table( $table );
	}
	// init the crawler
	$crawl = new crawler($_POST['url']);
	// lets only go after the local links
	$crawl->local(true);
	// skys the limits, no messing with max links
	$crawl->max_link(false);
	// scrape me!
	$crawl->scrape();
?>

If you create yourself a small form and post a domain name to this script you’ll begin your first crawl, yey. The only problem is, that it appears to be slightly boring as you don’t see anything happening unless you go the phpMyAdmin (or another form of mySql viewer) and see the amount of entries trickle upwards after numerous amounts of refreshes…

So we have a final piece of PHP code:

<?php
	
	// load relevent classes
	require_once "./classes/class_mysqli.php";
	require_once "./classes/crawler.php";
	// init the crawler class
	$crawl = new crawler($_POST['url']);
	// get a connection instance
	$con = Database::getInstance();
	// create a table name based on the url
	$table = crawler::get_table_name( $_POST['url'] );
	
	if( $re = $con->query("SELECT * FROM `$table` WHERE crawling = '1'") ) {
		// If we are not crawling, we might be finished
		if( $re->num_rows == 0 ) {
			if( $re = $con->query("SELECT * FROM `$table` WHERE to_crawl = '1'") ) {
				// if there is no other urls to crawl then we are finished
				if( $re->num_rows == 0 ) {
					$re = $con->query("SELECT * FROM `$table` WHERE indexed = '1'");
					$indexed =  "<p>" . $re->num_rows . " pages indexed.</p>";
					$re = $con->query("SELECT COUNT(id) as `amount` FROM `$table`");
					$amount = "<p>" . $re->fetch_object()->amount . " urls found.</p>";
					die('<!-- die --><p>' . $indexed . ' urls were indexed out of ' . $amount . '</p>' );
				}
			}
		}
		$obj = $re->fetch_object();
		// add a nice little ajax load
		echo "<p><img src=\"./scrape/ajax-loader.gif\" style=\"display:inline; margin-right:10px\" /><a target=\"_blank\" href=\"" . $crawl->get_cur_page() . "\">" . $crawl->get_cur_page() . "</a></p>";
		// query the table to see how many pages have been indexed
		$re = $con->query("SELECT * FROM `$table` WHERE indexed = '1'");
		echo "<p>" . $re->num_rows . " pages indexed.</p>";
		$re = $con->query("SELECT COUNT(id) as `amount` FROM `$table`");
		echo "<p>" . $re->fetch_object()->amount . " urls found.</p>";
		// work out the complete percentage
		$percent = $crawl->complete_percentage();
		// below is working out the width of the inner and outer percentage bar
		$percentBarWidth = 300; // in pixels
		$innerBarWidth = floor( ( $percentBarWidth / 100 ) * $percent );
		echo '<div style="width:' . $percentBarWidth . 'px;" id="outerBar"><div style="width:' . $innerBarWidth . 'px;" id="innerBar"></div></div>';
		echo '<p>' . $percent . '% complete...</p>';
	}else{
		echo "<p>Connection error... trying again to connect...</p>";	
	}
	
?>

This code will remain separate from the other code (ajaxGetData.php). What will happen is once the main script has been initiated via an AJAX request, another request will be sent every 1.5 seconds (or whatever you set it to) to the above script which basically tells us what URL we are currently scanning, completion percentage and how many indexed URLs we have.

The great thing about using PHP and Javascript instead of pure PHP is that we can cut down on the amount of memory used and CPU usage by over 75%. Having short bursts of 10 or so scans instead of one looped scan means I can actually type this post while running the scan! Brilliant!

So you’ll probably need the jQuery code now…

// A simple elapsed time plugin
// ACT Web Designs Mansfield
// 2011
(function($){
    $.fn.elapsed = function(options) {
        var defaults = { seconds: true, minutes: true, hours: true, days: true };
        var options = $.extend(defaults, options);
        var ob = this;
        var secs = 0, mins = 0, hours = 0, days = 0;
        function elapsed_time( secs, mins, hours, days ){
            if( secs == 60 ) { mins++; secs = 0; }
            if( mins == 60 ) { hours++; mins = 0; }
            if( hours == 24 ) { days++; hours = 0; }
            ob.html( ( options.days ? days + ':' : '') + 
                     ( options.hours ? hours + ':' : '') + 
                     ( options.minutes ? mins + ':' : '' ) + 
                     ( options.seconds ? secs : '') );
            window.setTimeout( function(){ secs++; elapsed_time( secs, mins, hours, days ); }, 1000 );
        }
        elapsed_time( secs, mins, hours, days );    
    };
})(jQuery);

$(document).ready(function() {
	var number = 0;
	var timer;

	function get_data( url ){
		$('#requests').html( number );
		$.ajax({ 
			type: 'POST', url: './includes/ajaxGetData.php', data: 'url=' + encodeURIComponent(url), cache: false, timeout: 10000,
			error : function(){ 
				if( timer != 'null' ) {
					timer = window.setTimeout( function(){ number++; get_data( url ) }, 2000 );
				}
			},
			success: function(html){ 
				if( html.substr(0,12) == '<!-- die -->' ) {
					$("#result").html('<p>Complete...</p>' + html );
					$('#requests').html('');
					timer = 'null';
				}else{
					$("#result").html(html);
					timer = window.setTimeout( function(){ number++; get_data( url ) }, 2000 );
				}
			}							
		});
	}
	$("input[name=submit]").live( "click", function(){
		if( timer != 'null' ) {
			var url = $("input[name=url]").val();
			if( $('#requests').html() == '' ) {
				$('#elapsed').elapsed();
				get_data( url );
			}
			$.ajax({ 
				type: 'POST', url: './index.php', data: 'submit=true&url=' + encodeURIComponent(url), cache: false,
				error : function(){ 
					$("input[name=submit]").trigger('click'); 
				},
				success: function(html){ 
					$("#result2").html(html); 
					if( html.substr(0,12) != '<!-- die -->' ) {
						$("input[name=submit]").trigger('click');
					}
				}							
			});
		}else{
			timer = 0;	
		}
	});
});

This code could actually do with some tidying up but at the end of the day it does the job that it is intended to.

It kicks off by sending a request to our main class and also starts the ever looping get_data() function which every 1.5 seconds retrieves information about what we are currently doing.

The final piece of code we need is a simple html form and some placeholders.

<?php
if( isset( $_POST['submit'] ) ) {
	require_once "includes/init_crawler.php";
}else{
    ?>
    
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>ACT Web Crawler</title>
    <link href="css/css.css" rel="stylesheet" type="text/css" />
    <script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.4/jquery.min.js"></script>
    <script type="text/javascript" src="js/js.js"></script>
    </head>
    <body>
        <div id="wrapper">
            <h1>ACT Web Crawler</h1>
            <form method="post" action="index.php">
                <input type="text" name="url" value="http://www.">
                <input type="button" name="submit" value="submit">
            </form>
            <div id="elapsed">0:0:0:0</div>
            <div id="requests"></div>
            <div id="result"></div>
            <div id="result2"></div>
        </div>
        <p><small>Created by ACT Web Designs 2010</small></p>
    </body>
    </html>
    
    <?php
}
?>

The Verdict

Well, this piece of code has been stuck on my hard drive with other lost files for a while so I decided to put it to the test.

google.co.uk – I ran the code for about 20 hours on and off in which time we stored 127,671 URLs and indexed only around 3%. The problem with Google was the amount of long-winded crazy URLs (http://www.google.co.uk/search?q=%23ily+site:twitter.com&amp;hl=en&amp;ie=UTF-8&amp;prmd=ivns&amp;ei=yGJvTavoGYSChQfxmYU9&amp;start=10&amp;sa=N) which mostly lead to nothing and not actual individual pages… something to think about for future updates.

smashingmagazine.com – Well I chose to test on this site due to its large amount of well structured URLs, because I would like to build an external search engine for my own personal use (as I’m reading from this site everyday). The scan took around 24 hours, which returned around 14,000 URLs, which after a quick look through I found that there where a few incorrect links…

After these 2 scans I can only come to one conclusion that in its current form it cannot crawl fast enough to be considered efficient.

What could be implemented to make this crawler more efficient?

  1. re-write the regular expression to only pick up relative paths or absolute paths that are directed to the domain entered if local() is set to true.
  2. re-write a few lines from the scan() method (making sure that there is not just one URL being crawled at one time) and add the ability to multi-thread using jQuery.
  3. with the multi-threading in place, use several servers to multi-thread the same domain (the more the better) updating a single database.

Download

You can download all of the above ready to do from https://sourceforge.net/projects/actwebcrawler/files/

Just remember this is just a starting point for anyone interested in looking into web crawling. There are many errors in the above code but with a little time, it could be manipulated to function the way you need it too. From previous experience, stay away from OOP crawlers or using OOP tag extraction classes (simple html DOM class) or storing URLs into arrays (the larger sites will kill your memory), simply to lower memory usage.

This entry was posted in General. Bookmark the permalink.

2 Responses to PHP, MySQL and AJAX Web Crawler

  1. Paul says:

    I love it but I have one request. Don’t create a new table for every site. Add WHERE site_base = “$URL”. Apart from that it’s amazing with a few tweaks here and there.

    • Luke Snowden says:

      I have an updated version somewhere… can’t remember where I put it though… I think it has what you mentioned and a few more tweaks including the regular expression. Does the job though and is a good starting point for anyone who is interested.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

* Copy this password:

* Type or paste password here:

2,145 Spam Comments Blocked so far by Spam Free Wordpress

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">