So, search engines use methods of traversing the web called spiders, a fitting name. The purpose of spiders is to grab the html of the visited page, look for useful information and catalogue all the links it finds, The way it discovers new websites is by looking at those links it has found and visiting them too, You then rinse and repeat the process, this is essentially how web spiders/crawlers work.
Web spiders always have a start point, this is called a seed. Seeds are usually hand picked at the start, for a website which links to lots of other websites. Indexing websites like Wikipedia or blogs are great examples of this.
My plan was to make specialised search engines for specific topics, such as learning or sports, maybe even search engines for specific locations. It is meant to be a crawler that can be specifically tailored to those category’s. Now by no means am I trying to beat Google or Bing, I just wanted to see what kind of volume of websites I could index just by myself and some Python/PHP. The data could even be used in future projects if needed.
How to make one?
So i knew that there were some very useful and well rounded librarys for this, one of them being Scrapy -> https://scrapy.org/ This is a Python library, you can simply install by doing
pip install scrapy
Once installed it has some great examples of premade web-crawlers.
I also knew I wanted a easy to use web interface to be able to schedule crawling sessions without having to SSH into a server. So i set up a Laravel website, and within 30 minutes had made a user interface that I can create new crawlers and set seed URLS and also set crawler targets so that once they have found say 10000 websites, stop for a while and try again later on a different seed.
But how do you control the python script with PHP?
Good question, I did this using a worker model, the PHP end would create a database table entry where it would say, okay, this workerID is ready for scraping, here are some seed URLS, and here is your target number to meet. On the Python end, I would have a Always running python script on a while loop, that every 10 seconds it would check the database to see if ANY workerIDs are ready to crawl. Once they were, grab all the relevant information such as the seed,target etc, Set currently crawling to 1 in the database so that other instances of the python script wont take the same workerID, and begin crawling. It would do a database check to see if a website it had already crawled was on the database and if it was, do not scrape the data, just collect links and move on. This seemed like a very fast and efficient way to do it as you were not doing lots of lockups for websites which you had already visited before.
With this method it also meant I could put this worker python script on ANY machine and it would automatically assign itself work as per the database/user Interface wants it to. If no workerIDs are set to ready, the python script just keeps checking every 10 – 20 seconds.
Just to keep you aware, I had two types of crawlers, The one I mentioned above was the Website_Discovery crawler, its primary purpose was to collect and find as many domain names as fast as possible and put them in the database.
The website scraper’s purpose was to find a previously un-scraped domain, collected by the Website_Discovery crawler and scrape that single domains data up to 2 levels deep, (So From the home page, click one link, then another from the next page)
With this method I could have say 20 instances of just one python script running, and then based of the user interface / database, those instances would change there crawler type depending on what the WorkerID wanted. It was a great success, within two days i had catalogued 500,000 websites and Scraped 50,000 websites 2 levels deep. I was also keeping a record of what websites were linking to where, as this is very important for Search capability’s.
So that is all the Theory
Next I will show you with code how this was achieved. (Post Coming Soon)