Web Crawlers or spiders are nothing more than a computer program that crawls the web from a given seed page. For example, once a web crawler is initialized for a web page, it loads all the links on that page. After fetching these links, it moves these links to the to_visit list, which is internally implemented as a stack. Each link is pulled from the stack and all links are pushed one by one onto the to_visit stack. The link that appeared is added to the list called visited. Similarly, the web crawler continues on and on until the to_visit stack is empty.
The following is a step-by-step function performed by a web spider:
Visit the given web page and get the source code of the page.
Extract all the links on the web page from the source code
Add a link to a visited page to a list called “visited”.
Move the extracted links to the “to_visit” stack.
Pull the link from the “to_visit” stack and repeat from step 1 until the “to_visit” stack is empty.
By understanding the concept of a web crawler, one learns a lot about various computer science concepts. You have several languages available to build your web browser. However, the language Python is used the most and for understandable reasons. Python constructs are easy to understand because they are very similar to English. Python is portable and extensible, i.e. platform independent. Suffice it to say that Google uses Python as the development language for most of its products.
Using Python, we can easily create a web crawler with indexing functionality in a few tens of lines. Keywords are mapped to the appropriate reference and maintained in the Dictionary type. The dictionary type is a built-in data structure provided in Python. the Dictionary type stores the value mapped to the corresponding key. Therefore, the Dictionary type can easily be used to store references (values) mapped to the relevant keyword (key). When searching for a specific keyword(s), the Python runtime extracts references (values) associated with that key.
Once you’ve created a web spider in Python, you can easily modify it to suit your needs. For example, you can tweak the code so that your web crawler crawls the web and collects any “.mp3” links it encounters on a web page. You can also modify it to crawl the web, search for a specific type of pages, and index them with matching keywords into a Python dictionary type. All this is achievable without much effort.
Learn how to build a web crawler using Python in a week. You will be able to learn a lot about programming concepts along the way.
See for yourself why Python is the favorite language of hackers around the world. Learn how to build a web crawler in a week, along with the most basic programming concepts. New to programming and novice, don’t worry, Python is the solution to your dilemma. There is no better language than Python to start with. Let’s get started and learn Python for beginners.