Want to see something weird yet amazing?

 

Go ahead and open a new tab and type ‘donald trump’ into your Google.

 

Done?

 

Alright, how many hits did you get and how long did it take?

 

406,000,000 results in just 0.92 seconds? (don’t worry if your’s is slightly different, as it varies each second)

 

 

Great, now type in your name into Google.

 

How many hits did you get and how long did it take?

 

 

Now..

 

Why is it that when you googled ‘donald trump’ did you get more hits (web content) in just 0.92 seconds, whilst when you googled your own name you found it had less hits but it seemed to take longer to load up so little hits compared to when you googled ‘donald trump’?

 

That my friend is the work of your web crawlers…

 

Keep reading.

 

 

 

How does a web crawler work?

 

Think of them as kind of like virtual Wall-Es

 

Except rather than they scraping and looking for junk, it is these virtual crawlers’ destiny to explore around the whole World Wide Web seeking all forms of information.

 

 

 

Hmm, what kind of information? (._.)

Mostly it comes down to these two kinds of information:

 

  • Web-page content

For example, all the images, videos and text that you see here on this page.

 

 

  • Links

Basically all links going out from this page to other pages on this website or other websites.

 

 

That’s what these little robots do.

 

 

 

I’ll give you an example

 

Say if we took Saint as the website that you want to crawl, so you type the URL of the homepage into the spider() function and then what happens is, it looks at all the content on this website.

 

Now this robot doesn’t really view how both you and I view any of the multimedia (videos, pictures, etc.) on this website, instead it just looks at the “text/html” as described in the code of the videos and pictures.

 

 

So let’s say now you want the little robot to crawl and find the word ‘socks’ on this page.

What you do is, you add into your code that you want the little guy to crawl and look for the word ‘socks’ on this page.

 

If the word ‘socks’ is not found in the text on this page, the robot will move on to the next link in its collection (it can be any site, it doesn’t have to be only this site) and it repeats the process again and again

 

It continues to look, until the robot has either found the word ‘socks’ or runs into the limit that you typed into the spider() function.

 

That’s it!

 

 

 

 

 

Is this how Google works?

 

Sort of.

 

Google has a whole army of these little guys, and they are the ones who crawl all over the web looking for new content.

 

 

Now I have a question specially for you,

 

How would you keep what is already crawled, faster to be seen by people?

 

Think…

 

If you said increase the number of web crawlers, then that wouldn’t do much because there’s already so many web crawlers (it’s more than you can even imagine) crawling the web as you read this word this very second, and it would be counterproductive since they still need time to find your content and crawl it entirely.

 

If you said seeding like how torrents seed by donating some of your downloaded data for others to use to access the same information.

Well that’s a great idea but that is just going take up more of your own internet which then leads to even more of your time being taken for the search results to load up wouldn’t it?

 

So how you would you tackle this problem?

 

You store it!

 

Like a library, Google stores whatever that is crawled into its own library and this is where Indexing comes into play.

 

If it’s not for Indexing, the web crawlers will have to crawl every time and it will take ages for you to find what you’re looking for on Google.

 

 

 

Now this is why,

 

Whenever you google ‘donald trump’ you can get like 406,000,000 results in just 0.92 seconds compared to when you searched for your own name.

 

Because..

 

The more content that is crawled and indexed to the ever growing library of Google, the faster it can be accessed, which is why when you googled your own name, it took more time because the web crawlers haven’t gotten to meet you yet. 🙂

 

*Your search terms also visit a number of databases simultaneously such as spell checkers, translation services, analytic and tracking servers, etc but again indexing plays a central role in how fast you can view crawled content online compared to what’s being crawled this very second.

 

 

 

 

 

Make a web crawler in under 50 lines of code

 

I have tried the following code a few days ago on my Python 3.6.1 (which is the latest as of 21st March 2017) and it should work for you too.

 

Just go ahead and copy+paste this into your Python IDE, then you can run it or modify it.

 

That’s it! 😀

 

from html.parser import HTMLParser  
from urllib.request import urlopen  
from urllib import parse

# We are going to create a class called LinkParser that inherits some
# methods from HTMLParser which is why it is passed into the definition
class LinkParser(HTMLParser):

    # This is a function that HTMLParser normally has
    # but we are adding some functionality to it
    def handle_starttag(self, tag, attrs):
        # We are looking for the begining of a link. Links normally look
        # like <a href="www.someurl.com"></a>
        if tag == 'a':
            for (key, value) in attrs:
                if key == 'href':
                    # We are grabbing the new URL. We are also adding the
                    # base URL to it. For example:
                    # www.saintlad.com is the base and
                    # somepage.html is the new URL (a relative URL)
                    #
                    # We combine a relative URL with the base URL to create
                    # an absolute URL like:
                    # www.saintlad.com/somepage.html
                    newUrl = parse.urljoin(self.baseUrl, value)
                    # And add it to our colection of links:
                    self.links = self.links + [newUrl]

    # This is a new function that we are creating to get links
    # that our spider() function will call
    def getLinks(self, url):
        self.links = []
        # Remember the base URL which will be important when creating
        # absolute URLs
        self.baseUrl = url
        # Use the urlopen function from the standard Python 3 library
        response = urlopen(url)
        # Make sure that we are looking at HTML and not other things that
        # are floating around on the internet (such as
        # JavaScript files, CSS, or .PDFs for example)
        if response.getheader('Content-Type')=='text/html':
            htmlBytes = response.read()
            # Note that feed() handles Strings well, but not bytes
            # (A change from Python 2.x to Python 3.x)
            htmlString = htmlBytes.decode("utf-8")
            self.feed(htmlString)
            return htmlString, self.links
        else:
            return "",[]

# And finally here is our spider. It takes in an URL, a word to find,
# and the number of pages to search through before giving up
def spider(url, word, maxPages):  
    pagesToVisit = [url]
    numberVisited = 0
    foundWord = False
    # The main loop. Create a LinkParser and get all the links on the page.
    # Also search the page for the word or string
    # In our getLinks function we return the web page
    # (this is useful for searching for the word)
    # and we return a set of links from that web page
    # (this is useful for where to go next)
    while numberVisited < maxPages and pagesToVisit != [] and not foundWord:
        numberVisited = numberVisited +1
        # Start from the beginning of our collection of pages to visit:
        url = pagesToVisit[0]
        pagesToVisit = pagesToVisit[1:]
        try:
            print(numberVisited, "Visiting:", url)
            parser = LinkParser()
            data, links = parser.getLinks(url)
            if data.find(word)>-1:
                foundWord = True
                # Add the pages that we visited to the end of our collection
                # of pages to visit:
                pagesToVisit = pagesToVisit + links
                print(" **Success!**")
        except:
            print(" **Failed!**")
    if foundWord:
        print("The word", word, "was found at", url)
    else:
        print("Word never found")
Related Posts