daniel shiffman

Crawling

Examples:

  • simple URL retrieval: A2ZUrlReader.java, UrlGrab.java
  • Valentine’s day crawler: valentinecrawler.zip, Crawler.java, Crawling.java
  • Example that uses WebSphinx (a java open-source crawling library): MyCrawler.java, WebSphinxTest.java, web_sphinx.zip.
  • Related:

  • Guidelines for Robot Writers — examples above are terribly impolite!
  • The Anatomy of a Large-Scale Hypertextual Web Search Engine
  • Web Dynamics, Web Search, Web Crawling, Web Characterization, Link Analysis, Search Engines and more (suggested)
  • Exercises (optional):

  • Rewrite the crawler to follow robots.txt. Investigate using Norbert
  • Rewrite the crawler to deal with relative paths, i.e. href=”relative/index.html” as well as href=”http://www.site.com/relative.html”
  • Rewrite the crawler to place all the words it encounters (ignoring HTML?) into a concordance.
  • Improve the crawler’s functionality to keep track of how many sites link to any given URL.
  • Grabbing a URL

    So far, we’ve only looked at how to retrieve textual information from a local file. Of course, there’s also that wacky world wide web thing and it’s not terribly difficult for us to expand the functionality of our examples to include scraping information from a URL. Here, we’ll need to introduce the java.net package, which will allow us to open an InputStream from a URL object.

    String urlpath = "http://www.mycoolwebsite.com";
    URL url = new URL(urlpath);
    InputStream stream = url.openStream();
    

    Once we have the InputStream, we can use a BufferedReader to read the stream (i.e. the source from the URL) line by line, appending it all to one big StringBuffer.

    StringBuffer stuff = new StringBuffer();
    //Create a BufferedReader from the InputStream
    BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
    
    // Read the stream line by line and append to StringBuffer
    String line;
    while (( line = reader.readLine()) != null) {
      stuff.append(line + "\n");
    }
    // Close the reader
    reader.close();
    

    Finally, we can simply convert the StringBuffer into a good old fashioned String and have our way with it.

    String urlContent = stuff.toString();
    

    Source for URL reading class: A2ZUrlReader.java

    Making a list and keeping it twice

    Sure, visiting one URL and grabbing data is super duper fun. But there is more joy out there and it comes in the form of crawling the web via a queue of URLs to visit. To accomplish this, we’ll dive into the linked list data structure. Sure, we could use an ArrayList here to keep track of multiple URLs in order, but a linked list will be more efficient because we only need some very basic functionality. We need to:

  • Add URLs to the end of the list (i.e. queue)
  • Get the first URL from the list
  •  
    (Note how we do not need to traverse the list itself. We only are required to access the first element, as well as place elements on the end.)

    It’s pretty simple to use the java LinkedList class.

    LinkedList urlsToVisit = new LinkedList();           // A queue of URLs to visit
    urlsToVisit.add("http://www.yahoo.com");             // Add a URL (as a String) to the list
    urlsToVisit.add("http://www.google.com");
    urlsToVisit.add("http://itp.nyu.edu");
    String urlpath = (String) urlsToVisit.removeFirst(); // Get the first URL (as a String)
    

    We could do something fancier, such as remove all the elements from the list one by one (as long as the list isn’t empty):

    while (!urlsToVisit.isEmpty()) {
      String urlpath = (String) urlsToVisit.removeFirst();
      // Presumably do something here!
    }
    

    For our web crawler project, however, one list will not do. For example, the following code would be somewhat tragic:

    urlsToVisit.add("http://www.sadlittleurl.com");
    urlsToVisit.add("http://www.sadlittleurl.com");
    urlsToVisit.add("http://www.sadlittleurl.com");
    urlsToVisit.add("http://www.sadlittleurl.com");
    urlsToVisit.add("http://www.sadlittleurl.com");
    urlsToVisit.add("http://www.sadlittleurl.com");
    urlsToVisit.add("http://www.sadlittleurl.com");
    

    We shouldn’t be visiting the same URL over and over again. This is why we’ll need a second list, one that keeps track of the URLs previously visited. Ideally, a data structure that had constant time look-up — O(1) — would be great. All we want to figure out is — here’s a URL, are you in the list or not? Our friendly neighborhood hash table from last week will do just fine.

    LinkedList urlsToVisit = new LinkedList();   // A queue of URLs to visit
    HashMap urlsVisited = new HashMap();      // A hash table of visited URLs
    String url = "http://www.abcdefg.com";
    urlsVisited.put(url,url);
    urlsToVisit.add("url");
    

    For each URL, we place it in both lists. (Note that the HashMap requires a key and a value, whereas the LinkedList requires only a value.) This way, when it comes time to look at a new URL, we can check and see if it’s been visited first before we waste our time and look at it again.

    String url = "http://www.iamtiredofmakingupurls.com";
    if (!urlsToVisit.containsKey(url)) {
      urlsVisited.put(url,url);
      urlsToVisit.add("url");
    }
    

    We can now see how we are well on the way to writing a web crawler. We start with one URL, grab it, look for other URLs, check if we’ve already visited them, if not add them to the list, and start all over with the next URL!

    Being Polite

    Ok, so we’re on our way to writing a web crawler. Web crawlers, sadly, can be somewhat rude. They can start poking their noses in places they shouldn’t. They can hog a lot of resources. As humble citizens of the web, we should do our best to follow these guidelines as well as the the rules for robot exclusion. The examples included on this page, for the sake of simplicity, do not live up to these standards and may be construed as rather impolite. If you plan on using them extensively, you will likely want to consider making some significant improvements.

    Here is the pseudo-code for our simple crawler:

  • 1 — Set-up LinkedList and HashMap
  • 2 — Add one URL to both
  • 3 — Grab source from first URL in queue
  • 4 — Look for more URLs in source
  • 5 — For every found URL, if it is already in the HashMap, ignore, if not, add it to both the queue and hash
  • 6 – -Repeat steps 3 – 5
  •  
    So far, we’ve covered the necessary code for every step except #4.

    Finding URLs

    Remembering regular expressions, we can create a Pattern object to match any reference to a URL.

    URL references on web pages generally look like this:

    <a href = "http://www.someurl.com">
    

    For simplicity’s sake, we’ll ignore any references to links specified by a relative path and assume that the URL has to start with “http”. We can begin to build a regular expression. . .

    href            # match href
    s*             # match optional infinite whitespace
    =               # match equal sign
    s*             # match optional infinite whitespace
    "               # match quote
    (http[^"s]+)   # capture the URL itself, http followed by one or more non whitespace/quote
    "               # end with a quote
    

    Sure, this could be improved a number of ways, but it will do as a start. In Java, it will look like this:

    Pattern href;
    // Match URLs
    // Note using Pattern.COMMENTS flag which ignores white spaces and anything after '#' in a regex
    href = Pattern.compile( "href            # match href \n" +
                            "\\s*=\\s*"     # 0 or more spaces, =, 0 ore more spaces, quote n" +
                            "(http[^\"\\s]*) # capture the URL itself, http followed by no spaces and no quotes n" +
                            "\"              # ending with a quote \n",
                            Pattern.CASE_INSENSITIVE | Pattern.COMMENTS);
    

    Now, as we grab some html source code, we can look for URLs and add them if appropriate:

    // Grab the URL content
    A2ZUrlReader urlr = new A2ZUrlReader(urlpath);
    String stuff = urlr.getContent();
    
    // Match the URL pattern to the content
    Matcher m = href.matcher(stuff);
    // While there are URLs
    while (m.find()) {
      // Grab the captured part of the regex (the URLPath itself)
      String newurl = m.group(1);
      // If it hasn't already been visited
      if (!urlsVisited.containsKey(newurl)) {
        // Add it to both the LinkedList and the HashMap
        urlsToVisit.add(newurl);
        urlsVisited.put(newurl,newurl);
      }
    }
    

    Now, not all URLs are treated equally. If our goal, for example, is to analyze text on the web, we can certainly ignore references to media files, such as jpgs, movs, etc. We’ll use a second regular expression for this:

    // We will ignore URLs ending with certain extensions
    String ignore = ".*(mov|jpg|gif|pdf)$";
    

    This allows us to not only check if the URL has already been visited, but confirm that it is not one of the listed excluded types.

    if (!newurl.matches(ignore) && !urlsVisited.containsKey(newurl)) {
      urlsToVisit.add(newurl);
      urlsVisited.put(newurl,newurl);
    }
    

    A Crawler Class

    The following Crawler class encompasses all of the basic elements described above. Its data is of a queue of URLs, a hash table of visited URLs, a regex for matching URL references, and a regex for URLs to ignore. It has methods to add a new URL, crawl to the next URL, read source of a URL (called in the ‘crawl’ method), and check if the queue is empty.

    /* Daniel Shiffman               */
    /* Programming from A to Z       */
    /* Spring 2006                   */
    /* http://shiffman.net       */
    /* daniel.shiffman@nyu.edu       */
    
    // Simple example of a web crawler
    // URL queue: linked list
    // Sites already visited: hash table
    
    // Needs to be updated to comply with ROBOTS.TXT!
    
    package a2z;
    
    import java.util.*;
    import java.util.regex.*;
    
    public class Crawler {
    
      private LinkedList urlsToVisit; // A queue of URLs to visit
      private HashMap urlsVisited;    // A table of already visited URLs
      private Pattern href;           // A Pattern to match an href tag
      private String ignore;          // To be used as a regex for ignoring media files (JPG,MOV, etc.)
    
      public Crawler() {
        urlsToVisit = new LinkedList();
        urlsVisited = new HashMap();
    
        // Match URLs
        // Note using Pattern.COMMENTS flag which ignores white spaces and anything after '#' in a regex
        href = Pattern.compile( "href             # match href n" +
                                "\s*=\s*"      # 0 or more spaces, =, 0 ore more spaces, quote n" +
                                "(http[^"\s]*)  # capture the URL itself, http followed by no spaces and no quotes n" +
                                ""               # ending with a quote n",
                                Pattern.CASE_INSENSITIVE | Pattern.COMMENTS);
    
        // We will ignore URLs ending with certain extensions
        ignore = ".*(mov|jpg|gif|pdf)$";
      }
    
      public void addUrl(String urlpath) {
        // Add it to both the LinkedList and the HashMap
        urlsToVisit.add(urlpath);
        urlsVisited.put(urlpath,urlpath);
    
      }
    
      public boolean queueEmpty() {
        return urlsToVisit.isEmpty();
      }
    
      public void crawl() {
          String urlpath = (String) urlsToVisit.removeFirst();
          read(urlpath);
      }
    
      private void read(String urlpath) {
        System.out.println(urlsVisited.size() + " " + urlsToVisit.size() + " " + urlpath);
        try {
          // Grab the URL content
          A2ZUrlReader urlr = new A2ZUrlReader(urlpath);
          String stuff = urlr.getContent();
    
          // OK I COULD DO SOMETHING WITH ALL OF THE CONTENT HERE!!
    
          // Match the URL pattern to the content
          Matcher m = href.matcher(stuff);
          // While there are URLs
          while (m.find()) {
            // Grab the captured part of the regex (the URLPath itself)
            String newurl = m.group(1);
            // If it hasn't already been visited (or if it matches the ignore pattern)
            if (!newurl.matches(ignore) && !urlsVisited.containsKey(newurl)) {
              addUrl(newurl);
            }
          }
        } catch (Exception e) {
          // System.out.println("Problem reading from " + urlpath + " " + e);
          // e.printStackTrace();
        }
      }
    
    }
    

    It is important to note that this crawler does not loop on its own. The driver program must check and see if the queue has elements and call the “crawl” method in a loop, i.e.:

    // Create a crawler object
    Crawler crawler = new Crawler();
    // Put the URL into the crawler object
    crawler.addUrl("http://www.blahlbah.com");
    while (!crawler.queueEmpty()) {
      crawler.crawl();
    }
    

    In the full example, I’ve built in a little safety net so that the crawler doesn’t run rampant (although it might be fun to let it do so!)

    // Since this crawler isn't particularly polite: http://www.robotstxt.org/wc/guidelines.html
    // I'm limited it to viewing 100 url requests
    int count = 0; int limit = 100;
    while (!crawler.queueEmpty()) {
      crawler.crawl();
      count++;
      if (count > limit) break;
    }
    

    Writing one’s own crawler is a useful and productive exercise, however, if all you need is basic crawling functionality, another option is to use an existing crawling library, such as WebSphinx. To use WebSphinx, simply download download websphinx.jar and import it into you Eclipse project (make sure to add it to the “build path” as well.) You can then create your own Crawler class that extends Crawler. By overriding the visit() method, you can have customize what you want the Crawler to do each time it visits a page.

    // What to do when we visit the page
    public void visit(Page page) {
        System.out.println("Visiting: " + page.getTitle());
        String content = page.getContent();
        // Do something with the content String!
    
        // Since we don't need to retain the pages
        // This code helps with memory management
        page.getOrigin().setPage(null);
        page.discardContent();
    }
    

    Here is an example of a basic Crawler implemented via WebSphinx. For other possibilities, take a look at the WebSphinx API.

    comments powered by Disqus