crawler4j example | java web crawler

Why to write the whole thing from the scratch if the same thing is already available ? Better to build on top! Isn’t it ?

Framework says : Concentrate on your objective rather then supporting things needed to accomplish the objective. Framework provide the facility and tools and expect the programmer to build the things on top of it.

crawler4j is a framework to that provide multi-threaded crawler with some extra features like Logging, customization’s, crawling password protected sites etc.

Here is a small example of starting with crawler4j.
Build env :Maven

1.Create a simple maven project.

2. Create a helper class having the logic. Below is the sample class

package vaibscrawl;

import java.util.logging.LogManager;
import java.util.regex.Pattern;

import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.url.WebURL;

/**
 * @author Vaibs
 *
 */

public class MVaibsWecker2 extends WebCrawler {
    // final static Logger logger = Logger.getLogger(MVaibsWecker.class);
    private final static Pattern FILTERS = Pattern
	    .compile(".*(\\.(css|js|gif|jpe?g" + "|png|mp3|mp3|zip|gz))$");
    String urli = "http://google.com/";

    /**
     * This method receives two parameters. The first parameter is the page in
     * which we have discovered this new url and the second parameter is the new
     * url. You should implement this function to specify whether the given url
     * should be crawled or not (based on your crawling logic). In this example,
     * we are instructing the crawler to ignore urls that have css, js, git, ...
     * extensions and to only accept urls that start with
     * "http://www.ics.uci.edu/". In this case, we didn't need the referringPage
     * parameter to make the decision.
     */
    @Override
    public boolean shouldVisit(Page referringPage, WebURL url) {
	String href = url.getURL().toLowerCase();

	return !FILTERS.matcher(href).matches();
    }

    /**
     * This function is called when a page is fetched and ready to be processed
     * by your program.
     */
    @Override
    public void visit(Page page) {
	String href = page.getWebURL().getURL();
	// System.out.println("URL: " + url);
	LogManager.getLogManager().reset();

	System.out.println("      " + href);
	// do whatever with the url

	// YOU CAN GET THE CONTENT OF THE URL AND ADD SOME REJEX TO FILTER OUT
	// EMAIL,PHONE NO,NAMES, SOME SPECIFIC INFO.
    }

}


Create the controller having the main method.

package vaibscrawl;

import java.util.Date;

import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;

/**
 * @author Vaibs
 * iseebug.com
 *
 */
public class Controller2 {
    public static void main(String[] args) throws Exception {
	System.out.println(new Date());
	String urli1 = "http://google.com/";

	try {

	    String crawlStorageFolder = "C:\\VaibsHack\\rootLog";
	    int numberOfCrawlers = 1;

	    CrawlConfig config = new CrawlConfig();
	    config.setCrawlStorageFolder(crawlStorageFolder);

	    /*
	     * Instantiate the controller for this crawl.
	     */
	    PageFetcher pageFetcher = new PageFetcher(config);
	    RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
	    RobotstxtServer robotstxtServer = new RobotstxtServer(
		    robotstxtConfig, pageFetcher);
	    // AuthInfo authInfo1 = new FormAuthInfo("your email", "your
	    // password", "login url", "idEmail", "");

	    CrawlController controller = new CrawlController(config,
		    pageFetcher, robotstxtServer);

	    /*
	     * For each crawl, you need to add some seed urls. These are the
	     * first URLs that are fetched and then the crawler starts following
	     * links which are found in these pages
	     */
	    controller.addSeed(urli1);
	    // crawlConfig: config.addAuthInfo(authInfo1);
	    /*
	     * Start the crawl. This is a blocking operation, meaning that your
	     * code will reach the line after this only when crawling is
	     * finished.
	     */

	    controller.start(MVaibsWecker2.class, numberOfCrawlers);

	} catch (Exception e) {
	    e.printStackTrace();
	} finally {
	    System.out.println(new Date());
	}
    }
}

pom.xml


	4.0.0
	myCrawl
	vaibscrawl
	0.0.1-SNAPSHOT
	
		UTF-8
	
	
		
			edu.uci.ics
			crawler4j
			4.0
			jar
			compile
		
		
			log4j
			log4j
			1.2.17
		
	


Download Link to the example.
crawler4j example with maven

Leave a Reply

Your email address will not be published. Required fields are marked *

Blue Captcha Image
Refresh

*