Java web crawler searcher robot that sends e-mail



This java crawler is extremely useful if you need to search a webpage for a specific word, tag or whatever you want to analyze in the data retrieved from a given URL.

I’ve used it for example to search for a specific error message that appeared in a page when a connection to the database could not be done. It helped me to prove that the error was really caused as a consequence of the connection link failure to the database.

The crawler saves in the file system the page that contains the string you’re searching for. The name of the file contains the time from when the string was found within the page body. With this information I could match the time information present on the file name with the time accompanying the error present in the web server log.

The code was originally developed by Rodrigo Gama that is a fellow developer/coworker of mine. I just adapted the code a little bit to fit my needs.

What’s the idea behind the crawler?
The main idea behind the crawler is the following:

You pass 2 essential parameters to run the application - these are the string you want to search for and the URLs you want to verify.

A thread for each URL is then created. This is done using the PageVerificationThread.java class that implements Runnable.

The PageVerificationThread creates a notificator object that is responsible for calling the MailSender object that in its turn sends a notification (message) to the emails you hardcoded in the Main.java class.

The message is also hardcoded inside the run() method of PageVerificationThread class.

I advise you to read the comments in the code.

You’ll have to change some strings in the code as is the case of the username and password used to send the e-mails.

The Code
The crawler has 4 classes: MailSender.java, Main.java, Notificator.java and PageVerificationThread.java.

This is the Main class:

/**
 * @authors Rodrigo Gama (main developer)
 *          Leniel Macaferi (minor modifications and additions)
 * @year 2009
 */

import java.net.MalformedURLException;
import java.util.ArrayList;
import java.util.List;

public class Main
{
    public static void main(String[] args) throws MalformedURLException
    {
        // URLs that will be verified
        List<String> urls = new ArrayList<String>();

        // Emails that will receive the notification
        String[] emails = new String[]
        { "youremail@gmail.com" };

        // Checking for arguments
        if(args == null || args.length < 2 || args[0] == null)
        {
            System.out.println("Usage: <crawler.jar> <search string> <URLs>");

            System.exit(0);
        }
        else
        {
            // Printing some messages to the screen
            System.out.println("Searching for " + args[0] + "...");
            System.out.println("On:");

            // Showing the URLs that will be verified and adding them to the paths variable
            for(int i = 1; i < args.length; i++)
            {
                System.out.println(args[i]);

                urls.add(args[i]);
            }
        }

        // For each URL we create a PageVerificationThread passing to it the URL address, the token to
        // search for and the destination emails.
        for(int i = 0; i < urls.size(); i++)
        {
            Thread t = new Thread(new PageVerificationThread(urls.get(i), args[0], emails));

            t.start();
        }
    }
}

This is the PageVerificationThread class:

/**
 * @authors Rodrigo Gama (main developer)
 *          Leniel Macaferi (minor modifications and additions)
 * @year 2009
 */

import java.io.BufferedReader;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.Calendar;
import java.util.Properties;
import java.util.TimeZone;

public class PageVerificationThread implements Runnable
{
    private String                strUrl;
    private String                searchFor;
    private static Notificator    notificator = null;
    private static Object         lock        = new Object();
    private int                   numAttempts = 0;

    public PageVerificationThread(String strUrl, String searchFor, String[] emails)
    {
        this.strUrl = strUrl;
        this.searchFor = searchFor;

        synchronized(lock)
        {
            if(notificator == null)
            {
                notificator = new Notificator();

                // For each email, adds it to the notificator "to" list.
                for(int i = 0; i < emails.length; i++)
                {
                    notificator.addDesetination(emails[i]);
                }
            }
        }
    }

    public void run()
    {
        try
        {
            URL url = new URL(strUrl);

            // Time interval to rerun the thread
            float numMinutes = 1;

            while(true)
            {
                try
                {
                    Properties systemProperties = System.getProperties();
                    systemProperties.put("http.proxyHost",
                            "proxy.yourdomain.com");
                    systemProperties.put("http.proxyPort", "3131");
                    System.setProperties(systemProperties);

                    URLConnection conn = url.openConnection();
                    conn.setDoOutput(true);

                    // Get the response content
                    BufferedReader rd = new BufferedReader(
                            new InputStreamReader(conn.getInputStream()));
                   
                    String line;
                   
                    StringBuilder document = new StringBuilder();

                    // A calendar to configure the time
                    Calendar calendar = Calendar.getInstance();
                    TimeZone tz = TimeZone.getTimeZone("America/Sao_Paulo");
                    calendar.setTimeZone(tz);
                    calendar.add(Calendar.SECOND, 9);
                    String timeStamp = calendar.getTime().toString();

                    boolean error = false;

                    // For each line of code contained in the response
                    while((line = rd.readLine()) != null)
                    {
                        document.append(line + "\n");

                        // If the line contains the text we're after...
                        if(line.contains(searchFor))
                        {// "is temporarily unavailable."))
                            // {
                            error = true;
                        }
                    }

                    // System.out.println(document.toString());
                   
                    // If we found the token...
                    if(error)
                    {
                        // Prints a message to the console
                        System.out.println("Found " + searchFor + " on " + strUrl);

                        // Sends the e-mail
                        notificator.notify("Found " + searchFor + " on " + strUrl);

                        // Writing the file to the file system with time information
                        FileWriter fw = null;

                        try
                        {
                            String dir = "C:/Documents and Settings/leniel-macaferi/Desktop/out/" + strUrl.replaceAll("[^A-Za-z0-9]", "_") + "/";

                            File file = new File(dir);
                            file.mkdirs();
                            file = new File(dir + timeStamp.replaceAll("[^A-Za-z0-9]", "_") + ".html");
                            file.createNewFile();

                            fw = new FileWriter(file);
                            fw.append(document);
                        }
                        finally
                        {
                            if(fw != null)
                            {
                                fw.flush();
                                fw.close();
                            }
                        }
                    }

                    // If error we reduce the time interval
                    if(error)
                    {
                        numMinutes = 0.5f;
                    }
                    else
                    {
                        numMinutes = 1;
                    }

                    try
                    {
                        Thread.sleep((long) (1000 * 60 * numMinutes));
                    }
                    catch(InterruptedException e)
                    {
                        e.printStackTrace();
                    }

                    // A counter to show us the number of attempts so far
                    numAttempts++;

                    System.out.println("Attempt: " + numAttempts + " on " + strUrl + " at " + calendar.getTime().toString());
                }
                catch(IOException e)
                {
                    e.printStackTrace();
                }
            }
        }
        catch(MalformedURLException m)
        {
            m.printStackTrace();
        }
    }
}

This is the Notificator class:

/**
 * @author Rodrigo Gama
 * @year 2009
 */

import java.util.ArrayList;
import java.util.List;
import javax.mail.MessagingException;
import javax.mail.internet.AddressException;

public class Notificator
{
    private List<String>    to   = new ArrayList<String>();
    private String          from = "leniel-macaferi";

    public void addDesetination(String dest)
    {
        to.add(dest);
    }

    public synchronized void notify(String message)
    {
        try
        {
            // Sends the e-mail
            MailSender.sendMail(from, to.toArray(new String[] {}), message);
        }
        catch (AddressException e)
        {
            e.printStackTrace();
        }
        catch (MessagingException e)
        {
            e.printStackTrace();
        }
    }
}

This is the MailSender class:

/**
 * @authors Rodrigo Gama
 * @year 2009
 */

import java.util.Properties;
import javax.mail.Authenticator;
import javax.mail.Message;
import javax.mail.MessagingException;
import javax.mail.PasswordAuthentication;
import javax.mail.Session;
import javax.mail.Transport;
import javax.mail.internet.AddressException;
import javax.mail.internet.InternetAddress;
import javax.mail.internet.MimeMessage;

public class MailSender
{
    public static void sendMail(String from, String[] toArray,
            String messageText) throws AddressException, MessagingException
    {
        // Get system properties
        Properties props = System.getProperties();

        // Setup mail server (here we’re using Gmail) 
        props.put("mail.smtp.host", "smtp.gmail.com");
        props.put("mail.smtp.starttls.enable", "true");
        props.put("mail.smtp.auth", "true");

        // Get session
        Authenticator auth = new MyAuthenticator();
        Session session = Session.getDefaultInstance(props, auth);

        // Define message
        MimeMessage message = new MimeMessage(session);
        message.setFrom(new InternetAddress(from));

        for(int i = 0; i < toArray.length; i++)
        {
            String to = toArray[i];

            message.addRecipient(Message.RecipientType.TO, new InternetAddress(to));
        }

        message.setSubject("The e-mail subject goes here!");
        message.setText(messageText);

        // Send message
        Transport.send(message);
    }
}

class MyAuthenticator extends Authenticator
{
    MyAuthenticator()
    {
        super();
    }

    protected PasswordAuthentication getPasswordAuthentication()
    {
        // Your e-mail your username and password
        return new PasswordAuthentication("username", "password");
    }
}

How to use it?
Using Eclipse you just have to run it as shown in this picture:

Java Crawler Run Configuration in Eclipse

I hope you make good use of it!

Source Code
Here it is for your delight: http://leniel.googlepages.com/JavaCrawler.zip