Java web crawler searcher robot that sends e-mail

This java crawler is extremely useful if you need to search a webpage for a specific word, tag or whatever you want to analyze in the data retrieved from a given URL.

I’ve used it for example to search for a specific error message that appeared in a page when a connection to the database could not be done. It helped me to prove that the error was really caused as a consequence of the connection link failure to the database.

The crawler saves in the file system the page that contains the string you’re searching for. The name of the file contains the time from when the string was found within the page body. With this information I could match the time information present on the file name with the time accompanying the error present in the web server log.

The code was originally developed by Rodrigo Gama that is a fellow developer/coworker of mine. I just adapted the code a little bit to fit my needs.

What’s the idea behind the crawler?
The main idea behind the crawler is the following:

You pass 2 essential parameters to run the application - these are the string you want to search for and the URLs you want to verify.

A thread for each URL is then created. This is done using the PageVerificationThread.java class that implements Runnable.

The PageVerificationThread creates a notificator object that is responsible for calling the MailSender object that in its turn sends a notification (message) to the emails you hardcoded in the Main.java class.

The message is also hardcoded inside the run() method of PageVerificationThread class.

I advise you to read the comments in the code.

You’ll have to change some strings in the code as is the case of the username and password used to send the e-mails.

The Code
The crawler has 4 classes: MailSender.java, Main.java, Notificator.java and PageVerificationThread.java.

This is the Main class:

/**
 * @authors Rodrigo Gama (main developer)
 *          Leniel Macaferi (minor modifications and additions)
 * @year 2009
 */

import java.net.MalformedURLException;
import java.util.ArrayList;
import java.util.List;

public class Main
{
    public static void main(String[] args) throws MalformedURLException
    {
        // URLs that will be verified
        List<String> urls = new ArrayList<String>();

        // Emails that will receive the notification
        String[] emails = new String[]
        { "youremail@gmail.com" };

        // Checking for arguments
        if(args == null || args.length < 2 || args[0] == null)
        {
            System.out.println("Usage: <crawler.jar> <search string> <URLs>");

            System.exit(0);
        }
        else
        {
            // Printing some messages to the screen
            System.out.println("Searching for " + args[0] + "...");
            System.out.println("On:");

            // Showing the URLs that will be verified and adding them to the paths variable
            for(int i = 1; i < args.length; i++)
            {
                System.out.println(args[i]);

                urls.add(args[i]);
            }
        }

        // For each URL we create a PageVerificationThread passing to it the URL address, the token to
        // search for and the destination emails.
        for(int i = 0; i < urls.size(); i++)
        {
            Thread t = new Thread(new PageVerificationThread(urls.get(i), args[0], emails));

            t.start();
        }
    }
}

This is the PageVerificationThread class:

/**
 * @authors Rodrigo Gama (main developer)
 *          Leniel Macaferi (minor modifications and additions)
 * @year 2009
 */

import java.io.BufferedReader;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.Calendar;
import java.util.Properties;
import java.util.TimeZone;

public class PageVerificationThread implements Runnable
{
    private String                strUrl;
    private String                searchFor;
    private static Notificator    notificator = null;
    private static Object         lock        = new Object();
    private int                   numAttempts = 0;

    public PageVerificationThread(String strUrl, String searchFor, String[] emails)
    {
        this.strUrl = strUrl;
        this.searchFor = searchFor;

        synchronized(lock)
        {
            if(notificator == null)
            {
                notificator = new Notificator();

                // For each email, adds it to the notificator "to" list.
                for(int i = 0; i < emails.length; i++)
                {
                    notificator.addDesetination(emails[i]);
                }
            }
        }
    }

    public void run()
    {
        try
        {
            URL url = new URL(strUrl);

            // Time interval to rerun the thread
            float numMinutes = 1;

            while(true)
            {
                try
                {
                    Properties systemProperties = System.getProperties();
                    systemProperties.put("http.proxyHost",
                            "proxy.yourdomain.com");
                    systemProperties.put("http.proxyPort", "3131");
                    System.setProperties(systemProperties);

                    URLConnection conn = url.openConnection();
                    conn.setDoOutput(true);

                    // Get the response content
                    BufferedReader rd = new BufferedReader(
                            new InputStreamReader(conn.getInputStream()));
                   
                    String line;
                   
                    StringBuilder document = new StringBuilder();

                    // A calendar to configure the time
                    Calendar calendar = Calendar.getInstance();
                    TimeZone tz = TimeZone.getTimeZone("America/Sao_Paulo");
                    calendar.setTimeZone(tz);
                    calendar.add(Calendar.SECOND, 9);
                    String timeStamp = calendar.getTime().toString();

                    boolean error = false;

                    // For each line of code contained in the response
                    while((line = rd.readLine()) != null)
                    {
                        document.append(line + "\n");

                        // If the line contains the text we're after...
                        if(line.contains(searchFor))
                        {// "is temporarily unavailable."))
                            // {
                            error = true;
                        }
                    }

                    // System.out.println(document.toString());
                   
                    // If we found the token...
                    if(error)
                    {
                        // Prints a message to the console
                        System.out.println("Found " + searchFor + " on " + strUrl);

                        // Sends the e-mail
                        notificator.notify("Found " + searchFor + " on " + strUrl);

                        // Writing the file to the file system with time information
                        FileWriter fw = null;

                        try
                        {
                            String dir = "C:/Documents and Settings/leniel-macaferi/Desktop/out/" + strUrl.replaceAll("[^A-Za-z0-9]", "_") + "/";

                            File file = new File(dir);
                            file.mkdirs();
                            file = new File(dir + timeStamp.replaceAll("[^A-Za-z0-9]", "_") + ".html");
                            file.createNewFile();

                            fw = new FileWriter(file);
                            fw.append(document);
                        }
                        finally
                        {
                            if(fw != null)
                            {
                                fw.flush();
                                fw.close();
                            }
                        }
                    }

                    // If error we reduce the time interval
                    if(error)
                    {
                        numMinutes = 0.5f;
                    }
                    else
                    {
                        numMinutes = 1;
                    }

                    try
                    {
                        Thread.sleep((long) (1000 * 60 * numMinutes));
                    }
                    catch(InterruptedException e)
                    {
                        e.printStackTrace();
                    }

                    // A counter to show us the number of attempts so far
                    numAttempts++;

                    System.out.println("Attempt: " + numAttempts + " on " + strUrl + " at " + calendar.getTime().toString());
                }
                catch(IOException e)
                {
                    e.printStackTrace();
                }
            }
        }
        catch(MalformedURLException m)
        {
            m.printStackTrace();
        }
    }
}

This is the Notificator class:

/**
 * @author Rodrigo Gama
 * @year 2009
 */

import java.util.ArrayList;
import java.util.List;
import javax.mail.MessagingException;
import javax.mail.internet.AddressException;

public class Notificator
{
    private List<String>    to   = new ArrayList<String>();
    private String          from = "leniel-macaferi";

    public void addDesetination(String dest)
    {
        to.add(dest);
    }

    public synchronized void notify(String message)
    {
        try
        {
            // Sends the e-mail
            MailSender.sendMail(from, to.toArray(new String[] {}), message);
        }
        catch (AddressException e)
        {
            e.printStackTrace();
        }
        catch (MessagingException e)
        {
            e.printStackTrace();
        }
    }
}

This is the MailSender class:

/**
 * @authors Rodrigo Gama
 * @year 2009
 */

import java.util.Properties;
import javax.mail.Authenticator;
import javax.mail.Message;
import javax.mail.MessagingException;
import javax.mail.PasswordAuthentication;
import javax.mail.Session;
import javax.mail.Transport;
import javax.mail.internet.AddressException;
import javax.mail.internet.InternetAddress;
import javax.mail.internet.MimeMessage;

public class MailSender
{
    public static void sendMail(String from, String[] toArray,
            String messageText) throws AddressException, MessagingException
    {
        // Get system properties
        Properties props = System.getProperties();

        // Setup mail server (here we’re using Gmail) 
        props.put("mail.smtp.host", "smtp.gmail.com");
        props.put("mail.smtp.starttls.enable", "true");
        props.put("mail.smtp.auth", "true");

        // Get session
        Authenticator auth = new MyAuthenticator();
        Session session = Session.getDefaultInstance(props, auth);

        // Define message
        MimeMessage message = new MimeMessage(session);
        message.setFrom(new InternetAddress(from));

        for(int i = 0; i < toArray.length; i++)
        {
            String to = toArray[i];

            message.addRecipient(Message.RecipientType.TO, new InternetAddress(to));
        }

        message.setSubject("The e-mail subject goes here!");
        message.setText(messageText);

        // Send message
        Transport.send(message);
    }
}

class MyAuthenticator extends Authenticator
{
    MyAuthenticator()
    {
        super();
    }

    protected PasswordAuthentication getPasswordAuthentication()
    {
        // Your e-mail your username and password
        return new PasswordAuthentication("username", "password");
    }
}

How to use it?
Using Eclipse you just have to run it as shown in this picture:

Java Crawler Run Configuration in Eclipse

I hope you make good use of it!

Source Code
Here it is for your delight: http://leniel.googlepages.com/JavaCrawler.zip

Doing maintenance on Chemtech's site

During the second half of July and the first half of August I was working on Chemtech's site doing some maintenance. It was a good job because I could get to know new technology as is the case of Liferay. Liferay is a great enterprise portal that allows you to create a complete website solution coded in Java.

I also could verify the quality of the new Eclipse 3.5 IDE codenamed Galileo that I used during the maintenance. It's a great IDE to Java developers. It has lots of plugins that allow you to work with practically any kind of programming technology inside a fantastic set of windows for every type of task. Before using Eclipse I had only worked with NetBeans to do Java development.

I improved my skills about Tomcat too.

Chemtech's site also known as chemsite

During a time like this you accelerate the learning process and get to know new things which are very important for any software developer.

After fixing some bugs on the site and writing in chemsite’s project wiki everything I grasped and did I came back to Volta Redonda for a two week job on CSN's MES; more on this in the next post.

You see, for the past 10 months I’ve worked with ASP.NET and Oracle (Braskem) and then I switched to work with Java and MySQL for 1 month (chemsite). Now at CSN I'm working with Visual Basic and SQL Server.

This shows that in today's world there's no bullet proof technology when we talk about programming languages and database systems. Each company has its own legacy systems that date back to two or one decade ago and such systems require certain types of interfaces developed in certain types of technologies. For example, if you look to 10 years ago (1999), C# wasn't even a programming language and so Visual Basic predominated during that time. It is most of the times impossible to a company to redevelop a really big system in a new programming language that is today's bullet proof. Those big systems consumed a lot of time and money to be constructed and in the future the programming language that is today’s bullet proof may not stand out.

As software professional you must act with any tool that is put in your hands.

I like what I do and no matter the tool I use I'm always satisfied with my job because of course, I do what I like to do, that is, software development.

To explain why Chemtech is a great place to work for in Brazil I think the above text says it all. In just 1 year I had the opportunity to work in different projects that use different technologies. A great way to leverage a career.

Thanks Jesus for that! :)

Creating Excel spreadsheets .XLS and .XLSX in C#

Interesting discussion at StackOverflow:
Create Excel (.XLS and .XLSX) file from C#

If you want to see how to combine NPOI + Excel Table and Chart,
take a look at the post titled NPOI with Excel Table and dynamic Chart.

NPOI 2.0 series of posts scheduled

Recently I had to implement some code to create an Excel spreadsheet/report using C#.

The task was: given an Excel spreadsheet template - a .XLS file (with formulas, pivot tables, macros, etc) I had to fill some data in one of the sheets of the spreadsheet and send this modified spreadsheet back to the user requesting such an operation (Excel report).

The following attests the need for Excel nowadays:

Excel has long been recognized as the de facto standard when it comes to presenting management reports. The unique combination of great calculation engine, excellent charting facilities, pivot tables and the possibility to perform “what if” analysis, make it the “must have” business intelligence tool.
by John Tunnicliffe

I had a great time while studying the possible ways of doing what the task asks for.

It appears to be a simple task at first but as the time passes by you get to know that this is not the case, well, till the moment this blog post was written at least, I think. :-)

Firstly I tried automating Excel through COM automation, but as the application was meant to be used in a web context it is not recommended to use automation. Why? Because COM automation for Excel is not thread safe, that is, EXCEL.EXE was not constructed to be used by concurrent users accessing the same process, in this case EXCEL.EXE; besides, Microsoft Excel must be installed on the server what is not always possible.

For more details on why you shouldn’t use Excel on the server, read this article on Microsoft Help and Support site: Considerations for server-side Automation of Office. The key part is this:

Microsoft does not currently recommend, and does not support, Automation of Microsoft Office applications from any unattended, non-interactive client application or component (including ASP, ASP.NET, DCOM, and NT Services), because Office may exhibit unstable behavior and/or deadlock when Office is run in this environment.

I’ve just experienced the above. EXCEL.EXE insisted in being alive in task manager even after processing the data and being closed in code. Each call to process a spreadsheet opens an EXCEL.EXE process on the server. With such EXCEL.EXE processes don’t being closed as they should you get lots of those processes on memory which could overload the server.

Do not use COM automation if you are developing server-side code.

ExcelPackageAfter struggling with COM automation I finally got to know ExcelPackage  which works with Office Open Document Format (OOXML). It can read an .XLSX (Microsoft Excel 2007) template and create another .XLSX file based on such template, giving you lots of possibilities to work with the template copy.

I’ve gotten really happy because I had found a way of doing what the task was asking for but with a minor detail: ExcelPackage works only with .XLSX file format letting the .XLS (Microsoft Excel 2003) format out of the game. Well it turned out to be a big impediment because the client (software buyer) wouldn’t allow us to install the famous Microsoft Office Compatibility Pack for Word, Excel, and PowerPoint 2007 File Formats on user machines so that users could open and save OOXML file formats even using Microsoft Office 2003 suite.

Discovering ExcelPackage was good but it didn’t do the trick that implies the use of an .XLS template.

NPOII got back Googling again and strived to find an open source library that would allow me to do what I wanted. After some time I finally discovered NPOI.  Wow, it could read the .XLS template and generate the end result I wanted. Great. I downloaded NPOI 1.2.1 for .NET 2.0 binaries immediately and started playing with it to see what it could really do.

OK, after this short story, I’ll show you how to use both open source projects (ExcelPackage and NPOI).

I’ve created a new ASP.NET MVC project as can be seen in this picture:

 Excel Writer Solution Explorer

In the Content folder I’ve placed the template spreadsheets.
In the Libs folder I’ve placed the DLLs necessary to use both ExcelPackage and NPOI open source projects.

The controller that interests us is the one called ExcelWriterController:

Excel Writer Controller

The methods that handle the creation of the spreadsheet are: ExcelPackageCreate and NPOICreate.

For each controller action (method) there’s a corresponding view that renders the UI to the user. Those views are the ones shown inside the ExcelWriter folder: ExcelPackage.aspx and NPOI.aspx.

This is the Home Page of the Excel Writer MVC Application - take a look at the tabs (ExcelPackage and NPOI) that lead you to the View pages:

Excel Writer Home Page

Each view has a button which when clicked calls the corresponding action method on the ExcelWriterController.

This is the NPOI view page:

Excel Writer NPOI View Page

I’ll play with a simple spreadsheet I filled with the data I got from Excel’s blog post titled Formula to Access a List of Values Interspersed with Zeros or Blanks.

Let’s see the code that goes into the ExcelPackageCreate method:

/// <summary>
/// Creates a new Excel spreadsheet based on a template using the ExcelPackage library.
/// A new file is created on the server based on a template.
/// </summary>
/// <returns>Excel report</returns>
[AcceptVerbs(HttpVerbs.Post)]
public ActionResult ExcelPackageCreate()
{
    try
    {
        FileInfo template = new FileInfo(Server.MapPath(@"\Content\ExcelPackageTemplate.xlsx"));

        FileInfo newFile = new FileInfo(Server.MapPath(@"\Content\ExcelPackageNewFile.xlsx"));

        // Using the template to create the newFile...
        using(ExcelPackage excelPackage = new ExcelPackage(newFile, template))
        {
            // Getting the complete workbook...
            ExcelWorkbook myWorkbook = excelPackage.Workbook;

            // Getting the worksheet by its name...
            ExcelWorksheet myWorksheet = myWorkbook.Worksheets["Sheet1"];

            // Setting the value 77 at row 5 column 1...
            myWorksheet.Cell(5, 1).Value = 77.ToString();

            // Saving the change...
            excelPackage.Save();
        }

        TempData["Message"] = "Excel report created successfully!";

        return RedirectToAction("ExcelPackage");
    }
    catch(Exception ex)
    {
        TempData["Message"] = "Oops! Something went wrong.";

        return RedirectToAction("ExcelPackage");
    }
}

Let’s see the code that goes into the NPOICreate method:

/// <summary>
/// Creates a new Excel spreadsheet based on a template using the NPOI library.
/// The template is changed in memory and a copy of it is sent to
/// the user computer through a file stream.
/// </summary>
/// <returns>Excel report</returns>
[AcceptVerbs(HttpVerbs.Post)]
public ActionResult NPOICreate()
{
    try
    {
        // Opening the Excel template...
        FileStream fs =
            new FileStream(Server.MapPath(@"\Content\NPOITemplate.xls"), FileMode.Open, FileAccess.Read);

        // Getting the complete workbook...
        HSSFWorkbook templateWorkbook = new HSSFWorkbook(fs, true);

        // Getting the worksheet by its name...
        HSSFSheet sheet = templateWorkbook.GetSheet("Sheet1");

        // Getting the row... 0 is the first row.
        HSSFRow dataRow = sheet.GetRow(4);

        // Setting the value 77 at row 5 column 1
        dataRow.GetCell(0).SetCellValue(77);

        // Forcing formula recalculation...
        sheet.ForceFormulaRecalculation = true;

        MemoryStream ms = new MemoryStream();

        // Writing the workbook content to the FileStream...
        templateWorkbook.Write(ms);

        TempData["Message"] = "Excel report created successfully!";

        // Sending the server processed data back to the user computer...
        return File(ms.ToArray(), "application/vnd.ms-excel", "NPOINewFile.xls");
    }
    catch(Exception ex)
    {
        TempData["Message"] = "Oops! Something went wrong.";

        return RedirectToAction("NPOI");
    }
}

One drawback of the ExcelPackage library is that it must create a file on the server. There are some modifications to the library that enables you to create the template copy on memory and send it to the user as the NPOI library does. Take a look at the ExcelPackage’s discussion page at CodePlex and specifically this thread: Why create Excel spreadsheets on the server?

The great thing about NPOI of course is that it enables you to work with the template in code and then send a copy of the spreadsheet directly to the user. The template remains intact and the user receives a modified copy of the template which contains the data processed by the application.

With NPOI when you click on the Create Excel report button you get the download dialog window:

Excel Writer NPOI Download dialog

With ExcelPackage you’d have to get the path of the file created on the server, in this case \Content\ExcelPackageNewFile.xlsx and then send that file to the user. This is an extra step and adds an additional burden to the server. I didn’t implement it and so I let this as an exercise to you.

Well, the spreadsheet included in this simple project has only formulas but you can for sure have an Excel template with lots of formulas, pivot tables, macros, etc. This gives you the power of Excel in code in a clean fashion.

Hope this helps shed some light on this topic!

Note
Using the open source libraries presented in this post you won’t need Microsoft Excel installed on the server.

Updated on 9/22/2010

There’s now EPPlus that extends ExcelPackage.

EPPlus is a .net library that reads and writes Excel 2007 files using the Open Office XML format (XLSX).

EPPlus supports ranges, cell styling, charts, pictures, shapes, named ranges, autofilters and a lot of other stuff.

Updated on 6/2/2010

I’m working on an ASP.NET project that uses .NET Framework 1.1. In such a case I needed to use NPOI 1.2.1 for .NET 1.1.

The code you’ll use with NPOI 1.2.1 for .NET 1.1 is the same presented on this post. Just pay attention to include this using directive inside your C# code:

using NPOI.HSSF.UserModel

Updated on 1/3/2010

If you like NPOI, help spread the word about it voting up on this ad:

Help make NPOI even more Awesome!

Visual Studio 2008 C# ASP.NET MVC Web Application
You can get the Microsoft Visual Studio Project at:

http://sites.google.com/site/leniel/blog/ExcelWriterMvcProject.zip

To try out the code you can use the free Microsoft Visual Web Developer 2008 Express Edition that you can get at: http://www.microsoft.com/express/vwd/Default.aspx

References
ExcelPackage: Office Open XML Format file creation
http://excelpackage.codeplex.com/

ExcelPackage binaries download
http://excelpackage.codeplex.com/Release/ProjectReleases.aspx

NPOI
http://npoi.codeplex.com/

NPOI 1.2.1 for .NET 1.1 
http://npoi.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=33203

NPOI 1.2.1 for .NET 2.0
http://npoi.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=19351

NPOI samples
http://npoi.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=19351#DownloadId=70100

A* pathfinding search in C# - Part 2

A* pathfinding search in C# - Part 1
A* pathfinding search in C# - Part 3

Code available at GitHub: https://github.com/leniel/AStar

To illustrate the path finding problem and how it can be solved using A* (A star) I decided to use the Romania map (with the same Cities I used in Breadth and depth first search series of posts). Now I modified it adding more connections between the cities so that we have more fun while debugging the code.

This time I decided to get the real values of latitude and longitude for the cities shown in the Romania map. To accomplish this I used this Google Spreadsheet that is a geocoder. It uses geo-google-docs to get geolocalization data for the address column.

This is the Map created using the data stored in the spreadsheet with the help of a Google Map:

You can pass your mouse over each pin to see the name of the city it represents in the map.

The following picture shows the fictitious connections (paths) between the cities and the cost to move from city to city.

Romania Map

To fill the Graph that represents the Romania map above I created a static private method called FillGraphWithEarthMap inside the AStar class that does the job of creating the Nodes and adding them to the graph (bellow I show a shortened version of such a method because it is a really big method given the fact that we have 20 cities and 41 connections (paths) between them:

/// <summary>
/// Fills a Graph with Romania map information.
/// The Graph contains Nodes that represents Cities of Romania.
/// Each Node has as its key the City name and its Latitude and Longitude associated information.
/// Nodes are vertexes in the Graph. Connections between Nodes are edges.
/// </summary>
/// <param name="graph">The Graph to be filled</param>
/// <param name="distanceType">The DistanceType (KM or Miles) between neighbor cities</param>
private static void FillGraphWithEarthMap(Graph graph, DistanceType distanceType)
{
    // 20 Vertexes in total
    Node arad = new Node("Arad", null, 46.1792414, 21.3150154); // Creating a Node...
    graph.AddNode(arad); // Adding the Node to the Graph...

    Node bucharest = new Node("Bucharest", null, 44.4479237, 26.097879);
    graph.AddNode(bucharest);

    Node craiova = new Node("Craiova", null, 44.3182085, 23.8016427);
    graph.AddNode(craiova);

    .
.
.
    // 41 Edges in total
    // Arad <-> Sibiu
    graph.AddUndirectedEdge(arad, sibiu, Haversine.Distance(arad, sibiu, distanceType));
    // Arad <-> Timisoara
    graph.AddUndirectedEdge(arad, timisoara, Haversine.Distance(arad, timisoara, distanceType));
    // Arad <-> Zerind
    graph.AddUndirectedEdge(arad, zerind, Haversine.Distance(arad, zerind, distanceType));
.
.
. }

The Vertexes
You must create a vertex/Node for each City of the map, passing to the Node’s class constructor the name of the City and its Latitude and Longitude coordinates. Add each Node to the Graph passed as a parameter to the method.

The Edges
You must create an undirected edge/path for each connection between two cities, passing to the constructor the first city, the second city and the cost to go from one city to the other. In this case the cost is calculated through the Haversine class discussed in A* pathfinding search in C# - Part 1.

To help debug the code I created two methods: DistanceBetweenNodes and ViewOtherPaths. Both methods are kept in the AStar class.

This is the DistanceBetweenNodes method:

/// <summary>
/// Prints on screen the distance from a city to its neighbors.
/// </summary>
/// <param name="graph">The Graph</param>
/// <param name="distanceType">The distance type: KM or Miles</param>
private static void DistanceBetweenNodes(Graph graph, DistanceType distanceType)
{
    // First we cast the Graph.Nodes which is a NodeList to type Node and then we order the list of Nodes by the Node Key
    // so that we get the list of cities in ascending order.
    foreach(Node n in graph.Nodes.Cast<Node>().OrderBy(n => n.Key))
    {
        // For each city neighbor we gets its information and print it on screen.
        foreach(EdgeToNeighbor etn in n.Neighbors)
        {
            Console.WriteLine("Distance from {0} to {1} is -> {2:#.##} {3}", n.Key, etn.Neighbor.Key, etn.Cost, distanceType);
        }
    }
}

Inside the FindPath method presented on A* pathfinding search in C# - Part 1, I call the ViewOtherPaths method (it’s commented out):

/// <summary>
/// This is the method responsible for finding the shortest path between a Start and Destination cities using the A*
/// search algorithm.
/// </summary>
/// <typeparam name="TNode">The Node type</typeparam>
/// <param name="start">Start city</param>
/// <param name="destination">Destination city</param>
/// <param name="distance">Function which tells us the exact distance between two neighbours.</param>
/// <param name="estimate">Function which tells us the estimated distance between the last node on a proposed path and the
/// destination node.</param>
/// <returns></returns>
static public Path<TNode> FindPath<TNode>(
    TNode start,
    TNode destination,
    Func<TNode, TNode, double> distance,
    Func<TNode, double> estimate) where TNode : IHasNeighbours<TNode>
{
    var closed = new HashSet<TNode>();

    var queue = new PriorityQueue<double, Path<TNode>>();

    queue.Enqueue(0, new Path<TNode>(start));

    while(!queue.IsEmpty)
    {
        var path = queue.Dequeue();

        if(closed.Contains(path.LastStep))
            continue;

        if(path.LastStep.Equals(destination))
            return path;

        closed.Add(path.LastStep);

        foreach(TNode n in path.LastStep.Neighbours)
        {
            double d = distance(path.LastStep, n);

            var newPath = path.AddStep(n, d);

            queue.Enqueue(newPath.TotalCost + estimate(n), newPath);
        }

        //ViewOtherPaths(queue, estimate);
    }

    return null;
}

This is the ViewOtherPaths method:

/// <summary>
/// This method can be used to view the other paths inside the PriorityQueue.
/// </summary>
/// <typeparam name="TNode">The Node type</typeparam>
/// <param name="queue">The priority queue</param>
/// <param name="estimate">Function which tells us the estimated distance between the last node on a proposed path and the
/// destination node.</param>
private static void ViewOtherPaths<TNode>(PriorityQueue<double, Path<TNode>> queue, Func<TNode, double> estimate)
{
    // The priority queue is composed of KeyValuePairs which has as key a double value (the TotalCost) and
    // has as Value a Queue which contains Paths.
    foreach(KeyValuePair<double, Queue<Path<TNode>>> kvp in queue)
    {
        // For each path in the Queue...
        foreach(Path<TNode> otherPath in kvp.Value)
        {
            // Reverse the Path so that we get the order of the cities in a more meaningful way...
            var otherPathReversed = otherPath.Cast<Node>().Reverse();

            // Prints on screen the Cities that are part of this path.
            foreach(Node n in otherPathReversed)
            {
                Console.WriteLine(n.Key);
            }

            // Get the total cost of the other path.
            double otherPathTotalCost = otherPath.TotalCost;
            // Get the estimation cost of the other path.
            double otherPathEstimation = estimate(otherPath.LastStep);

            // Prints on the screen the relevant information so that it gets easier to debug the code and see how
            // the A* search algorithm really does the job...
            Console.WriteLine("Total Cost other path = {0}", otherPathTotalCost);
            Console.WriteLine("Estimation other path = {0}", otherPathEstimation);
            Console.WriteLine(@"Priority Queue Cost other path = {0} =
                            Total Cost other path + Estimation other path =
                            {1}", kvp.Key, otherPathTotalCost + otherPathEstimation);
        }

        Console.WriteLine();
    }

    Console.WriteLine();
}

I also use two additional methods to get the Start and Destination cities entered by the user when s/he runs the code. Both methods keep a loop until the user enters the name of a city that is indeed part of the Graph we’ve just created:

/// <summary>
/// Gets the Destination city.
/// </summary>
/// <param name="graph">The Graph</param>
/// <returns>Name of Destination city</returns>
private static string GetDestinationCity(Graph graph)
{
    string destinationCity;
do { Console.Write("\nEnter a Destination city: "); destinationCity = Console.ReadLine(); } while(!graph.Nodes.ContainsKey(destinationCity));
return destinationCity; }
/// <summary>
/// Gets the Destination city.
/// </summary>
/// <param name="graph">The Graph</param>
/// <returns>Name of Destination city</returns>
private static string GetDestinationCity(Graph graph)
{
    string destinationCity;
do { Console.Write("\nEnter a Destination city: "); destinationCity = Console.ReadLine(); } while(!graph.Nodes.ContainsKey(destinationCity));
return destinationCity; }

This is the main entry point of the Console Application I created:

static void Main(string[] args)
{
    do
    {
        // Creating the Graph...
        Graph graph = new Graph();

        //FillGraphWithGridMap(graph);

        FillGraphWithEarthMap(graph, DistanceType.Kilometers);

        // Prints on the screen the distance from a city to its neighbors.
        // Used mainly for debug information.
        // DistanceBetweenNodes(graph, DistanceType.Kilometers);

        Console.WriteLine("A* Search - Sample implementation by Leniel Macaferi, June 7-20, 2009\n");
        Console.WriteLine("These are the Cities you can choose as Start and Destination in Romania: \n");

        // Prints on screen the cities that you can choose as Start and Destination.
        foreach(Node n in graph.Nodes.Cast<Node>().OrderBy(n => n.Key))
        {
            Console.WriteLine(n.Key);
        }

        string startCity = GetStartCity(graph);

        string destinationCity = GetDestinationCity(graph);

        Node start = graph.Nodes[startCity];

        Node destination = graph.Nodes[destinationCity];

        // Function which tells us the exact distance between two neighbours.
        Func<Node, Node, double> distance = (node1, node2) => node1.Neighbors.Cast<EdgeToNeighbor>().Single(etn => etn.Neighbor.Key == node2.Key).Cost;

        // Estimation/Heuristic function (Manhattan distance)
        // It tells us the estimated distance between the last node on a proposed path and the destination node.
        //Func<Node, double> manhattanEstimation = n => Math.Abs(n.X - destination.X) + Math.Abs(n.Y - destination.Y);

        // Estimation/Heuristic function (Haversine distance)
        // It tells us the estimated distance between the last node on a proposed path and the destination node.
        Func<Node, double> haversineEstimation = n => Haversine.Distance(n, destination, DistanceType.Kilometers);

        //Path<Node> shortestPath = FindPath(start, destination, distance, manhattanEstimation);
        Path<Node> shortestPath = FindPath(start, destination, distance, haversineEstimation);

        // Prints the shortest path.
        foreach(Node n in shortestPath.Reverse())
        {
            Console.WriteLine(n.Key);
        }

        Console.Write("\nDo you wanna try A* Search again? Yes or No? ");
    }
    while(Console.ReadLine().ToLower() == "yes");
}

The main entry point above calls all the other methods I showed in this post.

This is all you need to run the A* pathfinding search algorithm! :-)

I think the code is well documented through comments.

I wish this helps you understand how to put together the working pieces necessary to run a test case.

Next time: I’ll run a test case using the Graph we have assembled in this post.

A* pathfinding search in C# - Part 1

A* pathfinding search in C# - Part 2
A* pathfinding search in C# - Part 3

Code available at GitHub: https://github.com/leniel/AStar

Last Sunday after I got back from church I read Kirill Osenkov’s post Algorithms in C#: shortest path around a polygon (polyline routing). Kirill mentions Eric Lippert’s excellent series of posts titled Path Finding Using A* in C# 3.0. I’ve read that series of posts a long time ago (2007 - I was on the last year of the computer engineering course).

See you how life is interesting: in 2007 after reading Lippert’s posts I wanted to see a running sample of the code but didn’t try to implement that until the time I read Osenkov’s post. Why did it took so long? I don’t know how to answer this.

I studied A* (A star) in the 9th term of the computer engineering course  (Feb-Jun, 2007). The discipline was Artificial Intelligence. At that time I didn’t implement any code for this specific algorithm (I think this is due the so famous excuse during the classes - we just don’t have enough time to do that. We have lots of topics to see yet.) . Therefore, we’ve just run test cases on sheets of paper. Not so good but a start.

Well, Eric wrote the base structure of the A* pathfinding algorithm but didn’t provide a complete running sample. He wrote the following:

I have a nice example of using A* to find the shortest path around a lake (but not the longest shortest path!) but I'm not going to post it here until Orcas ships. I'll just put up the whole project file at that time and you can check it out all at once.

Still on Sunday night I took the time to write some code that put up a running A* search.

I basically merged some code I had already posted on the series of posts titled Breadth and depth first search and created some properties here and there to get the whole thing running. It’s nothing more than code reuse.

The following is the basic project’s structure I created:

A* Search Solution Explorer

In Path Finding Using A* in C# 3.0, Part One the A* algorithm is described.

In Path Finding Using A* in C# 3.0, Part Two the data structures necessary to implement the A* algorithm are defined.

In this part Eric writes:

Let’s make an immutable stack of nodes which tracks the total cost of the whole path. Later on, we’ll see that we need to enumerate the nodes of this thing, so we’ll do a trivial implementation of IEnumerable while we’re at it. And since we don’t know exactly what a node will look like, let’s make the thing generic.

In part two we are presented to an immutable stack that I’ve put inside the Path class.

In Path Finding Using A* in C# 3.0, Part Three we are presented to a priority queue. This structure is responsible for putting the best paths so far in priority order, that is, the path with a lesser cost has a major priority.

In this part Eric writes:

In order to make the A* algorithm work we need to get the lowest-estimated-cost-path-discovered-so-far out of the list of paths under consideration. The standard data structure for doing so is called a “priority queue”. Priority queues are so-called because they are typically used to store a list of jobs where each job has an associated priority.

The PriorityQueue lies within its namesake class.

In Path Finding Using A* in C# 3.0, Part Four we get the FindPath method that is the one responsible for finding the shortest path possible. This method uses all of the remaining classes shown on the project structure above.

I put the FindPath method inside the AStar class.

In this part Eric writes:

What do we need to make the algorithm run? We need a start node, a destination node, a function which tells us the exact distance between two neighbours, and a function which tells us the estimated distance between the last node on a proposed path and the destination node.

This is the beautiful PathFind method:

static public Path<TNode> FindPath<TNode>(
    TNode start,
    TNode destination,
    Func<TNode, TNode, double> distance,
    Func<TNode, double> estimate) where TNode : IHasNeighbours<TNode>
{
    var closed = new HashSet<TNode>();

    var queue = new PriorityQueue<double, Path<TNode>>();

    queue.Enqueue(0, new Path<TNode>(start));

    while(!queue.IsEmpty)
    {
        var path = queue.Dequeue();

        if(closed.Contains(path.LastStep))
            continue;

        if(path.LastStep.Equals(destination))
            return path;

        closed.Add(path.LastStep);

        foreach(TNode n in path.LastStep.Neighbours)
        {
            double d = distance(path.LastStep, n);

            var newPath = path.AddStep(n, d);

            queue.Enqueue(newPath.TotalCost + estimate(n), newPath);
        }
    }

    return null;
}

When you get at the fourth part of Eric’s posts you may get thinking: How can I run this thing? That’s what I try to explain in this post.

The immutable stack implemented in part two of Eric’s posts is a generic one, that is, it accepts an object custom made to represent a Node in the path.

I already had a class Node that I used sometime ago while studying Artificial Intelligence. Take a look at Breadth and depth first search series of posts.

I also needed a Graph that represents the map formed with the connections between the nodes.

The Graph has a NodeList that stores all the nodes currently in the graph.

A node can have neighbors, that is, other nodes connected to it and such connections are represented in the code through the class AdjacencyList. Each node has an AdjacencyList that stores its neighbors.

A connection (edge) between two nodes (vertexes) is represented by the EdgeToNeighbor class that stores the neighbor of a node and the cost to get to that neighbor.

The AdjacencyList, EdgeToNeighbor, Graph, Node and NodeList classes are the outcome of an excellent series of articles by Scott Mitchell titled An Introduction to Data Structures. I got those classes and added them to the project.

This way I had all set to pass in the first two parameters of the PathFind method.

Now I needed to figure out how to create the distance and estimate functions to pass as the third and fourth parameters.

For the distance function I did this:

// Function which tells us the exact distance between two neighbours.
Func<Node, Node, double> distance = (node1, node2) => node1.Neighbors.Cast<EdgeToNeighbor>().Single(etn => etn.Neighbor.Key == node2.Key).Cost;

The distance function receives two nodes as parameters as can be seen in the PathFind method and returns a double value. I then use a lambda expression to find the Cost, (distance) in this case, between the two nodes. I get the neighbors of node1 and cast those to EdgeToNeighbor and execute a query using a lambda expression again to return the Cost only if the key of one of the neighbors of node1 is equal to the key of node2.

For the estimation function I did this:

// Estimation/Heuristic function (Haversine distance)
// It tells us the estimated distance between the last node on a proposed path and the destination node.
Func<Node, double> haversineEstimation = n => Haversine.Distance(n, destination, DistanceType.Kilometers);

Now I’m using as the estimation the Haversine formula. This formula gives the great-circle distances between two points on a sphere (Earth) from their longitudes and latitudes.

I got the Haversine formula code from the post Haversine Formula in C# and in SQL by Seth Long. I just adapted it a little bit so that I could use it with the Node data structure.

The haversine generic delegate receives as input a Node and return a double. Using a lambda expression I get the Haversine distance from node n to the destination node and I set that I want the distance in Kilometers.

With the distance and estimate functions on hand we’re OK to call the PathFind method.
Wait! Don’t we need something before that? Of course we need. We need to populate the graph with nodes. After all we’re after the shortest path between two nodes and we need some nodes to test the A* star pathfinding search. :-)

Next time: I’ll show you how to create a complete scenario using the same Romania map I used in the Breadth and depth first search series of posts.

References
[1] Lippert, Eric. Path Finding Using A* in C# 3.0, Part One. 2007. Available at <http://blogs.msdn.com/ericlippert/archive/2007/10/02/path-finding-using-a-in-c-3-0.aspx>. Accessed on June 13, 2009.

[2] Lippert, Eric. Path Finding Using A* in C# 3.0, Part Two. 2007. Available at <http://blogs.msdn.com/ericlippert/archive/2007/10/04/path-finding-using-a-in-c-3-0-part-two.aspx>. Accessed on June 13, 2009.

[3] Lippert, Eric. Path Finding Using A* in C# 3.0, Part Three. 2007. Available at <http://blogs.msdn.com/ericlippert/archive/2007/10/08/path-finding-using-a-in-c-3-0-part-three.aspx>. Accessed on June 13, 2009.

[4] Lippert, Eric. Path Finding Using A* in C# 3.0, Part Four. 2007. Available at <http://blogs.msdn.com/ericlippert/archive/2007/10/10/path-finding-using-a-in-c-3-0-part-four.aspx>. Accessed on June 13, 2009.

[5] Mitchell, Scott. An Extensive Examination of Data Structures. 2005. Available at <http://msdn.microsoft.com/en-us/library/ms379570(VS.80).aspx>. Accessed on June 13, 2009.

[6] Long, Seth. Haversine Formula in C# and in SQL. 2008. Available at <http://megocode3.wordpress.com/2008/02/05/haversine-formula-in-c>. Accessed on June 13, 2009.

Calculating prime numbers with LINQ in C#

If you want to see how to use PLINQ (Parallel LINQ) to get performance gains when performing this calculation, take a look at the post titled
Parallel LINQ (PLINQ) with Visual Studio 2010.

To serve as a processing test in a follow up post I’m posting how to calculate prime numbers using a LINQ query expression.

In mathematics, a prime number (or a prime) is a natural number which has exactly two distinct natural number divisors: 1 and itself. The first twenty-five prime numbers are:

2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97

The following code[1] shows how to calculate the prime numbers that are less equal ( <= ) a given number “max”:

Func<int, IEnumerable<int>> primeNumbers = max =>
     from i in Enumerable.Range(2, max - 1)
     where Enumerable.Range(2, i - 2).All(j => i % j != 0)
     select i;

IEnumerable<int> result = primeNumbers(10);
foreach(int i in result)
{
  Console.WriteLine(i);
}

In my final computer engineering assignment I presented LINQ - Language Integrated Query and its constructs. With such constructs it’s possible to achieve a high degree of method chaining.

LINQ is declarative, not imperative. It allows us to simply state what we want to do without worrying about how it is done.

In the code above we declare a Func<(Of <(T, TResult>)>) generic delegate which has an int as input parameter and an IEnumerable<int> as the result.
Func delegates are very useful for encapsulating user-defined expressions that are applied to each element in a set of source data.

Using a lambda expression ( => ) we assign a query expression to the delegate.
A lambda expression is an anonymous function that can contain expressions and statements, and can be used to create delegates or expression tree types.

The query expression states that from each value i in the Enumerable.Range(2, max - 1) where all elements of the range Enumerable.Range(2, i – 2) satisfy the condition All(j => i % j != 0), we select i.

max is the delegate input parameter.

The % symbol is the modulus operator in C#. It computes the remainder after dividing its first operand by its second.

For example: considering max = 10 we’d have the following…

The range in the from clause is: { 2, 3, 4, 5, 6, 7, 8, 9, 10 }

Taking the 1st value out of the from range, we have i = 2.

The range in the where clause is Range (2, 2 - 2) = Range (2 , 0) = {  }.

Since i = 2 is by default included in the result, what must be evaluated lies where the variable max is used. So, taking the 2nd value and assigning it to i we have i = 3.

The range in the where clause is Range (2, 3 - 2) = Range (2 , 1) = { 2 }.

Evaluating j => i % j we have 3 % 2 = 1, and so, 3 is a prime number.

Taking the 3rd value and assigning it to i we have i = 4.

The range in the where clause is Range (2,  4 - 2) = Range (2, 2) = { 2, 3 }.

Evaluating j => i % j we have 4 % 2 = 0 and 4 % 3 = 1, and so 4 is not a prime number because all elements of the where range must yield a result != 0 for the expression i % j != 0.

Now we have i = 5.

The range in the where clause is Range (2, 5 - 2) = Range (2 , 3) = { 2, 3, 4 }.

Evaluating j => i % j we have 5 % 2 = 1, 5 % 3 = 2 and 5 % 4 = 1. From this we have that 5 is a prime number.

Now we have i = 6.

The range in the where clause is Range (2, 6 - 2) = Range (2 , 4) = { 2, 3, 4, 5 }.

Evaluating j => i % j we have 6 % 2 = 0, 6 % 3 = 0, 6 % 4 = 2 and 6 % 5 = 1. From this we have that 6 is not prime number.

Now we have i = 7.

The range in the where clause is Range (2, 7 - 2) = Range (2 , 5) = { 2, 3, 4, 5, 6 }.

Evaluating j => i % j we have 7 % 2 = 1, 7 % 3 = 1, 7 % 4 = 3, 7 % 5 = 2 and 7 % 6 = 1. From this we have that 7 is prime number.

DebuggingPrimeNumbersLINQQueryExpression

Now we have i = 8.

The range in the where clause is Range (2, 8 - 2) = Range (2 , 6) = { 2, 3, 4, 5, 6, 7 }.

Evaluating j => i % j we have 8 % 2 = 0, 8 % 3 = 2, 8 % 4 = 0, 8 % 5 = 3, 8 % 6 = 2 and 8 % 7 = 1. 8 isn’t a prime number.

Now we have i = 9.

The range in the where clause is Range (2, 9 - 2) = Range (2 , 7) = { 2, 3, 4, 5, 6, 7, 8 }.

Evaluating j => i % j we have 9 % 2 = 1, 9 % 3 = 0, 9 % 4 = 1, 9 % 5 = 4, 9 % 6 = 3, 9 % 7 = 2 and 9 % 8 = 1. 9 isn’t a prime number.

Now we have i = 10.

The range in the where clause is Range (2, 10 - 2) = Range (2 , 8) = { 2, 3, 4, 5, 6, 7, 8, 9 }.

Evaluating j => i % j we have 10 % 2 = 0, 10 % 3 = 1, 10 % 4 = 2, 10 % 5 = 0, 10 % 6 = 4, 10 % 7 = 3, 10 % 8 = 2 and 10 % 9 = 1. 10 isn’t a prime number.

Finally we have 4 prime numbers <= 10; they are: { 2, 3, 5, 7 }.

The following table illustrates the result:

i where Range(2, i - 2) All(j => i % j != 0)
2 { }  
3 { 2 } 3 % 2 = 1
4 { 2, 3 } 4 % 2 = 0
4 % 3 = 1
5 { 2, 3, 4 } 5 % 2 = 1
5 % 3 = 2
5 % 4 = 1
6 { 2, 3, 4, 5 } 6 % 2 = 0
6 % 3 = 0
6 % 4 = 2
6 % 5 = 1
7 { 2, 3, 4, 5, 6 } 7 % 2 = 1
7 % 3 = 1
7 % 4 = 3
7 % 5 = 2
7 % 6 = 1
8 { 2, 3, 4, 5, 6, 7 } 8 % 2 = 0
8 % 3 = 2
8 % 4 = 0
8 % 5 = 3
8 % 6 = 2
8 % 7 = 1
9 { 2, 3, 4, 5, 6, 7, 8 } 9 % 2 = 1
9 % 3 = 0
9 % 4 = 1
9 % 5 = 4
9 % 6 = 3
9 % 7 = 2
9 % 8 = 1
10 { 2, 3, 4, 5, 6, 7, 8, 9 } 10 % 2 = 0
10 % 3 = 1
10 % 4 = 2
10 % 5 = 0
10 % 6 = 4
10 % 7 = 3
10 % 8 = 2
10 % 9 = 1

I’ll use the primeNumbers delegate in a future post to evaluate PLINQ (Parallel LINQ) and measure the performance gains when the same calculation is done in parallel instead of in sequence.

References
[1] Perfetti, Michel. LINQ Quiz: The list of prime numbers in 3 clauses? 2007. Available at <http://blogs.developpeur.org/raptorxp/archive/2007/11/26/quizz-linq-la-liste-des-nombres-premiers-en-3-clauses.aspx>. Accessed on May 31, 2009.