Leniel Maccaferri's blog: Processing Stack Overflow data dump with Apache Spark

This post is about the final work I did for one of the disciplines of the Master's degree I'm currently attending at UFRJ - Federal University of Rio de Janeiro in the branch of Data and Knowledge Engineering (Databases) that is under the division of Computer and Systems Engineering department at COPPE\UFRJ.

The discipline is called Special Topics in Databases IV and is taught by professor Alexandre Bento de Assis Lima.

The presentation (PPT slides) is in Brazilian Portuguese. I'll translate the slides to English in this blog post. They give an overall view about the work done.

The final paper is written in English.

Files

- Trabalho prático sobre Apache Spark envolvendo um problema típico de Big Data (apresentação\presentation).pdf (in Portuguese)

- Processing Stack Overflow data dump with Apache Spark (in English)

Abstract. This paper describes the process involved in building an ETL tool based on Apache Spark. It imports XML data from Stack Overflow data dump.
The XML files are processed using Spark XML library and converted to a DataFrame object. The DataFrame data is then queried with Spark SQL library.
Two applications were developed: spark-backend and spark-frontend. The first one contains the code responsible for dealing with Spark while the later one is user centric allowing the users to consume the data processed by Spark.

All the code developed is in English and should be easy to read.

Presentation

Objective
Problem
Technologies
Strategy used to acquire the data
Development
Conclusion
Links

Objective

Put into practice the concepts presented during the classes.
Have a closer contact with modern technologies used to process Big Data.
Automate the Extraction\Mining of valuable\interesting information hidden in the immensity of data.

Problem

Analyse StackOverflow data dump available on the internet on a monthly basis.
The data dump is composed of a set of XML files compacted with the .7z extension.
Even after compaction the biggest file has 15.3 GB. This size is directly linked to the data volume handled by Big Data.
Spark at first will be used as an ETL tool (ETL = Extract > Transform > Load) to prepare the data consumed by a front-end web app.
"At first" because there's also the possibility of using Spark as a tool to process the data that'll be shown in the web app.

Technologies

Apache Spark 2.0.1 +
Spark XML 0.4.1 +
Spark SQL 2.0.2
Ubuntu 16.04 LTS (Xenial Xerus)
Linux VM (virtual machine) running on Parallels Desktop 12 for Mac
Scala 2.11.8
XML (Extensible Markup Language)
XSL (Extensible Stylesheet Language)
Play Framework 2.5 (front end)
Eclipse Neon 4.6.1 with Scala IDE 4.5.0 plugin as the IDE

Strategy used to acquire the data

Got the .torrent file that contains all the data dumps from Stack Exchange family of sites - https://archive.org/details/stackexchange
Selected the eight .7z files related to StackOverflow: stackoverflow.com-Badges.7z, stackoverflow.com-Comments.7z, stackoverflow.com-PostHistory.7z, stackoverflow.com-PostLinks.7z, stackoverflow.com-Posts.7z, stackoverflow.com-Tags.7z, stackoverflow.com-Users.7z, stackoverflow.com-Votes.7z

Development

To make the work viable (running locally out of a cluster), a single .xml file [Users.xml] was used. A subset of 100.000 lines (32.7 MB) was selected. This file has a total of 5,987.287 lines (1.8 GB).

hadoop@ubuntu:/media/psf/FreeAgent GoFlex Drive/Downloads$ head -100000 Users.xml > Users100000.xml

The file Users.xsl was used covert Users100000.xml data to the format expected by spark-xml library. The result was saved to Users100000.out.xml.
The .xml and .xsl files were placed into the input folder of the Scala project [spark-backend] inside Eclipse.
The application spark-backend read the file Users100000.out.xml through Spark XML and transforms it into a DataFrame object.
The Spark SQL library is used subsequently to search the data. Some sample queries were created.
Each query generates a CSV file (SaveDfToCsv) to be consumed in a later stage by a web application [spark-frontend], that is, Spark is used as an ETL tool.
The result of each query is saved in multiple files in the folder output. This happens because Spark was conceived to execute jobs in a cluster (multiple nodes\computers).
For testing purposes, a method that renames the CSV file was created. This method copies the generated CSV to a folder called csv. The destiny folder can be configured in the file conf/spark-backend.properties.
The application [spark-backend] can be executed inside Eclipse or through the command line in Terminal using the command spark-submit.
In the command line we make use of the JAR file produced during the project build in Eclipse. We pass as parameters the necessary packages as below:

spark-submit --packages com.databricks:spark-xml_2.11:0.4.1 -- class com.lenielmacaferi.spark.ProcessUsersXml - -master local com.lenielmacaferi.spark- backend-0.0.1-SNAPSHOT.jar

The application [spark-frontend] was built with Play Framework (The High Velocity Web Framework For Java and Scala).
The user opens spark-frontend main page at localhost:9000 and has access to the list of CSV files generated by [spark-backend] application.
When clicking a file name, the CSV file is sent to the user's computer. The user can then use any spreadsheet software to open and post-process\analyse\massage the data.

Conclusion

With Spark's help we can develop interesting solutions as for example: a daily job that can download and upload data to a folder "input" processing the data along the way in many different ways.
Using custom made code we can work with the data in a cluster (fast processing) using a rich API full of methods and resources. In addition, we have at our disposal inumerous additional libraries\plugins developed by the developer community. Put together all the power of Scala and Java and their accompanying libraries.
The application demonstrated can be easily executed in a cluster. We only need to change some parameters in the object SparkConf.