Processing Stack Overflow data dump with Apache Spark


This post is about the final work I did for one of the disciplines of the Master's degree I'm currently attending at UFRJ - Federal University of Rio de Janeiro in the branch of Data and Knowledge Engineering (Databases) that is under the division of Computer and Systems Engineering department at COPPE\UFRJ.

The discipline is called Special Topics in Databases IV and is taught by professor Alexandre Bento de Assis Lima.

The presentation (PPT slides) is in Brazilian Portuguese. I'll translate the slides to English in this blog post. They give an overall view about the work done.

The final paper is written in English.

Files

Trabalho prático sobre Apache Spark envolvendo um problema típico de Big Data (apresentação\presentation).pdf (in Portuguese)

Processing Stack Overflow data dump with Apache Spark (in English)

Abstract. This paper describes the process involved in building an ETL tool based on Apache Spark. It imports XML data from Stack Overflow data dump.
The XML files are processed using Spark XML library and converted to a DataFrame object. The DataFrame data is then queried with Spark SQL library.
Two applications were developed: spark-backend and spark-frontend. The first one contains the code responsible for dealing with Spark while the later one is user centric allowing the users to consume the data processed by Spark.

All the code developed is in English and should be easy to read.

Presentation
  1. Objective
  2. Problem
  3. Technologies
  4. Strategy used to acquire the data
  5. Development
  6. Conclusion
  7. Links
  1. Objective
    • Put into practice the concepts presented during the classes.
    • Have a closer contact with modern technologies used to process Big Data.
    • Automate the Extraction\Mining of valuable\interesting information hidden in the immensity of data.

  2. Problem
    • Analyse StackOverflow data dump available on the internet on a monthly basis.
    • The data dump is composed of a set of XML files compacted with the .7z extension.
    • Even after compaction the biggest file has 15.3 GB. This size is directly linked to the data volume handled by Big Data.
    • Spark at first will be used as an ETL tool (ETL = Extract > Transform > Load) to prepare the data consumed by a front-end web app.
    • "At first" because there's also the possibility of using Spark as a tool to process the data that'll be shown in the web app.

  3. Technologies
    • Apache Spark 2.0.1 +
    • Spark XML 0.4.1 +
    • Spark SQL 2.0.2
    • Ubuntu 16.04 LTS (Xenial Xerus)
    • Linux VM (virtual machine) running on Parallels Desktop 12 for Mac
    • Scala 2.11.8
    • XML (Extensible Markup Language)
    • XSL (Extensible Stylesheet Language)
    • Play Framework 2.5 (front end)
    • Eclipse Neon 4.6.1 with Scala IDE 4.5.0 plugin as the IDE

  4. Strategy used to acquire the data
    • Got the .torrent file that contains all the data dumps from Stack Exchange family of sites - https://archive.org/details/stackexchange
    • Selected the eight .7z files related to StackOverflow: stackoverflow.com-Badges.7z, stackoverflow.com-Comments.7z, stackoverflow.com-PostHistory.7z, stackoverflow.com-PostLinks.7z, stackoverflow.com-Posts.7z, stackoverflow.com-Tags.7z, stackoverflow.com-Users.7z, stackoverflow.com-Votes.7z

  5. Development
    • To make the work viable (running locally out of a cluster), a single .xml file [Users.xml] was used. A subset of 100.000 lines (32.7 MB) was selected. This file has a total of 5,987.287 lines (1.8 GB).
    • hadoop@ubuntu:/media/psf/FreeAgent GoFlex Drive/Downloads$ head -100000 Users.xml > Users100000.xml
    • The file Users.xsl was used covert Users100000.xml data to the format expected by spark-xml library. The result was saved to Users100000.out.xml.


    • The .xml and .xsl files were placed into the input folder of the Scala project [spark-backend] inside Eclipse.
    • The application spark-backend read the file Users100000.out.xml through Spark XML and transforms it into a DataFrame object.
    • The Spark SQL library is used subsequently to search the data. Some sample queries were created.
    • Each query generates a CSV file (SaveDfToCsv) to be consumed in a later stage by a web application [spark-frontend], that is, Spark is used as an ETL tool.
    • The result of each query is saved in multiple files in the folder output. This happens because Spark was conceived to execute jobs in a cluster (multiple nodes\computers).
    • For testing purposes, a method that renames the CSV file was created. This method copies the generated CSV to a folder called csv. The destiny folder can be configured in the file conf/spark-backend.properties.



    • The application [spark-backend] can be executed inside Eclipse or through the command line in Terminal using the command spark-submit.
    • In the command line we make use of the JAR file produced during the project build in Eclipse. We pass as parameters the necessary packages as below:
    • spark-submit --packages com.databricks:spark-xml_2.11:0.4.1 -- class com.lenielmacaferi.spark.ProcessUsersXml - -master local com.lenielmacaferi.spark- backend-0.0.1-SNAPSHOT.jar
    • The application [spark-frontend] was built with Play Framework (The High Velocity Web Framework For Java and Scala).
    • The user opens spark-frontend main page at localhost:9000 and has access to the list of CSV files generated by [spark-backend] application.
    • When clicking a file name, the CSV file is sent to the user's computer. The user can then use any spreadsheet software to open and post-process\analyse\massage the data.


  6. Conclusion
    • With Spark's help we can develop interesting solutions as for example: a daily job that can download and upload data to a folder "input" processing the data along the way in many different ways.
    • Using custom made code we can work with the data in a cluster (fast processing) using a rich API full of methods and resources. In addition, we have at our disposal inumerous additional libraries\plugins developed by the developer community. Put together all the power of Scala and Java and their accompanying libraries.
    • The application demonstrated can be easily executed in a cluster. We only need to change some parameters in the object SparkConf.

  7. Links

Using Zotero to convert Springer Link CSV search result to BibTex format


Today I needed to generate a BibTex file to serve as input to Parsif.al.

Parsif.al is an online tool designed to support researchers to perform systematic literature reviews within the context of Software Engineering.

I hit a brickwall while doing a search in Springer Link because it only gives us a CSV file with the entire search result. It caps the result to the first 1000 registries. It'd be a pain to click and open each and every search result to be able to export the corresponding BibTex.

Using Zotero it's easy to get a BibTex out of the CSV file generated by Springer Link.


Follow these simple steps:

1 - Open the CSV file in Excel for example and copy the column that contains the item DOI [ Digital Object Identifier ];

2 - Paste the DOI(s) into Zotero's Add item(s) by identifier (see Figure 1 above). Wait while it imports...

3 - Select the folder where you imported the DOI(s); (Player Modeling in Figure 1)

4 - Right click the folder and select Export collection... pick BibTex.

You're done.

Hope it helps.

References:

Adding Items to your Zotero Library

Italian Ancestry - Maccaferri & Cantamessa families


First of all I like genealogy and I like history.

I like Italy because part of my ancestors\antenati come from there. It's undeniable the affection I feel for the country.

To know my origin is something that instigates my feelings and this post is my try to find relatives scattered all over the world.

I've created a Family Tree at MyHeritage:

https://www.myheritage.com.br/site-family-tree-343227191/leniel-macaferis-family

Below I tell my "recent" Italian family history in 3 idioms: Italiano, English and Portuguese.

If you think we may be relatives, just drop me a line at leniel@gmail.com or use the contact form. It'll be a pleasure to get to know you. :)

By the way, this is my Facebook profile: https://www.facebook.com/leniel.macaferi
My Family Tree (click to zoom)
Family tree created with the help of FamilySearch website

ITALIANO
Il mio nome è Leniel Macaferi ed è nato il 06.10.1983 in Carangola, Minas Gerais, Brasile. Io sono di origine italiana. In linea paterna, io sono il nipote di una coppia di italiani: Giuseppe Maccaferri nato il 10.07.1893 a San Felice sul Panaro, Modena, Emilia-Romagna, Italia. & Anna Rosa Cantamessa nata il 23.07.1906 a Valtesse, Bergamo, Lombardia, Italia. Giuseppe Maccaferri [1893-1941] è arrivato in Brasile con 2 anni di età su 16.05.1896 a bordo della nave\vapore [Attività] con i suoi genitori Sperindio Maccaferri [1861 ~ 1935] e Maria Cirelli [1867 ~ 1935]. I miei bisnonni erano sposati a San Felice sul Panaro su 03.11.1887.
Il record dell’arrivo si trova nel sito pubblico di Arquivo Mineiro. http://www.siaapm.cultura.mg.gov.br/modules/imigrantes/brtacervo.php?cid=1383
Anna Rosa Cantamessa [1906-1983] è arrivato in Brasile con 7 anni di età su 20.08.1913 a bordo della nave\vapore [Regina Elena] con i suoi genitori Giuseppe Andrea Cantamessa [1875-1952] e Maria Camilla Irma Colombo [1880 ~ 1921]. I miei bisnonni si sono sposati in Valtesse su 24.09.1901. Il record dell’arrivo si trova nel sito di sistema nazionale informazioni archivio a Rio de Janeiro. http://imagem.sian.an.gov.br/anexos/sian/arquivos/1139635_40882.pdf
Il mio trisnonno Giacomo Cantamessa (padre di Giuseppe) è arrivato in Brasile il 16.12.1897 sulla nave\vapore [Spagne] con la moglie e gli altri bambini. Giuseppe era colui che era in Italia e emigrato in Brasile circa 15 anni più tardi. Il record dell'arrivo di Giacomo Cantamessa e la famiglia si trova nel sito pubblico di Arquivo Mineiro. http://www.siaapm.cultura.mg.gov.br/modules/imigrantes/brtacervo.php?cid=16126
Questi sono i cognome nel mio albero genealogico fino ad oggi:
#
Cognome
Antenato(a)
Relazione
Livello
1
Maccaferri
Giuseppe Maccaferri
nonno
2
2
Cantamessa
Anna Rosa Cantamessa
nonna
2
3
Colombo
Maria Camilla Irma Colombo
bisnonna
3
4
Cirelli
Maria Cirelli
bisnonna
3
5
Bocchi
Maria Bocchi
trisnonna
4
6
Bergamini
Candida Maria Filomena Bergamini
trisnonna
4
7
Baldis
Eufrosina Baldis
trisnonna
4
8
Locatelli
Antonia Locatelli
trisnonna
4
9
Dotti
Maria Dotti
quadrisavola
5
10
Calzolari
Eleonora Calzolari
quadrisavola
5
11
Cattaneo
Orsola Cattaneo
quadrisavola
5
12
Falci
Maria Teresa Falci
quadrisavola
5
13
Guidetti
Maria Guidetti
quinquisavola
6
14
Viscardi
Angela Viscardi
quinquisavola
6
15
Luiselli
Antonia Luiselli
quinquisavola
6

Il mio cognome Macaferi è il modo "brasiliano" di Maccaferri.

ENGLISH
My name is Leniel Macaferi and I was born on 10.06.1983 in Carangola, Minas Gerais, Brazil. I am descendant of Italians.
From the paternal line, I am the grandson of an Italian couple: Giuseppe Maccaferri born on 07.10.1893 in San Felice sul Panaro, Modena, Emilia-Romagna, Italy.
& Anna Rosa Cantamessa born on 07.23.1906 in Valtesse, Bergamo, Lombardy, Italy. Giuseppe (Joseph) Maccaferri [1893-1941] arrived in Brazil at the age of 2 on 05.16.1896 aboard the ship [Attivitá] with his parents Sperindio Maccaferri [1861 ~ 1935] and Maria Cirelli [1867 ~ 1935]. My great-grandparents married in San Felice sul Panaro on 11.03.1887.
The arrival record is in the public site of Arquivo Mineiro. http://www.siaapm.cultura.mg.gov.br/modules/imigrantes/brtacervo.php?cid=1383
Anna Rosa Cantamessa [1906-1983] arrived in Brazil at the age of 7 on 08.20.1913 aboard the ship [Regina Elena] with her parents Giuseppe Andrea Cantamessa [1875 - 1952] and Maria Camilla Irma Colombo [1880 ~ 1921]. My great-grandparents were married in Valtesse on 09.24.1901.
The arrival record is on the website of the National Archive Information System in Rio de Janeiro. http://imagem.sian.an.gov.br/anexos/sian/arquivos/1139635_40882.pdf
My great-great-grandfather Giacomo Cantamessa (Giuseppe’s father) has arrived in Brazil on 12.16.1897 aboard the ship [Spagne] with his wife and the other children. Giuseppe was the one who stayed in Italy and emigrated to Brazil about 15 years later.
The arrival record of Giacomo Cantamessa and the family is in the public site of Arquivo Mineiro. http://www.siaapm.cultura.mg.gov.br/modules/imigrantes/brtacervo.php?cid=16126 These are the surnames in my family tree to the present moment:
#
Surname
Ancestor
Relationship
Level
1
Maccaferri
Giuseppe Maccaferri
grandfather
2
2
Cantamessa
Anna Rosa Cantamessa
grandmother
2
3
Colombo
Maria Camilla Irma Colombo
great grandmother
3
4
Cirelli
Maria Cirelli
great grandmother
3
5
Bocchi
Maria Bocchi
2x-great-grandmother
4
6
Bergamini
Candida Maria Filomena Bergamini
2x-great-grandmother
4
7
Baldis
Eufrosina Baldis
2x-great-grandmother
4
8
Locatelli
Antonia Locatelli
2x-great-grandmother
4
9
Dotti
Maria Dotti
3x-great-grandmother
5
10
Calzolari
Eleonora Calzolari
3x-great-grandmother
5
11
Cattaneo
Orsola Cattaneo
3x-great-grandmother
5
12
Falci
Maria Teresa Falci
3x-great-grandmother
5
13
Guidetti
Maria Guidetti
4x-great-grandmother
6
14
Viscardi
Angela Viscardi
4x-great-grandmother
6
15
Luiselli
Antonia Luiselli
4x-great-grandmother
6

My last name Macaferi is the "Brazilianized" way for Maccaferri.
PORTUGUÊS
Me chamo Leniel Macaferi e nasci em 06.10.1983 em Carangola, Minas Gerais, Brasil. Sou descendente de italianos. Na linha paterna, sou neto de um casal de italianos: Giuseppe Maccaferri nascido em 10.07.1893 em San Felice sul Panaro, Modena, Emilia-Romagna, Itália. & Anna Rosa Cantamessa nascida em 23.07.1906 em Valtesse, Bergamo, Lombardia, Itália. Giuseppe (José) Maccaferri [1893-1941] chegou no Brasil com 2 anos de idade em 16.05.1896 no navio\vapor [Attività] com seus pais Sperindio Maccaferri [1861~1935] e Maria Cirelli [1867~1935]. Meus bisavós casaram-se em San Felice sul Panaro em 03.11.1887. O registro de chegada está no site do Arquivo Público Mineiro. http://www.siaapm.cultura.mg.gov.br/modules/imigrantes/brtacervo.php?cid=1383 Anna Rosa Cantamessa [1906-1983] chegou no Brasil com 7 anos de idade em 20.08.1913 no navio\vapor [Regina Elena] com seus pais Giuseppe Andrea Cantamessa [1875-1952] e Maria Camilla Irma Colombo [1880~1921]. Meus bisavós casaram-se em Valtesse em 24.09.1901. O registro de chegada está no site do Sistema de Informações do Arquivo Nacional no Rio de Janeiro. http://imagem.sian.an.gov.br/anexos/sian/arquivos/1139635_40882.pdf Meu trisavô Giacomo Cantamessa (pai de Giuseppe) chegou no Brasil em 16.12.1897 no navio\vapor [Espagne] com a mulher e os outros filhos. Giuseppe foi o único que ficou na Itália e emigrou para o Brasil aproximadamente 15 anos depois. O registro da chegada de Giacomo Cantamessa e família está no site do Arquivo Público Mineiro. http://www.siaapm.cultura.mg.gov.br/modules/imigrantes/brtacervo.php?cid=16126 Estes são os sobrenomes na minha árvore familiar até o presente momento:
#
Sobrenome
Antenato(a)
Relação
Nível
1
Maccaferri
Giuseppe Maccaferri
avô
2
2
Cantamessa
Anna Rosa Cantamessa
avó
2
3
Colombo
Maria Camilla Irma Colombo
bisavó
3
4
Cirelli
Maria Cirelli
bisavó
3
5
Bocchi
Maria Bocchi
trisavó
4
6
Bergamini
Candida Maria Filomena Bergamini
trisavó
4
7
Baldis
Eufrosina Baldis
trisavó
4
8
Locatelli
Antonia Locatelli
trisavó
4
9
Dotti
Maria Dotti
tetravó
5
10
Calzolari
Eleonora Calzolari
tetravó
5
11
Cattaneo
Orsola Cattaneo
tetravó
5
12
Falci
Maria Teresa Falci
tetravó
5
13
Guidetti
Maria Guidetti
pentavó
6
14
Viscardi
Angela Viscardi
pentavó
6
15
Luiselli
Antonia Luiselli
pentavó
6

Meu sobrenome Macaferi é a forma “abrasileirada” de Maccaferri.