Wednesday 8 April 2015

Apache Spark 1.3 data frames in IPython Notebook

I recently installed the latest version (1.3) of Apache Spark and started looking at some of the things that have changed since my last post on Spark and IPython Notebook.

There seems to be a lot going on around the Spark SQL libraries, including the new DataFrame abstraction, so I put together a quick IPython notebook exploring just a few aspects of Spark SQL.

Spark SQL is one of the key factors in why my own organisation is looking at Spark for large scale data applications, because it gives us an easier, more consistent and relatively familiar abstraction over data from a variety of sources.  The new DataFrame API seems to fit well within this trend towards more generic approaches to heterogeneous data sources.

In this demo notebook, I loaded up a file of sample CSV data on house sales from the UK Land Registry (see licence), applied a Spark SQL table to it, then performed some simple aggregation queries using the older API and the new DataFrame functions.  This is just a quick experiment, but my impression is that the DataFrame API queries are significantly faster than the older equivalents.

As an old database developer, I'm not sure yet how far this API will provide the same flexibility as good old SQL, but it's certainly an interesting and potentially useful option

My notebook is available for download, and you'll need to install Apache Spark version 1.3 and IPython Notebook (I use the excellent Anaconda Python distribution which includes IPython Notebook).

The download includes my sample CSV file,  but you should be able to adapt the code to work with your own CSV data fairly easily.  Enjoy!