Friday 29 May 2015

Strata + Hadoop World, London, 2015

Introduction



I recently attended the Strata Hadoop World conference in London from 5-7 May 2015, including two half-day tutorial sessions (on Apache Cassandra and R/RStudio) on the first day.  

This conference is the main annual Hadoop conference in Europe and is strongly oriented towards business users and the technology industry.  It is sponsored by the technology publisher O’Reilly and by Cloudera, one of the main commercial Hadoop suppliers, with additional sponsorship from several other major companies in the Big Data arena.  The conference exhibitors also included smaller suppliers of tools and services to the commercial Big Data sector.  IT service providers and consultancies were heavily represented here.

This year’s conference offered a wide range of talks, tutorials and (at extra expense) workshops on a variety of topics broadly relating to Hadoop and Big Data, including several sessions on Apache Spark, which seems to be emerging as a significant cross-platform technology for Big Data.

Other talks looked at business applications of Big Data technology e.g. in banking, retail and telecoms, as well as less commercial topics such as Big Data for health research.

From the attendees’ directory provided by the conference organisers, there seemed to be around 900 attendees.  A large proportion of these were from outside the UK, mainly from elsewhere in Europe but also from Asia.  The conference organisers announced that this year would see the first Strata Hadoop World in Asia (in Singapore), which is clearly aimed at a growing market for these Big Data events worldwide.

General themes


Hadoop is going mainstream but moving past the hype


Not surprisingly, the main focus of many sessions was around Hadoop and related tools and applications.  It was striking to see that several major organisations are already integrating Hadoop into their business processes.  Although many of us still struggle with basic Hadoop installation and architecture, it is clear that these technologies are increasingly seen as a mainstream platform for providing business critical services.  Prominent Hadoop users mentioned here included BT, Barclays, Bloomberg, IBM etc.

It was also encouraging to see that there is a growing body of expertise around how to use these technologies e.g. sessions on designing and implementing enterprise applications with Hadoop.  Access to this expertise is vital, as there is still a shortage of reliable and up-to-date documentation on how to build large scale applications with Big Data technologies in practice.

It is still early days compared to traditional relational databases systems, where the industry has decades of experience, but my impression is that Hadoop is no longer seen as a risky cutting-edge technology by mainstream commercial users.

However, it is interesting to note that Infoworld reported recently that demand for Hadoop is falling as people start to look more closely at alternative Big Data platforms.  It is suggested that this is partly due to the lack of Hadoop skills on the market, and to the availability of tools like NoSQL databases that may be more suitable for many users.  And of course, it may simply be that Hadoop is finally moving past the initial "peak of inflated expectations" in the Gartner hype cycle.

If so, then this is a welcome development, as a more critical attitude towards the Hadoop eco-system on the part of customers may further encourage suppliers to focus on providing tools and expertise to make it easier (and cheaper) to adopt and exploit Hadoop where appropriate.

Hadoop eco-system is growing


The industry already seems to recognise this need for a broader range of services and support for organisations that are starting to explore Hadoop as a platform for enterprise applications.  A growing number of technology and service providers are building their own products and services around Hadoop.  These range from massive companies like Amazon, eBay and Google, through established technology suppliers such as SAS Institute, Autodesk and Informatica, to smaller specialist consultancies and software suppliers offering a range of tools and services.

To some extent, one might still ask whether some of these companies are simply applying “Big Data” gloss to their existing products (e.g. Informatica), without necessarily representing a major transformation in their products themselves.  But this desire to link their products to Hadoop at least indicates a general sense of confidence that Hadoop is a viable technology platform on which to build enterprise data services for the longer term.  Even if the glory days of the hype cycle are behind us.

Data science still cool


If Hadoop is no longer quite so cool, there was more reassuring news for Big Data hipsters, as another prominent strand in the conference offering was the growing application of data science to real-world business needs.  This ranged from simple statistical analysis and imaginative data visualisation to machine learning for analytics and reporting. One of the short key note talks was from Cait O'Riordan, describing how music identification service Shazam is able to spot hits before they happen.

There were all-day workshops on Apache Spark and on machine learning with Python, which ran alongside the conference and were sold out well in advance, despite being quite expensive.  There is obviously a strong demand for training in these technologies, although it is not clear how far these technical skills are being applied in practice or to what extent they are complemented and informed by a solid understanding of basic statistics.

At least some “data science” is probably still more “data” than “science”, but many organisations are clearly aware of the need to build their data science skills, and several were actively recruiting for “data scientists” and “information architects” at the conference.

Big Data > Hadoop, and Google wants a slice of the action


Although this conference was primarily about Hadoop, other Big Data storage and processing technologies featured in several sessions.

There was a half-day tutorial on Apache Cassandra, which provided a quick introduction to this distributed, high-availability, column-family database, which is used for commercial Big Data applications such as video streaming (NetFlix, Sky) and time series data (Apple, British Gas, eBay).  This data model is restrictive but very powerful in the right context.

Apache Mesos is an abstraction layer for managing distributed resources transparently, which can be combined with cluster-based tools such as Spark or Cassandra.  Google also offers the commercial Mesos-based “data centre operating system” Mesosphere as an easy-deploy option on their Google Compute Platform.

Google announced a new version of their BigTable NoSQL database service, which will support the existing Hadoop HBase API, effectively making BigTable a potential replacement for HBase in some applications.  BigTable was one of the inspirations for HBase (and Apache Cassandra) so this may represent a certain degree of convergence.

Of course, using Google BigTable would require users to commit to the Google cloud platform for this aspect of their application architecture.  Google has recently been promoting its Google Cloud Platform as a potential alternative to Amazon’s more comprehensive (and complex) cloud services, so this announcement is clearly part of a larger strategy to gain a greater share of the market for cloud-based PaaS products.

Apache Spark is definitely one to watch


Databricks is the company behind Apache Spark and was set up by the core Spark creators. They were strongly represented at this conference, with contributions from founders Paco Nathan and Patrick Wendell.  Databricks supported the Spark training course that ran alongside the conference, and there were also several talks on various aspects of Apache Spark for ML and general data processing beyond the Hadoop platform.

Apache Spark offers a distinctive approach to distributed data processing, which can be deployed in various modes e.g. on top of a Hadoop cluster, on a Mesos cluster, on a stand-alone cluster or on a local machine e.g. for development or data exploration.

IBM veteran Rod Smith highlighted Spark in a short talk on emerging technologies, pointing to its unified programming model for batch, realtime and interactive processing that offers a way to integrate data from multiple sources, including streaming data.  He also mentioned growing interest in notebook-style interfaces e.g. IPython Notebook.

BT and Barclays are both looking at or already using Spark in some of their data pipelines, while other companies are starting to build tools and services around Spark e.g. Spanish start-up Stratio demonstrated their new SparkTA data analytics platform.

Datastax, the company behind the Apache Cassandra database, now offers a Spark Connector to allow Cassandra and Spark to work together on the same cluster, although this is not yet compatible with the current version of Spark.

Meanwhile, Databricks is currently testing its cloud-based PaaS Databricks Cloud, which will provide a notebook-style interface to a hosted Spark installation, allowing users to upload, transform, analyse and report on their data entirely via the browser.

This looks like a promising and powerful tool, especially for data scientists/analysts, although it is not yet clear when it will be available, or what the hosting model will be.  However, Patrick Wendell, one of the developers of Spark and co-founder of Databricks, told me that it will be possible to deploy Databricks Cloud onto a private AWS environment, which might help to address concerns over issues such as data location and confidentiality.

Spark is still developing quickly, with new features being added every few months, but despite its relative immaturity there seems to be a strong interest in exploring Spark’s potential for Big Data applications from developers, tool providers and businesses.

Conclusion


As discussed above, this conference gave a strong sense that the core Hadoop platform has come of age and is now being used for mainstream business applications in the real world.  This is helping to generate a wider eco-system of skills, services and tools around Hadoop, which should help later arrivals to make an easier transition to these distinctive distributed data technologies by exploiting the knowledge and expertise now available in the market.

The future of large scale data storage in the enterprise is clearly going to be based on distributed platforms, whether this is on Hadoop, Mesos or some other technology such as NoSQL databases, or indeed a combination of these.  Streaming for real-time data processing is a particularly hot topic, offering scope for new applications that would not previously have been possible.

Technology and service providers are all keen to exploit this relatively new market, both through improvements to existing products (SAS, Informatica, HP etc) and through new and innovative products (e.g. Databricks Cloud) that take advantage of the power of these new technologies.

Perhaps ironically, other popular topics covered technologies that have been developed or adopted as a response to perceived historical short-comings in the core Hadoop platform.  Google’s HBase-on-BigTable announcement is clearly aiming to attract existing HBase users away from Hadoop. Meanwhile, Apache Cassandra has long offered an alternative column-family database with its own distinctive features compared to HBase.

Apache Spark seems to be especially promising as a widely applicable and very flexible general-purpose distributed processing engine for data from a wide range of sources. Spark offers a way to integrate processing for a variety of common data-processing tasks, from data acquisition and ETL through traditional analytics to machine learning and even real-time analysis of streaming data.

Beyond technology, there seems to be a growing awareness of the need for and usefulness of data science skills in industry as well as academia.  Several keynote talks touched on topics familiar to traditional data analysts, such as issues of data quality, provenance, security and confidentiality, as well as the need for a more general understanding of how to make intelligent practical use of Big Data.

Data technology - especially for Big Data - is changing rapidly.  Many organisations are understandably reluctant to put themselves at the bleeding edge of technology, but we still need to be looking at what industry leaders are doing today, in order to help us identify what we should be doing tomorrow.  This conference provided a welcome opportunity to do this.


Wednesday 8 April 2015

Apache Spark 1.3 data frames in IPython Notebook

I recently installed the latest version (1.3) of Apache Spark and started looking at some of the things that have changed since my last post on Spark and IPython Notebook.

There seems to be a lot going on around the Spark SQL libraries, including the new DataFrame abstraction, so I put together a quick IPython notebook exploring just a few aspects of Spark SQL.

Spark SQL is one of the key factors in why my own organisation is looking at Spark for large scale data applications, because it gives us an easier, more consistent and relatively familiar abstraction over data from a variety of sources.  The new DataFrame API seems to fit well within this trend towards more generic approaches to heterogeneous data sources.

In this demo notebook, I loaded up a file of sample CSV data on house sales from the UK Land Registry (see licence), applied a Spark SQL table to it, then performed some simple aggregation queries using the older API and the new DataFrame functions.  This is just a quick experiment, but my impression is that the DataFrame API queries are significantly faster than the older equivalents.

As an old database developer, I'm not sure yet how far this API will provide the same flexibility as good old SQL, but it's certainly an interesting and potentially useful option

My notebook is available for download, and you'll need to install Apache Spark version 1.3 and IPython Notebook (I use the excellent Anaconda Python distribution which includes IPython Notebook).

The download includes my sample CSV file,  but you should be able to adapt the code to work with your own CSV data fairly easily.  Enjoy!






Monday 23 February 2015

Apache Spark and IPython Notebook - simple demo

I've been playing around with Apache Spark for a while now, mainly using Scala.  A couple of my colleagues are interested in learning about Spark as well, but they're data scientists, not developers, and they are more comfortable using Python.  So far so good  - Spark has a nifty Python shell and a comprehensive Python API, after all.

But what my colleagues really like is nice interactive tools without too much command-line voodoo.  And they're already using IPython Notebook to give them all this goodness for their Python data science work.

Happily, it turns out you can use IPython Notebook with a local Apache Spark installation, so I wrote a quick demo notebook to illustrate how you can use Spark to do the traditional word count inside a notebook.

The repo is on BitBucket so you can do a git clone or download the code as a zip-file.  It's just a single notebook, a data folder containing a couple of text files, and a sample shell-script for starting the notebook on Linux.

So if you're new to Spark  with IPython Notebook, feel free to try it out.


Friday 16 January 2015

MongoDB User Authentication - the basics

Overview


When you're just starting out with MongoDB e.g. playing with it on your home PC, you don't need to worry about setting up database users etc, because authentication is disabled by default.

But once you start thinking about building an application with MongoDB, you'll probably need to think about which users should be able to read/write to which databases/collections. For example, ordinary users should probably only be able to read/write certain collections in a given database, while your application admin would need to be able to do things like create collections and create indexes etc in the application database.

Coming from an RDBMS background, I found MongoDB's user set-up a bit confusing, so I put together this post on how to set up a couple of users for an application database, where some users can only read/write in a specific database/collection.  I hope this will help you to get started with MongoDB's approach to user permissions etc.

Remember that this is just a learning example, so don't rely on this for your production systems.

Make sure you refer to the MongoDB authentication guide for reliable and up-to-date advice on how to do this for real!

This example also relies on the localhost exception when enabling authentication and creating your first user, so make sure this is working on your server machine.  You don't want to lock yourself out of your own database!

What we will do here


This post will show you how you can do the following:

  • enable authentication on a local MongoDB server
  • create a user administrator (you need to do this as soon as you enable authentication)
  • create a DBA super-user who can do anything
  • create an application-specific database
  • create an admin user for the application database only
  • create a couple of collections inside the application database
  • create an application-specific user role with read/write permissions for specific collections only
  • create an application-specific user with that role
  • check that this user can only access the specified collections

That should be enough to get you started with MongoDB user authentication.

Databases


  • admin:  this is where you define privileges that apply to any database.
  • myappdb: an application-specific database where you want to define local user privileges.

Roles


  • root: can do anything.
  • userAdminAnyDatabase:  can manage users on any database.
  • dbOwner:  can manage the relevant DB (we use this in myappdb).
  • myAppUserRole:  has limited access to specific collections in in myappdb.

Users


  • siteUserAdmin:  this user can manage users across the MongoDB server.
  • superuser:  this user can do anything on any DB via the “root” role.
  • myappowner:  this user manages the “myappdb” database via the “dbOwner” role.
  • bob: this user has the “myAppUserRole” role within “myappdb“ only.


Enable authentication on MongoDB server


Modify configuration to enable authentication


This is based on a Ubuntu server (other platforms may have slightly different config files).

Stop MongoDB:

sudo service mongod stop 

Edit /etc/mongod.conf (in Ubuntu) and set auth=true.   

This is the simplest authentication method, but you can also set up SSH authentication via key­files, as described in the MongoDB documentation (see above). 

Re­start MongoDB:

sudo service mongod start 

Create default user administrator (must connect via localhost)


Connect to MongoDB shell on the DB server (localhost): 

mongo 

Switch to admin DB and create the site user administrator:

use admin 
db.createUser({ 
    user: "siteUserAdmin", 
    pwd: "secret", 
    roles: [ { role: "userAdminAnyDatabase", db: "admin" } ] 
  }) 

Re­connect to MongoDB shell on (localhost) DB server with this user: 

mongo -­u siteUserAdmin ­-p secret ­­--authenticationDatabase admin

Create a super-user


Log into the MongoDB shell as the site user administrator (as above). 

Switch to admin DB and create a “root” user called “superuser”: 

use admin 
db.createUser( 
    { 
      user: "superuser", 
      pwd: "123456", 
      roles: [ "root" ] 
    } 

Set up authentication for your application


Create an application database


Connect to MongoDB shell as the super­user (authenticated via admin DB)

mongo -­u superuser -­p 123456 ­­--authenticationDatabase admin 

Create your application database:  

use myappdb 

Create a DB owner for the application database


Make sure you are logged in as super­user and working in myappdb (see above). 

Create the application owner called “myappowner” for this database: 

db.createUser( 
  { user: "myappowner",  
    pwd: "secret",  
    roles:["dbOwner"]}) 

Re­connect to MongoDB as the DB owner (authenticated via your application DB): 

mongo ­-u myappowner -­p secret --­­authenticationDatabase myappdb 

Create collections in application database


Log in as application owner (see above) and switch to myappdb: 

use myappdb 

Create a collection that we will later define as read­only for ordinary users: 

db.myappreadonly.insert({"Foo":"Bar"}) 

Check it worked: 

db.myappreadonly.find() 
{ "_id" : ObjectId("54b7a63850d47a6b57031616"), "Foo" : "Bar" } 

Now create a read­write collection: 

db.myappreadwrite.insert({"Baz":"Bax"}) 

Check it worked: 

db.myappreadwrite.find() 
{ "_id" : ObjectId("54b7a67b50d47a6b57031617"), "Baz" : "Bax" }

Create an application-specific user role


Make sure you are logged in as application owner (see above) and using myappdb. 
Now create a local user role on this DB, with appropriate access to the collections we 
created above:

db.createRole({  
 role: "myAppUserRole",  
  privileges: [  
   { resource: { db: "myappdb",collection:"myappreadonly"}, 
                 actions: ["find"]},  
   { resource: { db: "myappdb", collection: "myappreadwrite"},  
                 actions: ["find","update","insert"]} 
  ], roles:[]}) 

Create a specific user with this role:

db.createUser( 
  {user:"bob",  
   pwd: "secret",  
   roles:["myAppUserRole"]}) 



Check your application user has correct permissions


Re­connect as this user and switch to myappdb:

mongo -­u bob -­p secret --­­authenticationDatabase myappdb 
use myappdb 

See if you can read the collections correctly

db.myappreadonly.find() 

{ "_id" : ObjectId("54b7a63850d47a6b57031616"), "Foo" : "Bar" } 
db.myappreadwrite.find() 

{ "_id" : ObjectId("54b7a67b50d47a6b57031617"), "Baz" : "Bax" } 

You should be able to insert into the read­write collection:

db.myappreadwrite.insert({"bob":"allowed"}) 

WriteResult({ "nInserted" : 1 }) 

db.myappreadwrite.find() 

{ "_id" : ObjectId("54b7a67b50d47a6b57031617"), "Baz" : "Bax" } 
{ "_id" : ObjectId("54b7a98099923fa639b3c70f"), "bob" : 
"allowed" } 

You should not be able to insert into the read­only collection:

db.myappreadonly.insert({"bob":"not allowed"}) 

WriteResult({ 
" writeError " : { 
"code" : 13, 
"errmsg" : "not authorized on myappdb to execute 
command { insert: \"myappreadonly\",  
documents: [ { _id: ObjectId('54b7a97299923fa639b3c70e'),  
bob: \"not allowed\" } ], 
 ordered: true }" }}) 


Check the application user cannot do anything else


e.g. cannot show all collections in myappdb:

show collections 

2015­01­15T11:49:29.040+0000 error: { 
"$err" : "not authorized for query on  
               myappdb.system.namespaces", 
"code" : 13} at src/mongo/shell/query.js:131 

e.g. cannot create collections or insert data on another DB:

use test 

switched to db test 

db.bobcoll.insert({"bob":"bad"}) 

WriteResult({ 
"writeError" : { 
"code" : 13, 
"errmsg" : "not authorized on test to execute command 
{ insert: \"bobcoll\", documents: [ { _id: 
ObjectId('54b7aa6699923fa639b3c710'), bob: \"bad\" } ], 
ordered: true }" 
}) 


Conclusion


  • Using these techniques, it should be possible to set up a simple set of users/roles to manage basic access to your database and application.  
  • MongoDB provides many more roles, e.g. for managing access to clusters, but these are beyond the scope of this simple introduction. 
  • See the MongoDB manual for comprehensive advice on security.