Friday, 29 May 2015

Strata + Hadoop World, London, 2015

Introduction



I recently attended the Strata Hadoop World conference in London from 5-7 May 2015, including two half-day tutorial sessions (on Apache Cassandra and R/RStudio) on the first day.  

This conference is the main annual Hadoop conference in Europe and is strongly oriented towards business users and the technology industry.  It is sponsored by the technology publisher O’Reilly and by Cloudera, one of the main commercial Hadoop suppliers, with additional sponsorship from several other major companies in the Big Data arena.  The conference exhibitors also included smaller suppliers of tools and services to the commercial Big Data sector.  IT service providers and consultancies were heavily represented here.

This year’s conference offered a wide range of talks, tutorials and (at extra expense) workshops on a variety of topics broadly relating to Hadoop and Big Data, including several sessions on Apache Spark, which seems to be emerging as a significant cross-platform technology for Big Data.

Other talks looked at business applications of Big Data technology e.g. in banking, retail and telecoms, as well as less commercial topics such as Big Data for health research.

From the attendees’ directory provided by the conference organisers, there seemed to be around 900 attendees.  A large proportion of these were from outside the UK, mainly from elsewhere in Europe but also from Asia.  The conference organisers announced that this year would see the first Strata Hadoop World in Asia (in Singapore), which is clearly aimed at a growing market for these Big Data events worldwide.

General themes


Hadoop is going mainstream but moving past the hype


Not surprisingly, the main focus of many sessions was around Hadoop and related tools and applications.  It was striking to see that several major organisations are already integrating Hadoop into their business processes.  Although many of us still struggle with basic Hadoop installation and architecture, it is clear that these technologies are increasingly seen as a mainstream platform for providing business critical services.  Prominent Hadoop users mentioned here included BT, Barclays, Bloomberg, IBM etc.

It was also encouraging to see that there is a growing body of expertise around how to use these technologies e.g. sessions on designing and implementing enterprise applications with Hadoop.  Access to this expertise is vital, as there is still a shortage of reliable and up-to-date documentation on how to build large scale applications with Big Data technologies in practice.

It is still early days compared to traditional relational databases systems, where the industry has decades of experience, but my impression is that Hadoop is no longer seen as a risky cutting-edge technology by mainstream commercial users.

However, it is interesting to note that Infoworld reported recently that demand for Hadoop is falling as people start to look more closely at alternative Big Data platforms.  It is suggested that this is partly due to the lack of Hadoop skills on the market, and to the availability of tools like NoSQL databases that may be more suitable for many users.  And of course, it may simply be that Hadoop is finally moving past the initial "peak of inflated expectations" in the Gartner hype cycle.

If so, then this is a welcome development, as a more critical attitude towards the Hadoop eco-system on the part of customers may further encourage suppliers to focus on providing tools and expertise to make it easier (and cheaper) to adopt and exploit Hadoop where appropriate.

Hadoop eco-system is growing


The industry already seems to recognise this need for a broader range of services and support for organisations that are starting to explore Hadoop as a platform for enterprise applications.  A growing number of technology and service providers are building their own products and services around Hadoop.  These range from massive companies like Amazon, eBay and Google, through established technology suppliers such as SAS Institute, Autodesk and Informatica, to smaller specialist consultancies and software suppliers offering a range of tools and services.

To some extent, one might still ask whether some of these companies are simply applying “Big Data” gloss to their existing products (e.g. Informatica), without necessarily representing a major transformation in their products themselves.  But this desire to link their products to Hadoop at least indicates a general sense of confidence that Hadoop is a viable technology platform on which to build enterprise data services for the longer term.  Even if the glory days of the hype cycle are behind us.

Data science still cool


If Hadoop is no longer quite so cool, there was more reassuring news for Big Data hipsters, as another prominent strand in the conference offering was the growing application of data science to real-world business needs.  This ranged from simple statistical analysis and imaginative data visualisation to machine learning for analytics and reporting. One of the short key note talks was from Cait O'Riordan, describing how music identification service Shazam is able to spot hits before they happen.

There were all-day workshops on Apache Spark and on machine learning with Python, which ran alongside the conference and were sold out well in advance, despite being quite expensive.  There is obviously a strong demand for training in these technologies, although it is not clear how far these technical skills are being applied in practice or to what extent they are complemented and informed by a solid understanding of basic statistics.

At least some “data science” is probably still more “data” than “science”, but many organisations are clearly aware of the need to build their data science skills, and several were actively recruiting for “data scientists” and “information architects” at the conference.

Big Data > Hadoop, and Google wants a slice of the action


Although this conference was primarily about Hadoop, other Big Data storage and processing technologies featured in several sessions.

There was a half-day tutorial on Apache Cassandra, which provided a quick introduction to this distributed, high-availability, column-family database, which is used for commercial Big Data applications such as video streaming (NetFlix, Sky) and time series data (Apple, British Gas, eBay).  This data model is restrictive but very powerful in the right context.

Apache Mesos is an abstraction layer for managing distributed resources transparently, which can be combined with cluster-based tools such as Spark or Cassandra.  Google also offers the commercial Mesos-based “data centre operating system” Mesosphere as an easy-deploy option on their Google Compute Platform.

Google announced a new version of their BigTable NoSQL database service, which will support the existing Hadoop HBase API, effectively making BigTable a potential replacement for HBase in some applications.  BigTable was one of the inspirations for HBase (and Apache Cassandra) so this may represent a certain degree of convergence.

Of course, using Google BigTable would require users to commit to the Google cloud platform for this aspect of their application architecture.  Google has recently been promoting its Google Cloud Platform as a potential alternative to Amazon’s more comprehensive (and complex) cloud services, so this announcement is clearly part of a larger strategy to gain a greater share of the market for cloud-based PaaS products.

Apache Spark is definitely one to watch


Databricks is the company behind Apache Spark and was set up by the core Spark creators. They were strongly represented at this conference, with contributions from founders Paco Nathan and Patrick Wendell.  Databricks supported the Spark training course that ran alongside the conference, and there were also several talks on various aspects of Apache Spark for ML and general data processing beyond the Hadoop platform.

Apache Spark offers a distinctive approach to distributed data processing, which can be deployed in various modes e.g. on top of a Hadoop cluster, on a Mesos cluster, on a stand-alone cluster or on a local machine e.g. for development or data exploration.

IBM veteran Rod Smith highlighted Spark in a short talk on emerging technologies, pointing to its unified programming model for batch, realtime and interactive processing that offers a way to integrate data from multiple sources, including streaming data.  He also mentioned growing interest in notebook-style interfaces e.g. IPython Notebook.

BT and Barclays are both looking at or already using Spark in some of their data pipelines, while other companies are starting to build tools and services around Spark e.g. Spanish start-up Stratio demonstrated their new SparkTA data analytics platform.

Datastax, the company behind the Apache Cassandra database, now offers a Spark Connector to allow Cassandra and Spark to work together on the same cluster, although this is not yet compatible with the current version of Spark.

Meanwhile, Databricks is currently testing its cloud-based PaaS Databricks Cloud, which will provide a notebook-style interface to a hosted Spark installation, allowing users to upload, transform, analyse and report on their data entirely via the browser.

This looks like a promising and powerful tool, especially for data scientists/analysts, although it is not yet clear when it will be available, or what the hosting model will be.  However, Patrick Wendell, one of the developers of Spark and co-founder of Databricks, told me that it will be possible to deploy Databricks Cloud onto a private AWS environment, which might help to address concerns over issues such as data location and confidentiality.

Spark is still developing quickly, with new features being added every few months, but despite its relative immaturity there seems to be a strong interest in exploring Spark’s potential for Big Data applications from developers, tool providers and businesses.

Conclusion


As discussed above, this conference gave a strong sense that the core Hadoop platform has come of age and is now being used for mainstream business applications in the real world.  This is helping to generate a wider eco-system of skills, services and tools around Hadoop, which should help later arrivals to make an easier transition to these distinctive distributed data technologies by exploiting the knowledge and expertise now available in the market.

The future of large scale data storage in the enterprise is clearly going to be based on distributed platforms, whether this is on Hadoop, Mesos or some other technology such as NoSQL databases, or indeed a combination of these.  Streaming for real-time data processing is a particularly hot topic, offering scope for new applications that would not previously have been possible.

Technology and service providers are all keen to exploit this relatively new market, both through improvements to existing products (SAS, Informatica, HP etc) and through new and innovative products (e.g. Databricks Cloud) that take advantage of the power of these new technologies.

Perhaps ironically, other popular topics covered technologies that have been developed or adopted as a response to perceived historical short-comings in the core Hadoop platform.  Google’s HBase-on-BigTable announcement is clearly aiming to attract existing HBase users away from Hadoop. Meanwhile, Apache Cassandra has long offered an alternative column-family database with its own distinctive features compared to HBase.

Apache Spark seems to be especially promising as a widely applicable and very flexible general-purpose distributed processing engine for data from a wide range of sources. Spark offers a way to integrate processing for a variety of common data-processing tasks, from data acquisition and ETL through traditional analytics to machine learning and even real-time analysis of streaming data.

Beyond technology, there seems to be a growing awareness of the need for and usefulness of data science skills in industry as well as academia.  Several keynote talks touched on topics familiar to traditional data analysts, such as issues of data quality, provenance, security and confidentiality, as well as the need for a more general understanding of how to make intelligent practical use of Big Data.

Data technology - especially for Big Data - is changing rapidly.  Many organisations are understandably reluctant to put themselves at the bleeding edge of technology, but we still need to be looking at what industry leaders are doing today, in order to help us identify what we should be doing tomorrow.  This conference provided a welcome opportunity to do this.