KDnuggets Home » News » 2014 » Nov » Opinions, Interviews, Reports » Surfing the Big Data Wave at H2O World ( 14:n32 )

Surfing the Big Data Wave at H2O World


Recent H2O World event showcased its open-source, scalable machine learning in the cloud, intended for people familiar with R but limited by its scalability. H2O can run on Hadoop and also on Apache Spark.



By Arun Swami, special to KDnuggets, Nov 2014.

H20 World Oxdata (http://h2o.ai/) provides software to allow data scientists to quickly and easily run machine learning models at scale. The intended audience seems to be people familiar with R who are limited by the scalability of R. Using H2O allows data scientists to distribute machine learning algorithms over a cluster. Not all machine learning algorithms are currently supported but the supported list is quite impressive. Many R functions and structures are supported but this will never likely be a clone of R. Oxdata seems to use a freemium model: the basic software is free and open source. Enterprises can choose to buy a premium license that provides them with 24/7 support, help with optimizing and scaling clusters, etc. For details, please refer to the company Web site.

H2O World took place on November 18-19 in Mountain View, CA at the Computer History Museum. This is a report on Day 1, primarily devoted to sessions that were significantly “hands on” to help attendees get a feel for how they could use the H2O suite of tools.

The day started with an introduction to the company and its mission by Sri Ambati (CEO, Co-founder). The rationale for H2O can be summarized by:

  • Faster: Minutes vs. hours/days
  • Bigger: Bigger dataset / Cluster Mode
  • Better: Ease of Sampling and Feature Selection

Cliff Click (CTO, Co-founder) gave a high level presentation of the architecture. He showed how data and computation are distributed (the platform is written in Java). According to him, 100GB datasets can be handled easily and they are moving towards handling 1TB datasets. Analysis can be run using either a Web UI or R Studio.

Amy Wang gave a fast paced tutorial on running different machine learning models on H2O. They have a number of models out of the box that can run on a distributed cluster and more are being added. For many models, only some of the features are supported. For example, Generalized Linear Models do not support weights and Gradient Boosting Machine (GBM) do not support different loss functions.

Tom Kraljevic (VP Engineering) gave a talk on Using H2O in Big Data Environments. H2O can be run on Hadoop (YARN) and there is a project (Sparkling Water) to run H2O as an application on top of Spark. They plan to interoperate with Spark MLLib.


Sign Up