The Hortonworks Connected Data Platforms

As part of the Big Data Platform Distributions week, I will have a closer look at the Hortonworks distribution.

Hortonworks was founded in 2011 when 24 engineers from the original Hadoop team at Yahoo! formed Hortonworks. This included the founders Rob BeardenAlan Gates, Arun Murthy, Devaraj Das, Mahadev Konar, Owen O’Malley, Sanjay Radia, and Suresh Srinivas. The name Hortonworks refers to Horton the Elephant, which relates to the naming of Hadoop.

“The only way to deliver infrastructure platform technology is completely in open source.”

The Hortonworks solution aims to offer an platform to be able to process and store data-in-motion as well as data-at-rest. This platform is a combination of Hortonworks Data Flow (HDF) and Hortonworks Data Platforms (HDP®). This way Hortonworks is not only about doing Hadoop (HDP), but it is also connecting data platforms via HDF.

Since the birth of Hortonworks they have had a fundamental belief: “The only way to deliver infrastructure platform technology is completely in open source.” Hortonworks is  also member of the Open Data Platform Initiative; “A nonprofit organization committed to simplification & standardization of the Big Data ecosystem with common reference specifications and test suites”

Hortonworks Data Flow

The Hortonworks Data Flow solution for data-in-motion includes 3 key components:

  • Data Flow Management Systems – a drag-&-drop visual interface based on Apache Nifi / MiNifi. Apache NiFi is a robust and secure framework for routing, transforming, and delivering data across a multitude of systems. Apache MiNiFi (a light-weight agent) is created as a subproject of Apache Nifi and focuses on the collection of the data at the source.
  • Stream Processing – HDF supports Apache Storm and Kafka. The added value is in the GUI of Streaming Analytics Manager (SAM), which eliminates the need to code streaming data flows.
  • Enterprise Services – Making sure that everything works together in an enterprise environment. HDF supports Apache Ranger (Security) and Ambari (Provisioning, management and Monitoring). The Schema Registry builds a catalog so data streams can be reused.

HDF-Data-Motion-Platform-1024x532

Streaming Analytics Manager and Schema Registry are both open source projects. Until this moment they are not part of the Apache Software Foundation project.

Hortonworks Data Platforms

Hortonworks solution for data-at-rest is Hortonworks Data Platform (HDP). HDP consists of the following components.

Hortonworks Data Platform

Hortonworks is also available in the cloud with two specific products:

  • Azure HDInsight – a collaboration between Microsoft and Hortonworks to offer a Big Data Analytics platform on the Azure Cloud.
  • Hortonworks Data Cloud for AWS – deploy Hortonworks Data Cloud Hadoop clusters on AWS infrastructure.

How to get started?

The best way to get to know the Hortonworks product(s) is by getting your hands dirty. Hortonworks offers Sandboxes on a VM for both HDP as well as HDF. These VM’s come in different flavours, like VMWare, VirtualBox and Docker. Go and download a copy here. For questions and other interactions go to the Hortonworks community.

Thanks for reading.