As part of the Big Data Platform Distributions week, I will have a closer look at the Hortonworks distribution.
Hortonworks was founded in 2011 when 24 engineers from the original Hadoop team at Yahoo! formed Hortonworks. This included the founders Rob Bearden, Alan Gates, Arun Murthy, Devaraj Das, Mahadev Konar, Owen O’Malley, Sanjay Radia, and Suresh Srinivas. The name Hortonworks refers to Horton the Elephant, which relates to the naming of Hadoop.
“The only way to deliver infrastructure platform technology is completely in open source.”
The Hortonworks solution aims to offer an platform to be able to process and store data-in-motion as well as data-at-rest. This platform is a combination of Hortonworks Data Flow (HDF™) and Hortonworks Data Platforms (HDP®). This way Hortonworks is not only about doing Hadoop (HDP), but it is also connecting data platforms via HDF.
Since the birth of Hortonworks they have had a fundamental belief: “The only way to deliver infrastructure platform technology is completely in open source.” Hortonworks is also member of the Open Data Platform Initiative; “A nonprofit organization committed to simplification & standardization of the Big Data ecosystem with common reference specifications and test suites”
Hortonworks Data Flow
The Hortonworks Data Flow solution for data-in-motion includes 3 key components:
- Data Flow Management Systems – a drag-&-drop visual interface based on Apache Nifi / MiNifi. Apache NiFi is a robust and secure framework for routing, transforming, and delivering data across a multitude of systems. Apache MiNiFi (a light-weight agent) is created as a subproject of Apache Nifi and focuses on the collection of the data at the source.
- Stream Processing – HDF supports Apache Storm and Kafka. The added value is in the GUI of Streaming Analytics Manager (SAM), which eliminates the need to code streaming data flows.
- Enterprise Services – Making sure that everything works together in an enterprise environment. HDF supports Apache Ranger (Security) and Ambari (Provisioning, management and Monitoring). The Schema Registry builds a catalog so data streams can be reused.
Streaming Analytics Manager and Schema Registry are both open source projects. Until this moment they are not part of the Apache Software Foundation project.
Hortonworks Data Platforms
Hortonworks solution for data-at-rest is Hortonworks Data Platform (HDP). HDP consists of the following components.
- Data Management – YARN and HDFS (scalable, fault tolerant and cost effective storage) are the two key components in the Hortonworks Data Platform solution.
- Data Access – Interact with the data in any way from batch to streaming, based on Apache™ Hadoop® open-source projects like; Apache Pig, Apache Hive, Apache HBase, Apache Storm and Apache Spark.
- Data Governance and Integration – Quickly and easily load data. Manage this data according policy, using Apache Knox & Apache Ranger
- Security – Authenticate, authorise, account and protect data.
- Operations – provision, manage, monitor and operate Hadoop clusters at scale using Apache Ambari, Apache Oozie and Apache ZooKeeper.
Hortonworks is also available in the cloud with two specific products:
- Azure HDInsight – a collaboration between Microsoft and Hortonworks to offer a Big Data Analytics platform on the Azure Cloud.
- Hortonworks Data Cloud for AWS – deploy Hortonworks Data Cloud Hadoop clusters on AWS infrastructure.
How to get started?
The best way to get to know the Hortonworks product(s) is by getting your hands dirty. Hortonworks offers Sandboxes on a VM for both HDP as well as HDF. These VM’s come in different flavours, like VMWare, VirtualBox and Docker. Go and download a copy here. For questions and other interactions go to the Hortonworks community.
Thanks for reading.
2 thoughts on “The Hortonworks Connected Data Platforms”