Big Data Platform Distributions week – Wrap up

Wrapping up a week of Big Data Platform comparisons. A closer look @ #Cloudera, #MapR and #Hortonworks.

This last week I have been taking a slightly closer look at 3 of the most well known Big Data Platform Distributions; Cloudera, MapR and Hortonworks. It’s interesting to see how different de various distributions look at the same data challenge.

Which Big Data Platform Distributions is the best?

The three different Big Data Platform Distributions have a different focus. Here are a few things that make each of the top three vendors stand out from each other:

  • Cloudera – Proven, user-friendly technology.
    • Use Case; Enterprise Data Hub. Let the Hadoop platform serve as a central data repository.
  • MapR – Stable platform with a generic file-system and fast processing.
    • Use Case; Integrated platform with a focus on streaming.
  • Hortonworks – 100% Open source with minimal investment.
    • Use Case; Modernising your traditional EDW.

There is no easy answer to the question; “Which Big Data Platform Distributions is the best?”. My answer would be; “It depends”. It depends on a various different factors:

  • Performance – MapR has extra focus on speed and performance and therefor developed its own file system (MapR-FS) as well as its own NoSQL database, MapR-DB
  • Scalability – Hadoop is known to scale very well. All three offer software to mange this effectively. Cloudera & MapR go for proprietary.
  • Reliability – Before Hadoop 2.0 the NameNode was the single point of failure (SOPF) in a HDFS-cluster. MapR has a different approach (more distributed) approach with its file system known as MapR File System (MapR-FS)
  • Manageability – Cloudera & MapR add (proprietary) management software to their distribution. Hortonworks chooses for their open-source equivalents.
  • Licenses – All three offer downloadable free versions of their software. Both Cloudera & MapR add additional features for their paying customers.
  • Support – All three are part of the Hadoop community as contributors & committers. They contribute and commit (updated) code back to the open source repository.
  • Upgrades – Cloudera & Hortonworks both are known for their quick adoption of new technologies. Hortonworks seems to be the quickest to get things production ready.
  • OS Support – Hortonworks supports the Microsoft Windows OS. Microsoft included Hortonworks and packaged it into its own HDInsight (both on-premise or in the Azure cloud).
  • Training – It looks like Cloudera offers the most complete and professional training program. This also reflected in the price.
  • Tutorials – All three offer various tutorials and sandboxes to get started

Back to the question; “Which Big Data Platform Distributions is the best?”. Go ahead and find out for yourself. Determine which of the points above are important to your situation and try it out for your self.

If you have anything to contribute, please let me know. I haven’t performed a thorough comparison, yet. Maybe Gartner can help out a bit as well.

Thanks for reading.

The Cloudera Enterprise Data Hub

As part of the Big Data Platform Distributions week, I will have a closer look at the Cloudera distribution.

Cloudera was founded in 2008 by a few people out of the Silicon Valley atmosphere:

Also Doug Cutting, co-creator of Hadoop, joined the company in 2009 as Chief Architect He still is active in  that role.

Cloudera offers an architecture which serves as an Enterprise Data Hub (EDH). The Hadoop platform serves as a central data repository. As opposed to traditional data management systems, data is not transferred (ETL / E-LT) from A to B. The data is ingested and stored on the EDH and processed, analysed and served where the data resides on the data platform.

Cloudera Enterprise Data Hub

The core of the Cloudera Distribution is based on Apache™ Hadoop® open-source project. These projects include projects like Impala, Kudu and Sentry, which are created inside Cloudera and returned back to the open-source community.

One product which sets Cloudera apart from the other distributions is the Cloudera Manager (CM). According to Cloudera, the Cloudera Manager is the best way to install, configure, manage, and monitor the Apache Hadoop stack. People will argue that CM is the best option because there is also Apache Ambari, which is open-source. I won’t go into details which one is better. As for a lot of things the answer would be; it depends. From what I hear and read, Cloudera Manager should be a must have for administrators, because of it’s rich functionalities. A downside of CM is that it is a proprietary product and therefore cannot benefit from the innovations from the community. That doesn’t necessarily mean that open-source projects are more open than proprietary software, because they are only used for a specific distribution.

The Enterprise Data Hub is the flagship product within Cloudera. Other products include:

As the leader in Apache Hadoop-based data platforms, Cloudera has the enterprise quality and expertise that make them the right choice to work with on Oracle Big Data Appliance.
— Andy Mendelson, Senior Vice President, Oracle Server Technologies

Oracle_Big_Data_ApplianceOracle has taken the don’t DIY philosophy one step further. They have created the Oracle Big Data Appliance (BDA) of which the recently announced the latest and 6th hardware generation of the BDA, which is now generally available.

Check out more about the collaboration between Cloudera and Oracle here.

“Oracle, Intel, and Cloudera partner to co-engineer integrated hardware and software into the Oracle Big Data Appliance, an engineered system designed to provide high performance and scalable data processing environment for Big Data.” See the video for the interview.

How to get started?

The best way to get to know the Cloudera product(s) is by getting your hands dirty. Cloudera offers Quickstart Virtual images (VM). These VM’s come in different flavours, like VMWare, VirtualBox and Docker. Go and download a copy for the current 5.12 version here. For questions and other interactions go to the Cloudera community.

Thanks for reading.

Big Data Platform Distributions week

There is a lot to do when it comes to Big Data. All kinds of new / improved techniques to us use data. Have a look at things like Machine Learning, Deep Learning or Artificial Intelligence. All these techniques use (Big) Data. I will not go into the discussion what Big Data exactly means. In the end it’s all about data whether it is structured (e.g. relational, spreadsheet, etc.), semi-structured (e.g. log files) or un-structured (e.g. pictures, video’s).

This blog is the start of a series blogs to have a closer look at technical implementations of Big Data. I am aware of the fact that there is a whole world around Big Data. Things like (full) data architecture or the actual request for information are often forgotten. Also the field of tension between the Business and IT deserves special attention.

What is Hadoop?

If we look at data from a technical perspective, we see one term popping up every time; “Hadoop”. What is Hadoop and why would I need it?

“The Apache™ Hadoop® project is a project that develops open-source software for reliable, scalable, distributed computing.”

Hadoop (/həˈdp/) is based on work done at Google in the late 1990s/early 2000s. According to the co-founders of Hadoop; Doug Cutting & Mike Cafarella, Hadoop is originated from the Google File System-paper that was published in October 2003. Doug Cutting named the project after is son’s pet elephant.

Back at the time, Google had a challenge. They wanted to index the entire web which required massive amounts of storage and a new approach to process these large amounts of data. Google found a solution in the Google File System (GFS) and  Distributed MapReduce (described in a paper released in 2004).

Hadoop was first built as the Nutch-project. It was meant to serve as an infrastructure to crawl the web and store a search engine index for the crawled pages. HDFS is used as a distributed filesystem that can store data across thousands of servers Map/Reduce jobs across various machines, running the work close to the data.

According to the the project page, Hadoop is built around three core components;

Hadoop-logo

  • Distributed File System (HDFS) – Stores data
  • Hadoop MapReduce – Processes data
  • Hadoop Yarn – Schedules work

These core Hadoop components are surrounded by a whole ecosystem of Hadoop projects. This open source eco-system provides all kinds of projects to solve real data problems. There are projects to support the different challenges within a data-driven environment:

This list is just an impression of the possible Hadoop ecosystem projects. There is a more actual list here, which provides; “…a summary to keep the track of Hadoop related projects…”.

Why would I need Hadoop?

There are a few reasons why one would need Hadoop. The most important ones are that the current amount of data is growing faster than the ability of e.g. RDBMS systems to store and process it. Next to that, the traditional data storage alternatives are no longer cost effective. Hadoop offers a an approach on low-cost commodity hardware, which makes it easy to scale up and down when necessary. Data is distributed over this hardware when it is stored. The processing of this data takes place where it is stored.

One of the big challenges while setting up an Hadoop environment is; “Where to start?” Starting a Single-Node Hadoop Cluster could be a first step, but that is the start. What to do next? Which projects (and which version) to include? When to upgrade which project? Is the project already production ready? What about things like support (issues, bugs, technical assistance), service level agreements (SLA), compliance, etc.

There are several distributions which provide a solution to answer the above questions. an additional benefit is that the organisations behind these distributions are part of the Hadoop community. They contribute and commit (updated) code back to the open source repository.

For this series I will focus on three of the largest distributions within the community; ClouderaMapR and HortonWorks. Please check out my findings in the following blogposts:

Thanks for reading.

 

RM BI Forum 2014 Notes – Cloudera Hadoop Masterclass

Cloudera

The Rittman Mead BI Forum started off with a one-day Hadoop Masterclass, provided by Lars George.  As he messaged us the day before we have learned what Hadoop is all about, what its major components are, how to acquire, processes and provide data as part of a production data processing pipeline. To that effect, Lars advised that it would be useful to follow along the examples in the course and have an environment handy. That would allow us to experiment at our convenience during and after the class. He directed us to the following link; the Cloudera Quickstart VM.

Lars recommends the following: “Select the CDH5 version of the VM. Please select a virtual machines image matching your VM platform of choice. If you do not have a VM host application installed yet, you can choose from a few available ones. VirtualBox is provided by Oracle and a great choice to use. It can be downloaded here. Set up the VM application, then download and start the Cloudera Quickstart VM to run on top of it. It is as easy as that.”
Find below a few notes I took during the Masterclass.
Lars devided the Masterclass into four parts.

I – Introduction into Hadoop

  • What is Big Data? –  It’s not necessarily volume but also format and speed. Three V’s – Volume, Variety and Velocity
  • Introduction to Hadoop
  • HDFS
  • MapReduce
  • YARN
  • Cluster Planning

Hadoop is Open Source and Apache licensed — http://hadoop.apache.org
Many developers Cloudera, Apple
Contributers
Many related projects, applications, tools
Hadoop is not a system but a set of tools, projects which work together. You should decide, for each part of the architecture, which tool you should use and how you would use it.
HadoopEcosystem

Hadoop where to get it?
Load, Process and Analyze data
Hadoop Concept – distribute data in the system.
Process the data where it resides
No network processing
High level code (java)
No communication between nodes
Data stored on different machines in advance

Map Reduce Data Flow
  • Map
  • Sort en Shuffle
  • Reduce

II – Ingress and Egress

Ingress – moving data into Hadoop (HDFS)
Flume  (Near Real-Time Pipeline)
  • Source
  • (File) Channel
  • Sink —> poll, collect and write to eg. HDFS
Apache_Flume
Transfer data between Relational Database (Oracle, Terradata, Sql Server, etc.) and HDFS
Oracle Database Driver for Sqoop – OraOop by Quest 
FIle Formats important to keep in mind when you want to get the data out again.
Simple File versus Container (Structured) File
Parquet vs Google Dremel —>
BI Integration
  • Sqoop
  • HDFS Connector
  • ODBC/JDBC

III – NoSQL and Hadoop

ACID (atomicity, consistency, isolation, durability) 

IV Analyzing Big Data

  • Pig
  • Hive  (HiveServer2 instead of HiveServer1)
  • Impala
  • Search – Lucien
  • Data Pipelines (micro –  macro)
  • Oozie (Workflow Server)
  • Information Architecture – Where / How to store data and how to secure this structure
  • Spark (Java, Python, Scala compile into code)
I think Lars could have talked about Hadoop two more days (with or without sheets). Hadoop is all about making choices. There are similar tools, projects, concepts, etc. All depends on what you want to achieve.
Although this Masterclass was very informative, I still struggle to see the use case at this moment. A lot of my customers are still struggling with their ’normal’ data……

Big Data and Analytics Top Ten Trends for 2014

Oracle recently published their view on the; Top Ten Trends “Big Data” & “Analytics” for 2014. Find the list below:

1. Business Users Get Hooked on Mobile Analytics –> Oracle Business Intelligence Mobile App Designer

2. Analytics Take to the Cloud –> Oracle Applications Cloud

3. Hadoop-Based Data Reservoirs Unite with Data Warehouses –> Your Data Warehouse and Hadoop – Better Together Featuring Cloudera Webcast

4. New Skills Bolster Big Data Investments –> In search of Insight and Foresight – Getting more out of Big Data

5. Big Data Discovery is the Secret to Workforce Success for HCM –> Harnessing the Power of Employee Sentiment

6. Predictive Analytics Lend Fresh Insight into Big Data Strategies –> Oracle Advanced Analytics Option

7. Predictive Analytics Bring New Insight to Old Business Processes –> Big Data @ work – Real-Time Fraud Detection and Prevention

  • White Paper Big Data Analytics: Advanced Analytics in Oracle Database
  • White Paper Bringing R to the Enterprise
  • White Paper Oracle Engineered Systems – Engineered for Extreme Performance

8. Decision Optimization Technologies Enhance Human Intuition –> Oracle Real-Time Decisions Review by James Taylor

9. Business Leaders Embrace Packaged Analytics –> Packaged Analytic Applications: Accelerating Time and Value

10. New Skills Launch New Horizons of Analysis –> Oracle Training

If you are interested in; Succeeding with Enterprise Performance Management in 2014, please check here.

Check it out yourself

Check here for the Pre-Built Developer VMs (for Oracle VM VirtualBox). Pay special attention to the ‘SampleApp V309’ (Oracle BI Foundation) and the ‘BigDataLite 2.5’. If you are interested in Endeca, you might want to download an Endeca Virtual Machine here.

There are also some hosted environments online for Oracle BI (prodney / Admin123) and Endeca (publicuser@oracle.com / Admin123)

If you would like some guidance, you might want to check out the Oracle Learning Library (OLL).