The MapR Converged Data Platform

As part of the Big Data Platform Distributions week, I will have a closer look at the MapR distribution.

John Schroeder founded MapR in 2009 and served as the company’s CEO until 2016.

MapR offers their Converged Data Platform (CDP). The vision behind this platform is to offer one integrated platform for Big Data, which enable batch (e.g. ETL offload, log file analytic), interactive (e.g. BI / Analytics) and streaming (e.g. Sensor Analytics) capabilities. MapR offers one integrated platform to prevent building data silos and point solutions.Mapr_Converged_Data_PlatformIt’s incorrect to think of the MapR CDP as proprietary. Of course the MapR CDP is powered by three platform services:

  • MapR-FS –> It’s good to understand why MapR decided to introduce their own file system instead of HDFS. Check out the explanation in one of MapR’s Whiteboard video’s.
  • MapR-DB –> explained via a Whiteboard
  • MapR Streams –> How differentiates MapR from similar products in the market. Also explained via a Whiteboard.

But apart from that MapR supports several projects of the Apache™ Hadoop® project (Apache Storm, Apache Pig, Apache Hive, Apache Mahout, YARN, Apache Sqoop, Apache Flume, etc.). Apache Drill™ is MapR’s SQL query engine. So the MapR Converged Data Platform is a mix of proprietary as well as open-source.

Central in the MapR philosophy is ‘Convergence’. I do not know the exact definition of ‘Convergence’ but in the MapR context it’s all about; “integrating Data-in-Motion & Data-at-Rest to support real-time applications”. Since the end of 2016 MapR supports this philosophy by using Event-Driven Micro-services. The idea behind these Micro-services (combined with specific API’s) is that they unify all kinds of data (structured, semi-structured and un-structured) as well as streaming & event data. These Micro-services are designed in such a way that they remove complexity and enhance several tasks.MapR_apiTo get the most out of the CDP and to speed up the process, MapR is delivering a Converged Application Blueprint to get things started. This blueprint includes:

  • Sample apps (incl. source)
  • Architecture guides
  • Community-supported best practices (use these wisely)

The Converged Data Platform is the flagship product within MapR. Other products include:

MapR Converged Data Platform Now Available in Oracle Cloud Marketplace

There is are a few areas where MapR and Oracle have a connection. One is via the Oracle Cloud Marketplace. This enables Oracle Cloud customers to use MapR in the Oracle Cloud. Oracle Data Integrator, being open and heterogenous, seems to integrate with MapR very well. Check out Issam Hijazi’s findings.

How to get started?

The best way to get to know the MapR product(s) is by getting your hands dirty. Try MapR and download a Sandbox. For questions and other interactions go to the MapR community.

Thanks for reading

The Cloudera Enterprise Data Hub

As part of the Big Data Platform Distributions week, I will have a closer look at the Cloudera distribution.

Cloudera was founded in 2008 by a few people out of the Silicon Valley atmosphere:

Also Doug Cutting, co-creator of Hadoop, joined the company in 2009 as Chief Architect He still is active in  that role.

Cloudera offers an architecture which serves as an Enterprise Data Hub (EDH). The Hadoop platform serves as a central data repository. As opposed to traditional data management systems, data is not transferred (ETL / E-LT) from A to B. The data is ingested and stored on the EDH and processed, analysed and served where the data resides on the data platform.

Cloudera Enterprise Data Hub

The core of the Cloudera Distribution is based on Apache™ Hadoop® open-source project. These projects include projects like Impala, Kudu and Sentry, which are created inside Cloudera and returned back to the open-source community.

One product which sets Cloudera apart from the other distributions is the Cloudera Manager (CM). According to Cloudera, the Cloudera Manager is the best way to install, configure, manage, and monitor the Apache Hadoop stack. People will argue that CM is the best option because there is also Apache Ambari, which is open-source. I won’t go into details which one is better. As for a lot of things the answer would be; it depends. From what I hear and read, Cloudera Manager should be a must have for administrators, because of it’s rich functionalities. A downside of CM is that it is a proprietary product and therefore cannot benefit from the innovations from the community. That doesn’t necessarily mean that open-source projects are more open than proprietary software, because they are only used for a specific distribution.

The Enterprise Data Hub is the flagship product within Cloudera. Other products include:

As the leader in Apache Hadoop-based data platforms, Cloudera has the enterprise quality and expertise that make them the right choice to work with on Oracle Big Data Appliance.
— Andy Mendelson, Senior Vice President, Oracle Server Technologies

Oracle_Big_Data_ApplianceOracle has taken the don’t DIY philosophy one step further. They have created the Oracle Big Data Appliance (BDA) of which the recently announced the latest and 6th hardware generation of the BDA, which is now generally available.

Check out more about the collaboration between Cloudera and Oracle here.

“Oracle, Intel, and Cloudera partner to co-engineer integrated hardware and software into the Oracle Big Data Appliance, an engineered system designed to provide high performance and scalable data processing environment for Big Data.” See the video for the interview.

How to get started?

The best way to get to know the Cloudera product(s) is by getting your hands dirty. Cloudera offers Quickstart Virtual images (VM). These VM’s come in different flavours, like VMWare, VirtualBox and Docker. Go and download a copy for the current 5.12 version here. For questions and other interactions go to the Cloudera community.

Thanks for reading.

Big Data Platform Distributions week

There is a lot to do when it comes to Big Data. All kinds of new / improved techniques to us use data. Have a look at things like Machine Learning, Deep Learning or Artificial Intelligence. All these techniques use (Big) Data. I will not go into the discussion what Big Data exactly means. In the end it’s all about data whether it is structured (e.g. relational, spreadsheet, etc.), semi-structured (e.g. log files) or un-structured (e.g. pictures, video’s).

This blog is the start of a series blogs to have a closer look at technical implementations of Big Data. I am aware of the fact that there is a whole world around Big Data. Things like (full) data architecture or the actual request for information are often forgotten. Also the field of tension between the Business and IT deserves special attention.

What is Hadoop?

If we look at data from a technical perspective, we see one term popping up every time; “Hadoop”. What is Hadoop and why would I need it?

“The Apache™ Hadoop® project is a project that develops open-source software for reliable, scalable, distributed computing.”

Hadoop (/həˈdp/) is based on work done at Google in the late 1990s/early 2000s. According to the co-founders of Hadoop; Doug Cutting & Mike Cafarella, Hadoop is originated from the Google File System-paper that was published in October 2003. Doug Cutting named the project after is son’s pet elephant.

Back at the time, Google had a challenge. They wanted to index the entire web which required massive amounts of storage and a new approach to process these large amounts of data. Google found a solution in the Google File System (GFS) and  Distributed MapReduce (described in a paper released in 2004).

Hadoop was first built as the Nutch-project. It was meant to serve as an infrastructure to crawl the web and store a search engine index for the crawled pages. HDFS is used as a distributed filesystem that can store data across thousands of servers Map/Reduce jobs across various machines, running the work close to the data.

According to the the project page, Hadoop is built around three core components;

Hadoop-logo

  • Distributed File System (HDFS) – Stores data
  • Hadoop MapReduce – Processes data
  • Hadoop Yarn – Schedules work

These core Hadoop components are surrounded by a whole ecosystem of Hadoop projects. This open source eco-system provides all kinds of projects to solve real data problems. There are projects to support the different challenges within a data-driven environment:

This list is just an impression of the possible Hadoop ecosystem projects. There is a more actual list here, which provides; “…a summary to keep the track of Hadoop related projects…”.

Why would I need Hadoop?

There are a few reasons why one would need Hadoop. The most important ones are that the current amount of data is growing faster than the ability of e.g. RDBMS systems to store and process it. Next to that, the traditional data storage alternatives are no longer cost effective. Hadoop offers a an approach on low-cost commodity hardware, which makes it easy to scale up and down when necessary. Data is distributed over this hardware when it is stored. The processing of this data takes place where it is stored.

One of the big challenges while setting up an Hadoop environment is; “Where to start?” Starting a Single-Node Hadoop Cluster could be a first step, but that is the start. What to do next? Which projects (and which version) to include? When to upgrade which project? Is the project already production ready? What about things like support (issues, bugs, technical assistance), service level agreements (SLA), compliance, etc.

There are several distributions which provide a solution to answer the above questions. an additional benefit is that the organisations behind these distributions are part of the Hadoop community. They contribute and commit (updated) code back to the open source repository.

For this series I will focus on three of the largest distributions within the community; ClouderaMapR and HortonWorks. Please check out my findings in the following blogposts:

Thanks for reading.

 

Extracting Data from BICS / Apex via RESTful Webservice

A few days ago, I read the following post; Extracting Data from BICS / Apex via RESTful Webservice on the Oracle A-Team Chronicles website. I tried the example to see how things work.

Cocoa Rest Client

In the blogpost is referred to cURL to test the REST-endpoints. I have been testing this out on a Mac and used; Cocoa Rest Client.

If you test the created RESTful Service Handler, you get a window with a JSON formatted result set. You can copy the url.

Test_SQLWorkshop

Enter the copied url in the Cocoa Client.

Test_Cocoa

You will find the result in the Response Body of the CocoaClient. Don’t forget to enter your credentials to your BICS environment.

Oracle BI Cloud Service (BICS) – Planned Upgrade for 10-July-2015

As from the 10th of July the Oracle BI Cloud Service environment will be upgraded. Looking forward to the new features and enhancements to the Oracle BI Cloud Service. Find below a list of things we can expect.

Fresh new look to Oracle BI Cloud Service

Improved design that’s simple to navigate and easy to use. Includes a brand new Academy to help you get the most out of Oracle BI Cloud Service.

View your data in a heat matrix

Use a heat matrix view to see a two-dimensional depiction of data where values are represented by a gradient of colors.

Visualize geographical data on maps

Use a map view to display data on a map in several different formats and to interact with the data.

Manage your data files

Review, download, and delete the data files you’ve uploaded for analysis. Quickly see whether you’re close to reaching your quota.

Enhancements to Data Modeler

  • Let Oracle BI Cloud Service recommend fact and dimension tables when you first start building the data model.
  • Override the aggregation set for a measure for one or more dimensions.
  • Sort attribute values by a different column.
  • Define variables that return multiple values.
  • Programmatically clear data cached for your model through new REST APIs.
  • A copy of your data model saves automatically when you publish changes to the model. This makes it easy to recover to a previous version if something goes wrong.

Enjoy.