Creating a Docker Image for Hadoop MapReduce (CDH5)


There are many ways how Docker can be used. In this blog post I’ll summarize the steps I did to create a running Hadoop Docker image for the Cloudera Version (CDH5) of Hadoop MapReduce MRv1 (the “old” MapReduce) and MRv2 (the “new” MapReduce aka YARN). Before we start, please check if Docker is installed on your local OS.

Ready-to-use image for MapReduce v1 and MapReduce v2:

To keep it simple, you can of course use one of my previously build Hadoop Docker images, execute Hadoop MapReduce jobs and skip the remaining steps.

MapReduce v1

docker pull seppinho/cdh5-hadoop-mrv1:latest
docker run -it -p 50030:50030 seppinho/cdh5-hadoop-mrv1:latest
sh /usr/bin/execute-wordcount.sh

MapReduce v2 - YARN Architecture

docker pull seppinho/cdh5-hadoop-mrv2:latest
docker run -it -p 19088:19088 seppinho/cdh5-hadoop-mrv2:latest
sh /usr/bin/execute-wordcount.sh

Step by step tutorial

If you want to start from scratch, we need a basic OS image we can work with. For that, pull a fresh Docker Ubuntu image (14.04) and run it. The run command starts a new container which is a running instance of the Ubuntu image.

docker pull ubuntu:14.04
docker run -i -t ubuntu:14.04
# verify that it's really Ubuntu 14.04 and not your local OS
lsb_release -a

Now, back on your local OS (type exit to close the Ubuntu container from before) create a new folder including an empty file named Dockerfile. The Dockerfile should include all necessary commands to build the new image. Have a look at my Github repository including a Dockerfile.

mkdir new-docker-image
cd new-docker-image
touch Dockerfile
edit Dockerfile

When you are satisfied with your Dockerfile you are ready to build your first Docker image. Just execute the following commands on your OS where the Dockerfile is located. Keep in mind that every time your Dockerfile has been changed, a rerun of the build command is required.

docker build --no-cache=false -t hadoop-image .
docker run -i -t -p 50030:50030  hadoop-image

You should now be able to connect to http://localhost:50030 from your local OS and execute a MapReduce job on the command line.

sh /usr/bin/execute-wordcount.sh

This is all for today. In the next blog post, I’ll show how our Hadoop exectution platform Cloudgene uses such an image for an easy installation and execution process.

Related Posts

Haplogrep on GitHub

Imputation Scripts

NAR WebServer Issue Publication - mtDNA-Server

NAR WebServer Issue Publication - HaploGrep2

Hadoop Services for Everyone

This post shows how we combine the Hadoop Docker Image (CDH5) with our Hadoop workflow system Cloudgene. Due to Cloudgene's interface, new applications can be registered to Cloudgene and provided as a service to everyone.

Setting up a HTTPS Restlet Webserver

Cloudgene, a Hadoop-As-A-Service approach

This post introduces the graphical Hadoop platform Cloudgene and shows how simple a Hadoop command line program (or a workflow of several programs) can be provided as a web service to everyone. Two services in Genetics based on Cloudgene are already available and showing promising success.