Already some years ago, my colleague Lukas Forer and I developed a platform to simplify the complete lifecycle of a Hadoop program. This includes steps like putting data into HDFS, execute a MapReduce job from the command line or export data back to the local file system. These steps are of course doable with some basic Unix knowledge. Nevertheless, for people who are used to work with graphical interfaces or want to combine several tools (Hadoop, Spark, Unix, R) to a workflow, this can be a major barrier to discover the beauty of Hadoop. The simpliest possible MapReduce command looks like this:
To abstract this command we developed Cloudgene, consisting of (a) a workflow system (including the workflow definition language WDL) and (b) a cloud orchestration platform. The idea behind Cloudgene is quite simple: If you are able to execute your Hadoop program on the command line, take some minutes and write a YAML configuration to connect your program with Cloudgene. Doing so, you are able to transform your Hadoop command line program (or a set of programs) into a web-based service, present your collaborators a scalable best practices workflow and provide reproducible science. The YAML configuration for the command above looks like this:
Cloudgene in Action
Cloudgene simplified our lives a lot in the last years, and two very recent services based on Cloudgene show the success of our platform: The first service is the mtDNA-Server, a heteroplasmy and contamination pipeline for next generation sequencing data developed by my colleague Hansi Weißenteiner. The other one is the quite popular Michigan Imputation Server for genome imputation based on minimac3.
For now, check out one of the web services to get a feeling what Cloudgene can do for you. The services will be introduced here in near future.
In the next post, I’ll show you how to combine the ideas of my first blog post with Cloudgene resulting in a Hadoop-As-A-Service approach for local usage.