By: Gigaom
Netflix shows off its Hadoop architecture
Netflix is at it again, this time showing off its homemade architecture for running Hadoop workloads in the Amazon Web Services cloud. It's all about the flexibility of being able to run, manage and access multiple clusters while eliminating as many barriers as possible.

Netflix is the undeniable king of computing in the cloud — running almost entirely on the Amazon Web Services platform — and its reign expands into  big data workloads, too. In a Thursday evening blog post, the company shared the details of its AWS-based Hadoop architecture and a homemade Hadoop Plaform as a Service that it calls Genie.

That Netflix is a heavy Hadoop user is hardly news, though. In June, I explained just how much data Netflix collects about users and some the methods it uses to analyze that data. Hadoop is the storage and processing engine for much of this work.

hadoop nflxAs blog post author Sriram Krishnan points out, however, Hadoop is more than a platform on which data scientists and business analysts can do their work. Aside from their 500-plus-nod cluster of Elastic MapReduce instances, there’s another equally sized cluster for extract-transform-load (ETL) workloads — essentially, taking data from other sources and making it easy to analyze within Hadoop. Netflix also deploys various “development” clusters as needed, presumably for ad hoc experimental jobs.

And while Netflix’s data-analysis efforts are pretty interesting, the cloud makes its Hadoop architecture pretty interesting, too. For starters, Krishnan explains how using S3 as the storage layer instead of the Hadoop Distributed File System means, among other things, that Netflix can run all of its clusters separately while sharing the same data set. It does, however, use HDFS at some points in the computation process to make up for the inherently slower method of accessing data via S3.

Netflix also built its own PaaS-like layer for Amazon Elastic MapReduce, called Genie. This lets engineers submit jobs via a REST API and without having to know the specifics of the underlying infrastructure. This is important because it means Hadoop users can submit jobs to whatever clusters happen to be available at any given time (Krishnan goes into some detail about the resource-management aspects of Genie) and without worrying about the sometimes-transient nature of cloud resources.

We’ve long been pushing the intersection of big data and cloud computing, although the reality is that there aren’t really a lot of commercial options that mix user-friendliness and heavy-duty Hadoop workload management. There’ll no doubt be more offerings in the future — Infochimps and Continuuity are certainly working in this direction, and Amazon is also pushing its big data offerings forward — but, for now, leave it to Netflix to build its own. (And if you’re interested in custom-built Hadoop tools, check out our recent coverage of Facebook’s latest effort.)

Stock Market XML and JSON Data API provided by FinancialContent Services, Inc.
Nasdaq quotes delayed at least 15 minutes, all others at least 20 minutes.
Markets are closed on certain holidays. Stock Market Holiday List
By accessing this page, you agree to the following
Privacy Policy and Terms and Conditions.
Press Release Service provided by PRConnect.
Stock quotes supplied by Six Financial
Postage Rates Bots go here