Use dependency injection to get Hadoop *out* of your application code

Hadoop MapReduce provides transparent parallelization but often results in specialized code bases that interact with low-level data formats. We present a means of using dependency injection to manage data flows in MapReduce which in turn supports reusable, Hadoop-agnostic application code that interacts with high-level business domain objects. An example is provided that applies Dependency Injection to the Hadoop WordCount example and shows how the same code invoked from the WordCount MapReduce job can be reused in a real-time context. We then discuss Opower’s application of this pattern to employ the same core calculations in both batch processing and in servicing real-time requests from end users. This topic will be of interest to those interested in reusing core batch calculations in real-time contexts. It also provides a means forward for organizations moving to Hadoop that have existing code components that they would like to employ in batch MapReduce computations.

Quick public service announcementThere was a small typo on the printed schedule that listed this session as user dependency injection, when in fact I’ll be talking about using dependency injectionSo if you are coming to hear about user dependency injection, I’m sorry to disappoint, but you’re in the wrong room.Maybe if there’s some time at the end, we can brainstorm on what user dependency injection might be.In any case, my name is Eric Chang and I’m the technology lead of the data services team at Opower- We build code infrastructure on top of hadoop and hbase- We’re also practitioners solving problems for Opower’s customers. These are pretty interesting problems that I’ll talk about later on.- As practitioners, we use the tools at hand to solve problems, and in our case that has historically included dependency injection.Now, Dependency injection isn’t a mainline topic at Hadoop Summit and is in fact a pretty well established approachbut I’m hoping to convince you that what’s old can be new again when applied in the right way in your hadoop infrastructure

“hadoop is hard”Of course, I’m being tongue in cheek, but there is a broader point here, namely the principle of separation of concerns.Relying on that principle, I’ll make the bold claim that there are parts of your code whose focus is *not* Hadoop interactions.For these parts of your code, Hadoop should be “hard”/”not my concern” or even better, entirely invisible.The most salient example of this is core application/business logic that should be focused entirely on higher-level business functionality, not Hadoop plumbing.So the challenge posed is: how do we build an effective separation of concerns and deploy Hadoop-agnostic code to our cluster?

Let’s dive a little deeper into the justification for Separation of Concerns to frame this discussionWhy does one need a separation of concerns?While quite a few Hadoop deployments are greenfield from a code perspective, you might find yourself in the position –as we did-- of having to migrate existing code components to Hadoop.Irrespective, of whether your code is brand new or legacy code, separation of concerns via good code componentization enables re-use in interesting ways as we’ll see later in the talkKeeping some parts of your code blissfully ignorant of MapReduce plumbing allows for more focus within your organization: you can have developers with a focus on core business logic who aren’t distracted by Hadoop plumbing and you can have developers with a focus on Hadoop infrastructure who aren’t bogged down in the details of complex business logic.Finally, while there are testing frameworks like MRUnit, by definition, a test that ends at the map() or reduce() method boundary is fairly coarse grained. Code componetization lends itself better to more granular (and lighter weight) testing.

The solution to our separation of concerns problem is dependency inejction.It provides a means by which we can have application logic that doesn’t explicitly configure its dependencies but instead interacts with abstractions (Interfaces).Implementations of these interfaces are managed by an Inversion of Control container such as Spring or Guice. A main application injects appropriate implementations of these interfaces into an Invernsion of Control container, then retrieves and invokes methods on managed code components.

Even if you are already familiar with Dependency Injection, or DI for short, these next two slides help set the groundwork for the rest of the concepts in this talk.Let’s start by illustrating use of code components in a traditional real-time access pattern and then show how they can be adapted to HadoopWe start with our application/business logic which is defined by the business service interface and implmented by the BizServiceImpl code component that you see in the lower right hand cornerThe service interacts with a readDAO interface that describes a means of retrieving domain objects. DAO in this case is shorthand for the Data Access Object design pattern which is used to encapsulate access to an underlying data store.The service also interacts with a writeDAO to save domain object.A disclaimer for those of you familiar with the DAO access pattern: bifurcating read and write operations in to separate interfaces isn’t something we would normally do in production code, but it’s helpful for the purposes of this illustration.The service, along with associated DAOs is managed by an Inversion of Control container.Since this is a real-time context, we’ll inject implementations of both DAOs that are backed by a real-time data store such as HbaseAt runtime, a servlet container processes a user request and delegates to the aRealtimeCallFromTheWeb method which:Requests an instance of the business service managed by the IoC container.Invokes the run() method on the service to execute our core application logicLet’s walk through that again:- our business services interacts with read and write interfaces- we configure our container to return realtime read annd write daos that are backed by a realtime data store- when a realtime request is made, the business service is invoked and pulls the data it needs from the realtime store through these daosAs you can see Dependency Injection allows calling code to interact with interfaces instead of concrete implementations. We’ll see how this works to our advantage on the next slide.

Let’s take the same code components and see how DI can support re-use of code in a Hadoop MapReduce context.Let’s say you’re writing code that executes in the reduce phase of a MapReduceJob. Remember that our application code – represented by BizServiceImpl – is supposed to remain entirely ignorant of Hadoop.One way to think of the values and context parameters in a reduce method are asKeys and values are a data sourcecontext is a data sink and is a front-end to an output format of some sortUsing these generalizations, we can construct a ReadDAO that provides a Data Source for domain objects- Domain objects are constructed from a combination of the keys and values passed in to the reduce method and provided to the ReadDAO during its construction and before it is injected into the IoC containerSimilary, we also inject a ContextBackedWriteDAO that uses a Reducer.Context as data sink for any domain objectsThe reduce method provides the appropriate DAOs to the IoC container at runtime… and then invokes the same method on an instance of the business service managed by the IoC container.Digging a little deeper here, one thing to point out is that all of this is made possible by short-lived IoC containers whose DAOs have a lifetime scoped to the reduce() method call- Compare this to the realtime case where we have a persistent data store like Hbase. In that case, the DAOs are long-lived because their backing store is long-livedBut in the MapReduce case, the data store is the values iterable, which has new content on each subsequent invocation of reduce()It wouldn’t make sense to keep piling values into the same DAO instance, so instead we make it an ephemeral pass-through to the values iterable and discard it after each call to reduce()

To more concretely demonstate how dependency injection can help you with Hadoop, we’ll use the traditional WordCount example and flip it on its head.We’ll follow recent hollywood trends and tell the WordCount origin story before it was a massively parallel, petabyte scale example of how to use Hadoop.Don’t worry if you’re not familiar withWordCount. Just imagine that that there was a time when people used to count words one at a time, in small, artisanal batches.As the disclaimer says, this is somewhat contrived example, but it’s useful for the sake of illustration.

Imagine you live in a borough of NYC and have a beard, and that you’ve built a great business around counting words in small, artisanal batches, in linear time.You get files from your customers and process them, one at a time, using your elegantly simple codeBut you knew you had to scale up at some point. So you compontentized your code as we’ll see in the next slide.

Let’s look at the way code has been decomposed, starting with the domain-specific data transfer object you see here, the WordCountDTO.The transfer object pattern allows us to decouple our business logic from details like storage formats and persistence layer implementation.As you can see here, we are only concerned with capturing two items: the word and the number of occurrences of that word in the input file provided us.Our core application logic is encapsulated by the WordCountService interface and its method countWord().For this example, we’ll have only one implementation of the interface, namely WordCountServiceImpl- retrieves words from a provided WordCountDAO interface by calling getWords()- It then counts wordsAnd writes the results back as WordCountDTOs to the WordCountDAO API by calling writeWordCount()In the Artisanal WordCount case, we’ll be using a ArtisanalWordCountDAO implementation.As its name implies, this DAO, on every call to getWords():opens up the input file provided by your customerscans the file in its entiretyreturns all words to the callermaybe it’s a little smart and does some caching, but in general, it’s a linear time implementation because we’re solving the problem at hand and dealing with small batches

Let’s look at the implementation of WordCountServiceImpl.For every word provided to the countWord() method, it asks the WordCountDAO for words from the word storeFor each match in the returned list, it increments a counterA WordCountDTO is constructed with the results and written back to the WordCountDAOThis class is DAO implementation-agnostic and has no knowledge of backing storage formats: it’s just pure business logic

Here we see how we configure the Inversion of Control container (in this case, Google Guice)Don’t worry if you aren’t familiar with this API, the main takeaway is:We always return (“bind”) a WordCountServiceImpl when an implementor of the WordCountService interface is requestedwe allow for any implementions of WordCountDAO to be injectedThe takeaway here is that the WordCountService implementation returned stays constant, but we can vary the implementation of the WordCounDAO returned

The ArtisanalWordCount main class builds an ArtisanalWordCountDAO from a provided input file and target output file.It injects the ArtisanalWordCountDAO into a Guice moduleIt then asks the Guice module for an implementation of the WordCountServiceWe know that the module will return the WordCountServiceImpl implementation that we just reviewed, with the ArtisanalWordCountDAO injected.Let’s assume there is a getWordsToCount() method that determines which words we’re interested in counting, maybe via command line arguments or some other input file.For each word that we’re supposed to count, we call the countWord method on the service we retrieved.

We represent the Artisanal implementation using our DI illustration method from earlier.The WorldCountServiceImpl service is managed by the IoC container and is injected with an artisanal implementation of the DAO that reads a file one line at a timeThe artisanalWordCount() method is a single-threaded batch process that invokes the service methods on the WordCountService interface to calculate word counts.

Fast forward a bit, and imagine that artisnal days are over and you have customers who want petabytes of words counted.Linear time won’t cut it.It just so happens that your boss used to work at Yahoo and suggests you look into Hadoop.You study MapReduce a bit, and figure out how to partition words in your map phase and come up with the classic WordCount reduce method.You build a MapReduceWordCountDAO that fulfills the WordCountDAO API contract and supports writing of calculated WordCountDTOs back to the MR context to be collected in an TextOutputFormat

So, how does DI gives us the best of both worlds and allow us to keep our small-batch code roots but apply them at scale?As mentioned earlier, we’ll start by partitioning our input files by word and emitting a count for each word found.By the time we’re ready to enter the reduce phase, we have keys which are words and values which are occurrence counts. -- So far, we’re no different than the classic Hadoop WordCount example. Here’s where we fork to employ some code reuse via DI.Let’s break down the parameters to reduce:Key is the wordValues is a list of IntWritables, one for each ocurrence of the wordContext is a text output formatGoing back to principles called out earlier, keys and values are our data source and the context is our data sink.The MapReduceWordCountDAO is constructed with the values, which it can sum up to know how many times the word ocurred in the file.The MapReduceWordCountDAO.getWords() method then merely “echoes” back the word the correct # of times.Additionally, MapReduceWordCountDAO writes a provided WordCountDTO to the Reducer context which is then written out to an TextOutputFormat at job completion.We wire the MapReduceWordCountDAO in to the IoC container, and since it satisfies the WordCountDAO API contract, we can use the same WordCountServiceImpl that we used in the artisanal flow.We call countWord and re-use the same core logic to count word occurrences.

Here’s how it’s all wired upIn the reduce method, we construct an instance of a mapreducewordcountdao using the key, values, and context provided to the reduce methodWe construct a guice module with this daoAnd then ask it for the word count service so we can invoke the countWord business method

Utility companies provide us with smart meter reads as well as a definition of rates to use to calculate costs.We use a forecasting algorithm to project usageWe then apply rates, when provided to us by the utility company, to calculate a projected costThe resulting calculation is made available via multiple channels, including the web as you see here, but also via push channels such as email and sms.Shameless plug: if you are a PGE customer, you can see this by clicking on the My Usage top-level tab, then My Dashboard sub-tabOne item I’d like to draw your attention to on this slideThis is one of the first user-facing features we built on Hadoop at OpowerThe Rate Engine was an existing, fairly complex set of code components that already were in use at Opower in non-Hadoop application stacksWe wanted to be able to preserve existing uses of the Rate Engine outside of HadoopSo the question was: how do we maintain existing legacy use cases of the Rate Engine while integrating in to Hadoop?

Dependency injection lets us employ the same code components in batch and in realtime.This allows us to use the same code in batch workflows for push channels like sms and email and re-use that same code base in realtime channels such as the web.One other thing to highlight here is that this code componentization also gives us a testing story:Because the core components are platform agnosticAnd because we can inject whatever data sources and sinks that we’d likeWe can provided curated test data inputsAnd assert results that are posted to our test data sinksAll this means that we can have a two-pronged integration testing approach:Finer-grained tests using curated inputs and outputs in a realtime contextCoarser-grained tests that test flows of data in a MapReduce container

The entry point to our bill projection calculations is the bill forecast service.It interacts with our Rate Engine, which is the pre-Hadoop code component.Both the Rate Engine and the forecast service access a DAO to retrieve usageAll code components are managed in a Spring IoC conatinerIn a Hadoop/MR context, we map over usage stored in Hbase and collect usage, grouped by customer, in the reduce phaseIn the reduce phase, we construct DAOs are runtime to feed usage to the bill projection components.We rely on the data source/sink principle here and use the values parameter as the data sourceWhat I don’t show here is the DAO that writes to the context as a data sink, which in our case the context is a TableOutputFormat that maps back to HBaseAgain, our application logic in the BillForecastService doesn’t need to know about any of these detailsThese DAOs transparently serve up data to the bill forecast and rate engine code components; no changes to the existing Rate Engine were necessary.The reduce phase then calls forecast() once it has set up the Spring container.

We covered batch calculations; however, depending on the scale of the utility company and the frequency with which we want to update bill projections, we also support real time calculations.In this case, everything stays the same on the right-hand side, but the invoking code is based on a web request which is represented by the calculationBillProjection() method.HBase-backed DAOs are wired in to pull usage from Hbase on a per-customer basis.These DAOs are injected in to the Spring containerThe BillForecastService remains unawares that it’s being invoked in a realtime context, where it’s getting its reads, and where it’s writing its results

Somebody still has to know MRThis is not about turnkey MR; it’s about separation of concernsWe know that we’re not alone in having existing code bases that drive a successful business and need a story for transitioning to hadoop.For those of you who find yourselves in a situation similar to ours, I hope I’ve been able to provide a few insights and maybe given you another approach to consider as you face the challenge of taking something that’s already successful and deploying it at scale on hadoop.

11.
Opower @ Hadoop Summit North America 11
Example: Artisanal WordCount
» You live in a borough of NYC and have a beard
» You’ve built a great business around counting words, one at
a time, in small, handcrafted batches in linear O(n) time
» You receive files from customers and run your simple but
effective code
» You had the foresight to know that some day you need to
scale up. So you created a properly componentized
architecture:
• Domain objects
• Data access layer
• Service layer (application logic)