The issue here is that when Java tries to fork a process (in this case bash), Linux allocates as much memory as the current Java process, even though the command you are running might use very little memory. When you have a large process on a machine that is low on memory this fork can fail because it is unable to allocate that memory.

The workaround here is to either use an instance with more memory (m2 class), or reduce the number of mappers or reducers you are running on each machine to free up some memory.

Since the task I was running was reduce heavy, I chose to just drop the number of mappers from 4 to 2. You can do this pretty easy with the emr bootstrap actions.

Tuesday, September 21, 2010

I’ve recently started some work that involves extending Salesforce for our Ad Ops team. For our most recent Hack Day, I decided to do a little project to continue learning about development with the Salesforce cloud platform, Force.com.

After thinking about what I wanted to work on, I decided to build a custom button that would allow a user to update an Account record in Salesforce with an Advertiser ID from DART, our primary ad serving platform, for the following reasons:

It’s a tool that I could see being used in our live Salesforce instance.

It seems like a typical use case for extending Salesforce (i.e. integrating with a 3rd party SOAP service).

The back of the napkin design looked like this:

At a high-level, I wanted to call DART’s DFP API from within Salesforce and then update an Account object in Salesforce with the Advertiser Id returned from DART. However, I first needed to authenticate with Google’s ClientLogin service in order to get an authentication token for calling the DFP API.APEXAPEX is the programming language that allows a developer to customize a Salesforce installation. APEX’s syntax, not surprisingly, is very similar to Java. The really interesting thing is that none of the code you write actually compiles or runs on your machine. All compilation and execution happen “in the cloud”.DART IntegrationSalesforce has a strict security model. In order to make a request to a Web Service you actually need to configure any URLs you are accessing as a Remote Site. Instructions for doing this can be found here. For this project, I simply needed to add https://www.google.com as a Remote Site.There are a couple of options for calling a Web Service via APEX:

This piece of code would fetch an authToken for the given username and password. Once I had the authToken, I could then call the DFP API. For this part, I used WSDL/SOAP, the 2nd method for calling web services.Salesforce provides a way to import a WSDL file via its Admin UI. It then parses and generates APEX code that allows you to call methods exposed by the WSDL. However, when I tried importing DFP’s Company Service WSDL, I ran into some errors:

It turns out that the WSDL contains an element named ‘trigger’ and trigger is a reserved APEX keyword. In any event, I ended up copy/pasting the generated code and fixing it so that it compiled correctly (I also ran into a problem where generated exception classes were not extending Exception). Once the code to call the DFP Company Service was compiling, I created an APEX controller to perform the update on an Account record.

// get Account with the given id for (Account o:[select id, name from Account where id =:theId]) { DartCompanyService.CompanyServiceInterfacePort p = new DartCompanyService.CompanyServiceInterfacePort(); p.RequestHeader = new DartCompanyService.SoapRequestHeader(); p.RequestHeader.applicationName = 'sampleapp';

if (page.totalResultSetSize > 0) { // update the record if we get a result o.Dart_Advertiser_Id__c = page.results.get(0).id; update o; } }

// Redirect the user back to the original page PageReference pageRef = new PageReference('/' + theId); pageRef.setRedirect(true); return pageRef; }}

UI updatesThen, I created a simple Visuaforce page to invoke the controller:

<apex:page standardController="Account" extensions="SyncDartAccountController" action="{!onLoad}"> <apex:sectionHeader title="Auto-Running Apex Code"/> <apex:outputPanel > You tried calling Apex Code from a button. If you see this page, something went wrong. You should have been redirected back to the record you clicked the button from. </apex:outputPanel></apex:page>

Finally, I added a custom button to the Account page which would invoke the Visualforce page. You can do this in the Salesforce UI:

1) Click on ‘Buttons and Links’:

2) Click New:

3) Enter the info for the new button:

4) After clicking on Save, we can add the button to the Account page layout. The final result:Final ThoughtsThis was my first foray into APEX programming in Salesforce and I was pleased with the overall set of tools and ability to be productive quickly. The only hiccup I encoutered was in the WSDL generation step and this issue was fairly easy to overcome. There are good developer docs and there are ways to add debug logging (which I didn’t go over) as well as a framework for unit testing.

Monday, September 20, 2010

We write a lot of hive reports. Frequently we want to email the resulting report to a list. In the past I've usually done this with some one-off post processing scripts, but I thought it would be nice to write a reusable emr job step that will execute as part of the hive job.

The script will download files from an s3 url, concatenate them together, zip up the results and send it as an attachment to a specified email address. It sends email through smtp.mail.com, using account credentials you specify.

I wanted to make it easy to just append an additional step to any existing job, not requiring any additional machine setup or dependencies. I was able do this by making use of amazon's script-runner (s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar). The script-runner.jar step will let you execute an arbitrary script from a location in s3 as an emr job step.

As I mentioned, the intended usage is to run it as a job step with your hive script, passing it in the location of the resulting report.

Above you can see I'm starting a hive report as normal, then simply appending the script-runner step, calling the emr-mailer send-report.rb, telling it where the report will end up, and details about the email.