Archive for the ‘Amazon Web Services AWS’ Category

Machine-readable data from certain electronic 990 forms filed with the IRS from 2011 to present are available for anyone to use via Amazon S3.

Form 990 is the form used by the United States Internal Revenue Service to gather financial information about nonprofit organizations. Data for each 990 filing is provided in an XML file that contains structured information that represents the main 990 form, any filed forms and schedules, and other control information describing how the document was filed. Some non-disclosable information is not included in the files.

This data set includes Forms 990, 990-EZ and 990-PF which have been electronically filed with the IRS and is updated regularly in an XML format. The data can be used to perform research and analysis of organizations that have electronically filed Forms 990, 990-EZ and 990-PF. Forms 990-N (e-Postcard) are not available withing this data set. Forms 990-N can be viewed and downloaded from the IRS website.

I could use AWS but I’m more interested in deep analysis of a few returns than analysis of the entire dataset.

Fortunately the webpage continues:

…
An index listing all of the available filings is available at s3://irs-form-990/index.json. This file includes basic information about each filing including the name of the filer, the Employer Identificiation Number (EIN) of the filer, the date of the filing, and the path to download the filing.

All of the data is publicly accessible via the S3 bucket’s HTTPS endpoint at https://s3.amazonaws.com/irs-form-990. No authentication is required to download data over HTTPS. For example, the index file can be accessed at https://s3.amazonaws.com/irs-form-990/index.json and the example filing mentioned above can be accessed at https://s3.amazonaws.com/irs-form-990/201541349349307794_public.xml (emphasis in original).
…

Once you have the index.json file, with grep, a little awk and wget, you can quickly explore IRS 990 filings for further analysis or to prepare queries for running on AWS (such as discovery of common directors, etc.).

We are excited to announce that over one million electronic IRS 990 filings are available via Amazon Simple Storage Service (Amazon S3). Filings from 2011 to the present are currently available and the IRS will add new 990 filing data each month.

(image omitted)

Form 990 is the form used by the United States Internal Revenue Service (IRS) to gather financial information about nonprofit organizations. By making electronic 990 filing data available, the IRS has made it possible for anyone to programmatically access and analyze information about individual nonprofits or the entire nonprofit sector in the United States. This also makes it possible to analyze it in the cloud without having to download the data or store it themselves, which lowers the cost of product development and accelerates analysis.

Each electronic 990 filing is available as a unique XML file in the “irs-form-990” S3 bucket in the AWS US East (N. Virginia) region. Information on how the data is organized and what it contains is available on the IRS 990 Filings on AWS Public Data Set landing page.
…

Some of the forms and instructions that will help you make sense of the data reported:

The DevOps series covers how to get started with the leading open source distributed technologies. In this tutorial, we step through how to deploy a Spark Standalone cluster on AWS Spot Instances for less than $1. In a follow up post, we will show you how to use a Jupyter notebook on Spark for ad hoc analysis of reddit comment data on Amazon S3.

One of the significant hurdles in learning to build distributed systems is understanding how these various technologies are installed and their inter-dependencies. In our experience, the best way to get started with these technologies is to roll up your sleeves and build projects you are passionate about.

This following tutorial shows how you can deploy your own Spark cluster in standalone mode on top of Hadoop. Due to Spark’s memory demand, we recommend using m4.large spot instances with 200GB of magnetic hard drive space each.

m4.large spot instances are not within the free-tier package on AWS, so this tutorial will incur a small cost. The tutorial should not take any longer than a couple hours, but if we allot 6 hours for your 4 node spot cluster, the total cost should run around $0.69 depending on the region of your cluster. If you run this cluster for an entire month we can look at a bill of around $80, so be sure to spin down you cluster after you are finished using it.
…

How does $0.69 to improve your experience with distributed systems sound?

It’s hard to imagine a better deal.

The only reason to lack experience with distributed systems is lack of interest.

I’ve been meaning to learn Theano for a while and I’ve also wanted to build a chess AI at some point. So why not combine the two? That’s what I thought, and I ended up spending way too much time on it. I actually built most of this back in September but not until Thanksgiving did I have the time to write a blog post about it.

Chess sets are a common holiday gift so why not do something different this year?

Pretty print a copy of this post and include a gift certificate from AWS for a GPU instance for say a week to ten days.

I don’t think AWS sells gift certificates, but they certainly should. Great stocking stuffer, anniversary/birthday/graduation present, etc. Not so great for Valentines Day.

If you ask AWS for a gift certificate, mention my name. They don’t know who I am so I could use the publicity. 😉

We have come a long way in five years, but there’s always room to do better! The database engines that I listed above were designed to function in a constrained and somewhat simplistic hardware environment — a constrained network, a handful of processors, a spinning disk or two, and limited opportunities for parallel processing or a large number of concurrent I/O operations.

The RDS team decided to take a fresh look at the problem and to create a relational database designed for the cloud. Starting from a freshly scrubbed white board, they set as their goal a material improvement in the price-performance ratio and the overall scalability and reliability of existing open source and commercial database engines. They quickly realized that they had a unique opportunity to create an efficient, integrated design that encompassed the storage, network, compute, system software, and database software, purpose-built to handle demanding database workloads. This new design gave them the ability to take advantage of modern, commodity hardware and to eliminate bottlenecks caused by I/O waits and by lock contention between database processes. It turned out that they were able to increase availability while also driving far more throughput than before.

…

In preview now but you can sign up at the end of Jeff’s post.

Don’t become confused between Apache Aurora (“a service scheduler that runs on top of Mesos”) and Amazon Aurora, the MySQL compatible database from Amazon. (I guess all the good names have been taken for years.)

What am I missing?

Oh, following announcement of open source from Microsoft, Intel, Mapillary (to name the ones I noticed this week), I can’t find any reference to the source code for Amazon Aurora.

Do you think Amazon Aurora is closed source? One of those hiding places for government surveillance/malware? Hopefully not.

Perhaps Jeff just forgot to mention the GitHub respository with the Amazon Aurora source code.

It’s Friday (my location) so let’s see what develops by next Monday, 17 November 2014. If there is no announcement that Amazon Aurora is open source, …, well, at least everyone can factor that into their database choices.

PS: Open source does not mean bug or malware free. Open source means that you have a sporting chance at finding (and correcting) bugs and malware. Non-open source software may have bugs and malware which you will experience but not be able to discover/fix/correct.

President Obama has announced a series of executive actions to reduce carbon pollution and promote sound science to understand and manage climate impacts for the U.S.

Following the President’s call for developing tools for climate resilience, OpenNEX is hosting a workshop that will feature:

Climate science through lectures by experts

Computational tools through virtual labs, and

A challenge inviting participants to compete for prizes by designing and implementing solutions for climate resilience.

Whether you win any of the $60K in prize money or not, this looks like a great way to learn about climate data, approaches to processing climate data and the Amazon cloud all at one time!

Processing in the virtual labs is on the OpenNEX (Open NASA Earth Exchange) nickel. You can experience cloud computing without fear of the bill for computing services. Gain valuable cloud experience and possibly make a contribution to climate science.

The Wikimedia Foundation publishes page view statistics for Wikimedia projects here; this serveris rate-limited so it took roughly a month to transfer this 4 TB data set into S3 Storage in the AWS cloud. The photo on the left is of a hard drive containing a copy of the data that was produced with AWS Import/Export.

Once in S3, it is easy to process this data with Amazon Map/Reduce using the Open Source telepath software.

Future projects require that this data be integrated with semantic data from :BaseKB and that has me working on tools such as RDFeasy. In the meantime, a mirror of the Wikipedia pagecounts from Jan 2008 to Feb 2014 is available in a requester pays bucket in S3 , which means you can use it in the Amazon Cloud for free and download data elsewhere for the cost of bulk network transfer.

Interesting isn’t it?

That “open” data can be so difficult to obtain and manipulate that it may as well not be “open” at all for the average user.

Something to keep in mind when big players talk about privacy. Do they mean private from their prying eyes or yours?

I think you will find in most cases that “privacy” means private from you and not the big players.

If you want to do a good deed for this week, support this data set at Gittip.

This paper describes by example how astronomers can use cloud-computing resources offered by Amazon Web Services (AWS) to create new datasets at scale. We have created from existing surveys an atlas of the Galactic Plane at 16 wavelengths from 1 μm to 24 μm with pixels co- registered at spatial sampling of 1 arcsec. We explain how open source tools support management and operation of a virtual cluster on AWS platforms to process data at scale, and describe the technical issues that users will need to consider, such as optimization of resources, resource costs, and management of virtual machine instances.

In case you are interesting in taking your astronomy hobby to the next level with AWS.

Mesosphere, a startup that focuses on developing Mesos, a technology that makes running complex distributed applications easier, is launching Elastic Mesos today. This new product makes setting up a Mesos cluster on Amazon Web Services a basic three-step process that asks you for the size of the cluster you want to set up, your AWS credentials and an email where you want to get notifications about your cluster’s state.

Given the complexity of setting up a regular Mesos cluster, this new project will make it easier for developers to experiment with Mesos and the frameworks Mesosphere and others have created around it.

As Mesosphere’s founder Florian Leibert describes it, for many applications, the data center is now the computer. Most applications now run on distributed systems, but connecting all of the distributed parts is often still a manual process. Mesos’ job is to abstract away all of these complexities and to ensure that an application can treat the data center and all your nodes as a single computer. Instead of setting up various server clusters for different parts of your application, Mesos creates a shared pool of servers where resources can be allocated dynamically as needed.
…

In the near future, all forms of digital communication will be secure from the NSA and others. Before Snowden, it was widely known in a vague sense that the NSA and others were spying on U.S. citizens and others. Post-Snowden, user demand will result in vendors developing secure communications with two settings, secure and very secure.

Ironic that overreaching by the NSA will result in greater privacy for everyone of interest to the NSA.

Magazine Luiza, one of the largest retail chains in Brazil, developed an in-house product recommendation system, built on top of a large knowledge Graph. AWS resources like Amazon EC2, Amazon SQS, Amazon ElastiCache and others made it possible for them to scale from a very small dataset to a huge Cassandra cluster. By improving their big data processing algorithms on their in-house solution built on AWS, they improved their conversion rates on revenue by more than 25 percent compared to market solutions they had used in the past.

Not a lot of technical details but a good success story to repeat if you are pushing graph-based services.

Looking to save the world through data? Amazon, in conjunction with the NASA Earth Exchange (NEX) team, today released over 20 terabytes of NASA-collected climate data as part of its OpenNEX project. The goal, they say, is to make important datasets accessible to a wide audience of researchers, students, and citizen scientists in order to facilitate discovery.

“Up until now, it has been logistically difficult for researchers to gain easy access to this data due to its dynamic nature and immense size,” writes Amazon’s Jeff Barr in the Amazon blog. “Limitations on download bandwidth, local storage, and on-premises processing power made in-house processing impractical. Today we are publishing an initial collection of datasets available (over 20 TB), along with Amazon Machine Images (AMIs), and tutorials.”

The OpenNEX project aims to give open access to resources to aid earth science researchers, including data, virtual labs, lectures, computing and more.

Excellent!

Isaac also reports that NASA will be hosting workshops on the data.

Anyone care to wager on the presence of semantic issues in the data sets? 😉

Netflix runs a lot of Hadoop jobs on the Amazon Web Services cloud computing platform, and on Friday the video-streaming leader open sourced its software to make running those jobs as easy as possible. Called Genie, it’s a RESTful API that makes it easy for developers to launch new MapReduce, Hive and Pig jobs and to monitor longer-running jobs on transient cloud resources.

In the blog post detailing Genie, Netflix’s Sriram Krishnan makes clear a lot more about what Genie is and is not. Essentially, Genie is a platform as a service running on top of Amazon’s Elastic MapReduce Hadoop service. It’s part of a larger suite of tools that handles everything from diagnostics to service registration.

It is not a cluster manager or workflow scheduler for building ETL processes (e.g., processing unstructured data from a web source, adding structure and loading into a relational database system). Netflix uses a product called UC4 for the latter, but it built the other components of the Genie system.

It’s not very futuristic to say that AWS (or something very close to it) will be your next utility bill.

Like paying for water, gas, cable, electricity, it will be an auto-pay setup on your bank account.

What will you say when clients ask if the service you are building for them is hosted on AWS?

Are you going to say your servers are more reliable? That you don’t “trust” Amazon?

Both of which may be true but how will you make that case?

Without sounding like you are selling something the client doesn’t need?

As the price of cloud computing drops, those questions are going to become common.

Amazon Web Services, Inc. today announced that Amazon Redshift, a managed, petabyte-scale data warehouse service in the cloud, is now broadly available for use.

Since Amazon Redshift was announced at the AWS re: Invent conference in November 2012, customers using the service during the limited preview have ranged from startups to global enterprises, with datasets from terabytes to petabytes, across industries including social, gaming, mobile, advertising, manufacturing, healthcare, e-commerce, and financial services.

Traditional data warehouses require significant time and resource to administer. In addition, the financial cost associated with building, maintaining, and growing self-managed, on-premise data warehouses is very high. Amazon Redshift aims to lower the cost of a data warehouse and make it easy to analyze large amounts of data very quickly.

Amazon Redshift uses columnar data storage, advanced compression, and high performance IO and network to achieve higher performance than traditional databases for data warehousing and analytics workloads. Redshift is currently available in the US East (N. Virginia) Region and will be rolled out to other AWS Regions in the coming months.

“When we set out to build Amazon Redshift, we wanted to leverage the massive scale of AWS to deliver ten times the performance at 1/10 the cost of on-premise data warehouses in use today,” said Raju Gulabani, Vice President of Database Services, Amazon Web Services….

Wondering what impact a 90% reduction in cost, if borne out over a variety of customers, will have on the cost of on-premise data warehouses?

Suspect the cost for on-premise warehouses will go up because there will be a smaller market for the hardware and people to run them.

Something to consider as a startup that wants to deliver big data services.

Do you really want your own server room/farm, etc.?

Or for that matter, will VCs ask: Why are you allocating funds to a server farm?

PS: Amazon “Redshift” is another example of semantic pollution. “Redshift” had (past tense) a well know and generally accepted semantic. Well, except for the other dozen or so meanings for “redshift” that I counted in less than a minute. 😉

In addition to delivering great services and features to our customers, we are constantly working towards helping customers so that they can build highly-scalable, highly-available cost-effective cloud solutions using our services. We not only provide technical documentation for each service but also provide guidance on economics, cross-service architectures, reference implementations, best practices and details on how to get started so customers and partners can use the services effectively.

In this post, let’s review all the content that we published in 2012 so you can help build and prioritize our content roadmap for 2013. We are looking for feedback on content topics that you would like us to build this year.

A mother lode of technical content on AWS!

Definitely a page to bookmark even as new content appears in 2013!

Posted in Amazon Web Services AWS | Comments Off on 2012 Year in Review: New AWS Technical Whitepapers, Articles and Videos Published

I’m happy to be able to report that we have sold all of the available seats at AWS re:Invent! The halls here are ablaze with excitement and we’re all working 18 hours per day to bring you a conference that will be fun, informative, and memorable. We’ve lined up an amazing array of speakers and a good time will be had by all.

The entire team of AWS evangelists is committed to doing everything possible to bring the excitement of the conference online. We’ll be live-blogging, tweeting (using the #reinvent hashtag), posting pictures, posting videos, and posting the slide decks to the Amazon Web Services SlideShare page.

Way cool! The program is stunning.

I would rather be in Los Vegas but will instead be moving meetings that conflict with the stream.

Amazon Web Services invites you to AWS re: Invent, our first global customer and partner conference. Your whole team can ramp up on everything needed to thrive in the AWS Cloud. AWS re: Invent will feature deep technical content on popular cloud use cases, new AWS services, cloud migration best practices, architecting for scale, operating at high availability and making your cloud apps secure.

Sessions: There are 16 tracks and 150+ sessions. The choices are going to be really hard.

A “streaming” registration was due to appear a month before the conference but as of 4 November 2012, no such option is available.

Unlike some conferences, it looks like conference content is going to be limited to registered attendees who physically attend the conference.

This new, option will allow you to build, test, and run your low-traffic database-backed applications at a cost starting at $30 per month ($0.04 per hour) using the License Included option. If you have a more intensive application, the micro instance enables you to get hands on experience with Amazon RDS before you scale up to a larger instance size. You can purchase Reserved Instances in order to further lower your effectively hourly rate.

These instances are available now in all AWS Regions. You can learn more about using Amazon RDS for managing Oracle database instances by attending this webinar.

Oracle databases aren’t for the faint of heart but they are everywhere in enterprise settings.

If you are or aspire to be working with enterprise information systems, the more you know about Oracle databases the more valuable you become.

The following is a guest post kindly offered by Adam Kawa, a 26-year old Hadoop developer from Warsaw, Poland. This post was originally published in a slightly different form at his blog, Hakuna MapData!

Recently I have found an interesting dataset, called Million Song Dataset (MSD), which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo, duration, danceability, and loudness as well as artist name, popularity, localization (latitude and longitude pair), and many other things. There are no music files included here, but the links to MP3 song previews at 7digital.com can be easily constructed from the data.

The dataset consists of 339 tab-separated text files. Each file contains about 3,000 songs and each song is represented as one separate line of text. The dataset is publicly available and you can find it at Infochimps or Amazon S3. Since the total size of this data sums up to around 218GB, processing it using one machine may take a very long time.

Definitely, a much more interesting and efficient approach is to use multiple machines and process the songs in parallel by taking advantage of open-source tools from the Apache Hadoop ecosystem (e.g. Apache Pig). If you have your own machines, you can simply use CDH (Cloudera’s Distribution including Apache Hadoop), which includes the complete Apache Hadoop stack. CDH can be installed manually (quickly and easily by typing a couple of simple commands) or automatically using Cloudera Manager Free Edition (which is Cloudera’s recommended approach). Both CDH and Cloudera Manager are freely downloadable here. Alternatively, you may rent some machines from Amazon with Hadoop already installed and process the data using Amazon’s Elastic MapReduce (here is a cool description writen by Paul Lemere how to use it and pay as low as $1, and here is my presentation about Elastic MapReduce given at the second meeting of Warsaw Hadoop User Group).

An example of offering the reader their choice of implementation detail, on or off a cloud. 😉

While high-level data parallel frameworks, like MapReduce, simplify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graph-parallel computation while ensuring data consistency and achieving a high degree of parallel performance in the shared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees.

We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 1-2 orders of magnitude performance gains over Hadoop-based implementations.

I’m going to bet that you (or your organization) spend a lot of time and a lot of money archiving mission-critical data. No matter whether you’re currently using disk, optical media or tape-based storage, it’s probably a more complicated and expensive process than you’d like which has you spending time maintaining hardware, planning capacity, negotiating with vendors and managing facilities.

True?

If so, then you are going to find our newest service, Amazon Glacier, very interesting. With Glacier, you can store any amount of data with high durability at a cost that will allow you to get rid of your tape libraries and robots and all the operational complexity and overhead that have been part and parcel of data archiving for decades.

Glacier provides – at a cost as low as $0.01 (one US penny, one one-hundredth of a dollar) per Gigabyte, per month – extremely low cost archive storage. You can store a little bit, or you can store a lot (Terabytes, Petabytes, and beyond). There’s no upfront fee and you pay only for the storage that you use. You don’t have to worry about capacity planning and you will never run out of storage space. Glacier removes the problems associated with under or over-provisioning archival storage, maintaining geographically distinct facilities and verifying hardware or data integrity, irrespective of the length of your retention periods.

With the caveat that you don’t have immediate access to your data (it is called “Glacier” for a reason), but it is still an impressive price.

Unless you are monitoring nuclear missile launch signatures or are a day trader, do you really need arbitrary and random access to all your data?

Or is that a requirement because you read some other department or agency was getting “real time” big data?

Titan is an Apache 2 licensed, distributed graph database capable of supporting tens of thousands of concurrent users reading and writing to a single massive-scale graph. In order to substantiate the aforementioned statement, this post presents empirical results of Titan backing a simulated social networking site undergoing transactional loads estimated at 50,000–100,000 concurrent users. These users are interacting with 40 m1.small Amazon EC2 servers which are transacting with a 6 machine Amazon EC2 cc1.4xl Titan/Cassandra cluster.

The presentation to follow discusses the simulation’s social graph structure, the types of processes executed on that structure, and the various runtime analyses of those processes under normal and peak load. The presentation concludes with a discussion of the Amazon EC2 cluster architecture used and the associated costs of running that architecture in a production environment. In short summary, Titan performs well under substantial load with a relatively inexpensive cluster and as such, is capable of backing online services requiring real-time Big Graph Data.

This poster presents an overview of Titan along with some excellent stress testing done by Matthias and Dan LaRoque. The stress test uses a 6 machine Titan cluster with 14 read/write servers slamming Titan with various read/writes. The results are presented in terms of the number of bytes being read/write from disk, the average runtime of the queries, the cost of a transaction on Amazon EC2, and a speculation of the number of concurrent users are concurrently interacting.

Being a poster you will have to pump up the size for legibility but I think you will like the poster.

Impressive numbers. Including the Amazon EC2 cost.

Makes me wonder when governments are going to start requiring cost comparisons for system bids versus use of Amazon EC2?

Amazon is touting the horn of one of its larger customers, Netflix when they say:

Our friends at Netflix have embraced AWS whole-heartedly. They have shared much of what they have learned about how they use AWS to build, deploy, and host their applications. You can read the Netflix Tech Blog benefit from what they have learned.

Earlier this week they released Asgard, a web-based cloud management and deployment tool, in open source form on GitHub. According to Norse mythology, Asgard is the home of the god of thunder and lightning, and therefore controls the clouds! This is the same tool that the engineers at Netflix use to control their applications and their deployments.

Asgard layers two additional abstractions on top of AWS — Applications and Clusters.

Even if you are just in the planning (dreaming?) stages of cloud deployment for your topic map application, it would be good to review the Netflix blog. On Asgard and others posts as well.

You know how I hate to complain, ;-), but the Elder Edda does not report “Asgard” as the “home of the god of thunder and lighting.” All the gods resided at Asgard.

Even the link in the quoted part of Jeff’s post gets that much right.

Most of the time old stories told aright are more moving than modern misconceptions.

This is part three of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re exploring the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in Hive, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

Series Part One: Avroizing the Enron Emails. In that post, we used Pig to extract, transform and load a MySQL database of the Enron emails to document format and serialize them in Avro.The Enron emails are available in Avro format here.

Series Part Two: Mining Avros with Pig, Consuming Data with Hive. In part two of the series, we extracted new and interesting properties from our data for consumption by analysts and users, using Pig, EC2 and Hive.Code examples for this post are available here: https://github.com/rjurney/enron-hcatalog.

Series Part Three: Booting HCatalog on Elastic MapReduce. Here we will use HCatalog to streamline the sharing of data between Pig and Hive, and to aid data discovery for consumers of processed data.

Russell continues walking the Enron Emails through a full data lifecycle in the Hadoop ecosystem.

Given the current use and foreseeable use of email, these are important lessons for more than one reason.

What about periodic discovery audits on enterprise email archives?

To see what others may find, or to identify poor wording/disclosure practices?

Over the past couple of decades the medical research community has witnessed a huge increase in the creation of genetic and other bio molecular data on human patients. However, their ability to meaningfully interpret this information and translate it into advances in patient care has been much more modest. The difficulty of accessing, understanding, and reusing data, analysis methods, or disease models across multiple labs with complimentary expertise is a major barrier to the effective interpretation of genomic data. Sage Bionetworks is a non-profit biomedical research organization that seeks to revolutionize the way researchers work together by catalyzing a shift to an open, transparent research environment. Such a shift would benefit future patients by accelerating development of disease treatments, and society as a whole by reducing costs and efficacy of health care.

To drive collaboration among researchers, Sage Bionetworks built an on-line environment, called Synapse. Synapse hosts clinical-genomic datasets and provides researchers with a platform for collaborative analyses. Just like GitHub and Source Forge provide tools and shared code for software engineers, Synapse provides a shared compute space and suite of analysis tools for researchers. Synapse leverages a variety of AWS products to handle basic infrastructure tasks, which has freed the Sage Bionetworks development team to focus on the most scientifically-relevant and unique aspects of their application.

Amazon Simple Workflow Service (Amazon SWF) is a key technology leveraged in Synapse. Synapse relies on Amazon SWF to orchestrate complex, heterogeneous scientific workflows. Michael Kellen, Director of Technology for Sage Bionetworks states, “SWF allowed us to quickly decompose analysis pipelines in an orderly way by separating state transition logic from the actual activities in each step of the pipeline. This allowed software engineers to work on the state transition logic and our scientists to implement the activities, all at the same time. Moreover by using Amazon SWF, Synapse is able to use a heterogeneity of computing resources including our servers hosted in-house, shared infrastructure hosted at our partners’ sites, and public resources, such as Amazon’s Elastic Compute Cloud (Amazon EC2). This gives us immense flexibility is where we run computational jobs which enables Synapse to leverage the right combination of infrastructure for every project.”

The Sage Bionetworks case study (above) and another one, NASA JPL and Amazon SWF, will get you excited about reaching out to the documentation on Amazon Simple Workflow Service (Amazon SWF).

In ways that presentations that consist of reading slides about management advantages to Amazon SWF simply can’t reach. At least not for me.

Take the tip and follow the case studies, then onto the documentation.

Full disclosure: I have always been fascinated by space and really hard bioinformatics problems. And have < 0 interest in DRM antics on material if piped to /dev/null would raise a user's IQ.