It has become something of a cliché for IT environments to claim that they offer a hybrid cloud environment when in most cases they are actually providing parallel environments with little functional crossover between local and elastic resources. Without a shared “namespace” users must duplicate data and modify applications should they wish to compute in the cloud. Of course, this might be understandable in the presence of sensitive data or intellectual property but those exceptions aside organizations should work towards a seamless access environment that doesn’t require end users to move data and alter workflows simply to run remotely. While Virtual Private Clouds can help local IT shops carve out cloud resources that appear to be extensions of the local network not much has been accomplished if it requires a lengthy data upload over a network choked with general campus traffic. (Lack of a dedicated research network is a big problem but that’s another article). It is always easier to bring the computation to the data than vice versa especially when it relates to genomic and biomedical computing.

What Are Your Use Cases?

One of the first things a classic IT support person will ask a user is “What is your use case?” which, in a typical IT shop, is a reasonable question since the workloads are usually static, access patterns are predictable, and storage performance is rarely an issue. Thus, a suitable architecture can easily be defined that also doesn’t require an adjustment in existing governance policies. (Music to an IT manager’s ears!). However, it is largely lost on the classic IT analyst that in research computing the determination of an appropriate environment can be in and of itself an ongoing research project involving frequent experimentation prior to the identification of the ideal setup. And even that might require subsequent change. This is why research computation deserves its own leadership who understand that “spikey” and volatile workloads, in addition to obnoxiously large data sources, are de rigueur in research computation. Unfortunately, this is a persistent problem and is also a reason that many investigators choose to access cloud resources independently of any institutional path because they have a direct route to an elastic resource without having to jump through local policy hoops that were probably designed for a generic web server or departmental database as opposed to a dynamic, high throughput computational resource.

Overcoming the Past

All of this said it is not trivial to overlay an existing heterogeneous environment with a common namespace. Outside of the technical realm it takes careful planning and a willingness to adjust policies to accommodate wider variation workloads. At a technical level, it involves the reconciliation of naming conventions, filesystem parameters, and any APIs that might be in use. This is in addition to linkage to virtual machines and instances (both local and remote). The good news is that there are software stacks to help address these issues while providing clean management tools and interfaces for intelligent provisioning and allocation of resources. It’s been my experience that the various technical solutions can be applied as long as up-front identification of key resources is carefully accomplished. This can sometimes be a primary obstacle to adoption of a true hybrid cloud as it involves looking at longstanding policies and procedures that were probably developed for an earlier time when resources were static. It also forces IT shops to examine their inventory and legacy services. But this is going to be essential anyway as organizations moved towards the cloud.

In wrapping this up, computational researchers have the ability to go direct to the cloud to accomplish their work but this assumes they have the requisite knowledge to do so. Ideally there would be institutional resources available to help make the management and analysis of their data easier and within the policies of the associated institution. Moreover, accessing data and writing code should not involve changes on the part of the investigator which is why a true hybrid setup is necessary to animate campus research computation. Need help with this ? Let me know.

This is the companion blog to this post which gives an overview of the NSF I-Corps program and the dynamics therein.

The 100 Interviews

You will soon discover that doing interviews is a HUGE part of the I-Corps experience. In fact most people hear about this aspect of the program long before they apply for the funding. Given that interviews are so essential to the I-Corps program you must become adept at talking to people preferably in person. This can be daunting to those who might not generally be fond of social interaction but it’s a must if you are to succeed in this program. Your team mentor should be able to provide names from his or her rolodex but you should always be on the lookout for someone to interview since the course is only 7 weeks long and it is expected that you will have at least 100 by the conclusion. Preferably you should accumulate these at the rate of 15 per week to keep an even pace. I thought I would offer some detailed comments on this phase of the program so you can see how one might approach getting the 100 minimum interviews.

Do they really mean 100 ?

They mean AT LEAST 100 and that number is firm. Obviously some weeks will be drier than others in terms of numbers but they really mean 100 interviews ! The first week is the toughest since you are at the Workshop that has obligations into the evening which cuts into your time for getting interviews. So some teams turn in only a few. The instructors will drill into you incessantly about upping the number which adds to the stress of what is already a pretty intense experience. We were able to leverage contacts in the Bay Area to get about 12 before the opening workshop had concluded. Not all teams did as well and it wasn’t because they were unmotivated or lazy – just that their customer segment didn’t have a lot of representatives in the area or making contact at short notice (like with physicians) was difficult. This is why it’s important to have these interviews lined up in advance before you arrive on site for the Workshop.

Business to Customer ?

The approach you employ to get interviews is contingent upon the nature of your technology in that if you have a “B2C” Business to Customer solution then you don’t necessarily need high tech interview candidates – you can simply engage laymen who might have an interest in your technology. For example, in my cohort there was a team with a solution to snoring and they were able to get over 50 interviews in one day just by setting up in a busy area and asking couples if they could spare a few minutes to talk about snoring. This might be the same with exercise or stress reduction technologies you are considering to commercialize. However, if your technology appears to relate to niche customer segments then it can be challenging to find the right people though this is part of the process and also why I think conferences and trade shows are the best approach.

Beware the Changing Customer Segments

It is to be expected that your previously held assumptions about who your customer segments are/were will change each week – sometimes many times within in a given week. As an example, our team believed that the university research market was solid but we found out that it isn’t except as a trainer/feeder market into drug discovery and later companion diagnostics (a hypothesis which is still under consideration). So while we did interview lots of researchers we came to understand that we would have to look much more broadly to get the perspective we needed to better identify and define more lucrative customer segments. This meant doubling down on interview efforts and led to lots of anxious moments about who we were going to contact and how we would get in front of them. Any I-Corps team can tell you about this problem. You thought you had the 100 interviews in the bag ? Guess again – one or more of your customer segments just pulled out on you or the teaching team challenged you so you have to move into uncharted territory. The good news is that you will have lots of company.

Attend Conferences !!

In my opinion this is the best way to get interviews. Actually you could (and probably should) plan the selection of your I-Corps Cohort around conference season relative to your technology. Some industries are “conference heavy” in the Fall whereas others are more abundant in the Spring with the latter months of the Summer being less popular. I’m making an assumption that your anticipated customer segment demographic does attend conferences and trade shows but you will have to make that determination. Keep in mind that if you do attend a conference and rack up a bunch of interviews that you will also still be on the hook for presenting at your weekly WebEx conference so make sure you have plenty of time to develop your presentation, upload it, and get online for the required 1.5 hours. Most Hotels have decent enough internet but you need to make sure that your connectivity is good.

Make friends with your departmental travel representative since you will be going on the road and will need to make reservations and arrange for reimbursement. By the time I was done with the 7 weeks I was an expert at using my institution’s travel reservation services and submitting expense reports. Also remember to schedule your conference travel earlier in the program as by week 6 you will need to start developing your “lessons learned” draft video and making plans for the closing workshop. I started conference travel in the 2nd week of the program and by Week 5 had been to 3 different conferences which yielded plenty of interviews. I had considered going to another conference but was able to get enough interviews locally to make the quota with room to spare.

How to Proceed at Conferences

It’s better to go to conferences with a team member who can help you cover more ground. For example if you have a Co-EL then you should be able to get lots of interviews. I went by myself to two conferences and got about 21 interviews at each so imagine how many you could get with two people ? Don’t go to the conferences with the idea of actually attending it in the traditional sense. You are there to do interviews – not to sit in presentations except perhaps as a means to identify people who might be interesting to interview. Scan the bios and job descriptions listed in the conference attendee list and try to prearrange a meeting. Even so I had better success just approaching people at breaks, poster sessions, end-of-day receptions, and in the vendor exhibits which are also great since they might be your competition, potential partner, or even customer. In fact some of the best advice and input I received came from vendors.

If the conference starts at 8 then be there at 7:15 to catch the early birds (they are always there). Also, if possible show up one day early to the conference especially if there are pre conference training sessions. Most people are bored and will be happy to talk. Of course the after hours social events sponsored by the conference are also excellent ways to get an interview. I walked into the hotel bar with my conference badge and people started talking to me. The idea is to start a conversation – don’t launch into a robotic recitation of “Hi I’m conducting customer discover – could I ask you some questions”. I found that that approach didn’t work so well. Your mileage may vary.

Keep in mind that initiating a technical or industry-specific conversation at a conference is entirely reasonable and expected which is precisely why the conferences are great for interviews. In our case we were trying to figure out the pain-points of those engaged with large scale Next Generation Sequencing analysis so I stood next to a poster that dealt with this topic and started a conversation with people who approached – “Do you deal a lot with NGS data ?” After that I just let the conversation flow. Not every conversation represented an interview but most did. Remember that most people will gladly tell you their problems so make sure to let them do most of the talking. If you are making notes in a laptop or a tablet just politely inform them that you are funded by the NSF to do some customer discovery. No one had a problem with me doing that. After about 25 or so interviews I got pretty good at listening to their comments and identifying only the “good stuff” which I logged into my notes after the conversation had ended. This way I didn’t have to type as I listened which made the conversation more natural and flowing.

Email – Brute Force Methods Work

Sending out lots of emails and invite requests to people will work as long as you do so respectfully. I found it useful to emphasize in my email that I was funded by the NSF I-Corps program to perform discovery and that I was NOT a salesperson. That seems to be everyone’s major concern – that you are just a sales person trying to find an angle. I dispatched 125 requests over the 7 weeks and got perhaps 40 responses from which I got about 22-25 interviews. My team members got an additional 15-20 interviews so that combined with the conference interviews totaled 120 by the 7 week mark. See what I mean ? I just ended the program and I’m still getting responses from people who are just now in a position to do an interview. I had an odd experience in that my email to one company contained a typo wherein I had accidentally referenced the NSA instead of the NSF and the company president responded by referring me to their lawyers !

LinkedIn

I’m not a fan of LinkedIn so this section will be short. I got maybe 5 interviews out of this medium but others in my cohort knew better how to “work” LinkedIn and were expert at connecting in a way that yielded lots of interviews. I’m pretty sure that I-Corps will pay for a temporary upgrade to the higher levels of LinkedIn membership which will allegedly give you the ability to send more “In Mails” and view more distant connections. So if you are a social media expert then maybe this would be a good angle to pursue.

Meetups

This can be a great way to get some interviews especially if your technology coincides with an existing Meetup group. The trouble I find with Meetups is that many groups aren’t active so always make sure to check their events calendar to insure that a meeting hasn’t gone stale. I was able to attend a couple and conduct some interviews but because what my team was looking for was a bit specialized it didn’t turn out to be a big win. I did consider creating a Meetup for Next Generation Sequencing Analysis just to see if anyone would join the group after which I would schedule a meeting a nearby bar. It costs some money to create the group and if you get lots of responses then it might work. You are in effect creating your own focus group solely for the purpose of getting interviews.

In The End ?

We presented at the closing workshop that we were a “No Go” which means that we are not moving forward with starting a company within the next 3-6 months. It was only very late in the program that we found what we believe to be a reasonable customer segment though that itself is a hypothesis that we have to (in)validate before we pursue incorporation. We’ve also been approached by some of the companies we interviewed to explore the possibility of a partnership so it’s not as if we are slowing down our efforts at this point. Had we not done all of those interviews we would not have made these contacts or be signing NDAs to pursue more discussions. This has been a great experience overall. It is time consuming so make sure you pursue it when you don’t have teaching duties or other involved obligations. But I think you will be pleased with the result.

National Science Foundation Innovation Corps

I just concluded participation in the NSF’s I-Corps program as part of Bay Area Cohort Number 2 which started in mid July and ended on August 29th. It was an exciting non stop 7 weeks with lots of turns and twists. While I’m exhausted I thoroughly enjoyed it and learned a great deal which, after all, was the goal. There were some really talented people at the Cohort and I benefited just from interacting with them. It helped reignite some of my dormant social skills which I now intend to put to good use in developing our technology further. It will more likely be a significant variation of what we originally took to I-Corps but that’s really the point of the program (in my opinion). What you come out with can be totally different in a surprising way.

The history and intent of the program are outlined in the above link so I won’t reproduce that information except to say that if you are a scientist, have had some recent NSF funding, and currently possess some technology that you think might be marketable then you can apply for $50,000 to figure out the commercial potential of your work. Keep in mind that the purpose of I-Corps is to teach approaches and techniques that will enable you to assess the commercial viability of your own product. No one “sits down” with you, analyzes your technology, and suggests a path or arranges meetings with potential funding sources. There are in fact paid consultants who can provide this service but this is not what I-corps is about. The idea is to provide a set of skills that will enable you to do these things for yourself for your existing technology and any you subsequently develop.

Team Composition

The team that goes to an I-Corps Cohort is typically three people:

The (PI) “Primary Investigator” who is typically the originator of the technology or product being considered for commercialization

The (EL) Entrepreneurial Lead who is usually a student or a Postdoc who might also be a co-developer of the technology. This person could also be another investigator or professor. There can also be a second or “Co-EL” though most teams seem to show up with one.

The (IM) “Industry Mentor” who is someone with industry experience preferably within the domain(s) to which you believe your product or technology belongs.

The EL does ALL of the presenting throughout the 7 Weeks. Part of the reason for the emphasis on these presentations is to condition you for the possibility of one day interacting with Venture Capitalists, local Research Alliances, Angel Investors and anyone else who might be in a position to give you money for your technology. Better to hone your presentation chops at I-Corps than in front of a group of unsympathetic strangers. Don’t worry – everyone in your cohort is in the same boat and everyone gets a similar level of scrutiny so there is no need to take comments personally. During the opening and closing workshops – when it’s your time to present your team goes to the podium and the EL presents according to that day’s assignment. In the interim weeks the teams present via WebEx.

Funding your I-Corps Experience

The grant provides $50,000 to cover the costs of the workshop which are usually $1,500 per person with the rest of the money being available for conference travel and upgrades to Dropbox to accommodate the course materials and the content you will generate over the 7 weeks. There are restrictions on the types of travel such as the NSF does not want you going to purely academic conferences. Instead you need to be going to trade shows and “applied” conferences since your I-Corps experience relates to products and applied tech as opposed to say emerging theory discussions. They will go over these types of guidelines and regulations in the Opening Workshop.

Cohort Dynamics

Your team will be “married” for the duration of the seven weeks of the program. The I-Corps Teaching Team (usually comprised of 6 people) wants all team members to be present at ALL meetings which includes the opening onsite 3.5 day workshop, the weekly remote WebEx presentations, and the onsite 2 day closing workshop. Just accept the fact that you will be super busy during the onsite workshops and you won’t have any problems. Most teams don’t get very much sleep during the opening 3 days. If there is an emergency then by all means tell the Teaching Team or the TA and they will work with you. But they won’t be sympathetic to the idea that, for example, one of the team members has a business meeting or needs to be on a conference call during any of these sessions.

Here is how the calendar looked for my Cohort. Note that the Sunday arrival in San Francisco (on July the 16th) was necessary to attend the mandatory opening reception before the course began on Monday.

The reception on the 16th involved an introductory exercise to get the teams talking to each other and to the group as a whole. The instructors also used this time to introduce themselves and to prepare us for the upcoming busy schedule. After completing the Monday through Wednesday workshop we returned home and for each of the next 6 weeks participated in a 1.5 hours WebEx presentation. We returned to San Francisco for the closing workshop on the 28th and 29th of August to offer our final conclusions.

Everything is a Hypothesis

You will learn that many, if not all, of the ideas you have about your technology and the value thereof are simply hypotheses that have to be (dis)proven based on questions you present to potential customers you interview over the course of seven weeks. It is interesting to note that even the domain(s) from which you imagine your potential customers to come are also hypotheses which can change rapidly over time – sometimes within the space of a week. Don’t worry since, as a student of I-Corps, you will learn to grow comfortable with the idea of rapid change and the need to refocus (or “pivot”) to different customer segments as you continue to refine the value propositions of your products. Just consider that any assumption you have about the appeal of your technology has to be vetted via interviews which is why arranging interactions with potential customers is such an important part of the process.

The so-called “Business Model Canvas” (BMC) is where you register key ideas/hypotheses about your product. This will change from week to week and sometimes multiple times within a week. It looks like the following – well yours will have more content. This is just the blank template.

The I-Corps experience is concerned primarily with the right side of the canvas but all areas are important since you will need to find the a good product-market fit and identify potential key partners who might be able to help you. My team spent most of our time iterating towards solid Customer Segments (the most right column) which in turn will influence the formation of compelling Value Propositions (the middle column). Of course then ideas about Channels and Revenue Streams are equally as important so there will be hypotheses about these areas also. Just get used to the idea that interviews with people will yield information that might require you to abandon strongly held ideas about your technology. But this is necessary if you are to arrive at an informed decision about moving forward with commercialization. The Business Model Canvas winds up being the primary tool to capture your (dis)proven hypotheses from week to week and as a way to communicate these changes to the I-Corps Teaching Team.

This Sounds Hard – Why Bother ?

Why bother with physical exercise ? It’s hard too. One way to think about the experience is that it approximates the chaos, uncertainty, fatigue, and general dynamic surrounding startup projects. The process forces you to get out of the lab and talk to people (potential customers and/or partners) who will help determine the appeal and usability of your technology. While there were times when I was not happy with how things were going I now feel much more confident about approaching potential funding sources. We also made some valuable partner contacts who are reaching out to us to have some next-phase discussions relating to our technology. It’s very refreshing to have people calling us instead of us chasing them ! I’m very sure we wouldn’t have been able to do that as easily without I-Corps. Or it would have taken much longer to transpire.

As you interview the 100 people (yes that many) over the 7 week period you will learn that your strongly held ideas about your technology will have to change to accommodate reality. This can be a tough realization but that, to me, is what it’s all about. You go in thinking, “I know EXACTLY who will buy my technology” only to find out via interviews that your anticipated customers are like, “Err – No thanks”. The cool thing is that you can rebound (or “pivot” in I-Corps speak) to a new, unexpected direction that could hold much greater promise. There are no guarantees that this will happen of course but the more interviews you do the more apparent it becomes as to who is interested in you (or not). So I think there is a lot to be gained by the experience and would recommend it.

Preparing for the Experience

As you apply for acceptance into I-corps please consider the location and time of year since it is expected that you will have 15 interviews lined up in advance. In-person interviews are strongly encouraged although Skype and Phone interviews are accepted but the program pushes the in-person contacts. (More on the interview process here). The time required for participation in I-Corps is significantly more than the 15 hours stated by the NSF and Syllabus especially so for the Entrepreneurial Lead. The 15 hour metric is a minimum at at least in my experience. In any case once you are accepted into the program you will need to select a Cohort. Pick a location that promises a significant number of in person interview possibilities relative to your technology.

Background:

Over the past year I have been inundated with questions from a wide variety of researchers across a number of institutions (national and international) seeking guidance on how best to “relocate” or “retool” their research to use on-demand, utility computing services such as those offered by Amazon. Given the frequency and similarity of these inquiries I decided to put together this FAQ targeted at cloud newcomers who want straightforward information about the viability of on-demand computation tools and their appeal to funding agencies. Most of my experience has been with Amazon Web Services and to a lesser extent Google although I feel that Amazon has a long head start in this space. This FAQ is aimed at biomedical researchers engaged in the management and analysis of genomic data although most, if not all, of the answers can be applied to other research domains.

The Data Deluge:

Unless you have been in hiding, you are definitely aware that biomedical and bioinformatics-assisted projects are overrun with data as Next Generation Sequencing technology continues to offer better coverage and faster run times at decreasing cost. Consequently we are running out of places to put this data let alone analyze it. Being able to “park”, for example, twenty terabytes of data on a storage device is useful though if that data is not also visible to an associated computational grid via a high-speed network and well-performing file system then analysis will be difficult. “Data has gravity” which means that it has become easier to bring the computation to the data than the other way around. If your local environment has no difficulty providing this capability and is fully committed to helping you scale your research then you probably don’t need cloud resources. However, it is important to develop a longer-range strategy since institutions as a whole are looking at cloud computing to reduce costs thus ongoing local hardware acquisitions and associated data center upgrades are being reconsidered.

Cloud services provide on-demand, custom compute resources that can be configured and optimized to your unique workload. The same is true for storage, networking (to an extent), databases, and at-scale technologies such as Apache Spark. The cloud-computing model is made available on a “pay only for what you use” model so there is no subscription cost. The resources you “spin up” are yours alone so these environments can be highly customized with the ability to create templates and associated images to facilitate reproducible research and ease distribution of any software products you develop as part of your research. Another way to consider this is that anyone with a laptop, an Internet connection, and a credit card could setup and manage large-scale computational resources including databases, replicated storage, and fast networking. Of course knowing how to accomplish these things is a prerequisite so optimal use of the cloud becomes a function of one’s willingness to learn and/or engage knowledgeable collaborators.

Are you sure this isn’t just a fad ?

Hardly. Independently of Research Computing interests the use of cloud resources is immensely popular in many domains. Ever watch Netflix ? That’s running off Amazon. Your Banking solutions are cloud based (SaaS – Software as a Service) as is your email (MS Office 365) so don’t expect the trend to stop anytime soon. CIOs are under pressure to reduce costs associated with infrastructure and data centers, which means the only viable alternative solutions will be cloud-based. As it relates to Computational Research there will always be a need for some type of local experimental activity (see other questions in this FAQ for more detail) so it’s not as if moving to the cloud requires a total evacuation of local resources – although some IT shops might like that since it takes the pressure off of them to continue to provide local resources. On the other hand the Cloud generally provides far more flexibility in compute and storage services than most IT shops so consider that the cloud isn’t something that happens to you – it happens for you. Also consider that there are a number of excellent large-scale computational resources such as TACC, Oak Ridge, and the SouthDB hub who offer a more specialized form of compute services.

What questions should I ask of my existing computational provider ?

Chances are that your local provider of computational resources is aware of cloud and utility computing so they might in fact have some ideas about what the future holds. However, you should be proactive and ask them what plans and timelines exist to migrate from local to cloud resources. In reality they should be pursuing a hybrid model which is a mixture of local and cloud resources whereas some workloads might be so large as to be feasible only when using cloud resources. A hybrid model allows users to continue within a familiar environment as they learn how to move the workloads into the cloud. A more important question is how your local facility intends to support users in the transition. This is critical to facilitate adoption of cloud resources and to make users productive.

Is using the cloud better than using my in-house computational cluster ?

Are you happy with what you’ve got ?

The answer is mostly a function of your satisfaction level with your current environment. But the two aren’t mutually exclusive. Many people use both cloud and local resources and over time are building into a “cloud preferred” model at their convenience. Relative to local resources, an important question is to what extent those resources will continue to be available and at what cost ? Also ask your local HPC provider to what extent they intend to expand using cloud resources. Hybrid environments can be powerful in that anything that is too large to run locally can be spun-out into the cloud although that linkage has to be setup for this to occur transparently. Many local providers (should) recognize the utility of the hybrid approach as it allows them to expand into the cloud without disturbing local resources. It also provides opportunities for them to train personnel to fully exploit the at-scale technologies offered by the cloud. User support is a key concept and if your projects remain heavily reliant upon support personnel then you will need to budget for similar assistance when moving to the cloud.

In the cloud it all belongs to you

One idea that cloud novices frequently miss is that when using on-demand compute resources, the resulting environment is yours and yours alone unless you wish to share it. That is, one does not need grid management software as you can run jobs at will and without waiting. Moreover, you can start up multiple servers of arbitrary memory and storage size to arrive at the optimal configuration for your workload. To illustrate this point I recently assisted a researcher in the creation of an Amazon x1.32xlarge instance (128 vCPUs and 1,952 GB of RAM). As his computational career to date has exclusively involved use of shared Beowulf-style clusters his first question was, “So what is the average wait length of the job queue ?” He simply didn’t get it that the resource was for his use alone and that instead of waiting six days for his queued jobs to be accepted and completed (his usual experience) he could be finished in approximately eight hours. Of course this comes at a cost ($13.33 per hour as of this writing) and if you are paying little or nothing for local resources then perhaps waiting six days as opposed to eight hours is acceptable. You’ll have to make that decision.

What workloads are NOT appropriate for the cloud ?

An obvious case is any project involving protected health information or code associated with proprietary software development. While Amazon is able to accommodate these use cases you should definitely ask someone locally to determine the relevant policies. Not all IT shops are supportive of or enamored with the idea of cloud usage in general let alone when it intersects with sensitive data so be prepared for some strongly worded responses though, again, protecting health information is essential so ask around before moving data anywhere ! This includes USB sticks, DropBox accounts, wherever.

Another scenario wherein you might be better off running locally would be any service that can be hosted for free. After all – you aren’t paying anything so unless performance or scalability issues exist, or the subsidies that keep the service free are in jeopardy of ending, then no one would blame you for using such resources in perpetuity. That said, it’s easy to experiment with low cost Amazon instances configured with the LAMP stack to move, for example, a lab web site into the cloud if you wish to do more aggressive web development or make your site independent of an institution.

Also if you have a collaborator who has access to computational hardware ,or has a relationship with a computing facility, at an attractive price then obviously you would explore that option especially if the collaborator is going to be taking the lead as it relates to analysis and data management.

Does the arrival of cloud computing imply that super computing entities are no longer viable ?

Not at all. Dedicated high performance computing installations such as XSEDE, Open Science Grid, Oak Ridge, TACC, (to name a few) can and do provide excellent computing resources as well as expert user support. Depending on the nature of your research along with your funding agency you might receive financial allowances from one or more of these organizations. While many workloads can be “spun up” in the cloud on your own it might be beneficial to first leverage these resources especially if they bring specific expertise to bear on your research problem.

Does the NIH endorse Cloud Computing ?

Consider reading the publication “The 25-Point Implementation Plan to Reform Federal Information Technology and Management.” This was published in 2010 and discusses a shift to a “cloud first” approach. The idea here is that as far back as seven years ago the NIH recognized the viability of cloud computing and sought to document it’s intent to formally recognize it as a future direction. It is important to consider that access to cloud technology has had a democratizing effect in that institutions and research laboratories with comparatively limited computational resources can now compete with better endowed institutions by leveraging the enormous power of on-demand computing. The conclusion is the NIH has been “on board” with cloud computing for years now.

Also check out the NIH Data Commons Pilot Project associated with the BD2K initiative which uses “a cloud-based data commons model to find and interact with data directly in the cloud without spending time and resources downloading large datasets to their own servers”. The concept here is that “data has gravity” which means that it is easier to bring th e compute resources to where the data is stored as opposed to dragging data (possibly in the terabyte to petabyte range) to where the compute resources live.

In 2015 the NIH issued a statement outlining its position on storing controlled access information in the cloud. This relates more to the idea of NIH’s recognition of the value and increased use of cloud resources than it is an actual endorsement but since that time use of Amazon and Google cloud resources within NIH funded research has grown considerably. Check out the NCI Cancer Genomics Cloud project page for an idea of some of the pilot projects.

When applying for grants it is helpful to view computational cycles as “consumables” much in the same way we view office supplies. The idea is that funding agencies might not be so intrigued by the underlying technical details of a computer’s chip set (unless of course that is part of the research) only that the price per cycle is competitive and that any awarded funding will be used optimally.

What is involved in moving to the cloud ?

In the case wherein you do not have an existing workload then it’s as simple as signing up for an account and then launching compute instances. Of course knowing how to do this, and doing it in a productive way, requires knowledge so if you aren’t up to speed on then you will need assistance. But this is no different than if you were to move to an institution and needed help acclimating to the resident computational resource. Some places offer lots of user support whereas some offer almost none. It really depends on the environment.

In the case wherein you have existing workloads then you will do what is known as “lift and shift” wherein you reproduce your local computational environment in the cloud, upload your code and data, and then confirm that the results you are accustomed to getting locally can be reproduced in the cloud. Given the wide variety of available compute instances it is fairly easy to match a local configuration. After your computation is complete then you can allow your data to remain in cloud storage or you may choose to push it to backup storage (e.g. Amazon glacier) or maybe download it if you have a good deal on local storage.

Consider though that simply reproducing workloads in the cloud might be under utilizing the capabilities of elastic computing. Instead of trying to reproduce a Beowulf style cluster environment, Amazon offers a system with, for example, 64 Cores and 256GB of RAM so much can be achieved using a single instance, which in most cases would simplify the deployment of your work since you would not need to run jobs across many nodes unless desired. That said, one could spin up a classic cluster solution within Amazon using software such as CfnCluster or MIT’s StarCluster.

I don’t know much at all about configuring Linux machines or how to architect computational environments. What can I do ?

In some cases you might be able to benefit from using SaaS (Software as a Service) solutions that neatly sidestep the requirement for you to create everything from scratch. Services such as Galaxy Cloudman allow you to create a fully functioning scalable version of Galaxy, which is a popular bioinformatics framework for analyzing genomic data. Of course you are still left with the job of using the software but you did not have to agonize over how many servers to implement, how much storage to setup, what operating system to use, or what versions of various genomes to download. This is a true convenience especially for someone just breaking into large-scale genomic computation. There are also SaaS solutions such as Seven Bridges genomics that provide professional support services so if you anticipate a need for that type of involvement check them out.

If you don’t want to do any of this or if you are transitioning to a phase of your career that will involve more in-depth computation you will need to recruit knowledgeable students. Knowing how to instantiate, manage, and effectively exploit scalable cloud resources is a career path all to its own so many students might be interested in acquiring these skills (or might already have them). However, if you anticipate relying upon students remember that their “day job” is as a student in a graduate program so they still need to focus on science and research. You should identify students with some programming experience ideally with multiple languages, (Python, C++, Java, R) and hopefully some Linux command line experience and, if you are lucky, some system administration experience. Spinning up Linux instances is very common thus being able to install, upgrade, and configure open source packages is essential. Thankfully, many bioinformatics analysis environments are being distributed as “AMIs” or Docker images for convenient use with little or no up front configuration.

I have collaborators at other institutions. Is it possible to include them in my research when using cloud resources ?

Yes. In fact this is a key feature of Amazon AWS in that it uses Identity Management policies to assign roles for team members independently of location. Of course it is also possible to setup computing instances on VPC (Virtual Private Clouds) that may or may not be available on the general internet though this is completely under your control. The cloud provides an easy to use shared middle ground for collaboration and ongoing work that can “mothballed” and then reanimated at will without losing configurations and various workflow dependencies.

What are the basic concepts behind cloud computing ?

There is an incredible amount of hype and terminology that overlays the cloud space much of which has emanated from vendors seeking to reposition themselves and their products to profit from cloud technologies. However, as a biomedical researcher your concerns probably relate more to processing terabytes of sequencing data or building an accurate, robust predictive model, or processing billions of remote sensor readings, all of which require far more than a canned service. Please understand that not all vendors who offer solutions are flawed or are hawking yesterday’s technology under a new name. Just that there are enough of them that caution is warranted.

Consider that there are 1) computational environments, such as those offered by Amazon and Google and 2) whatever you wind up putting “on top” of these computational environments. For example you will hear of SaaS (Software as a Service) solutions that are “one stop shops” provided by a vendor or institution. Think of Microsoft 365 as being such a solution wherein you access email, calendaring, and file sharing from one interface via a web browser. In the world of bioinformatics think of something like the Galaxy CloudMan package that offers a comprehensive environment for genomic data processing that transparently provisions the necessary servers, storage, operating systems, and networking, required to support such a tool.

However, for the typical computational researcher the point of entry into the cloud will typically be “IaaS” Infrastructure as a Service, which allows you to specify the amount of storage, number and type of servers, the networking them between them, and the operating system type. Shared file systems can be setup as can event driven processing. IaaS provides the greatest amount of flexibility but also requires the greatest amount of knowledge to provision. Some researchers slide comfortably into this role though most seek out collaborators and/or recruit students who can help.

What type of improvements can I expect by moving my code to the cloud ?

If you are experiencing performance bottlenecks then it is natural to assume that any “larger” resource will probably provide enhanced performance. Frequently that is the case though it isn’t always clear what key factor(s) is most responsive to an enhanced resource. That is, was it the increased network bandwidth, additional memory, or additional core count that led to better performance ? There are ways to make this determination. But estimating a “percent improvement” simply by moving to the cloud (or even another local computer) is difficult in absence of an existing benchmark yet it is one of the most frequently asked questions I get. I understand that many researchers (or their students) are executing pipelines given to them by someone else so having an intimate knowledge of which components in a workflow might need refactoring might not be a primary consideration though it really should be. Especially, when anticipating a need to do massive scale-up. Frequently the code and data are moved to an instance and that becomes the benchmark, which is fine. Just keep in mind that arbitrarily selecting configurations with the hope of a generalized performance increase is not a very scientific approach. If you need help ask for it.

What do I need to watch out for when experimenting with the cloud ?

The unanticipated big bill

The biggest fear anyone has involves receiving a large bill after engaging in some basic experimentation. Some researchers fear that merely signing up for a Google or Amazon cloud account will incur cost. Don’t worry this isn’t a cable subscription as you pay for only what you use so if you don’t use anything then guess what – no charges will accrue. There is a considerable amount of Amazon training material available on YouTube that helps novices become comfortable with navigating the AWS Dashboard. So even for total newcomers it is easy to setup and account and begin experimentation.

A typical experimentation session on Amazon might involve you logging in, creating an instance, having some fun with it, but then you get interrupted. So let’s say that 4 hours have passed and you come to see that your instance is still running. Guess what ? You are being charged whatever that hourly rate is although if you booted a smaller instance you would be looking at maybe $2.50 total. So you pay for what you use. The solution however is very easy. Amazon allows one to set “alarms” to send email or text messages to you based on user specified thresholds. So the very moment your use exceeds a certain dollar amount then you get notified. This saves you money and also provides you with motivation to watch your workloads. As your computing becomes more complex you can use load and or activity triggers to deactivate instances that have no use within a specified period of time.

Amazon has training wheels

Amazon does provide 750 free hours for experimentation though it requires use of “free tier” resources. That is you can’t just login in and spin up a 128 Core 1.9 TB instance for free. The free tier involves micro instances that I consider to be the “training wheels” of the cloud. These instances are not very useful for doing real genomic computing work but they are quite useful for learning how to configure virtual machines and become familiar with the basic processes of managing instances.

I’m currently negotiating a startup package. Should I request cloud credits instead of actual hardware ?

This is an interesting question as being able to continue your work in a new environment is essential to your success. For whatever reason some researchers are shy about drilling down deeply into the computational capabilities of a potential employer when that should probably be one of the first things to discuss. This is particularly important given that institutions are reviewing (or should be) their approach to maintaining and refreshing local hardware in light of the potential cost savings offered by Amazon and Google. The last thing you want to have happen is to show up only to have minimal support for your work. All this said I’m pretty sure you could request cloud credits but this assumes that you have some familiarity and experience with AWS to the extent that you could reasonably project workload needs. In absence of that you could still request some credits for experimentation while using local resources. Independently of the cloud vs local resources issue always make sure that good user support for your project will be available. I’ve talked to many researchers in transition who observe wide variation between institutions in terms of computational support. Always make sure you ask detailed questions about the computational environment before accepting the job.

I use licensed software like MATLAB. Can I still use the cloud ?

This is an important question because many institutions have site licenses for commercial products such as MATLAB that allow them to run jobs locally. The best answer to this type of question is to contact the vendor directly and ask them to describe approved cloud options. Most will direct you to an Amazon Machine Instance that has the software pre-loaded and licensed in a way that it is available once you launch that AMI. The vendor will typically charge fees on top of the Amazon use fees to cover licensing fees. Mathworks, (MATLAB’s parent company), has its own cloud service which is in turn based on Amazon AWS that one can use by signing up on the Mathworks site.

You might also consider refactoring your code to exclude the dependency on the commercial solution. Of course this assumes that you can do this without impacting the results you are accustomed to getting. This isn’t to say that these commercial solutions are somehow bad – far from it. Just that you might want to determine if the code you are using cannot be replaced, if only in part, using open source tools. As an example, the Julia language provides a robust, well-performing, parallel language for matrix manipulation (among other things) – all for free. More common open source substitutes include Python and R. Note that I am not suggesting that replacing your commercial code with these alternatives is easy or could be done over night. Depending on what you are doing in the code it could become an involved process. However, if you need to scale up your code and you want to disperse or share your code with a larger community then moving to open source will help accomplish that aim.

From a practical point of view how do I start using Amazon Web Services ?

First sign up for an account. Don’t worry it doesn’t cost anything though you will be prompted to enter a credit card number. You might also consider apply for an Amazon Research Grant credits that can be used for purposes of aggressive experimentation. Consider completing the AWS “getting started tutorials” designed to acquaint you with the basic procedures behind spinning up some example Linux-based instances. For institutions with a lynda.com subscription there are Amazon orientation videos available also. Keep in mind that most successful computational researchers will assemble a team over time over which work can be distributed accordingly so always think collaboratively as Amazon has many services in addition to “elastic computing” that will require an investment in time and effort to master. On the other hand this is true of any field wherein innovative services are on offer – Learning to integrate them into your work does take time so budget accordingly. The idea of user support is always important so you should ask your local IT or computational resource to what extent they will assist (or not) your use of the cloud. Ideally they would be enthusiastic about helping you though understand that the “cloud vs local” issue can represent a political hot point for IT organizations concerned about protecting “turf”.

Pittard Consulting of Decatur LLC was founded in 2014 in recognition of the unique technology needs of Higher Education. We provide training and consultation services involving the management, preparation, and analysis of large scale data and cloud-based infrastructure. In particular our areas of expertise are High Performance Computing (HPC), large-scale data storage, R and Python education and classes, Amazon cloud computing, and online learning solutions. We also work closely with Central and Departmental IT groups to assess, update, and optimize service offerings to better support computational researchers. See our Services page for a full description of what we can do for you.

R and Python Programming and Data Science Training

R is the is the most frequently used analytics / data science software, followed by Python, and SQL

For researchers, post-docs, and students we can help you become facile with software tools such as R and Python as well as visualization packages including ggplot2 and Tableau. The annual KDnuggets Survey indicates that R remains the most popular data science language. While data wrangling and analysis skills are essential in areas such as genomic research this need now extends to the Humanities and Public Health wherein being able to mine social media data has become an important skill. Being able to clean and manage data alone is a valuable skill in addition to knowing how to perform basic statistical analysis and visualization. In some cases in depth programming knowledge is not necessary to build preliminary prediction or classification models in which case tools like Orange and Weka can be used to rapidly assess data. We also provide education on how to author reproducible research documents that effectively represent your work with the added capability of others being able to reproduce it. We can also assist in the preparation of documentation and budget estimates for inclusion in grant proposals and external funding requests.

Technology is undeniably playing a much larger role in research of all types which in turn requires solid computational expertise to accomplish. The volume of data, as well as the speed and granularity at which it is being generated, is overwhelming for the typical Information Technology department let alone the individual researcher. Let us help you understand your computational and educational options to help you remain competitive and successful in your computationally-based research and NIH grant applications.