Tools

Getting Started with The Cloud: The Ecosystem

The first in a series of hands-on articles by Allen Holub about programming for the public cloud

Storage

Before finishing up, there are a few other issues that you need to consider that are just part and parcel of cloud applications, regardless of the provider.

First of all, you need to rethink your relationship with your data. Data may or may not be replicated by the service, and it may or may not be stored on the same physical machine (or on the machine that's running the service that accesses the data). Securing the data can be a significant problem. You can encrypt sensitive information like credit-card numbers, but encrypting everything is not a practical solution because you can't issue queries on encrypted data. More to the point, if you're doing the encryption, then there's always a point at which the data is unencrypted and your encryption key is in plain sight. If your application is running on someone else's server, you're potentially vulnerable. Of course, a dedicated VM is less vulnerable than a shared server at an ISP that allows shell access, but you'll never be as secure as you would be in your own data center.

Also, bear in mind that your data is certainly replicated on many servers. This organization can be an important performance enhancer. Consider a video-streaming application that stores its data in the cloud. The actual application may be in only one place, but once we start streaming, we hope that we're connected to the version of the data with the fastest transmission time. However, more servers in more data centers means more vulnerability. And every cloud-service-provider employee who has administrative access to a physical cloud server also has access to your data, so your vulnerability is replicated along with the data.

There's also the issue of back up. A cloud provider may or may not actually back up your data. Google, for example, does tape back ups of gmail, but as far as I know, does not back up its general storage system. Feel free to correct me if I'm wrong, but practically speaking, cloud applications have to assume that there's no underlying backup mechanism. The data is replicated on many servers, and the servers themselves use RAID drives (if they have disks in them), so it would take a global catastrophe to loose your data altogether, but it's not possible to go back in time as you could do using a back up tape. You can write a web application that transfers data from Google to your own local repository, but that's actually a surprisingly difficult (and painfully slow) operation to perform using JDO/JPA, which is the only access method that Google provides.

Security

The biggest problem with cloud applications is actually application security. Security is, of course, a huge problem with most software. Programmers are simply unaware of how to write secure applications, and management is typically unwilling to spend the paltry sum required for the training that would eliminate 90% of the problem. It's symptomatic that most of the really big breaches I've read about in the past few years have been done using SQL Injection, which is not only a venerable, well-understood exploit, but is literally a trivial matter to defeat. This is a case where 10 minutes of training could eliminate millions of dollars of vulnerability, but the training still doesn't happen.

One of the problems is perception. Application security has nothing at all to do with things like firewalls and SSL. The IT department is simply not involved. Most exploits attack an application through a bug of some sort, usually a minor one. And most hackers get at that bug simply by using the program in predictable ways, just like all your other users do. There are no secret back doors, and the vulnerabilities are typically in plain sight. For example, the classic example of a SQL-injection exploit that you find in the books shows you how to get a dump of someone's entire database by exploiting a very-simple bug in a website's password-recovery page (access to which doesn't typically require a password, I would hope). It doesn't matter whether you access that page using HTTPS; if you can access the page at all, you can do the damage.

So, the biggest security problem with a cloud-based Web 2.0 application is the size of the attack surface — the number of places where a hacker can potentially use a bug to break into the system. Most Web 2.0 applications use remote procedure calls, or an equivalent mechanism like REST, heavily; and every one of those calls — in fact, every argument to every one of those calls — represents a potential vulnerability. It's easy to deal with these problems if you know about them (e.g., check that all arguments are valid and reasonable on both the client and server side; if you're using Google's Web Toolkit, you can literally use the same Java code on both sides to do the checking).

The real solution to this problem is simple: training.

The Entire Ecosystem

There are a few loose ends to cover. First, Google provides a reasonably rich set of support services for your web application. Get complete details at http://code.google.com/appengine/docs/java/apis.html, but here's a list of functionalities to be aware of. I'll demonstrate a few of these in future articles.

Blobstore: Lets you store very large objects that can't be handled by the standard JDO mechanism, and serve them directly to your users if you like. It's handy for things like big images.

Capabilities: A management API that lets you dynamically detect whether other Google services are operational. You can use it to disable features of your own app when a Google service on which it depends goes down for maintenance.

Channel: A mechanism for pushing information down to a browser-based client, so that client can update itself without polling.

Mail: Lets you both send and receive email from your application. You send using standard JavaMail APIs. You receive by writing a servlet that waits for email to arrive. (Google receives the email, then posts it to your servlet.) This facility is useful, but Google limits the number of emails that you can send in a day to 500, so you can't use this service for mailing lists or bulk mail.

Memcache: A mechanism for caching chunks of data in "memory." This is a wrapper around Java's JCache APIs. Caching through this API is better than rolling your own cache because Memcache can scale properly if the application is running on multiple machines.

Multitenency: Effectively adds namespaces to the storage system so that you can partition your data easily.

OAuth: Provides a mechanism to grant third-party access to Google services. For example, a customer of yours could use this mechanism to allow your application to access his Google Docs files for persistent storage.

Task Queues: Allows your application to execute background tasks that are not necessarily triggered by a user action.

URL Fetch: A wrapper around java.net.URL and related classes that lets you access other web content using URLs. Handy for doing things like sending bulk email from a non-Google server.

Users: Allows you to use Google's login mechanism for your application. That is, one of your users can log in to your application using Google's login page. I have mixed feelings about this service because it's one of the few services that doesn't just implement a standard Java library. If you use it as your sole log-in mechanism, you're effectively giving your user list to Google, and I'd rather know who my users are, thank you.

XMPP: Support for XMPP-compatible IM services (like Google Talk).

APIs

The final thing to think about are the services that can coexist with your web application. Google, under the moniker "GData," provides API access to literally all of its web applications — from Calendar to YouTube, making it easy to do things like integrate a Google Docs page into your application or update an appointment on a Google Calendar.

You can find the complete list of APIs at http://code.google.com/apis/gdata/docs/directory.html. Most of these are simple REST-based APIs. You typically encode a request in the URL and HTTP GET or POST, and receive a result in JSON. However, Google provides both Java and Python libraries that wrap the REST APIs, and it also provides an Eclipse plug-in that makes it easier to write to the APIs. I'll talk about how to use these APIs in future articles.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!