Archive for July 4th, 2017

Note: This is a continuation of my post on exploring Google Cloud Platform. There is also a mirror of it on dev.to.

At first glance, Datastore is equivalent to DynamoDB. I personally think that Datastore is better compared to SimpleDB. Unfortunately, SimpleDB isn’t accepting new customers/isn’t being deployed to new regions, so it’s not a good comparison.

That said, the fundamental idea behind Datastore is the same (Hosted NoSQL database), but how it’s implemented is very different.

App Engine Datastore vs Cloud Datastore?

At one point, Cloud Datastore was (is?) part of App Engine, but it’s since been split out. Presumably as part of this legacy, Datastore appears limited to the regions that App Engine is in, which unfortunately isn’t all of Google’s regions.

Additionally, an App Engine account is created for Cloud Datastore. It’s required if you use the Datastore SDKs as well. Why this dependency exists, and why it’s exposed are open questions.

The only thing I can suggest is check the URL, make sure the docs you’re reading start with https://cloud.google.com/datastore/.

Pros:
1. SQL-like semantics (transactions!)
2. More granular breakdowns for multi-tenancy: namespaces/’Kind’/’ancestor path’ (Google says a Kind is functionally equivalent to a table). I’m not sure about the usefulness of the namespace/kind distinction, but it’s an extra way to get multi-tenancy and is ignored by default, so meh.
3. Per request pricing! DynamoDB is charged at what you’re expected to use, not what you actually use. Given AWS’s obsessive focus on “pay what you use”, Dynamo’s provisioned read/write units are odd
4. Automatic indexes for every property enables arbitrary querying, not AWS’s you must define any indexes you want
5. A dashboard that allows SQL-like queries to be run (but only SELECT queries)

Cons:
1. Nothing like DynamoDB streams (which are awesome for replication/async actions that are implicitly triggered off a data change)
2. Dynamo has 25x the storage on the free tier compared to Datastore (25GB vs 1GB)
3. Dynamo offers more total read/write operations per day – good if you have a consistent request rate, bad if you have bursts
4. Index (created by default, you have to opt out) data storage seems to be charged for
5. Creating a custom index requires the use of the gcloud CLI tool. There is no mention of any other method in the index documentation.
6. If you have a query that involves filtering on more than one property, you might run into a situation that isn’t covered by the built-in indexes or is otherwise impacted by one of a decently long list of query restrictions.

While you could get away with doing a scan + filter combination in Dynamo, GQL will reject you with a "Your Datastore does not have the composite index (developer-supplied) required for this query." error. (My usecase was select * from kind where property1 < value order by property2.)

I haven’t found a way to get Datastore to scan and filter server side, so I have to iterate over everything and throw away data that I don’t want – after retrieving it.

Pricing

A bit more about the price, because the pricing models of the two products are really different.

So if you’re trying to save money, and drop your read/write units to 1, and you do something request heavy, you’re going to have a bad time unless you increase the units before running your operation. Dynamo’s new auto scaling feature takes some time to kick in as well (the scale up alarms take 5 minutes to kick in – the CloudWatch alarm is set on ConsumedWriteCapacityUnits > NN for 5 minutes).

If you’re doing any sort of table scanning in Dynamo to find elements by properties, or you have indexes on single properties, chances are Datastore will work better for you by virtue of the built-in-by-default indexes. You can get the same functionality out of Dynamo, but it’s harder to set up, and functions as (and is charged as) a separate table.

If you have composite (multi-property) indexes, that’s a bit more complicated. Datastore does a far better job of hiding the index complexity (once it’s set up) and actually using the indexes. But the setup process is hit or miss, requiring you to know in advance things like sort orders.

If you’re not doing anything fancy, and just accessing everything directly by key, Dynamo is better for small scale stuff by virtue of the massively greater free storage space (25GB vs 1GB).

Note: This is a continuation of my post on exploring Google Cloud Platform. There is also a mirror of it on dev.to.

Storage price wise, S3 and GCS are mostly comparable, with a note that GCS bandwidth is more expensive, and you don’t really get a choice of what rate you want to pay.

GCS Single Region is pretty much directly equivalent to standard S3, and Nearline is equivalent to S3 Infrequent Access, complete with per GB retrieval fees. Storage for both classes is cheaper than the respective S3 classes. Operation fees are exactly the same though.

There’s two other storage classes though:

Multi-region Buckets

One major feature is multi-region buckets for $0.006/GB more. Presumably built to avoid incidents like us-~~tirefire~~-east-1 falling over and your buckets disappearing, it’s a definite point if you need high availability (and are willing to trust that Google has proper HA).

Assuming single region stores 3 copies, I speculate that for the price, multi-region stores 2 copies in each region for a total of 4 copies. That (suspected) single additional copy would be why multi-region buckets are only a third more expensive.

Coldline

Coldline is a bit more interesting. As the equivalent of Glacier, it has one big advantage – quick retrieval times (on the order of a second), much better than Glacier. The downside is the fixed retrieval costs – $0.05/GB. Just comparing Coldline to Nearline means you shouldn’t retrieve anything more than once every 2 years(!) if you want to save money compared to Nearline. (Nearline is $0.01/GB, Coldline is $0.007/GB, so you save $0.003/GB/month going for Coldline.)

In comparison, Glacier has a number of retrieval speeds and corresponding retrieval pricing. Even Glacier’s expedited requests (the most expensive option) cost 40% less than Coldline. (Interestingly, the bulk retrieval option works out to about 25% of retrieving from S3 IA, but takes in the order of 8-12 hours.)

Considering Glacier costs almost 50% less than Coldline, I really question the utility of Coldline. With the restore costs it only makes sense if you’re storing archival data that doesn’t need to be accessed for around 2 years. I think Glacier has a much better handle on the expected use cases here.

I can see Nearline being used (and people having the expectation of immediate access), but the retrieval price of Coldline with no way to change that makes me very leery of using it.

Note: This is a continuation of my post on exploring Google Cloud Platform. There is also a mirror of it on dev.to.

Compute Engine is one of the more important services for me. While I’d love to have all my stuff on managed services like Lambda/Cloud Functions, it’s not possible. The fundamentals of GCE are pretty similar to EC2, but there’s a few interesting features that would be interesting to see EC2 adopt (if ever).

Pros:
1. The GCE console supports SSHing to an instance within your browser. There’s also automatic auto SSH key adding, something which had annoyed people (me) using EC2. You can still have a master SSH keypair, but you can also add keypairs for individual users.
2. Linux OSes (Debian, CentOS, etc) are natively supported, not “support for Amazon Linux, and maybe other distros”. This is pretty much required for some of the stuff that GCE is offering. The auto-key pair adding? Done by a daemon waiting for instructions. Said daemon has been added to the OS for you.
3. Custom machines types (mix and match memory and CPU cores). I’d love to see EC2 adopt this, but it’s not going to happen any time soon.
4. Sustained usage discounts: No reserved instances required for discounts
5. Premptible instances have a 24 hour limit, unlike EC2’s spot block of max 6 hours. Having a fixed price means you don’t have to worry about bill shock. Tradeoff is that there’s no spot instance equivalent, where if your bid is high enough, the instance will practically never be terminated.
6. Live migration for maintenance events! It’s coming to EC2 (it’s in the Xen mainline), just a question of when.

Cons:
1. No security groups equivalent
2. Firewall rules applied to the entire VPC, or specific labelled instances (I suppose you could bludgeon this into security groups).
3. No comparison graphing. Are they depending on external providers (ie Datadog) like they do for sending email? (Yes, the creators of GMail recommend using another party to send email…)
4. The bandwidth out prices. Google might have a super awesome network, with private backhaul to endpoints that are close… but that is more expensive than AWS.

Also, it’s subjective, but I find that the GCE console UI is a lot cleaner than EC2’s.

I have a bunch of experience with AWS (Disclaimer: I worked there). My AWS account dates back to 2010, but I’ve only really started using AWS heavily in the last 2 years.

I’ve been speccing out and performing cost estimations for a new project recently, and with the introduction of the GCP free tier and the $300 of credit, I decided to look into some of the services GCP offers to see how it compares to AWS.

Google has a comparison between AWS and GCP, which is useful but pretty dry. I decided to just dive in and experiment – that $300 of credit means I’m pretty safe!

Registering for GCP

It was a matter of going to the GCP console and logging in with my Google account. I had to sign up for the free trial and provide my credit card details, but that was it. Compared to the AWS signup process, this was a lot simpler.

However, it’s simple because Google has effectively split the account verification steps – I used my gmail account, which was already verified. A side effect of this is created resources are associated with this account by default. An AWS account is trivial to transfer – update the email address and be done with it. My Google account? Less easily transferred, but that brings me to the first major difference.

AWS Accounts vs GCP Projects

Google doesn’t really expound on the account/project difference, merely saying this in their comparison:

Cloud Platform groups your service usage by project rather than by account. In this model, you can create multiple, wholly separate projects under the same account. In an organizational setting, this model can be advantageous, allowing you to create project spaces for separate divisions or groups within your company.

In practice, it’s an entirely different way of handling resources. If you wanted to run something in its own isolated silo in AWS, you would generally create an entirely separate account, and use consolidated billing/AWS Organizations (which is a whole other set of problems). In GCP, each project is its own little silo, with no communication between projects by default.

After getting used to AWS (and using AWS Organizations to handle the account-per-project), this is a very different way of thinking. To me, there are two main benefits.

The first is that switching ownership of resources is incredibly simple – assign a new project owner, and remove the existing owner, and it’s done. What’s most impressive is that (as far as I can tell), the transfer will be done without interrupting anything currently running. Compute Engine instances will continue to run, Cloud Storage buckets don’t need to have contents copied out, the bucket deleted, then recreated in the new account, hoping that no one else steals the bucket name in the meantime.

The second benefit is that segmentation of projects is far easier. You don’t have to have the equivalent of an AWS account per project if you want separation for security.

A nerfed IAM?

The downside of separation by project is that GCP seems not to have an active equivalent of AWS IAM’s ability to restrict access to individual resources. The GCP documentation explicitly calls this out:

I am conflicted over this situation. Best practice says that accounts/roles should have the fewest permissions possible. I try to lock down my IAM policies to specific resources wherever possible. For example, a user can only interact with a single SQS queue because I restrict the attached IAM policy by queue name.

On GCP, it’s all or nothing within a project. I have to allow access to all PubSub topics if I want to allow access to one.

Now, what actually happens is that people can and do liberally use * in their IAM policies in AWS, so Google’s just making it really easy to get up and running.

But the fact that the restrictions aren’t available by default are worrying, especially for large companies that do have to capability to manage IAM policies (and not operate accounts per service).

I think Google’s realised this, and is extending IAM (still in Alpha) to allow permissions to be defined on individual resources where supported (eg PubSub, Datastore). It looks like it’s possible to use the IAM API to define custom roles, but I haven’t successfully done so. I just ended up using project isolation, which works, but feels bad.

Authentication

GCP has a greater variety of ways to authenticate with their APIs compared to AWS.

Compute Engine/App Engine work with IAM and get credentials, much like EC2’s instance roles. These are limited to individual projects.

Developers using the gcloud CLI can authenticate using OAuth2, and switch between projects.

Non-interactive systems outside GCP use a service account that’s tied to a specific project.

Using the SDK requires creating a service account, which generates a JSON file (or PKCS12, but let’s ignore that). The easiest way is to use an environment variable GOOGLE_APPLICATION_CREDENTIALS to set the location of the file when using the SDK, and let the SDK handle everything.

You can define the file location in code, like Boto. (And presumably other AWS SDKs, I’ve only really used the Python version.)