Cloud Bigtable Schema Design for Time Series Data

Whenever you measure something, and you record the time together with the
measurement, you're building a time series. Time series are all around us:

When you look at the plot of memory usage on your computer because it's
running slow, you're looking at a time series.

When you look at temperature over time on a news report, you're looking at a
time series.

If you're a foreign exchange trader, and your job involves plotting 5, 10, and
30 day moving average prices for USD/JPY, you're looking at time series.

Time series are also incredibly important:

Time series help us optimize resource usage, decrease energy usage, minimize
environmental impact, and reduce cost.

Time series help us identify trends in data, letting us demonstrate concretely
what happened in the past and make informed estimates about what will happen
in the future.

Time series underpin some of the complex analysis and machine learning in
fields such as financial services, retail, insurance, physics, and chemistry.

This guide provides detailed strategies and a walk through for storing and
querying time-series data in Cloud Bigtable.

Time series and Cloud Bigtable

Storing time-series data in Cloud Bigtable is a natural fit.
Cloud Bigtable stores data as unstructured columns in rows; each row has a
row key, and row keys are sorted lexicographically.

There are two commonly used ways to retrieve data from Cloud Bigtable:

You can get a single row by specifying the row key.

You can get multiple rows by specifying a range of row keys.

These methods are ideal for querying time-series data, since you often want data
for a given time range (for example, all of the market data for the day, or
server CPU statistics for the last 15 minutes). As a result, Cloud Bigtable
is functionally a great fit for time series.

Of course, there's always a devil in the details. For Cloud Bigtable, the
devil is that the schema for your data—the columns and the row-key
structure—must be designed carefully. A good schema results in excellent
performance and scalability, and a bad schema can lead to a poorly performing
system. However, there is no single schema design that provides the best fit for
all use cases.

The remainder of this paper presents a number of patterns for schema design in
Cloud Bigtable. You can use these patterns to design an ideal schema for your
use case. After enumerating and explaining the patterns for schema design, you
can learn from examples for the following use cases:

Financial market data

Server metrics (for example, CPU, memory, and network usage)

Intelligent energy meters (part of the "Internet of Things", or
IoT)

Schema design patterns for time series

The schema design patterns for storing time series in Cloud Bigtable fit into
the following categories:

General patterns

Patterns for row key design

Patterns for data column design

General patterns

Keep names short but meaningful

When you transfer data from Cloud Bigtable, you're also transferring
metadata, including:

The row key

The column family, an identifier that's used to group related columns

The column qualifier, a unique name within a column family

As a result, it's important to choose meaningful names that are also as short as
possible, because the size of each name contributes to storage and RPC overhead.
For example, rather than using CELLPHONE_NUMBER as a column qualifier, you
might use CELL as a short but meaningful abbreviation.

Patterns for row key design

Use tall and narrow tables

A tall and narrow table has a small number of events per row, which could be
just one event, whereas a short and wide table has a large number of events per
row. As explained in a moment, tall and narrow tables are best suited for
time-series data.

For example, suppose you take the temperature in your vegetable garden each
morning. Now if you decide that, because you take the temperature each morning,
one row per day is appropriate; as a result, your table is tall and narrow. Note
that the timestamp isn't the first element of the row key. As explained later,
using a timestamp as the first element of a row key can cause a variety of
problems.

Row key

Column data

VEGGIEGARDEN#20150301

DAILY:TEMP:60.4

VEGGIEGARDEN#20150302

DAILY:TEMP:61.2

VEGGIEGARDEN#20150303

DAILY:TEMP:61.0

VEGGIEGARDEN#20150304

DAILY:TEMP:65.1

VEGGIEGARDEN#20150305

DAILY:TEMP:62.2

...

...

VEGGIEGARDEN#20150331

DAILY:TEMP:60.4

Note: In the table representations in this paper the row key is shown in a
column as elements separated by # symbols (to aid readability) and the column
data is shown in columns in the form column family, column qualifier, data
separated by : symbols.

In contrast, suppose you want to plot the temperature over each
month, so one row per month is appropriate. The following example shows the
short and wide table that you get as a result:

Row key

Column data

VEGGIEGARDEN#20150301

TEMP:1:60.4

TEMP:2:61.2

TEMP:3:61.0

TEMP:4:65.1

TEMP:5:62.2

...

TEMP:31:60.4

For time series, you should generally use tall and narrow tables. This is
for two reasons: Storing one event per row makes it easier to run queries
against your data. Storing many events per row makes it more likely that the
total row size will exceed the recommended maximum (see Rows can be big but are
not infinite).

As an optimization, you can use short and wide tables, but avoid unbounded
numbers of events. For example, if you usually need to retrieve an entire
month of events at once, the temperature table above is a reasonable
optimization—the row is bounded in size to the number of days in a month.

Prefer rows to column versions

In Cloud Bigtable, columns can have timestamped versions. As a result, it's
theoretically possible to store a time series as a set of versions of a column.
For example, if you wanted to record the closing price of ZXZZT shares each day,
you could have a single column with a timestamped version for each day:

Row key

Column data

ZXZZT

STOCK:PRICE (V1 03/01/15):558.40

STOCK:PRICE (V2 03/02/15):571.34

STOCK:PRICE (V3 03/03/15):573.64

STOCK:PRICE (V4 03/04/15):573.37

STOCK:PRICE (V5 03/05/15):575.33

However, this isn't the best way to store this data.

By default, use new rows instead of column versions. Using multiple rows,
with a single version of an event in each row, is the simplest way to represent,
understand, and query your data.

It is acceptable to use versions of a column where the use case is actually
amending a value, and the value's history is important. For example, suppose
you did a set of calculations based on the closing price of ZXZZT, and initially
the data was mistakenly entered as 559.40 for the closing price instead of
558.40. In this case, it might be important to know the value's history in case
the incorrect value had caused other miscalculations.

Design your row key with your queries in mind

When Cloud Bigtable stores rows, it sorts them by row key in lexicographic
order. There is effectively a single index per table, which is the row key.
Queries that access a single row, or a contiguous range of rows, execute quickly
and efficiently. All other queries result in a full table scan, which will be
far, far slower. A full table scan is exactly what it sounds like—every
row of your table is examined in turn. For Cloud Bigtable, where you could be
storing many petabytes of data in a single table, the performance of a full
table scan will only get worse as your system grows.

For example, consider a table where players' scores from video games are stored,
which might be designed as follows.

Note: Two of the row keys in this example are identical, which indicates
that the table is storing two versions of the data for that row key. Row keys
must always be unique in a Cloud Bigtable table.

Row key

Column data

LoL#20150301

GAME:PLAYER:Corrie

GAME:WIN:false

GAME:KDA:4.25

LoL#20150302

GAME:PLAYER:Jo

GAME:WIN:true

GAME:KDA:7.00

LoL#20150302

GAME:PLAYER:Sam

GAME:WIN:true

GAME:KDA:7.00

LoL#20150303

GAME:PLAYER:Corrie

GAME:WIN:true

GAME:KDA:9.50

Starcraft#20150303

GAME:PLAYER:Eriko

GAME:WIN:true

GAME:KDA:6.00

Suppose you want to query this data to answer the question "How many games of
LoL did Corrie win in March?" With the schema shown above, you will have to scan
most of the table to answer this question. In contrast, if you design the table
as follows, you could complete this query by retrieving a specific range of row
keys:

Row key

Column data

LoL#Corrie#20150301

GAME:WIN:false

GAME:KDA:4.25

LoL#Corrie#20150303

GAME:WIN:true

GAME:KDA:9.50

LoL#Jo#20150302

GAME:WIN:true

GAME:KDA:7.00

LoL#Sam#20150302

GAME:WIN:true

GAME:KDA:7.00

Starcraft#Eriko#20150303

GAME:WIN:true

GAME:KDA:6.00

Choosing a row key that facilitates common queries is of paramount importance
to the overall performance of the system. Enumerate your queries, put them in
order of importance, and then design row keys that work for those queries.

How do you deal with a situation where there is no perfect row key? For example,
suppose the queries for all LoL games for March and all LoL games played by
Corrie in March were equally important. The schema above would enable us to
query for Corrie's games on LoL in March but wouldn't help us with all LoL games
in March—the best you could do is query all LoL games and then filter for
March. There are two ways to solve this problem:

Denormalization

Use two tables, each with a row key appropriate to one of the queries. This is
a good solution, because it results in a robust, scalable system.

Query and filter

Stick with the schema shown above, and have one query (all LoL games in March)
that underperforms because you are filtering a large number of rows. This is
not normally a good solution, because it results in a less scalable system
that could easily deteriorate as usage increases.

Ensure that your row key avoids hotspotting

The most common issue for time series in Cloud Bigtable is hotspotting. This
issue can affect any type of row key that contains a monotonically increasing
value.

In brief, when a row key for a time series includes a timestamp, all of your
writes will target a single node; fill that node; and then move onto the next
node in the cluster, resulting in hotspotting. For example, if you're storing a
cell phone's battery status, and your row key consists of the word "BATTERY"
plus a timestamp (as shown below), the row key will always increase in sequence.
Because Cloud Bigtable stores adjacent row keys on the same server node, all
writes will focus only on one node until that node is full, at which point
writes will move to the next node in the cluster.

Row key

Column data

BATTERY#20150301124501001

METRIC:USER:Corrie

METRIC:PERCENTAGE:98

BATTERY#20150301124501002

METRIC:USER:Jo

METRIC:PERCENTAGE:54

BATTERY#20150301124501003

METRIC:USER:Corrie

METRIC:PERCENTAGE:96

BATTERY#20150301124501004

METRIC:USER:Sam

METRIC:PERCENTAGE:43

BATTERY#20150301124501005

METRIC:USER:Sam

METRIC:PERCENTAGE:38

There are a few ways to solve this problem:

Field promotion. Move fields from the column data into the row key to make
writes non-contiguous.

Salting. Add an additional calculated element to the row key to
artificially make writes non-contiguous.

Field promotion

In this example, you will promote USER from a column to an element of the row
key. This change would solve the hotspotting issue because user identifiers will
provide a more uniform distribution of row keys. As a result, writes will be
split across multiple nodes in your cluster.

The advantage of field promotion is that it often makes your queries more
efficient as well, making this strategy a clear winner. The (slight)
disadvantage is that your queries are constrained by your promoted fields,
leading to rework if you don't promote the right fields.

Row key

Column data

BATTERY#Corrie#20150301124501001

METRIC:PERCENTAGE:98

BATTERY#Corrie#20150301124501003

METRIC:PERCENTAGE:96

BATTERY#Jo#20150301124501002

METRIC:PERCENTAGE:54

BATTERY#Sam#20150301124501004

METRIC:PERCENTAGE:43

BATTERY#Sam#20150301124501005

METRIC:PERCENTAGE:38

Salting

In this example, you will take a hash of the timestamp and divide it by 3; take
the remainder of this calculation; and add the remainder to the row key. Why 3?
This is an estimate of the number of nodes in the cluster in this case and would
provide a good division of activity across those nodes.

The advantage of salting is its simplicity—it's essentially a simple
hashing function. One disadvantage is that when you query for time ranges,
you'll have to do multiple scans—one scan per salt value—and combine
the results in your own code. Another disadvantage is that it's difficult to
choose a salt value that both distributes activity across nodes and operates
well as you scale your system up or down. Because of these disadvantages, and
because it's best to use human-readable row keys, avoid salting unless you can
find no other way to prevent hotspotting.

Row key

Column data

BATTERY#1#20150301124501003

METRIC:USER:Jo

METRIC:PERCENTAGE:96

BATTERY#1#20150301124501004

METRIC:USER:Sam

METRIC:PERCENTAGE:43

BATTERY#2#20150301124501002

METRIC:USER: Corrie

METRIC:PERCENTAGE:54

BATTERY#3#20150301124501005

METRIC:USER:Sam

METRIC:PERCENTAGE:38

BATTERY#3#20150301124501001

METRIC:USER:Corrie

METRIC:PERCENTAGE:98

By default, prefer field promotion. Field promotion avoids hotspotting in
almost all cases, and it tends to make it easier to design a row key that
facilitates queries.

Use salting only where field promotion does not resolve hotspotting. In the
rare case where you apply a salting function, be careful not to make too many
assumptions about the underlying size of the cluster. The example above uses a
salting function that assumes there are 3 nodes in the cluster;
this assumption is safe because it would scale to the limited number of nodes
that can exist in a Cloud Bigtable cluster. If you could create clusters with
hundreds of nodes, you would want to use a different salting function.

Reverse timestamps only when necessary

You can reverse timestamps by subtracting the timestamp from your programming
language's maximum value for long integers (such as Java's
java.lang.Long.MAX_VALUE). By reversing the timestamp, you can design a row
key where the most recent event appears at the start of the table instead of the
end. As a result, you can get the N most recent events simply by retrieving the
first N rows of the table.

Prefer reverse timestamps only where your most common query is for the latest
values. This is because reversing timestamps makes every other query more
complex and complicates the overall schema.

Patterns for data column design

Rows can be big but are not infinite

Rows in Cloud Bigtable can contain ~100 column families and millions of
columns, with a 100 MB limit on each value stored in a column. These generous
limits provide great flexibility. However, you shouldn't assume that big rows
are the right way to store data, and that you should therefore fill each row
with as much data as possible. Always keep in mind that retrieving large values
takes extra time and memory.

In general, keep row sizes below approximately 100 MB. This is more of a
guideline than a rule--rows can be larger than 100 MB. However, if you have many
rows larger than this, you should expect performance issues.

In general, keep column values below approximately 10 MB. Again, this is
more of a guideline than a rule--you can store some values that are larger than
10 MB, but they are likely to cause performance issues.

To reiterate, if you often rely on large rows, or large individual values,
expect performance issues in your system.

Keep related data in the same table, keep unrelated data in different tables

Cloud Bigtable is a key/value store, not a relational store. It does not
support joins, nor does it support transactions except within a single row. As a
result, it's best to access data in individual rows or in a set of contiguous
rows.

One result of this pattern is fairly obvious: In the vast majority of cases,
time-series queries are accessing a given dataset for a given time period.
Therefore, make sure that all of the data for a given time period is stored in
contiguous rows, unless doing so would cause hotspotting.

Another result is that when you read data for a row, or a range of rows, that
data should be useful on its own—you should not need to combine it with
other data. For example, suppose you are storing user activity on a shopping
website, and you often need to retrieve the last five actions performed by the
user so you can display them in a sidebar. In this case, you should consider
denormalizing your data and including some user and product details in the table
of recent actions. In contrast, with a relational database, you would likely
store the user ID and product ID in one table, then join that table with
separate user and product tables in your SQL query.

That said, you don't need to include every piece of data about an entity in
every single row. For example, if you're displaying information about a user's
recent actions, you don't need to store the user's phone number, or the address
of a product's manufacturer, because you aren't going to display that
information in a sidebar.

Look for opportunities to denormalize data to satisfy queries, but only
include as much data as required by the queries.

Store data you will access in a single query in a single column family

Column qualifiers in a single column family have a physical as well as a logical
relationship. In general, all of the column qualifiers in a single column family
are stored together, accessed together and cached together. As a result, a query
that accesses a single column family might execute more efficiently than a query
spanning column families.

Ensure your common queries are as efficient as possible by retrieving data
from as few column families as possible.

Don't exploit atomicity of single rows

Cloud Bigtable does not support transactions, with one exception: Operations
on a single row are transactional. Transactions are also expensive, meaning that
a system that relies on transactions will not perform as well as one that does
not.

When working with time series, do not leverage the transactional behavior of
rows. Changes to data in an existing row should be stored as a new, separate
row, not changed in the existing row. This is an easier model to construct, and
it enables you to maintain a history of activity without relying upon column
versions.

Schema design examples

Now you will apply the schema design patterns to create examples for the
following types of data:

Financial market data

Server metrics

Intelligent energy meters (Internet of Things)

Remember, these are just examples! To find the best schema for your time-series
data, you'll need to consider what data you're storing and how you plan to query
the data, then apply the design patterns from the previous section.

Financial market data

This example takes a hypothetical equity market data message that represents
information about an imaginary stock:

Field

Example data

Ticker Symbol

ZXZZT

Bid

600.55

Ask

600.60

Bid size

500

Ask size

1500

Last sale

600.58

Last size

300

Quote time

12:53:32.156

Trade time

12:53:32.045

Exchange

NASDAQ

Volume

89000

Some observations about equity market data messages before beginning:

The message aggregates quote data and trade data, which are logically
separate.

There are a relatively large number, several thousand, of ticker symbols
(tickers).

Several hundred of these tickers will account for 90% of messages received,
because relatively few stocks are actively traded.

Messages come frequently, from hundreds to tens of thousands per second, with
an average of several thousand per second.

Typical queries will be for quote data or for trade data separately, not for
both at the same time.

For both quote data and trade data, a typical query will specify:

An exchange (such as NASDAQ)

A ticker symbol (such as ZXZZT)

A start and end time

Now you can design the table for this use case:

Keep related data in the same table, keep unrelated data in different tables

Store quote data in a table called QUOTE.

Store trade data in a table called TRADE.

Queries can include arbitrary time ranges, so you will store data from a
single message in each row.

Rows can be big, but not infinitely big

Each row stores data from a single message. This does not raise any size
concerns.

Don't exploit atomicity of single rows

Each message is self-contained, so there are no concerns.

Keep names short but meaningful

For simplicity, this example uses the fields from the message, capitalized
with spaces removed, as names.

These decisions give you the following column layout:

QUOTE table example:

Column data

MD:SYMBOL:ZXZZT

MD:BID:600.55

MD:ASK:600.60

MD:BIDSIZE:500

MD:ASKSIZE:1500

MD:QUOTETIME:1426535612156

MD:EXCHANGE:NASDAQ

TRADE table example:

Column data

MD:SYMBOL:ZXZZT

MD:LASTSALE:600.58

MD:LASTSIZE:300

MD:TRADETIME:1426535612045

MD:EXCHANGE:NASDAQ

MD:VOLUME:89000

Next, design the row key:

Use tall and narrow tables

Each row will store data from one message, resulting in a very large number
of relatively narrow rows.

Prefer rows to column versions

Use column versions only in the exceptional circumstance where a
value was incorrect.

Design your row key with your queries in mind

QUOTE and TRADE row keys can follow the same form.

Because this is a time series, you can assume by default that QUOTETIME
will be part of the row key.

To query by exchange and ticker for a given start and end time, you will
need to use the values of EXCHANGE, SYMBOL, and QUOTETIME.

Therefore, you will promote EXCHANGE (as a 6 character code; exchanges
with less than 6 characters will be right padded with spaces), SYMBOL (as
a 5 character code; tickers with less than 5 characters will be right padded
with spaces), and QUOTETIME (as a 13 digit numeral). By padding EXCHANGE
and SYMBOL with spaces, you ensure that each part of the row key is at a
predictable offset.

Taking these values together, the row key will be of the form EXCHANGE +
SYMBOL + QUOTETIME (for example, NASDAQ#ZXZZT#1426535612156).

Ensure that your row key avoids hotspotting

Having EXCHANGE and SYMBOL in the leading positions in the row key will
naturally distribute activity.

Given that 90% of the messages are concentrated in a few hundred tickers,
there is some risk of hotspotting, but you would need to stress test the
system before making further changes. If this concentration resulted in poor
performance, you could apply salting to break up activity more effectively.

Reverse timestamps only when necessary

You will not reverse timestamps in this case, because queries do not always
require access to the latest data.

After this design exercise, you have the following tables:

QUOTE table example:

Row key

Column data

NASDAQ#ZXZZT#1426535612156

MD:SYMBOL:ZXZZT

MD:BID:600.55

MD:ASK:600.60

MD:BIDSIZE:500

MD:ASKSIZE:1500

MD:QUOTETIME:1426535612156

MD:EXCHANGE:NASDAQ

TRADE table example:

Row key

Column data

NASDAQ#ZXZZT#1426535612045

MD:SYMBOL:ZXZZT

MD:LASTSALE:600.58

MD:LASTSIZE:300

MD:TRADETIME:1426535612045

MD:EXCHANGE:NASDAQ

MD:VOLUME:89000

These tables will grow at the rate of hundreds of millions of rows per day,
which Cloud Bigtable can handle without difficulty.

Server metrics

The following example uses a hypothetical server-monitoring system that
collects a large variety of metrics (such as per-core CPU, memory, and disk
usage) from a large inventory of machines. You will go through multiple
iterations of the schema in this example.

You can make the following assumptions about the data:

Collect 100 metrics per machine.

Collect metrics from 100,000 machines.

Metrics are collected every 5 seconds.

Typical queries will be one of the following:

Metrics for a given machine for a given start and end time

The latest metrics for the entire inventory of machines

With that use case in mind, you can design the table:

Iteration 1

Keep related data in the same table, keep unrelated data in different tables

You will store metrics data in a table called METRIC.

There are several categories of metrics, and you will group these using
appropriate column families.

Queries can include arbitrary time ranges, so you will have each row store a
single set of metrics from a machine at a given time.

Rows can be big, but not infinitely big

Each row stores a single set of metrics, which causes no size concerns.

Don't exploit atomicity of single rows

You won't rely upon the atomicity of rows in our schema design.

Keep names short but meaningful

For simplicity, you will use the field names from the metrics, capitalized
with spaces removed, as column qualifier names.

These decisions give you the following column layout:

Column data

METRIC:
HOSTNAME:
server1.bbb.com

METRIC:
CPU/CPU1_USR:
0.02

METRIC:
CPU/CPU1_NICE:
0.00

...

METRIC:
IO/BLK_READ:
253453634

METRIC:
MIO/BLK_WRTN:
657365234

Next design the row key based on the patterns:

Use tall and narrow tables

Each row in the table will store one set of metrics for one machine,
resulting in a very large number of rows.

Prefer rows to column versions

You will not use column versions.

Design your row key with your queries in mind

Because this is a time series, include the timestamp, TS, in the row key.

To retrieve metrics for a given machine for a given start and end time,
you will retrieve a range of rows using HOSTNAME plus TS.

Retrieving the latest metrics for the entire inventory of machines is
complicated. You can't simply reverse the timestamp and scan N rows, because
there is no guarantee that would pick up each machine in the inventory.

Now, you have run into a problem with designing the row key. The solution here
is denormalization. You will create a separate table that holds the latest
versions of the metrics, called CURRENT_METRIC, which you will update whenever
you update METRIC. When you update the existing metrics for a machine, you
will simply overwrite the row for that machine.

Next iterate on your original design:

Iteration 2

Keep related data in the same table, keep unrelated data in different tables

You will store metrics data in a table called METRIC.

You will store the latest version of the metrics in a table called
CURRENT_METRIC.

Other information stays the same as iteration 1.

Rows can be big, but not infinitely big

Stays the same as iteration 1.

Don't exploit atomicity of single rows

You will rely upon the atomicity of rows for updating each machine's data in
CURRENT_METRIC. This is a simple row mutation, with little potential for
contention, so it will not cause any issues.

Keep names short but meaningful

Stays the same as iteration 1.

Next you design the row key based on the patterns:

Use tall and narrow tables

In both tables, each row in the table will store one set of metrics for one
machine, resulting in a large number of rows.

Prefer rows to column versions

Stays the same as iteration 1.

Design Your row key with your queries in mind

Because this is a time series, you will include the timestamp, TS, in the
row key. To retrieve metrics for a given machine for a given start and end
time, you will retrieve a range of rows from METRIC using HOSTNAME plus
TS.

Therefore, you will promote HOSTNAME to the row key and use a row key of
the form HOSTNAME + TS.

To find the latest metrics for the entire inventory of machines, you will do
a scan of CURRENT_METRIC.

Therefore, you do not need to promote any additional fields to the row key,
and the row key will have the simple form HOSTNAME + TS. This leads to a
simple, easily understood schema that makes it possible for tables to be
sharded effectively.

In the CURRENT_METRIC table, even though you know that it is always
storing the latest metric for each table, for the sake of simplicity the row
key will again be HOSTNAME + TS.

For both METRIC and CURRENT_METRIC, there is no concern about
hotspotting, because having the hostname at the start of the row key will
distribute activity across regions.

Reverse timestamps only when necessary

You are storing data for the most recent metrics in a separate table, so
there's no need to reverse the timestamps.

After the design exercise, you have the following:

METRIC and CURRENT_METRIC table example:

Row key

Column data

server1.bbb.com#1426535612045

METRIC:
CPU/CPU1_USR:
0.02

METRIC:
CPU/CPU1_NICE:
0.00

...

METRIC:
IO/BLK_READ:
253453634

METRIC:
MIO/BLK_WRTN:
657365234

These tables will grow at the rate of approximately 2 billion rows per day,
which Cloud Bigtable can handle without difficulty.

Intelligent energy meters (Internet of Things)

This example uses a hypothetical IoT scenario where there are intelligent energy
meters sending sensor readings periodically to a centralized system. Once again,
you will go through multiple iterations of the schema in this example.

You can make the following assumptions about the data:

There are 10,000,000 operational meters.

Each meter sends a sensor reading every 15 minutes.

The meter ID is a unique numeric ID.

Typical queries will be one of the following:

All data for a given meter for a given day

All data for a given day

Given that use case, you can design the table:

Keep related data in the same table, keep unrelated data in different tables

You will store the sensor data in a table called SENSOR.

Queries are in daily increments, so you will have each row store data for
one meter for one day.

Rows can be big, but not infinitely big

Each row will contain data for one meter for one day, for a total of 96
columns (24 hours in a day * 60 minutes in an hour / 1 reading every 15
minutes), which causes no size concerns.

Don't exploit atomicity of single rows

You will exploit the atomicity of rows because when the data is received, it
will be added to the appropriate daily row. This is a simple row mutation,
with little potential for contention, so it will not cause any issues.

Keep names short but meaningful

ID to store the meter ID (a unique integer).

A series of columns with names 0000 through 2345 to contain the 96
values recorded every 15 minutes during the day. This scheme is adopted
because it enables you to change to frequencies other than 15 minutes if
necessary.

These decisions give you the following column layout:

Iteration 1

Column data

METER:ID:987654

METER:0000:
12.34

METER:0015:
13.45

...

METER:2330:
27.89

METER:2345:
28.90

Next design the row key:

Use tall and narrow tables

Each row will store data for one day, as previously explained.

Prefer rows to column versions

Use column versions only in the exceptional circumstance where a value was
incorrect.

Design your row key with your queries in mind

Because this is a time series, include DATE in the row key. Because you
are interested only in precision to the day the last five digits of the
timestamp will be zero and will therefore be omitted.

To query a given meter for a given day, you will retrieve a single row using
METER and DATE.

To query all meters for a given day, retrieve a range of rows using DATE.

Taking the queries together, the row key will need to be of the form
DATE + METER. For example: |20170726|0000987654|.

At this point, you might notice a problem: At the start of each day, all the
meters will hit a single node, because the date is in the leading position in
the row key. With 10 million meters, this is likely to be an issue that impacts
performance every single day. The solution is to find a better way to query
meter data for a specific day. If you run a single query each night and store
the results as a new table, called SENSOR_YYYYMMDD, you won't need to optimize
our row key for date- based queries.

Let's iterate to solve this problem:

Iteration 2

Keep related data in the same table, keep unrelated data in different tables

Store the sensor data in a table called SENSOR.

Queries will retrieve data in daily increments, so you will have each row
store data for one meter for one day.

You would run an overnight batched query that produces another table called
SENSOR_YYYYMMDD (the date) to store all meter data for that date.

Rows can be big, but not infinitely big

Stays the same as iteration 1.

Don't exploit atomicity of single rows

Stays the same as iteration 1.

Keep names short but meaningful

Stays the same as iteration 1.

Putting this all together, an example row would look like the following for both
the SENSOR table and the SENSOR_YYYYMMDD tables:

Column data

METER:ID:987654

METER:0000:
12.34

METER:0015:
13.45

...

METER:2330:
27.89

METER:2345:
28.90

Next, design the row key:

Use tall and narrow tables

Stays the same as iteration 1.

Prefer rows to column versions

Stays the same as iteration 1.

Design your row key with your queries in mind

Because this is a time series, include DATE in the row key.

To query a given meter for a given day, you will need to retrieve a single
row using the METER and DATE.

The row key will need to be of the form METER + DATE (for example,
"0000987654#20170726"), which will satisfy our query and lead to a good
distribution of activity across nodes.

You will also run a batched query once per day that will scan the entire
table and storing yesterday's data in a new table called SENSOR_YYYYMMDD,
where YYYYMMDD is yesterday's date.

Ensure that your row key avoids hotspotting

Hotspotting isn't a concern for the SENSOR table. Writes will be
distributed evenly because the row key has METER in the leading position.

Hotspotting isn't a concern for the SENSOR_YYYYMMDD tables. Each table is
constructed only once, as a batched query, where performance is a lesser
concern. However, creating these tables requires a full scan of SENSOR, so
you will want to create the SENSOR_YYYYMMDD tables when there are few
other queries of the SENSOR table.

Reverse timestamps only when necessary

In this case, you don't need to reverse timestamps.

After the design exercise, you have the following for both the SENSOR table
and the SENSOR_YYYYMMDD tables:

Row key

Column data

0000987654#20170726

METER:ID:987654

METER:0000:12.34

METER:0015:13.45

...

METER:2330:27.89

METER:2345:8.90

This table will grow at the rate of a little less than 10 million rows per day,
which Cloud Bigtable can handle without difficulty.