Table/Key Optimizations

Video

In this unit, you will learn about table/key optimizations. Most likely the majority of the optimizations we can make to our data model involve making changes to our table's schema, particular to the primary key.

Transcript:

Probably the majority of the optimizations we are able to do with our data model involve making changes to our table schemas, particular with the primary key of our table. It's no wonder, considering the primary key determines how data gets distributed to different nodes across the cluster, and helps to determine how large partitions will grow over time. In this video, we'll be taking a look at some of the optimization techniques that can be used to help manage performance issues you may have with your tables and how they can be modified for better performance and efficiency.

When thinking back to the keys used in traditional relational databases, the whole purpose of the primary key is to maintain uniqueness for each row that is inserted into the table, while minimizing the number of columns needed to form the primary key. In our example below, both video tables here work perfectly fine for describing a unique row. However the table on the left includes several other columns that are not necessary for the purposes of uniqueness.

With a primary key in Apache Cassandra, there are several parts to the key that fulfill different purposes. It is divided between the partition key, which makes some number of columns in the primary key, and the clustering columns which make up the rest of the columns in the primary key. The partition key is what defines the partitions that be holding up to certain CQL rows, and also determines where the partition will be stored in the cluster. Clustering columns also have some useful attributes as well, namely being able to determine the ordering of CQL rows stored within a cluster, and makes searching on those clustering columns slightly easier. For our latest_videos column here, we are interested in having rows ordered by the timestamp when a video is uploaded, but in descending order. Of course, the primary key still needs be able to maintain uniqueness using a minimum number of columns, which is why the video_id is included in the primary key, although it's placed at the very end.

When considering what keys to use to make up your primary key, you usually decide between two different types - natural keys and surrogate keys. Natural keys represent attributes that already exist within your data. When thinking about a users table, you can define a unique user using their email address, or maybe a combination of their first and last name and date of birth, or if you're with us here in the United States, you could make use of each person's unique social security number or taxpayer number.

In comparison, surrogate keys are keys that are generated for the sole purpose of establishing uniqueness for a row. This can be something like an autoincrementing integer or sequence that you may be familiar with from relational databases, or something like a UUID that would be more commonly used in Cassandra. Most importantly, there shouldn't be any particular relationship with the data that its representing.

The nice things about surrogate keys is that they are intended to be completely unique. This means that you should never generate the same key twice, and therefore wouldn't have to worry about the possibility of overwriting an existing row that has the same primary key. Surrogate keys are also immutable and do not change over time. Whereas it is possible that attributes used for natural keys change, and therefore changes the key itself, a surrogate key is generated specifically for a row, and will always be the same value for that row. Because there is some mechanism to generate a surrogate key, they also tend to be of uniform size and performance.

Let's now take a look at some different techniques that can be used to optimize a table. There are several characteristics we can optimize for, such as the speed of reading partitions, or managing the size of the partitions or even removing some of the complexity of maintaining the data in the table in comparison to other tables.

The first technique we'll take a look at is splitting a table partition. Having partitions grow too large is a frequently encountered problem which may require some adjustment to how the partition, or basically the partition key is defined. A typical example might be the case of a table that tracks actions performed by individual users. It may be possible that a highly active user may have more than 1000 different video interactions per day, in the form of of playing videos, pausing videos, skipping ahead in the video, and so forth. Our table that we define here is the video_interactions_by_user table, which uses the user id as the primary key and the event timestamp and video_id as the clustering columns. This means that each partition represents a user and each CQL row represents a unique interaction for that user. With so many interactions done by this highly active user, the size of this partition can easily exceed the recommended guidelines for the partition size within two months. What do we want to do if we don't want operations that touch this partition to be impacted? Well we can see if it's possible to split the partition by redefining our partition key.

The general strategy is to find an existing column that you can use to logically split the partition into multiple pieces. Sometimes this may be an existing clustering column or a regular column that you have in the table.

If it doesn't seem like there are any viable choices, it's possible that you will need to create your own artificial column specifically for splitting a partition. A bit later we'll take a look at an example of creating an artificial column for partition splitting.

Continuing on with our example of the highly active user and his or her's large partition, let's say that we want to split the partition using an existing column. Just looking at the primary key, let's see if there are any possible choices here. If using event_timestamp, that would mean that the partition will store all user interactions that occurred at that specific point of time. This may not make too much sense though, since it's is unlikely to have more one interaction occurring within the same second.

The next clustering column, video_id, just so happens to be a good candidate. Instead of defining each partition to store all of the interactions for all videos that a user has done, it will instead represent all interactions by that the user for a specific video.

All we need to do now to make this happens is to move the video_id column in the primary key, so that it becomes part of the partition key. In this way you optimize the table to create smaller partitions while still satisfying the requirements of the query.

Let's now take a look at using an artificial column to split the partition. Here we need to come up with some sort of column that can split up the partiton. Although we have previously ruled out the event_timestamp, what if we were able to make that value less granular? For example, we can split the partition by the date of the event, rather than the exact timestamp. This way all interactions for a user on a particular date would be stored in one partition. Even taking into account our highly active user, this should still make the partition quite manageable. Therefore we create a new column called event_date, that will be part of the partition key. The application will then need to handle populating that column using the value from the event_timestamp.

It is also possible to split a partition using a bucket column, where you and the application can manually control how much a partition is split. You basically control the max bucket value, and the application will then need to ensure that each split partition is uniform. For example, maybe we want to store 1000 interactions in each partition. As a user has more and more interactions, it'll fill up the first partition, which would have a bucket value of zero, and then proceed to write to the next partition with a bucket value of 1, and so forth.

Another technique that we'll talk about here is a way to split up the table. Instead of storing all of the data into a single table, it may make sense to split up the table, basically the columns in the table, into separate tables. This will allow some queries to perform faster, if they do not necessarily need to read all of the columns in the original table. Table partitions will also be smaller when split. Finally because the partitions are smaller it may allow more of them to be cached in memory, and thus making the cache more useful.

In this example, users will query for video metadata more often than watching the video. We split the video streaming data to its own table so our application does not unnecessarily pull it when displaying just the title, tags, etc.

Depending on your queries, you may need to adjust the primary key when you split a table.

Be careful to ensure that you don't lose or orphan records in either table when performing the split. For example, if we are not careful in our application, we could store Interstellar's metadata but fail to also store its streaming blob (or vice-versa).

Also ensure that you don't require client-side joins when splitting a table. For example, splitting the streaming blob here seems like a good choice, but if we never allowed users to view video metadata without watching the video in the same action, we forced our application to do two queries now instead of one.

If you can split tables and partitions, you can also do the reverse, which is merging them together. This may be helpful in eliminating duplication, which reduces complexity, but may come at the expense of having queries run slower.

When merging partitions, the general strategy is to introduce a new partition key and nest objects in the new partition. The new key may can use either existing columns or a new artificial column. In our example below, we have a table users which defines each partition as a single row, or a single user. Our goal is to merge users together so that there are multiple users in a single partition. There are many different ways this can be done, with three possible tables shown below. One merges users based on their first name and last name, using a new name column as the partition key. Another merges users based on a related attribute called symbol. The final one uses an arbitrary bucket column to merge users together. In this case, what do you think would be the best choice? The answer of course, is that it depends. Each one has certain advantages and disadvantages, so it really is up to you to decide what works best.

Finally let's take a look at adding new columns to a table. This is pretty straightforward since you just need to alter the current table to include your new columns, so long as the new columns do not need to be in the primary key. If you take a look at our example, this is a scenario we've seen in a different video where we want to compute and save the avg_rating of a video. We have the original videos table as well as a ratings_by_video table with the count and sum of the ratings for each video. All we need to do then is to alter the table to include our new avg_rating column so that the calculated average rating can be updated for the video.

This is quite a lot of information to absorb, so why don't we try some applications of the optimization techniques that we've covered in this video? There's an exercise you can do right now on table optimizations. Give that a try, and once you've finished you can come back and listen to be in my next video.