A deep look at the CQL WHERE clause

While they share similar syntaxes, there are lots of differences between CQL and SQL. The reasons for these differences come mainly from the fact that Cassandra is dealing with distributed data and aims to prevent inefficient queries.

One of the places where CQL differs a lot from SQL is the WHERE clause. The goal of this post is to describe what is supported by the CQL WHERE clause and the reasons why it differs from normal SQL.

Primary key columns

In Cassandra, two types of columns have a special role: the partition key columns and the clustering columns. Together, they will define your row primary key.

The partition key columns are the first part of primary key and their role is to spread data evenly around the cluster. Rows will be spread around the cluster based on the hash of the partition keys.

The clustering key columns are used to cluster the data of a partition, allowing a very efficient retrival of rows.

Due to the differences in the role that they are playing, partition key, clustering and normal columns support different sets of restrictions within the WHERE clause. Futhermore, those sets of restrictions differ depending of the type of query: SELECT, UPDATE or DELETE.

WHERE clause restrictions for SELECT statements

Partition keys restrictions

The partition key columns support only two operators: = and IN

IN restriction

Prior to 2.2, the IN restrictions could only be applied to the last column of the partition key. So, for example, if your table was:

In 2.1, you could only use an IN operator on the date column. In 2.2 you can use the IN operator on any partition key column. By consequence, a query like:

SELECT * FROM numberOfRequests
WHERE cluster IN ('cluster1', 'cluster2')
AND date = '2015-05-06'
AND time >= '12:00'
AND time <= '14:00';

is valid since 2.2 but is invalid in the prior releases.

This change makes CQL more consistent but you should nevertheless be careful with IN restrictions on partition key columns. The nice post of Ryan Svihla will give you a clear explanation on why you should try to avoid them.

Another change, introduced with 2.2, is that the results are not returned anymore in the order in which the partition key where specified in the IN clause. From 2.2 onward, the results are returned in the natural order of the column type and duplicates are ignored.

Unrestricted partition key columns

Cassandra will require that you either restrict all the partition key columns, or none of them unless the query can use a secondary index.

This means that a query like:

SELECT * FROM numberOfRequests WHERE cluster='cluster1' AND time ='12:00';

will be rejected as the date column is not restricted.

The reason why is that Cassandra needs all the partition key columns to be able to compute the hash that will allow it to locate the nodes containing the partition.

If no restrictions are specified on the partition keys but some are specified on the clustering keys, Cassandra will require ALLOW FILTERING to be added to the query. For more information on ALLOW FILTERING you should look at ALLOW FILTERING explained.

>, >=, <= and < operators

Cassandra distributes the partition accross the nodes using the selected partitioner. As only the ByteOrderedPartitioner keeps an ordered distribution of data Cassandra does not support >, >=, <= and < operator directly on the partition key.

Instead, it allows you to use the >, >=, <= and < operator on the partition key through the use of the token function.

If you use a ByteOrderedPartitioner, you will then be able to perform some range queries over multiple partitions. You should nevertheless be careful. Using a ByteOrderedPartitioneris not recommended as it can result in unbalanced clusters.

Prior to 2.2, multi-column IN restrictions can only be applied to the last set of clustering columns being restricted. By consequence, the previous query is invalid in 2.1. But following query is a perfectly valid one.

You should nevertheless be carefull with filtering. It can be an expensive operation.

Partition key restrictions and Secondary indices

When Cassandra must perform a secondary index query, it will contact all the nodes to check the part of the secondary index located on each node. If all the partition key components are restricted, Cassandra will use that information to
query only the nodes that contains the specified partition keys, which will make the query more efficient.

For secondary index queries, only = restrictions are supported on partition key columns.

Clustering column restrictions and Secondary indices

For each indexed value, Cassandra stores the full primary key (partition key columns + clustering columns) of each row containing the value. When an index query is performed, Casssandra will retrieve the primary keys of the rows containing the value from the index. It will then retrieve the rows from the table and perform any filtering needed on it.

If the first clustering columns have been restricted, Cassandra will perform an early filtering on the primary keys returned by the index, making the filtering more efficient.

For that type of filtering, Cassandra will accept the following clustering column restrictions: =, IN, >, >=, <= and <.

So, if we add the following secondary index to the numberOfRequests table:

Multi-column slice restrictions are allowed on the last set of clustering columns being restricted.
SELECT * FROM numberOfRequests
WHERE cluster = ‘cluster1’
AND date = ‘2015-06-05’
AND datacenter = 'US_WESTCOAST'
AND (server, time) >= (‘196.8.0.0’, 12:00) AND (server, time) <= (‘196.8.255.255’, 14:00)

If both sides are specified for the slice, the restrictions must start with the same column. Therefore the query:
SELECT * FROM numberOfRequests
WHERE cluster = ‘cluster1’
AND date = ‘2015-06-05’
AND datacenter = 'US_WEST_COAST'
AND (server, time) >= (‘196.8.0.0’, 12:00) AND (server) <= (‘196.8.255.255’)

is valid

Reading them most people will expect results where time is after “12:00” (and before “14:00” in the former).

While the actual results hereare all values between (‘196.8.0.0’, 12:00) and (‘196.8.255.255’) inclusive.

That is all rows from ‘196.8.0.1’ to ‘196.8.255.254’ are returned regardless of their time value. Only the first ‘196.8.0.0’ row is restricted to time values from and after ’12:00′, and only the last ‘196.8.255.255’ row is restricted to time values before and to ’14:00′).

These two statements would have been more intuitive against a schema like cluster text,
date text,
datacenter text,
day int,
hour int,
numberOfRequests int,
PRIMARY KEY ((cluster, date), datacenter, day, hour))
);

cqlsh:apprepo> select * from users3 where userid=112 AND count >100 AND count < 200 AND last_name='ZZTop2';
Bad Request: PRIMARY KEY column "last_name" cannot be restricted (preceding column "count" is either not restricted or by a non-EQ relation)

For the benefit of beginners coming from SQL background, I think it is a good idea to include a line saying: “in CQL, only primary key columns and secondary index columns are supported.” before diving into the WHERE clause restrictions.

Another question. As far as I understand, the token() function returns the hash value of the keys, so comparing `token(cluster, date) > token(‘cluster1’, ‘2015-06-03’)` does not necessarily compare the values of `cluster` and/or `date`. In other words the token() function does not preserve the ordering.

”
SELECT * FROM numberOfRequests
WHERE cluster = ‘cluster1’
AND date = ‘2015-06-05’
AND hour = 14
AND minute = 0;
Cassandra will reject the query as it has to scan the entire partition to find the requested data, which is inefficient.”
This doesn’t tell the complete picture. A query with only few clustering columns will work but they must be specified in order. So following query will work –

You are right. Prior to 3.6, you can have unrestricted clustering columns but a clustering column cannot be restricted if one of the previous one was not.
Since 3.6, it is allowed if filtering is used.

The multi slice feature can be confusing but the returned results should be the good ones. If you have found an issue do not hesitate to open a jira ticket on the ASF Cassandra project. I will look into it.

One possible approach is to use filtering on your second column. Unfortunately, filtering on clustering columns is only supported since 3.6. So if your version is smaller than that but is a 3.0.x version you can solve your problem by converting your second clustering column into a regular one. In which case the following query will give you the results that you are expecting:SELECT * FROM foo WHERE part_key='B' AND start_ts = 7 ALLOW FILTERING

Now, I do not know your data and its cardinality so filtering might not be a good option.

Another option would be to change your data model. There are different way to model the things to have a query that fulfill your need in an efficient way.

If you always query your data with those 2 columns they should be your partition key. If you sometime query your data using only the first column then it should be your partition key and the other one a clustering column.

Since 3.10 filtering is supported on partition key columns so the following query would work:
SELECT * FROM numberOfRequests
WHERE date = ‘2015-06-05’
AND datacenter = ‘US_WEST_COAST’
AND hour= 12
AND minute >= 0 AND minute <= 30
ALLOW FILTERING;

But it is the type of queries that I would avoid as it will be pretty inefficient. In my opinion, you should probably change your model to have the date column (may be with your datacenter) as your partition key.

I have a table with 36 int columns to make a combination and 2 counter keys for analytics.

1. How about if I combine 36 int columns into one text and use it as a partition key. It will have 320 characters. How about Cassandra performance in this case?
2. How about if I keep 36 int column as 36 partition keys?