Menu

Virtual Columns and Effective Functional Indexes in InnoDB

In April, I wrote a blog post introducing the new InnoDB virtual column and effective functional index work for the JSON lab release. Now the feature is officially in MySQL 5.7.8. It’s worth revisiting the topic again in order to write about what is in the 5.7 Release Candidate, along with providing some additional examples. I also ran some quick performance tests comparing some simple operations on a table with and without virtual columns and indexes, comparing them with those using materializedgenerated columns.

Key Features

The earlier article already described the design in detail. If you have not already read it, I would recommend that you check it out first. I won’t go into all of the design detail again here, but I will restate some key points:

The virtual columns are no longer materialized in the table. They are truly “virtual”, which means they are NOT stored in InnoDB rows (InnoDB stores the table data in its primary key records—a clustered index—so it means that virtual columns will not be present in InnoDB primary key records), thus decreasing the overall size of the table. This then allows for faster table scans and other large operations.

Since they are truly “virtual”, adding and dropping virtual columns does not require a table rebuild. Rather, it only requires a quick system table update that registers the new metadata. This makes the schema changes simple and fast.

Creating an index on a virtual column (only secondary indexes are allowed) will essentially “materialize” the virtual column in the index records as the computed values are then stored in the secondary index, but the values are not stored in the primary key (clustered index). So the table itself is still small, and you can quickly look up the computed (and stored) virtual column values in the secondary index.

Once an index is created on a virtual column, the value for such a virtual column is MVCC logged so as to avoid unnecessary re-computation of the generated column value later when we have to perform a rollback or purge operation. However, since its purpose is in maintaining the secondary index only, we will only log only up to a certain limited length of the data in order to save space, since our index has a key length limitation of 767 bytes for COMPACT and REDUNDANT for formats, and 3072 bytes for DYNAMIC and COMPRESSED row formats.

Changes Since the Lab Release

There are a few noteworthy and useful changes/additions since the initial Lab release:

A single “functional index” can now be created on a combination of both virtual columns and non-virtual generated columns. That is, you can create a composite index on a mix of virtual and non-virtual generated columns.

Users can create functional indexes ONLINEusing the in-place algorithm so that DML statements can still be processed while the index is being created. In order to achieve that, the virtual column values used within the concurrent DML statements are computed and logged while the index is being created, and later replayed on the functional index.

Users can create virtual columns based on other virtual columns, and then index them.

The next improvement is less visible to user, but still worth mentioning. It is about enhancements to the purge related activities on indexed virtual columns. A new callback (WL#8841) provides a server layer function that can be called by InnoDB purge threads to compute virtual column index values. Generally this computation is done from connection threads (or sessions), however, since internal InnoDB purge threads do not correspond to connections/sessions and thus don’t have THDs or access to TABLE objects, this work was necessary in order to provide a server layer callback which enables the purge threads to make the necessary computations.

Limitations

There are still some notable restrictions around the “functional indexes”, some of which will be lifted by later work:

Primary Keys cannot be added on virtual columns.

You cannot create a spatial or fulltext index on virtual columns (this limitation will eventually be lifted).

Adding and dropping of virtual columns can be done in-place or online only when done as single operations by themselves, and not when combined with other table alterations. This limit will be removed later, but you can always go around it by keeping the in-place ADD/DROP virtual columns operations contained within a separate DDL statement.

A Few More Examples

In the previous blog post we have an example on how to use a “functional index” in conjunction with some JSON functions. Users can essentially use any functions for virtual columns, except for those that are non-deterministic (such as NOW()). So let’s next walk through some additional examples using some non-JSON functions:

Some Quick Performance Benchmarks

As expected, there will be some additional write cost when using an index on a virtual column due to the necessary computation of the virtual columns when they need to be materialized (e.g. INSERT or UPDATE), in other words the costs will be associated with creating and maintaining the index. If the column does not have a functional index, however, then the cost will instead be associated with reads as the value will need to be materialized any time that the row is examined. The added cost is also directly related to the complexity of the computation functions used.

However, even with such additional costs, using virtual columns and “functional indexes” can still be far better than creating the table with such STOREDgenerated columns as the latter materialize the data in the clustered index (primary key), thus resulting in a larger table (both on disk and in memory).

These quick tests are conducted on 3 types of table:

A table with virtual columns:

MySQL

1

2

3

4

5

6

7

8

9

10

11

CREATETABLE`t`(

`h`INTNOT NULLPRIMARY KEY,

`a`varchar(30),

`b`BLOB,

`v_a_b`BLOBGENERATEDALWAYSAS(CONCAT(a,b))VIRTUAL,

`v_b`BLOBGENERATEDALWAYSAS(b)VIRTUAL,

`e`int,

`v_h_e`INT(11)GENERATEDALWAYSAS(h+e)VIRTUAL,

`v_e`INTGENERATEDALWAYSAS(e)VIRTUAL,

`v_a`INTGENERATEDALWAYSAS(char_length(a))VIRTUAL

)ENGINE=InnoDB;

This table has 4 normal columns and 5 VIRTUAL columns. We make the computation function used very simple so as to minimize the impact from the function itself.

A “normal” table without a any generated columns at all (neither VIRTUAL or STORED):

MySQL

1

2

3

4

5

6

CREATETABLE`t_nv`(

`h`INTNOT NULLPRIMARY KEY,

`a`VARCHAR(30),

`b`BLOB,

`e`INT

)ENGINE=InnoDB;

A table with materialized or STORED generated columns:

MySQL

1

2

3

4

5

6

7

8

9

10

11

CREATETABLE`t_m`(

`h`INTNOT NULLPRIMARY KEY,

`a`varchar(30),

`b`BLOB,

`v_a_b`BLOBGENERATEDALWAYSAS(CONCAT(a,b))STORED,

`v_b`BLOBGENERATEDALWAYSAS(b)STORED,

`e`int,

`v_h_e`INT(11)GENERATEDALWAYSAS(h+e)STORED,

`v_e`INTGENERATEDALWAYSAS(e)STORED,

`v_a`INTGENERATEDALWAYSAS(char_length(a))STORED

)ENGINE=InnoDB;

We then use the following procedure to INSERT rows into each of these tables:

All tests were conducted on a 48 core x86_64 GNU/Linux machine, with a 10GB InnoDB buffer pool. All tests were run with a single thread. Each number comes from averages of over 3 runs. Here are the results:

Insertion without an index:

Insert of 500,000 row

Insert of 1,000,000 row

Table 1. with virtual column (t)

3 min 24.65 sec

6 min 59.91 sec

Table 2. without virtual column (t_nv)

3 min 21.41 sec

6 min 31.82 sec

Table 3. with materialized column

4 min 25.58 sec

8 min 43.66 sec

So for insertion into tables without any secondary index, the time is very similar for a table without any generated columns at all and the one with VIRTUAL columns. The latter does not have those columns materialized, so the amount of data inserted is exactly the same for the first two tables. However, if the columns are materialized/stored (as in table 3), the time will take substantially longer.

One thing to note is that even though the time taken for table 1 and table 2 are very similar, table 1 insertion still takes slightly longer. This is due an issue that some unnecessary computation is still done for a table with virtual columns, this will be fixed soon.

Creating an index

Create index on similar columns for 3 different tables:

Table with 1,000,000 row

Create index on t(v_e)

2.90 sec

Create index on t_nv(e)

2.40 sec

Create index on t_m(v_e) (Table 3. with materialized column )

3.31 sec

Create the index on table 1’s virtual column v_e is a bit more expensive than its base column in table 2, since indexing virtual column(s) requires computing the value. However, in this case, it is still faster than creating an index on the same column with a STOREDgenerated column, simply because the table with generated columns is so much bigger, a simple scan takes more time.

A few more runs of CREATE INDEX on the table t, just to show the scale of the costs when adding indexes on a virtual column.

Time to create index on 1,000,000 rows on table 1 (table t with virtual columns)

Create index on column v_e

2.90 sec

Create index on column e

2.47 sec

Create index on column v_b(3)

3.26 sec

Create index on column b(3)

2.67 sec

Create index on column v_h_e

2.97 sec

Create index on column v_a

3.06 sec

Create index on column v_a_b(10)

4.19 sec

As mentioned, creating an index on virtual columns are a bit more costly than creating an index on normal columns, since the computation needs to be performed on each row.

Adding a new column

ALTER TABLE ... ADD COLUMN would usually require a full table rebuild for normal or STORED generated columns. But if you add a virtual column, it is not required and thus will be almost instant.

ALTER TABLE ADD COLUMN on table with 1million rows

alter table t_nv add column col1 int;

1 min 20.50 sec

alter table t_nv add column col2 int GENERATED ALWAYS AS (e) stored;

1 min 32.40 sec

alter table t_nv add column col3 int GENERATED ALWAYS AS (e) virtual;

0.10 sec

So if you add a virtual column and then materialize it via CREATE INDEX, it will only take a few seconds (2 to 3 sec for creating the index according to previous experiment). If you do that for a normal column or generated column, it will take 50x to 60x more time (mostly spent in rebuilding the table).

Dropping a column

Similarly, dropping a virtual column is far faster for the same reasons.

ALTER TABLE DROP COLUMN on table with 1million rows

alter table t_nv drop column col1;

47.02 sec

alter table t_nv drop column col2; (A GENERATED column)

50.41 sec

alter table t_nv add column col3; (A virtual column)

0.10 sec

DMLs with a virtual index or “functional index”:

INSERT

Insert of 500,000 rows

Table 1 with functional index on column v_e

6 min 57.31 sec

Table 2 with index on column e

6 min 33.09 sec

Table 3 with index on column v_e

9 min 5.24 sec

As shown in this example, for a table with indexed virtual columns, ts insertion times are significantly less than table 3, which materializes the value in the clustered index (primary key).

UPDATE

The following UPDATE statements were then performed on the 3 tables, and results were:

mysql> update t set e=e+1;

Update time on table with 1,000,000 rows

Update on table 1 with index on virtual column v_e

1 min 20.39 sec

Update on table 2 with index on column e

52.26 sec

Update on table 3 with index on materialized column v_e

1 min 2.52 sec

As you can see, UPDATE statements on indexed virtual columns are more expensive. This demonstrates the additional MVCC costs associated with the operation (in addition to any operation associated with column e) because 1) The old value for v_e needs to be computed (for the UNDO log) and 2) The old and new values for v_e will need to be UNDO logged.

DELETE

Delete of 1,000,000 rows

Delete all row with index on virtual column ‘v_e’

21.52 sec

Delete all row with index on ‘e’

20.54 sec

Delete on table 3 with index on materialized column v_e

32.09 sec

The DELETE statements were faster on table with VIRTUALgenerated columns than those for table with the STOREDgenerated columns. The table with virtual columns is apparently much smaller than that with materialized column. So the deletion operation is much faster.

The DELETE operation will also require a little extra MVCC work if there are indexes on virtual columns because 1) The old value for v_e needs to be computed (for the UNDO log) and 2) The old and new values for v_e will need to be UNDO logged.

So the DELETE statement is a little bit more expensive than when using a regular column, but much faster than those with STOREDgenerated columns.

SELECT Queries:

Of course, as expected, the table with STOREDgenerated columns is much larger than the one with VIRTUALgenerated columns. And this is clearly shown with a quick table scan (after the initial run to bring the data into the buffer pool)

While the table with virtual columns and indexes remains small, it still takes advantage of having a materialized (secondary) index to facilitate efficient queries:

Query on char_length(a) on table with 1,000,000 rows

Table 1 with virtual column and index on ‘char_length(a)’

0.00 sec

Table 2 without index on ‘char_length(a)’

0.66 sec

Table 3 with stored column and index on ‘char_length(a)’

0.00 sec

Without the “functional index” on the ‘char_length(a)‘ value, table 2 requires a full table scan to get the results.

Summary

The virtual column and functional index work is now officially in 5.7! This is a great feature that allows users to ADD and DROP VIRTUALgenerated columns, along with adding optional secondary indexes on those columns, all done using ONLINE operations. As shown in the simple performance study above, the data materialized in such a way keeps the base table small (as it does not have duplicate copies in the InnoDB clustered/primary index) and thus making more efficient use of persistent storage space and memory, while at the same time providing vastly improved query performance!

Thanks for benchmarking! Please, confirm if I understood correctly, STORED keyword is less or more useless and in any possible case it is better to stick with VIRTUAL generated column? I just trying to figure out what to use in what situation.

The short answer is that there are some cases where STORED is required, but the longer more detailed answer is that you may also want STORED for the hypothetical case that the generated column definition is computationally heavy and you need to scan through a lot of data.