Speaker introduction

First, about me. I am an infrastructure engineer and I am also the CEO of PingCAP. Currently, my team and I are working on two open source projects: TiDB and TiKV. Ti is short for Titanium, which is a chemical element known for its corrosion resistance and it is widely used in high-end technologies.

So today we will cover the following topics:

Why another database?

What kind of database we want to build?

How to design such a database, including the principles, the architecture, and design decisions?

How to develop such a database, including the architecture and the core technologies for TiKV and TiDB?

Why another database

Before we start, let's go back to the very beginning and ask yourself a question: Why another database. We all know that there are many databases, such as the traditional Relational database and NoSQL. So why another one?

Relational databases like MySQL, Oracle, PostgreSQL, etcetera: they are very difficult to scale. Even though we have sharding solutions, YouTube/vitess, MySQL proxy, but none of them supports distributed transactions and cross-node join.

NoSQL like HBase, MongoDB, and Cassandra: They scale well, but they don't support SQL and consistent transactions.

NewSQL, represented by Google Spanner and F1, which is as scalable as NoSQL systems and it maintains the ACID transactions. That's exactly what we need. Inspired by Spanner and F1, we are making a NewSQL database. Of course, it's open source.

How to design?

Now we have a clear picture of what kind of database we want to build, the next step is how, how to design it, how to develop it and how to test it. In the next few slides, I am going to talk about how to design TiDB.

In this section, I will introduce how we design TiDB, including the principles, the architecture and design decisions.

The principles or the philosophy

It must ensure that no data is ever lost and the system can automatically recover from machine failures or even downtime of the entire datacenters.

It should be easy to use.

It should be cross-platform and can run on any environment, no matter it's on premise, cloud or container.

As an open source project, we are dedicated to being an important part of the big community through our active engagement, contribution and collaboration.

We need TiDB to be easy to maintain so we chose the loose coupling approach. We design the database to be highly layered with a SQL layer and a Key-Value layer. If there is a bug in SQL layer, we can just update the SQL layer.

The alternatives: Although our project is inspired by Google Spanner and F1, we are different from those projects. When we design TiDB and TiKV, we have our own practices and decisions in choosing different technologies.

Disaster recovery

The first and foremost design principle is to build a database where no data is lost. To ensure the safety of the data, we found that multiple replicas are just not enough and we still need to keep Binlog in both the SQL layer and the Key-Value layer. And of course, we must make sure that we always have a backup in case the entire cluster crashes.

Easy to use

The second design principle is about the usability. After years of struggling among different workarounds and trade-offs, we are fully aware of the pain points of the users. So when it comes to us to design a database, we are going to make it easy to use and there should be no scary sharding keys, no partition, no explicit handmade local index or global index, and making scale transparent to the users.

The community and ecosystem

The next design principle is about the community and ecosystem. We want to stand on the shoulders of the giants instead of creating something new and scary. TiDB supports MySQL protocol and is compatible with most of the MySQL drivers (ODBC, JDBC) and SQL syntax, MySQL clients and ORM, and the following MySQL management tools and bench tools.

etcd

etcd is a great project. In our Key-Value store, TiKV, which I will dive deep into later, we have been working with the etcd team very closely. We share the Raft implementation, and we do code reviews on Raft module for each other.

RocksDB

RocksDB is also a great project. It's mature, fast, tunable, and widely used in very large scale production environments, especially in facebook . TiKV uses RocksDB as it's local storage. While we were testing it in our system, we found some bugs. The RocksDB team fixed those bugs very quickly.

Namazu

A few months ago, we need a tool to simulate slow, unstable disk, and the team member found Namazu. But at that time, Namazu didn't support hooking fsync. When the team member raised this request to their team, they responded immediately and implement the feature in just a few hours and they are very open to implement other features as well. We are deeply impressed by their responsiveness and their efficiency.

Spark connector

We are using the Spark connector in TiDB. TiDB is great for small or medium queries and Spark is better for complex queries with lots of data. We believe we can learn a lot from the Spark community too, and of course we would like to contribute as much as possible.

So overall, we'd like to be a part of the big open source community and would like to engage, contribute and collaborate to build great things together.

Loose coupling – the logical architecture

This diagram shows the logical architecture of the database.

As I mentioned earlier about our design principle, we are adopting the loose coupling approach. From the diagram, we can see that it is highly-layered. We have TiDB to work as the MySQL server, and TiKV to work as the distributed Key-Value layer. Inside TiDB, we have the MySQL Server layer and the SQL layer. Inside TiKV, we have transaction, MVCC, Raft, and the Local Key-Value Storage, RocksDB.

For TiKV, we are using Raft to ensure the data consistency and the horizontal scalability. Raft is a consensus algorithm that equals to Paxos in fault-tolerance and performance. Our implementation is ported from etcd, which is widely used, very well tested and highly stable. I will cover the technical details later.

From the architecture, you can also see that we don't have a distributed file system. We are using RocksDB as the local store.

Atomic clocks / GPS clocks VS TimeStamp Allocator

If you've read the Spanner paper, you might know that Spanner has TrueTime API, which uses the atomic clocks and GPS receivers to keep the time consistent between different data centers.

The first alternative technology we chose is to replace the TrueTime API with the TimeStamp Allocator. It goes without any doubt that time is important and that Real time is vital in distributed systems. But can we get real time? What about clock drift?

The sad truth is that we can't get real time precisely because of clock drift, even if we use GPS or Atomic Clocks.

In TiDB, we don't have Atomic clocks and GPS clocks. We are using the Timestamp Allocator introduced in Percolator, a paper published by Google in 2006.

The pros of using the Timestamp Allocator are its easy implementation and no dependency on any hardware. The disadvantage lies in that if there are multiple datacenters, especially if these DCs are geologically distributed, the latency is really high.

Distributed file system VS RocksDB

Spanner uses Colossus File System, the successor to the Google File System (GFS), as its distributed file system. But in TiKV, we don't depend on any distributed file system. We use RocksDB. RocksDB is an embeddable persistent key-value store for fast storage. The primary design point for RocksDB is its great performance for server workloads. It's easy for tuning Read, Write and Space Amplification. The pros lie in that it's very simple, very fast and easy to tune. However, it's not easy to work with Kubernetes properly.

Paxos VS Raft

The next choice we have made is to use the Raft consensus algorithm instead of Paxos. The key features of Raft are: Strong leader, leader election and membership changes. Our Raft implementation is ported from etcd. The pros are that it's easy to understand and implement, widely used and very well tested. As for Cons, I didn't see any real cons.

C++ VS Go & Rust

As for the programming languages, we are using Go for TiDB and Rust for TiKV. We chose Go because it's very good for fast development and concurrency, and Rust for high quality and performance. As for the Cons, there are not as many third-party libraries.

That's all about how we design TiDB. I have introduced the principles, the architecture, and design decisions about using the alternative technologies. The next step is to develop TiDB.

The architecture

MVCC, Multiversion concurrency control. I believe many of you are pretty familiar with MVCC. TiKV is a multi-versioned database. MVCC enables us to support lock-free reads and ACID transactions.

Transaction: The transaction model is inspired by Google's Percolator. It's mainly a two-phase commit protocol with some practical optimizations. This model relies on a timestamp allocator to assign monotone increasing timestamp for each transaction, so the conflicts can be detected. I will cover the details later.

KV API: it's a set of programming interfaces and allows developers to put or get data.

Placement Driver: Placement driver is a very important part, and it helps to achieve geo-replication, horizontal scalability and consistent distributed transactions. It's kind-of the brain of the cluster.

About the TiDB architecture:

MySQL clients: The top layer is a set of MySQL clients. These clients send requests to the next layer. You can still use any MySQL driver that you are already familiar with.

Load Balancer: This is an optional layer. Such as HAProxy or LVS.

TiDB Server: It's stateless, and a client may connect to any TiDB server. Within the TiDB server, the top layer is MySQL Protocol, it provides MySQL protocol support; the next layer is SQL optimizer, which is used to translate MySQL requests to TiDB SQL plan.

The bottom layer is KV API and Distributed SQL API. If the lower level storage engine supports coprocessor, TiDB SQL Layer will use DistSQL API, which is much more efficient than KV API. TiDB supports pluggable storage engines. We recommend TiKV as the default storage engine.

TiKV core technologies

TiKV software stack

Let's take a look at the software stack.

First, we can see that there is a client connecting to TiKV. We also have several nodes. And within each node, we have stores, one per physical disk. Within each store, we have many regions. Region is the basic unit of data movement and is replicated by Raft. Each region is replicated to several nodes. A Raft group consists of the replicas of one Region. And region is more like a logical concept, in a single store, many regions may share the same Rocksdb instance.

Raft

In TiKV, we use the Raft for scaling and replication. We have multiple Raft groups. Workload is distributed among multiple regions. There could be millions of regions in one big cluster. Once a region is too large, it will be split into two smaller regions, just like cell division.

MVCC

Each transaction sees a snapshot of the database at the beginning time of this transaction. Any changes made by this transaction will not be seen by other transactions until the transaction is committed.

Data is tagged with versions in the following format: Key_version: value.

Transaction

These are Transaction APIs. As a programmer, I want to write code like this:

txn := store.Begin() // start a transaction

txn.Set([]byte("key1"), []byte("value1"))

txn.Set([]byte("key2"), []byte("value2"))

err = txn.Commit() // commit transaction

if err != nil {

txn.Rollback()

}

Speak of Transaction, It's mainly a two-phase commit protocol with some practical optimizations. In the transaction model, there are 3 column families, namely, cf:lock, cf:write and cf:data.

cf: lock: An uncommitted transaction is writing this cell and contains the location/pointer of primary lock. For each transaction, we choose a primary lock to indicate the state of the transaction.

cf: write: It stores the commit timestamp of the data

cf: data: Stores the data itself

Let's see an example: If Bob wants transfer 7 dollars to Joe.

Initial state: Joe has 2 dollars in his account, Bob has 10 dollars.

The transfer transaction begins by locking Bob's account by writing the lock column. This lock is the primary for the transaction. The transaction also writes data at its start timestamp, 7.

The transaction now locks Joe's account and writes Joe's new balance. The lock is secondary for the transaction and contains a reference to the primary lock; So we can use this secondary lock to find the primary lock.

The transaction has now reached the commit point: it erases the primary lock and replaces it with a write record at a new timestamp (called the commit timestamp): 8. The write record contains a pointer to the timestamp where the data is stored. Future readers of the column "bal" in row "Bob" will see the value $3.

The transaction completes by adding write records and deleting locks at the secondary cells. In this case, there is only one secondary: Joe.

Mapping table data to Key-Value store

Let's use an example to show how a SQL table is mapped to Key-Value pairs.

If we have a simple user table in database. It has 2 rows and 3 columns: user id, name and email. And user id is the primary key.

INSERT INTO user VALUES (1, "bob", "huang@pingcap.com");

INSERT INTO user VALUES (2, "tom", "tom@pingcap.com");

If we map this table to key-value pairs, it should be put in the following way.

Of course, TiDB supports secondary index. It's a global index. TiDB puts data and index updates into the same transaction, so all the indexes in TiDB are transactional and fully consistent. And it's transparent to the users.

Indexes are just key-value pairs that the values point to the row key. After we create indexes for the user name, the key-value storage looks like this:

The key of the index consists of two parts: the name and the user id as the suffix. So here "bob" is the name, and 1 is the user id, and the value points to the row key.

Predicate push-down

For some operations like count some columns in a table, TiDB pushes down these operations to the corresponding TiKV nodes, the TiKV nodes do the computing and then TiDB merges the final results. This diagram shows the process of a simple predicate push-down.

Schema changes

This slide is about schema changes. Why online schema change is a must-have feature? It's because we need the full data availability all the time and minimal performance impact so that the ops people can have a good-night's sleep.

Something the same as Google F1

Each TiDB server has a copy of a relational schema that describes tables, columns, indexes, and constraints.

Any modification to the schema requires a distributed schema change to update all servers.

Shared data storage

All TiDB servers in all datacenters have access to all data stored in TiKV.

There is no partitioning of data among TiDB servers.

No global membership

Because TiDB servers are stateless, there is no need for TiDB to implement a global membership protocol.This means there is no reliable mechanism to determine currently running TiDB servers, and explicit global synchronization is not possible.

Schema states

Basically schema changes is a multiple state multiple phase protocol. There are two states which we consider to be non-intermediate: absent and public.

There are two internal, intermediate states: delete-only and write-only

Delete-only: A delete-only table, column, or index cannot have their key–value pairs read by user transactions and

If E is a table or column, it can be modified only by the delete operations.

If E is an index, it is modified only by the delete and update operations. Moreover, the update operations can delete key–value pairs corresponding to updated index keys, but they cannot create any new one.

For the write-only state, it is defined for columns and indexes as follows:

A write-only column or index can have their key–value pairs modified by the insert, delete, and update operations, but none of their pairs can be read by user transactions.

TiDB: status of Adding index (delete-only)

Here is one of the screenshots for adding an index.

We can use any MySQL client to query the status of the online DDL job. Just simply run the "show status" statement and we can see that the current state is "delete-only" as I highlighted and that the action is "add index". There is some other information such as who is doing the DDL job, the state of the current job and the current schema version.