Tuesday, October 19, 2010

UPDATE: Oracle officially released memcached daemon plugin that talks with InnoDB. I'm glad to see that NoSQL+MySQL has become an official solution. It's still preview release but will be very promising. Let's try it to make it better!

Most of high scale web applications use MySQL + memcached. Many of them use also NoSQL like TokyoCabinet/Tyrant. In some cases people have dropped MySQL and have shifted to NoSQL. One of the biggest reasons for such a movement is that it is said that NoSQL performs better than MySQL for simple access patterns such as primary key lookups. Most of queries from web applications are simple so this seems like a reasonable decision. Like many other high scale web sites, we at DeNA(*) had similar issues for years. But we reached a different conclusion. We are using "only MySQL". We still use memcached for front-end caching (i.e. preprocessed HTML, count/summary info), but we do not use memcached for caching rows. We do not use NoSQL, either. Why? Because we could get much better performance from MySQL than from other NoSQL products. In our benchmarks, we could get 750,000+ qps on a commodity MySQL/InnoDB 5.1 server from remote web clients. We also have got excellent performance on production environments. Maybe you can't believe the numbers, but this is a real story. In this long blog post, I'd like to share our experiences.(*) For those who do not know.. I left Oracle in August 2010. Now I work at DeNA, one of the largest social game platform providers in Japan.

Is SQL really good for fast PK lookups?

How many times do you need to run PK lookups per second? Our applications at DeNA need to execute lots of PK lookups, such as fetching user info by user id, fetching diary info by diary id. memcached and NoSQL certainly fit very well for such requirements. When you run simple multi-threaded "memcached get" benchmarks, you can probably execute 400,000+ get operations per second, even though memcached clients are located on remote servers. When I tested with the latest libmemcached and memcached, I could get 420,000 get per sec on a 2.5GHz x 8 core Nehalem box with a quad-port Broadcom Gigabit Ethernet card.

How frequently can MySQL execute PK lookups? Benchmarking is easy. Just run concurrent queries from sysbench, super-smack, mysqlslap, etc.

MYSQLparse() and MYSQLlex() are called during SQL parsing phase. make_join_statistics() and JOIN::optimize() are called during query optimization phase. These are "SQL" overhead. It's obvious that performance drops were caused by mostly SQL layer, not by "InnoDB(storage)" layer. MySQL has to do a lot of things like below while memcached/NoSQL do not neeed to do.

MySQL also has to do lots of concurrency controls. For example, fcntl() are called lots of times during sending/receiving network packets. Global mutexes such as LOCK_open, LOCK_thread_count are taken/relesed very frequently. That's why my_pthread_fastmutex_lock() were ranked #2 in the oprofile output and %system were not small. Both MySQL development team and external community are aware of concurrency issues. Some issues have already been solved in 5.5. I'm glad to see that lots of fixes have been done so far. But it is also important that %user reached 60%. Mutex contentions result in %system increase, not %user increase. Even though all mutex issues inside MySQL are fixed, we can not expect 300,000 queries per second. You may be heard about HANDLER statement. Unfortunately HANDLER statement was not so much helpful to improve throughput because query parsing, opening/closing tables still be needed.

CPU efficiency is important for in-memory workloads

If little active data fit in memory, SQL overheads become relatively negligible. This is simply because disk i/o costs are much much higher. We do not need to care so much about SQL costs in this case. But, on some of our hot MySQL servers, almost all data fit in memory and they became completely CPU bound. Profiling results were similar to what I described above: SQL layer spent most of resources. We needed to execute lots of primary key lookups(i.e. SELECT x FROM t WHERE id=?) or limited range scans. Even though 70-80% of queries were simple PK lookups from the same table (difference was just values in WHERE), every time MySQL had to parse/open/lock/unlock/close, which seemed not efficient for us.

Have you heard about NDBAPI?

Is there any good solution to reduce CPU resources/contentions around SQL layer in MySQL? If you are using MySQL Cluster, NDBAPI would be the best solution. When I worked at MySQL/Sun/Oracle as a consultant, I had seen lots of customers who were dissapointed at SQL Node + NDB performance, then became happy after they could get N times bettern performance by writing NDBAPI clients. You can use both NDBAPI and SQL in MySQL Cluster. It's recommended using NDBAPI for frequent access patterns, and using SQL + MySQL + NDB for ad-hoc or infrequent query patterns. This was what we wanted. We wanted faster access APIs, but we also wanted to use SQL for ad-hoc or complex queries. But DeNA is using InnoDB, like many other web services. Switching to NDB is not trivial. Embedded InnoDB does neither support SQL nor network interface so it's not an option for us.

We thought that the best approach was implementing a NoSQL network server inside MySQL. That is, writing a network server as a MySQL plugin (daemon plugin) which listens on specific ports, accepting NoSQL protocols/APIs, then accessing to InnoDB directly by using MySQL internal storage engine APIs. This approach is similar to NDBAPI, but it can talk with InnoDB. This concept was initially invented and prototyped by Kazuho Oku at Cybozu Labs last year. He wrote MyCached UDF that speaks memcached protocols. My colleague Akira Higuchi implemented another plugin: HandlerSocket. The below picture shows about what Hanldersocket can do.

< Fig 1 > What is Hanldersocket?

HandlerSocket is a MySQL daemon plugin so that applications can use MySQL like NoSQL. The biggest purpose of the HandlerSocket is that it talks with storage engines like InnoDB without SQL-related overheads. To access MySQL tables, of course HandlerSocket needs to open/close tables. But HandlerSocket does not open/close tables every time. It keeps tables opened for reuse. Opening/closing tables is very costly and causes serious mutex contentions so it's very helpful to improve performance. Of course HandlerSocket closes tables when traffics become small etc so that it won't block administrative commands (DDL) forever. What is different from using MySQL + memcached? By comparing Fig 1 with Fig 2, I think you'll notice lots of differences. Fig 2 shows typical memcached and MySQL usage. memcached is aggressively used for caching database records. This is mainly because memcached get operation is much faster than in-memory / on-disk PK lookups in MySQL. If HandlerSocket can fetch records as fast as memcached, we don't need memcached for caching records.

< Fig 2 > Common architecture pattern for MySQL + memcached

Using HandlerSocket

As an example, here is a "user" table. Suppose we need to fetch user information by user_id.

Since HandlerSocket is a MySQL plugin, you can use it like other plugins such as InnoDB Plugin, Q4M, Spider, etc. That is, you do not need to modify MySQL source code itself. MySQL version has to be 5.1 or later. You need both MySQL source code and MySQL binary to build HandlerSocket.

#2. initializing an index so that we can use in main logics. # MySQL tables will be opened here (if not opened)my $res = $hs->open_index(0, 'test', 'user', 'PRIMARY', 'user_name,user_email,created');die $hs->get_error() if $res != 0;

For most web applications, it's a good practice to keep lightweight HandlerSocket connections established (persistent connections), so that lots of requests can focus on main logic (the #3 in the above code). HandlerSocket protocol is a small-sized text based protocol. Like memcached text protocol, you can use telnet to get rows through HandlerSocket.

Benchmarking

Now it's good time to show our benchmarking results. I used the above user table, and tested how many PK lookup operations can be done from multi-threaded remote clients. All user data fit in memory (I tested 1,000,000 rows). I also tested memcached with similar data (I used libmemcached and memcached_get() to fetch a user data). In MySQL via SQL tests, I used a traditional SELECT statement: "SELECT user_name, user_email, created FROM user WHERE user_id=? ". Both memcached and HandlerSocket client codes were written in C/C++. All client programs were located on remote hosts, connecting to MySQL/memcached via TCP/IP.

MySQL via HandlerSocket could get over 7.5 times higher throughput than traditional MySQL via SQL statements, even though %us was 3/4. This shows that SQL-layer in MySQL is very costly and skipping the layer certainly improves performance dramatically. It is also interesting that MySQL via HandlerSocket was 178% faster than memcached, and memcached spent too much %system resources. Though memcached is an excellent product, there are still rooms for optimizations.

Features and Advantages of HandlerSocket

HandlerSocket has lots of features and advantages like below. Some of them are really beneficial for us.

* Supporting lots of query patterns HandlerSocket supports PK/unique lookups, non-unique index lookups, range scan, LIMIT, and INSERT/UPDATE/DELETE. Operations that do not use any index are not supported. multi_get operations (similar to IN(1,2,3..), fetching multiple rows via single network round-trip) are also supported. See documentation for details.

* Can handle lots of concurrent connections HandlerSocket connection is light. Since HandlerSocket employs epoll() and worker-thread/thread-pooling architecture, the number of MySQL internal threads is limited (can be controlled by handlersocket_threads parameter in my.cnf). So you can establish thousands or tens of thousands of network connections to HandlerSocket, without losing stability(consuming too much memory, causing massive mutex contentions, etc: such as bug#26590, bug#33948, bug#49169).

* Extremely high performance HandlerSocket is possible to gain competitive enough performance against other NoSQL lineups, as already described. Actually I have not seen any NoSQL product that can execute 750,000+ queries on a commodity server from remote clients via TCP/IP. Not only HandlerSocket eliminates SQL related function calls, but also it optimizes around network/concurrency issues.

** Smaller network packets HandlerSocket protocol is much simpler and smaller than normal MySQL protocols. So overall network transfer size can be much smaller.

** Running limited number of MySQL internal threads See above.

** Grouping client requests When lots of concurrent requests come to HandlerSocket, each worker thread gathers as many requests as possible, then executing gathered requests at one time, and sending back results. This can improve performance greatly, by sacrificing response time a bit. For example, you can gain the following benefits. I'll explain them in depth in later posts, if anybody is interested.*** Can reduce the number of fsync() calls*** Can reduce replication delay* No duplicate cache When you use memcached to cache MySQL/InnoDB records, records are cached in both memcached and InnoDB buffer pool. They are duplicate so less efficient (Memory is still expensive!). Since HandlerSocket plugin accesses to InnoDB storage engine, records are cached inside InnoDB buffer pool, which can be reused by other SQL statements.

* No data inconsistency Since data is stored at one place (inside InnoDB), data consistency check between memcached and MySQL is not needed.

* Crash-safe Backend storage is InnoDB. It's transactional and crash safe. Even though you use innodb-flush-log-at-trx-commit!=1, you lose only < 1s of data on server crash.

* SQL can be used from mysql clients In many cases people still want to use SQL (i.e to generate summary reports). This is why we can't use Embedded InnoDB. Most NoSQL products don't support SQL interface, either. HandlerSocket is just a plugin for MySQL. You can usually send SQL statements from MySQL clients, and use HandlerSocket protocols when you need high throughput.

* All operational benefits from MySQL Again, HandlerSocket runs inside MySQL, so all MySQL operations such as SQL, online backups, replication, monitoring by Nagios / EnterpriseMonitor, etc are supported. HandlerSocket activities can be monitored by regular MySQL command such as SHOW GLOBAL STAUTS, SHOW ENGINE INNODB STATUS, SHOW PROCESSLIST, etc.

* No need to modify/rebuild MySQL Since it's a plugin, it runs on both MySQL Community and MySQL Enterprise Servers.

* Independent from storage engines HandlerSocket is developed so that it can talk with any storage engine, though we have tested and used with 5.1 and 5.5 InnoDB Plugin only.

Notes and Limitations

* Need to learn HandlerSocket APIs You need to write a program to talk with HandlerSocket, though it's pretty easy to use. We provide C++ API and Perl bindings.

* No security Like other NoSQL databases, HandlerSocket does not provide any security feature. HandlerSocket's worker threads run with system user privileges, so applications can access to all tables through HandlerSocket protocols. Of course you can use firewalls to filter packets, like other NoSQL products.

* No benefit for HDD bound workloads For HDD i/o bound workloads, a database instance can not execute thousands of queries per second, which normally results in only 1-10% CPU usage. In such cases, SQL execution layer does not become bottleneck, so there is no benefit to use Hanldersocket. We use HandlerSocket on servers that almost all data fit in memory.

DeNA is using HandlerSocket in production

We already use HandlerSocket plugin in our production environments. The results are great. We could have reduced lots of memcached and MySQL slave servers. Overall network traffics have been reduced, too. We haven't seen any performance problem (slow response time, stalls etc) so far. We've been very satisfied with the results. I think MySQL has been underrated from NoSQL/Database communities. MySQL actually has much longer history than most of other products, and lots of unique and great enhancements have been done so far by excellent my ex-colleagues. I know from NDBAPI that MySQL has very strong potentials as a NoSQL. Storage engine API and daemon plugin interface are completely unique, and they made Akira and DeNA develop HandlerSocket possible. As an ex-employee at MySQL and a long-time fun for MySQL, I'd like to see MySQL becomes better and more popular, not only as an RDBMS, but also as Yet Another NoSQL lineup. Since HandlerSocket plugin is Open Source, feel free to try. We'd be appreciated if you give us any feedback.

About Me

I am a database engineer at Facebook.
Before joining Facebook, I was a principal database and infrastructure architect at DeNA. My primary responsibility at DeNA was to make our database infrastructure more reliable, faster and more scalable. Before joining DeNA, I worked at MySQL/Sun/Oracle as a lead MySQL consultant in APAC for four years.
You can contact me on Yoshinori.Matsunobu_at_gmail.com (replace _at_ with @).