Drupal's database interaction

I used XDebug to profile the behavior of Drupal, and to study the interaction with the database server. I aggregated the profile information of 100 requests to the main page using the "Apache, mod_php, PHP4, APC" configuration used for previous benchmark experiments. More information about my experimental setup is available at that page. XDebug generates a trace file with all the profile information which I visualized using KCacheGrind.

Drupal has a page cache mechanism which stores dynamically generated web pages in the database. By caching a web page, Drupal does not have to create the page each time it is requested. Only pages requested by anonymous visitors (users that have not logged on) are cached. Once users have logged on, caching is disabled for them since the pages are personalized in various ways. Because this represents two different modes of operation, I investigated Drupal's behavior with and without page caching.

With page caching

This figure shows how much time is spent in each function when page caching is enabled. The functions are sorted by the 'Self' column, which shows the time spent in each function without the time spent in its children. The second column shows how often the function was called to serve 100 requests to the main page.

We observe that more than 45% (14.37% + 14.18% + 8.9% + 5.54% + 2.9% + ...) of the execution time is spent in the database related functions. Sending the SQL queries to the database server and waiting for the results takes 14% of the total execution time. Drupal preparing the queries (eg. database prefixing, sanitizing the input to prevent SQL query injection, etc) takes more than 31% of the total execution time. We should look into optimizing the functions that prepare the queries (db_query(), _db_query and _db_query_callback()).

The figure above depicts that PHP's mysql_query() function is called 1401 times. This means that we need 14 SQL queries to serve a cached page. This is where these SQL queries come from:

This figure shows the Drupal functions responsible for querying the database when serving 100 cached pages. The first column shows how much time is spent in the calls to <code>db_query()</code>. The second column shows how many times each function queried the database and the last column shows the functions' names and source files.

Without page caching

This figure shows how much time is spent in each function when page caching is disabled. The functions are sorted by the 'Self' column, which shows the time spent in each function without the time spent in its children. The second column shows how often the function was called to serve 100 requests to the main page.

When the page cache is disabled, we see that on average we need 144 SQL queries to generate the main page. That is a lot. We also observe that 25% of the total execution time is spent in database related functions: 13% of the total execution time is spent executing queries, and 12% of the total execution time is spent preparing the queries. This is where the SQL queries come from:

This figure shows the Drupal functions responsible for querying the database when serving 100 pages. The first column shows how much time is spent in the calls to <code>db_query()</code>. The second column shows how many times each function queried the database and the last column shows the functions' names and source files.

The performance gain that you can obtain from enabling the MySQL query cache is bound by the time you spend executing SQL queries. In this particular case, we spent 13-14% of the time executing SQL queries (see figures). The rest of the time is spent in PHP code which doesn't benefit directly from the MySQL query cache. Thus, the upper bound of the potential performance improvement is 13-14%. In practice it will be less because even with the MySQL query cache enabled, you have to do a minimal amount of work. For pages with more expensive SQL queries, the performance gain is likely to be more significant.

Are the Self and Time values the total time for that function, versus the average time for that function? I am guessing yes.

If so, then there needs to be some optimization in how many times some of the functions are called, not just in how they perform. For example, why is drupal_lookup_path() being called about 6.5 times per node on this page? Is that number even node count related?

Note that several of the functions are called 10 times each, exactly the number of nodes being displayed, so clearly they are node-related.

drupal_lookup_path() is responsible for half of the SQL queries on the page; that is a lot. Yet we tried to cache the whole alias table with bad results. Not sure what to do here, but that looks like a smoking gun if we really care about reducing the number of queries.

I think about this post from time to time and I still can't come to terms with it. Specifically, the statement "The functions are sorted by the 'Self' column, which shows the time spent in each function without the time spent in its children". How is it possible that db_query() is at the top? that function hardly does anything on its own. It hands off all processing to other functions.

I verified the results at the time so I'm confident it is not my fault. I'll do some additional analysis in the future; maybe I can use alternative tools or newer versions of the same tools to get more accurate results or to better understand the problem. I'll investigate this further as time permits.

This thread has been inactive for 1.75 years. I'm considering using Drupal 6.X for a new major site where performance matters a lot. How does 6.X stack up in these investigations? Has the number of queries been reduced since 2006? Any other news to report?