Four Ways to Optimize Your MySQL Database

March 26, 2007

From its inception, speed has been a strong point of the MySQL database server. In fact, its developers have long been cautious to add new features at the expense of performance even when faced by withering pressure from detractors. Yet over time MySQL's features caught up with its blazing speed, and today its used to power some of the highest traffic websites in the world, Yahoo! Finance, Craiglist, and TicketMaster among them. Accordingly, for most applications chances are MySQL is going to perform to your expectations.

However, with Web traffic exploding, and the complexity of applications increasing all the time, you should strive to optimize your database from its very inception, and continue to review and refine its structure and query activity over its lifetime. You can do so by following a process I refer to as NICM, or preferably, "Neglecting to Index Causes Misery" (those of you new to database optimization will soon get the joke). This process consists of five steps, including Normalize, Index, Cache, and Monitor. In this inaugural installment of my new MySQL series for Developer.com, I'll introduce these five steps, highlighting not only how they will ensure your database is running at full tilt, but also how they will actually help you to more effectively manage your data.

Normalize

The relational database is aptly named because it promotes the strategy of managing data through well-defined relations. By creating and enforcing relations, its possible to greatly reduce the possibility inconsistencies could creep into the data. This strategy is known as database normalization, of which there are several well-defined states, also known as forms.

To illustrate both the concept and importance of normalization, consider a scenario in which you're tasked with creating a corporate human resources application capable of managing employee information. This application would allow the HR manager to easily insert, update, view and delete employee data. Each employee record would contain the usual information, including name, phone number, position, salary, and supervisor.

Note that in the opening sentence I stated that relational databases promote the strategy of managing data through well-defined relations—you're certainly not constrained to do so. Therefore there's nothing to prevent you from creating a single table that looks like this:

Yet several problems with this approach should be evident. Because the data types used to define these columns are all indicative of an input method requiring the manager to manually enter each value. For example, the manager might type the following values into the web form when adding a new employee:

But what if over time the manager begins to enter the "Software Developer" title as "Sft. Dev.", and occasionally forgets the periods, using just "Sft Dev"? These inconsistencies eliminate the possibility of using simple SELECT queries to retrieve the roster of software developers. The simple answer to the problem is to normalize the position data, creating a new table intended to contain the master list of all possible positions:

Then you revise the employees table so that it refers to a position's primary key rather than the name (in the revised table, the _fk postfix is simply a naming convention to remind the database administrator that this column points to a foreign table key):

Now you can retrieve the positions from the positions table and use them to populate a drop-down list. This forces the manager to choose from a constrained set rather than haphazardly type in the titles.

A similar normalization improvement can be made to the supervisor column, although this is a case where an additional table isn't necessary. All you need to do is modify the employees table anew so it looks like this:

Can you figure out the origin of the data used to populate the supervisor_pk column? That's right, the very same employees table!

This is just a taste of how database normalization can greatly reduce the data inconsistencies that can arise over time. For a great summary of the concept and the various forms I alluded to earlier in the section, see the Wikipedia article on this topic.

I also suggest downloading MySQL's free MySQL Query Browser application, which can help you to visualize the data and its various relationships in a manner much more efficient than if you were using the command-line client.

Index

Once the tables have been properly designed and normalized, you should next take some time to think about what data will most commonly be queried, and create special data structures known as indexes which will dramatically improve the performance of these query operations. Indexes are important because they organize the indexed data in a way that allows MySQL to retrieve the desired record in the fastest possible fashion. In the previous section you already encountered an index, although it isn't apparent: the primary key. But the primary key will only speed query operations when you're searching for a record by using that primary key as the identifier, for example:

SELECT name, telephone FROM employees WHERE id='3'

What if you wanted to search for one or several records based on the employee name? The query might look like this:

SELECT id FROM employees WHERE name='John Smith'

To speed these sorts of queries, you'll need to add an index to the name column. The revised employees table looks like this:

Cache

MySQL is often used to dynamically generate web pages based on similar queries and rarely changing data. For instance, consider the home page of Developer.com, which likely depends on a single query used to repeatedly retrieve the latest ten or so articles. New articles are typically published early to mid-morning, meaning the relevant data found on this page really only chances perhaps once or twice a day. Why not cache the data returned by these queries, thereby bypassing the need to repeatedly retrieve it from the database? Introduced in version 4, enabling MySQL's query caching mechanism can result in a huge performance gain over neglecting to do so.

To learn more about how your MySQL installation's query caching mechanism is configured, run the following command:

Monitor

The reasoning behind this step should be obvious: how are you going to know what should be optimized if you're not actively monitoring how MySQL is operating? Thankfully, MySQL's developers have been particularly proactive in providing developers with the tools for keeping abreast of database performance.

For starters, you should familiarize yourself with MySQL's EXPLAIN query, which will provide detailed information regarding how MySQL goes about executing a SELECT query. You'll learn valuable tips regarding how MySQL is executing table joins, and whether additional indexes should be added.

If you're running a MySQL version earlier than 5.0, check out mytop, a console-based utility for monitoring MySQL performance.

Finally, be sure to have a solid understanding of MySQL's configuration variables, because they control crucial performance-related matters such as how much memory is allocated to queries, what sort of data is logged, how many simultaneous collections are allowed, and much more. You can review a complete list of these variables by executing:

mysql>SHOW VARIABLES;

I hope this introductory article left you thinking about how you can go about making your MySQL database absolutely scream. And moving forward, every time you begin a new database project, don't forget to approach it the NICM way!