Monday, June 11, 2012

Yet another Blog on Query Optimization for MySQL Server

If you have been into MIS development for some time, then you may have realized that buying latest, multi-thousand-dollar Machine, stuffed with a top notch processor and an army of memory chips is not sufficient to your needs when it comes to processing large data, especially when your DBMS is MySQL Server.

In this article, I have tried to input the tips and techniques to-be-followed - some in general and some specific to MySQL Server; but I would, as every blogger, repeat the same common phrase that "in the endit all depends on your scenario".

The results you are going to see will mostly be in milliseconds so before thinking "is it worth the effort if the result is in a few milliseconds?", do know that these results are derived using a very very simple database with not more than 100000 records in a table. With complex databases and records in millions, the effort will pay you back.

Coming straight to topic, here are some points you should not ignore when designing tables and writing queries for retrieving data.

Choosing storage Engine

MySQL is unique in a way that it offers different Engines for storage. The two mostly used are InnoDB and MyISAM. When designing tables, choosing appropriate storage engine serves you as lifesaver afterwards.

The most significant differences between InnoDB and MyISAM are that InnoDB supports transactions, foreign keys, row locks, data caches and clustered indexes; it serves best for Insert, Update and Delete operations, more controlled locking, and maintaining referential integrity.

Shortly, InnoDB should be selected for a table when DML operations are more frequent than Select, not because of speed but reliability and durability. While MyISAM is best for data warehouse. More precisely, for tables used for search and reporting purpose.

Indexing and Primary Key

Indexing is a method to sort a number of records into one or more columns. The Index itself works as representative of records in a table; it can be created from one or more columns, and functions as a pointer to the records it's related to. When you define primary key for a table, a primary key index is created automatically by DBMS.

Although creating indexes boosts up data read performance, remember that each index you create requires some additional disk space and the write performance decreases as the number of indexes grow.

Below is an experiment on read/write performance in indexed and non-indexed tables.

It doesn't really matter what data type you choose, but the length of data itself. The length should be chosen very carefully, tying best to keep it low. You don't want to keep your Primary Key column(s) BIGINT unless you are willing to register every person, animal, fish and tree on the planet to your system.

Writing Select Statement

Not something to mention, the primary purpose of a Database isn't to pushing data in, in fact the data never read shouldn't have been there in the first place; you retrieve the data in chunks and bulks, of various combinations, in various dimensions. You do so most of the time using Select query. So, in abstract, the efficiency of your data retrieval - secondarily, after design - depends upon how beautifully you write your Select queries.

Avoid Select * ...

I know I sound annoying but this is true. The more the number of columns, the more the DB Engine processes the bytes. This serves most in cases where you have full-text columns like Address, Password_Hint, Essay_on_National_Hero, etc. Restricting your columns to only needed relieves your processor, especially when "Fetching" data. Have a look at the results:

Base tables are actually the tables from which Views are derived but here, what I mean is the first tables you define right after the term "from" in the Select statement, like select blah_blah from base_table. Other tables will be called Join tables.

For example, if you want to retrieve a list of those names from person table for which the contact number in contact table ends on 9. What you would typically do would be something like:

select p.name, c.contact from person as p inner join contact as c using (id) where contact like '%9';

Result: 0.031sec

Because Where clause is executed first, if you choose contact as your base table and person as join table, then you will have unnecessary records filtered out before the two tables are joined. Interchanging the positions of both the tables:

select p.name, c.contact from contact as c inner join person as p using (id) where c.contact like '%9';

Result: 0.015sec

Hierarchy of JoinsFollowing the above principle, correct placement of each join limits the data for next join. In addition, use the column(s) of table right before join table when joining instead of using Base table's columns throughout the list of joins. Here is an example:

WritingWhere clauseLast but not least, Where clause is where you should be focusing most when optimizing queries. There are various tips on how to organize where clause like:

Order:... where gender = 'M and age < 25should be ordered as ... where age < 25 and gender = 'M'Because there are more chances of limiting number of records in age condition than gender.Note: Some clever DBMS's like Oracle do such optimization automatically. However, I am unsure that MySQL is in the list.

Between:Using age between 13 and 19 is slightly better than age >= 13 and age <= 19

IN vs EXISTSWhere exists (... sub-query ...) is more efficient than Where column_name in (... sub-query ...) when you must use sub-query.

GroupingGroup by clause is used typically with an aggregate function like max(...), sum(...), etc. In a situation where you are applying one or more aggregate functions to your main query to create a summary of records, Group by clause saves your server from extra burden and you from headache. The example below shows how to avoid a sub-query by grouping:

select j.designation, (select min(annual_income) from person_job where designation = j.designation) as average from person_job as j;

Disastrous Result: 45.692sec

select j.designation, min(j.annual_income) as min_income from person_job as j group by j.designation;

Result: 0.015sec

Now can imagine the same with max, count and sum functions added???

Use existing FunctionThe assumption that "the DBMS function is doing the same thing in the back-end that I am in my function" is fallacious; you don't know the back-end so don't predict. It is very essential for a query writer to learn using built-in aggregate functions for calculations than inventing.Have an example:

select j.designation, sum(j.annual_income) / count(id) as average from person_job as j group by j.designation;

Result: 0.031sec

select j.designation, avg(j.annual_income) as average from person_job as j group by j.designation

Result: 0.015sec

And these get worse with dates, like: writing select substring(dob, 1, 4) as year instead of simply saying select year(dob) as year.

Difference between WHERE and HAVINGThe only difference (which is a big one) between Where and Having is that Having clause is executed after data has been fetched, i.e. after Select query is executed. It works like filters and instead of column names, it uses aliases (if no alias is defined, column name is used).Example: We are interested in finding out for each vendor, the designations with average annual income greater than 500000:

select m.sp_id, j.designation, avg(j.annual_income) as average from person_mobile_sp as m

where avg(j.annual_income) > 500000

inner join person_job as j using (id)

group by m.sp_id, j.designation

having average > 500000;

Using Having instead of Where should give you false results but not vice-versa. So do not use Having for any other purpose.

Choosing between Join and Sub-queryRule of thumb, when you want more than 1 column from a table, don't use a sub-query. However, sometimes you need to create a join with sub-query itself when a table is unavailable, but that again is a join. Look at the structure below:select a.col1, b.col1, c.col1 from table_a as a inner join table_b as b on b.id = a.id

inner join (select id, col1, sum(col2) as total from table_c group by id, col1) as c on c.id = b.id

Sub-queries can be used in situations where grouping is not desired and you want to apply an aggregate function on a separate table. For example, you want to write a query for each person with his id, name, designation, annual income, highest annual income for his designation, mobile service provider and total subscribers of his service provider:select p.id, p.name, j.designation, j.annual_income, (select max(annual_income) from person_job where designation = j.designation) as max_income,

sp.sp_name, (select count(id) from person_mobile_sp where sp_id = m.sp_id) as total_subscribers from person as p

inner join person_job as j on j.id = p.id inner join person_mobile_sp as m on m.id = p.id inner join mobile_sp as sp on sp.sp_id = m.sp_id;Result: Duration 14.758sec. Fetched in: 31.356secObviously, this is undesirable time for you so you optimize the query. Instead of writing sub-queries for aggregate functions, you can create a join with separate queries. Doing so, the tables will be created only once:

inner join (select designation, max(annual_income) as max_income from person_job group by designation) as a on a.designation = j.designation

inner join (select sp_id, count(id) as total_subscribers from person_mobile_sp group by sp_id) as b on b.sp_id = m.sp_id;

Result: 1.794sec

Views and Temporary Tables

Views are literally nothing other than Select statements stored for reuse. In MySQL, Views are handled using two Algorithms - Merge and Temptable, the third Undefined is a default option which leaves the job of choosing algorithm to DBMS. Merge algorithm cannot be used if the statement contains an aggregate function, sub-query, grouping or union.

I prefer using Temporary tables instead of views because first, I, almost all the time need a separate data container when I have hard time with sub-queries, groups and aggregate functions (which Mege algorithm doesn't support) and secondly, I often need to define additional indices for better performance, something that cannot be done with views.

For more on MySQL views, check this old-yet-useful blog on performance.

Creating temporary tables is one of the best techniques in query optimization, not specific to MySQL but in general. And if you choose to create these in Memory, then the performance will be magical. The example below illustrates how useful Temporary tables can be:

Our objective is to create a report for each person containing his id, contact no, name, mobile service provider, total subscriber of provider, job designation, annual income, minimum and maximum income for his designation and difference of his annual income from average annual income for his designation.

Without using temporary tables or joins with queries, your server will weep like a starving baby:

But this is not end of the story, you can avoid using temporary tables if you are okay to trade simplicity with performance, by using the same method of creating joins from queries, which will be even faster. Look at the results below:

inner join (select sp_id, count(id) as total_subscribers from person_mobile_sp group by sp_id) as a on m.sp_id = a.sp_id

inner join (select designation, min(annual_income) as min_income, max(annual_income) as max_income, avg(annual_income) as average_income from person_job group by designation) as b on j.designation = b.designation;

Result: Duration 0.31sec. Fetched in 0.31sec

So again when to use Temporary tables? The answer is simple: Not when the data inside the tables is not meant to be reused and not for jobs that can be done by Union. In our example, the tables we created may be used in other queries as well.

An example of how Temporary tables can be avoided by using Union:

Bad Practice:

create table tmp

select a.col1, a.col2 from table_a as a;

insert into tmp

select b.col1, b.col2 from table_b as b

where not exists (select col1 from table_a where col1 = b.col1);

insert into tmp

select c.col1, c.col2 from table_c as c

where not exists (select col1 from table_a where col1 = c.col1)

and not exists (select col1 from table_b where col1 = c.col1);

Good Practice:

select a.col1, a.col2 from table_a as a

union

select b.col1, b.col2 from table_b as b

union

select c.col1, c.col2 from table_c as c;

Keep in mind that Union gives you unique values only, for repeated values use Union all.

Because use of temporary tables requires more than one statements to execute, a good way to organize these statements is creating a Stored Procedure and calling it from your code.

Problem with Update

In MySQL Server, performance ofUpdate statement much slower than insert, so the idea of creating a temporary table with empty column and then updating this column using a bulk update query. Thinking that a sub-query in Select will be slower and divide the operation in parts is good in general, but disappointing in MySQL when one of the parts is Update statement.

Lets take an example. You are willing to display for each person his name, his mobile service provider and total number of subscribers for his provider. Using the divide and conquer fall approach like above, you will have an unpleasant surprise:

Splitting the Time between Server and Client

In almost 9 out of 10 reports (my perception), you will be using a client-side application to retrieve queries instead of MySQL console or other Querying tools. You can save query time by shifting some load to client-end. Like if you are designing reports in a reporting tool like iReports, a Crystal Reports alternative, you can completely avoid:

grand aggregates like "sum of counts", "sum of sums", or "average of sums".

Order by clause by sorting in the reporting tool

There is more to the list, depending upon the tool you are using. But if your tool's name is like "[some super]Java IDE", "C++ Monster", "PHP Guru" (these are fictitious so don't google them) or other programming interface, then I strictly advise to stick to query because besides the smartness of your Programming IDE, the way the data is handled through data structures of programming language is different than T-SQL, which are designed built to handle data.

When nothing works.. Warehouse does

Data Warehousing is (not exaggerating) the final solution to most of your complex reporting problems when you have a separate repository, containing all your historical data (NOT Live) and free hand to create as many tables - formally called Dimension tables and Fact tables in Data Warehousing language - in any way you like.

Data Warehousing itself is a complete branch of Data reporting, which helps in massive analysis for decision making, including data mining.

You can experiment this by creating a replica of your Transaction Database and deriving different tables for reporting purpose from it. This new warehouse can be fed new data after a specific interval of time (daily, weekly, or bi-weekly, etc.). However, the biggest setback for this is that you cannot use it for Live reporting like in Dashboards.

Hope the blog helped you. The tips and hints are not cent percent guaranteed to work for you, but as I explained, it all depends on your application's architecture and preferences. Secondly, these are all from my own limited experience of Programming and query design.