5 Answers
5

When data is stored on disk based storage devices, it is stored as blocks of data. These blocks are accessed in their entirety, making them the atomic disk access operation. Disk blocks are structured in much the same way as linked lists; both contain a section for data, a pointer to the location of the next node (or block), and both need not be stored contiguously.

Due to the fact that a number of records can only be sorted on one field, we can state that searching on a field that isn’t sorted requires a Linear Search which requires N/2 block accesses (on average), where N is the number of blocks that the table spans. If that field is a non-key field (i.e. doesn’t contain unique entries) then the entire table space must be searched at N block accesses.

Whereas with a sorted field, a Binary Search may be used, this has log2 N block accesses. Also since the data is sorted given a non-key field, the rest of the table doesn’t need to be searched for duplicate values, once a higher value is found. Thus the performance increase is substantial.

What is indexing?

Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.

The downside to indexing is that these indexes require additional space on the disk, since the indexes are stored together in a table using the MyISAM engine, this file can quickly reach the size limits of the underlying file system if many fields within the same table are indexed.

Note: char was used in place of varchar to allow for an accurate size on disk value.
This sample database contains five million rows, and is unindexed. The performance of several queries will now be analyzed. These are a query using the id (a sorted key field) and one using the firstName (a non-key unsorted field).

Example 1

Given our sample database of r = 5,000,000 records of a fixed size giving a record length of R = 204 bytes and they are stored in a table using the MyISAM engine which is using the default block size B = 1,024 bytes. The blocking factor of the table would be bfr = (B/R) = 1024/204 = 5 records per disk block. The total number of blocks required to hold the table is N = (r/bfr) = 5000000/5 = 1,000,000 blocks.

A linear search on the id field would require an average of N/2 = 500,000 block accesses to find a value given that the id field is a key field. But since the id field is also sorted a binary search can be conducted requiring an average of log2 1000000 = 19.93 = 20 block accesses. Instantly we can see this is a drastic improvement.

Now the firstName field is neither sorted, so a binary search is impossible, nor are the values unique, and thus the table will require searching to the end for an exact N = 1,000,000 block accesses. It is this situation that indexing aims to correct.

Given that an index record contains only the indexed field and a pointer to the original record, it stands to reason that it will be smaller than the multi-field record that it points to. So the index itself requires fewer disk blocks that the original table, which therefore requires fewer block accesses to iterate through. The schema for an index on the firstName field is outlined below;

Note: Pointers in MySQL are 2, 3, 4 or 5 bytes in length depending on the size of the table.

Example 2

Given our sample database of r = 5,000,000 records with an index record length of R = 54 bytes and using the default block size B = 1,024 bytes. The blocking factor of the index would be bfr = (B/R) = 1024/54 = 18 records per disk block. The total number of blocks required to hold the table is N = (r/bfr) = 5000000/18 = 277,778 blocks.

Now a search using the firstName field can utilise the index to increase performance. This allows for a binary search of the index with an average of log2 277778 = 18.08 = 19 block accesses. To find the address of the actual record, which requires a further block access to read, bringing the total to 19 + 1 = 20 block accesses, a far cry from the 277,778 block accesses required by the non-indexed table.

When should it be used?

Given that creating an index requires additional disk space (277,778 blocks extra from the above example), and that too many indexes can cause issues arising from the file systems size limits, careful thought must be used to select the correct fields to index.

Since indexes are only used to speed up the searching for a matching field within the records, it stands to reason that indexing fields used only for output would be simply a waste of disk space and processing time when doing an insert or delete operation, and thus should be avoided. Also given the nature of a binary search, the cardinality or uniqueness of the data is important. Indexing on a field with a cardinality of 2 would split the data in half, whereas a cardinality of 1,000 would return approximately 1,000 records. With such a low cardinality the effectiveness is reduced to a linear sort, and the query optimizer will avoid using the index if the cardinality is less than 30% of the record number, effectively making the index a waste of space.

Awesom explanation! I have one important question. If I have a non-unique field that is not sorted, and if the cardinality of the field is 2, would the index table of this field have just 2 rows? If yes, would there be multiple pointers for each of these 2 rows pointing to all the rows that have this value? If no, would the index table have as many rows as there are for this field in the original table?
–
Abhishek ShivkumarAug 4 '12 at 10:42

5

@CHAPa: He has already explained it but since you didn't get it yet here is a detailed explanation. Example: Worst Case: Total records=20 & I'm searching 20th element. In Linear search no. of comparisons=N. But there can also be other cases, like say you are searching for 4th element, so that will take only 4 comparisons! Best Case: finding 1st element, bingo! you get it right at the first shot! To compare algorithms you cannot assume best case/worst case,you need average case. Avg case is element you want is sitting in the middle-how many comparisons would you make to reach the middle? N/2.
–
Saurabh PatilJul 8 '13 at 4:02

1

@AbhishekShivkumar:Great question!I think the index table will have as many rows as there are in the data table. And as this field will have only 2 values(boolean with true/false) & say you want a record with value true,then you can only halve the result set in first pass, in second pass all your records have value true so there is no basis to differentiate,now you have to search the data table in linear fashion-hence he said cardinality should be considered while deciding the indexed column. In this case,it's worthless to index on such a column. Hope I'm correct :)
–
Saurabh PatilJul 8 '13 at 4:20

1

shouldn't the number of block accesses in the average case be (N+1)/2. If we sum the number of block accesses for all possible cases, and divide it by the number of cases, then we have N*(N+1)/(2*n) which comes out to be (N+1)/2.
–
ajayJan 30 '14 at 12:11

5

I think there are a few typos in this answer, for example, in the sentence: "a far cry from the 277,778 block accesses required by the non-indexed table." doesn't the author mean 1,000,000 block accesses? 277,778 is the number of blocks required by the index itself. There seems to be a couple of other inaccuracies too :(
–
jcmAug 24 '14 at 4:02

Since then I gained some insight about the downside of creating indexes:
if you write into a table (UPDATE or INSERT) with one index, you have actually two writing operations in the file system. One for the table data and another one for the index data (and the resorting of it (and - if clustered - the resorting of the table data)). If table and index are located on the same hard disk this costs more time. Thus a table without an index (a heap) , would allow for quicker write operations. (if you had two indexes you would end up with three write operations, and so on)

However, defining two different locations on two different hard disks for index data and table data can decrease/eliminate the problem of increased cost of time. This requires definition of additional file groups with according files on the desired hard disks and definition of table/index location as desired.

Another problem with indexes is their fragmentation over time as data is inserted. REORGANIZE helps, you must write routines to have it done.

In certain scenarios a heap is more helpful than a table with indexes,

e.g:- If you have lots of rivalling writes but only one nightly read outside business hours for reporting.

Also, a differentiation between clustered and non-clustered indexes is rather important.

I think, these indexing issues can be resolved by maintaining two different databases, just as Master and Slave. Where Master can be used to insert or update records. Without indexing. And slave can be used to read with proper indexing right???
–
bharateshMay 23 '14 at 9:51

no, wrong, sorry. not just the content of the tables must be updated, but also the index structure and content (b-tree, nodes). your concept of master and slave makes no sense here. what can be feasable though is replicating or mirroring to a second database on which analytics take place to take that workload away from the first database. that second database would hold copies of data and indexes on that data.
–
Der UMay 29 '14 at 16:11

Ya...! Try to read my comment and understand it properly. I also said the same, I referred to master and slave (whatever) as "eplicating or mirroring to a second database on which analytics take place to take that workload away from the first database. that second database would hold copies of data and indexes on that data"
–
bharateshJun 2 '14 at 11:04

the second database - to which mirroring or replicating is done, the slave - would experience all the data manipulation as the first one does. with each dml-operation the indexes on that second database would experience "these indexing issues". i don't see the gain in that, where ever the indexes are needed and built for quick analysis they need to be kept up to date.
–
Der UJun 3 '14 at 13:23

This answer should have more votes, it complements perfectly with the accepted one. Thanks.
–
j.a.Jul 24 at 10:57

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
–
IlyaOct 10 '14 at 8:35

2

Link still up, great info in there for those interested. This answer should provide quotes from the site for reasons stated in above content.
–
MichaelDec 9 '14 at 8:15

Although the other answers are very good, I would say that:
An index is just a data structure that makes the searching faster for a specific column in a database. This structure is usually a b-tree but it can also be a hash table or some other logic structure.

Just a quick suggestion.. As indexing costs you additional writes and storage space, so if your application requires more insert/update operation, you might want to use tables without indexes, but if it requires more data retrieval operations, you should go for indexed table.