If your dataset is anywhere between 1 million to 180 million then you have come to the right blog. Normally a procedure like this will require atleast 1 day or more depending on dataset. But I will teach you how to get results in as quick as 3 hrs. You will be able to execute queries on indexes fully utilizing them without going through the painful process of creating them on the the entire table. Recently at work I had to deal with a dataset of 180 million and this entry gives details of the work flow that I adopted. I will not effectively sort the complete table but will tell you how to achieve the same effect in less than 2-3 hrs for 180 million rows. I tried well known methods to achieve indexing. I tried to use Alter table statement after 2 hrs stopped it. I tried to dump the data in hope of inserting it in an indexed table but after 1 hr stopped that as well. Then I explored Mysql functionality a bit to discover 3 features which can really reduce time, those were Memory tables, Merge tables and Partitioning(not necessary). Fortunately sorting using merge sort is a recursive job, so in the process of sorting 180 million rows I also had to learn how to sort 2 million faster than normal index creation. So the first thing that you have to do is find out the order of rows that you have.

If you have less than 1 million rows then you are better off using mysql’s inbuilt commands

Alter table <tblname> create index (<column name>);

If you have more than than then read on…..

The concepts I will be using are as follows:

Memory Tables:

These tables are hash maps that mysql stores in memory. These tables are temporary (in the sense they disappear after mysql server restart) and have limited rows. However they are lighting fast on index creation or sorting.

Partitioning:

Somebody in Mysql finally realized that to be fast database has to split the data across servers. Finally in version 5 they have introduced partitioning. If you have the luxery of more than 1 server you can use this. I did not have this luxury so will do indexing without it.

Merge tables:

So these are 1 server counterpart of Partitioning. If you want to split your dataset into multiple tables and run single query on all of them you can use Merge tables. They have their own share of problems, but for general functionality of insertion, deletion, searching and dumping, they work great. The only 2 major limitations one that they only work if base tables are MyISAM and searching on them is multiples of logn(see below if interested)

Irrelevant Mathematical jargon(feel free to skip)

So if you search table on indexed columns then search takes Log(n) time (n = number of tuples). So imagine you have 100*n size of table and you create 100 sub tables of n each. Now searching on one table will take Log(n) time and total time is 100*log(n)=log(n to the power 100). If it were a single table it would have taken Log(100*n) =Log(100)+log(n). So in a single table it scale logarithmically but merge table scales exponentially.

I will explain the steps that I took below, you can customize them to your dataset.

There are several ways of creating index however mine will require you to have quite a powerful machine and a reliable power supply. Atleast 2gb of ram is required and more is better here. If you have around 2 million rows then skip to step 4.

Step 0

This step is about reconnaissance. It will determine the numerical data in the next steps. You need to know the approximate amount of space your row takes. This will determine the size of your memory table. You can determine the size of memory table by trial and error also but a numerical estimate will help. In the steps below I will assume initial table as client_data and final table as client_data_sorted. All other tables are temp and generated on the spot.

Step 1: Increasing allowed size of Memory table to fit data

The first thing you need to do is extend the allowed memory limit of mysql. To sort faster we will need memory tables. Do so by adding the lines marked as red below under [mysqld] in my.cnf file. Typically located in /etc/my.cnf or /etc/mysql/my.cnf depending on your distribution.

Step 2: Dump data in memory table for indexing

The second line alters the engine to memory. The only thing to keep in mind is that memory engine keeps all the data in memory hence if your mysql server restarts or machine restarts then all the data is lost. Ensure that you never rely on them as a reliable storage.

Figure out how many rows your memory table can contain. In a typical scenario this would be order of 10 million. This value will change depending on your system’s memory and values set in step 1. You can test it out by inserting data from source table to memory table. After the limit the process will interrupt in the middle giving error like table is full.

Step 4: Storing data from Memory table to MyISAM table

If you have order of 10 million rows then the process stops for you here. Just use an alter table command to create index on memory table(3-4 minutes) then insert the memory table data into client_data_sorted table. You just saved yourself hours of pain. If you have more than 10 million then skip this step.

Alter table client_data_memory add index (<column name>);

(or sort the table)

Insert into client_data_sorted select * from client_data_memory;

If you can store 10million rows in your memory table and total tables are around 40 million then you are better off iteratively repeating the above steps. Merely sort 10 million then insert, then truncate memory table , then insert next 10 million , sort then insert. The process will take exponentially more time every time but still will finish far faster than normal. If you have more than 40 million then read on. A world of pain awaits you….

Step 5: Create Merge table for accessing data

The above process will not work iteratively for you if you have more than 50 million rows as while inserting you have to merge too. Suppose your memory can store 10 million and you have 180 million rows then create 18 temporary tables of engine type MyISAM.

This will take quite a long time, for me it took 10 min for every 10 million so 180 min in total. Meanwhile consider some of the more worthwhile alternatives like hive, its too late for it now but its worth a look. This is also useful.

Now create a merge table. A merge table is created on top of several MyISAM tables of same structure. Its important to have same structure and MyISAM tables. You can use this as a single table however all the searching is performed on all tables and hence is slower than single table.

Step 5: Final Sorted table(if required)

For most temporary uses this merge table will work. You can use this table normally. Insert data into the table etc etc. However if the limitations of merge tables are not acceptable you have to create the final table from this merge table. You can use merge tables to dump the data in a sorted fashion according to indexes or simply use

Insert into <tblname> Select * from <merge table> order by <index1>,<index2>…..

This way you get to Insert much faster as the data is already sorted and mysql simply has to insert the data and update the indexes. If this method helped.. do let me know in the comments.

Very few people know that every synaptics trackpad supports multitouch and as long as your OS supports it you can always use multitouch to enhance your computers interface. Apple has patented several UI design elements ,hence has inhibited Windows and Linux to come out with similar UI elements like pinch and zoom, scroll etc.

If you were looking for MP4/Mpeg4 Part 10 CoreAVC vs FFmpeg video codec performance review then click here

Codecs:

Though there are several codecs that perform the encoding owing to the universal use of this encoding in dvds, common implementations are Nero, Divx, Xvid etc. Among these arguably the most popular codecs are Xvid and Divx. We will be seeing side by side comparison of both in the article. You can find both codecs individual history on their wiki pages. I find the history of Divx quite amusing,They adopted their name Divx to mock a company that used to charge users for viewing by the hour and themselves used spyware in their later version hence becoming a subject of mockery. You might have already noticed that “Xvid” is “Divx” backwards.

I will talk of decoding performance here. Decoding probably makes much more sense to a normal user as encoding is done by few but everybody decodes videos. Decoding is quite a subjective criteria hence besides my own comments I have also attached screenshots that allow the user to make his/her own judgment. The codecs were taken from Final build site.

Xvid
There is a very good chance that any video that you come across is encoded with Xvid. Across my search of encoders I found Xvid is the most preferred one for encoding. Its opensource hence free, much more configurable hence lets you draw the last byte’s worth. Though it seems logical that if video is encoded with encoder its respective decoder should also be the best, however I found the results didn’t fare as I expected.

Divx

The good old closed source version. Divx is costly and costing is on per PC basis hence encoding costly. Decoding can be done for free using their web player. The pro version is not free however there are other ways of using the codec in Media Player classic shipped with Klite Codec as explained below.

Here is the screenshot of Gspot codec analyzation. This provides the codec information with which the file was encoded in:

2. Audio Encoding was done by MPEG-1 Layer 3 more popularly known as MP3

3. The default resolution is 352×272 and should be judged at that resolution

Further information can also be seen from the screenshot

PS: you can also use other codec information utils like AVI codec etc.

Following are the screenshots taken from the video with different filters(All the left screenshots are Divx and right ones are Xvid)

Click the pics and view them at full-size and compare them. As the videos clearly depicted the Divx Filter far outperforms Xvid Filter in visual apeal.

There is a very important decoder that I have not mentioned here, that is the FFmpeg decoder(default decoder in most open players like mplayer, media player classic and vlc). This decoder generally uses libavcodec to decode the media files and even though the performance is not as good as the above decoders but it is good enough but the difference in CPU usage is extreme. Ex for a normal movie where Divx and Xvid take as much as 25-30% CPU usage, FFmpeg will take barely 10-15%. So if its quality you are looking for then look no further than Divx but for performance nobody matches FFmpeg

So overall the results are quite ambigous. If you are looking for quality then Divx, for free encoding/decoding Xvid and for decoding performance FFmpeg . You can download the filters from here, and you can learn how to switch your filter from here. So what are you going to do with your player???

MPEG-4 is used for AV data for web (streaming media) and CD distribution, voice (telephone, videophone) and broadcast television applications.MPEG-4 adds new features such as (extended) VRML support for 3D rendering, object-oriented composite files (including audio, video and VRML objects), support for externally-specified Digital Rights Management and various types of interactivity. AAC (Advanced Audio Codec) was standardized as an adjunct to MPEG-2 (as Part 7) before MPEG-4 was issued.The key parts to be aware of are MPEG-4 part 2 (MPEG-4 SP/ASP, used by codecs such as DivX, Xvid, Nero Digital and 3ivx and by Quicktime 6) and MPEG-4 part 10 (MPEG-4 AVC/H.264, used by the x264 codec, by Nero Digital AVC, by Quicktime 7, and by next-gen DVD formats like HD DVD and Blu-ray Disc).

If you were looking for MPEG 4 Part 2/ AVI video codec performance review: Divx vs Xvid then click here

Codecs:

As told previously 2 parts of MPEG-4 format are popular. There are several codecs available that implement MPEG-4 with different level of perfection, popular ones are Quicktime(Part 10), FFmpeg(Part 10), Xvid(Part 2), Divx(Part 2) and CoreAVC(Part 10). After consulting several forums, I found 2 of those very talked about CoreAVC and FFmpeg. I have compared the 2 codecs below.

FFmpeg
Whether you are a codec geek or not, this is the most probable codec that you have been using since time immemorial. Its free, opensource, can allow a lot of tweaking, plays a major number of containers and codecs. I will try not to go beyond H.264 in this article but one thing is for sure, whether you are using CoreAVC or not, you definitely have to use FFmpeg due to its overwhelmingly large number of codec compatibility.

CoreAVC
Its a closed source alternative for decoding H.264, infact its one of the accepted formats for Blue-ray. When it came out it astonished everyone with the sheer speed and performance outputs. Its known to exceed several hardware implementations (reminds me of John Carmack implementation of square root in Quake 3 which was faster than FPU). Corecodec people call it the fastest codec on earth.(and I am nobody to argue).

Left one is CoreAVC and right is FFmpeg

Doesnt take a rocket scientist to notice that better one. However when I discussed this on outside forums, I faced considerable shouting and screaming so I did more tweaking and testing and found another bit of information that convinced me to use CoreAVC for good.
Left is CoreAVC and right is FFmpeg.
Top is W/O post processing and bottom is with Post Processing

Difference between CPU usage when using FFmpeg and CoreAVC is huge. It can be seen from the above CPU usage or by seeing CPU cycles consumption . Considering that I have a Core 2 duo system with 1.8Ghz this difference is a substantial one, infact with preprocessing the player hung at 100% cpu usage in FFmpeg.

I found out to my amazement that Corecodecs are actually the fastest codecs in the world. I used all possible codecs at my disposal(VLC, ffmpeg, quicktime, windows media player and Nero) but CoreAVC not only gave better performance compared to all of them but also astonishingly smooth one at that. I ran it on 1080p trailer of 10000BC on MPC using quicktime alternative(had to rename .mov to .hdmov to use coreAVC) and performance was mind boggling.
Bottomline, I’d suggest you one thing. just go right now to Final build site and download the coreavc format and start using it. You can find out how to do that here.

Well, you are an opensource user. Your geek rating is much higher than any of your other puny friends on facebook. You bask in the glory of being clean and untainted. But dont confuse being opensource as “not being evil”. The real trick is to use the good will of opensource and still use it for your nefarious (defination) purposes. Behold the sins of Opensource and how/where to use it in your own company:

As Strategy (aka Greed): Most evil are those who employ opensource as a strategy, this not only allows them to publisize there product but also gives them a GNU loving status. Google like always leading the way with Google Gears and Libjingle. By making Gears opensource it caused a 3 way effect

It allowed the people to look into the code and ensure that there are no google spies sitting in it, hence building trust.Without this nobody would opt for it

It also became instant favorite to all those who hate The Microsoft way, giving them the satisfaction of using an opensource alternative on an opensource platform(firefox).

Allowing other people to step in with it and start making more tools to replace more softwares(ZOHO, Offline WIkipedia etc)

So using it as a stratigic weapon you can assasinate a lot more quickly than otherwise.

As a User (aka Gluttony): You reading this blog in firefox and me writing it in wordpress doesnt make things a whole lot better. If you have used a product of opensource and have never cared to:

Submit a bug report.

donate for it(and are never planning to).

Contribute back.

then you are evil. You are another one of those selfish people who just wanna use the product for free and run away with the benifits. I wouldnt blame you for it but we are all sinners

As a developer(aka Pride): If you are a developer associated with a opensource community because of any of the following reasons then you are also a sinner:

Even if you see your own project suffering because of unsufferable GUI, you still stand by and keep working on core internal library which is already ruthlessly efficient. Since GUI is one of the most boring things to work on you choose to ignore it and because of that your project suffers. This is the sin that leads to poor interfaces of so many opensource software.

You have been laid off, and just want to code to keep your skills sharp and to build your CV and have no interest in the concept of “greater good” and benifit of all man kind.

As a startup(aka Sloth): If you are a startup or for that matter any company and you are using opensource software because of any of the reasons then you are also sinners:

Using your expert skills you are able to disprove Microsoft’s TCO model and use opensource as cheap software.

Use plenty of opensource tools in your company but have never released any of the scripts etc neither have cared enough to put even a simple logo on your official site.

You rely of free service of opensource community that is 10 times larger than that of any company’s support staff.

Worse are all are those developers who use the free experience provided by opensource community in development of features that they eventually want to put in there own software(best oppurtunity is GSOC).

Disclaimer: I’d like to apologize to anybody who feels offended. This article fundamentally represents how opensource is practically being used by different catagory of people(coders, users and companies). There are several other ways of employing/deploying opensource which will be discussed in future articles.

GNU license (whether 1 or 2 or…) makes sense to few. But whats more important in this world is compatibility. As Linux grows, the arrogance of few are blocking the freedom that comes with it. Its sad when somebody giving something as “free” also wants people using it to be bound by their philosophy, its like a religion being forced upon others.

Recently bought a PCI-express gigabit ethernet card(Agere chipset ET1310)]. I put it in linux (secretly praying for a stable driver), reboot and pop goes ubuntu, restricted driver found, turns out the restricted driver is put up, under BSD license. And so another day goes by where ubuntu saves countless hours of setup time and futile searching and debugging, thanx to a few people who can accept other people for not following their philosophy as bible.