Tuesday, June 20, 2006

I will be at FrOSCon 2006 on June 24, with a talk about Pivot tables in MySQL 5.I will arrive in Sankt Augustin on June 23, flying from Alghero Fertilia, Italy to Frankfurt Hahn, Germany and then driving to the conference site (travel plan).The funny thing is that, when I booked my flight, I had no idea that I was about to get in the middle of the Soccer World Cup. Being no soccer fan, I realized this fact only when the media circus started and I could not avoid being informed. Oh, well, now I understand why it wasn't easy to find a hotel, even booking two months in advance!

The slides and the source code for this talk are available from my site.

Friday, June 02, 2006

Let's say that you are building a new application, and you need to test it against a large set of data. You would need either to borrow the data from some known source or to create it yourself.

If you have such collection at your disposal, good for you. But more often than not you need some sort of data that is not ready in your repositories, and then you need to have a quick method to create it.

I will tell you three quick tricks to create large datasets of simple data types, namely numbers, words, and dates.

Le's assume that we need at least one million records. Here's how to go for it.

numbers

Creating a large table with just a numeric field is quite simple.

You may be tempted to run a loop in your favorite language, or even in a SQL stored procedure, but this approach would run for quite a long time.There is a better solution, that will fill your large table in a few seconds

Rather than inserting 1,000,000 lines, this procedure will insert just one record, and then it will double the table 20 times, until we end up with a table containing 1,048,576 lines (220). This operation runs in less than 8 seconds in my laptop, which is quite slow if compared to my usual servers.Even without a stored procedure, you could just insert a line manually and then execute 20 times this query:insert into numbers (id) select id + (select count(*) from numbers) from numbers; select count(*) from numbers;

And it should not take you more than 30 seconds.

words

If your case calls for a large list of unique words, you could of course build it with a program, but then again the insertion process would be quite slow to complete. A faster method is to load an existing list from as file.

All Unix system include a word list, whose size ranges from a few thousand to half a million. If you don't have such a list available in your box, you can get one (or build it from several ones) from several places. Good places to start looking for words are this and this.At the end of the exercise, let's say that you've got a list of about half a million distinct words in /usr/share/dict/words. Thus, you can proceed to build your table.

That was quick, but we got only about half of what we needed. We said that we wanted one million records, and here we only have a little more than five hundred thousand.Since we need unique words, we can ask the database to produce the missing half by reversing the existing ones.

Almost there. Now we passed the million records mark, but we are not sure that they are unique, since the reverse of one word could be the duplicate of an existing word (think about mood and doom, for example). Thus, to complete the task, let's add a unique index with the IGNORE clause, so that we'll get rid of any duplicate.

dates

Finally, let's see how to create a large list of dates. Actually, you would seldom need a million dates, since a million days cover more than 2,700 years. A table of just dates, thus, will usually range from 1,000 to 10,000 records. When you need a million records, you are more likely talking about DATETIME values, with intervals of hours, minutes, or seconds. Nothing will prevent you from using this technique to create one thousand DATE values, but let's keep to our original goal of filling a large table.Then, if we want records with one minute interval, we can proceed like this:

Does it look familiar? It should, since it's the same technique we've used to put a set of numbers into a table.

Only this time we use the record count to calculate the interval in minutes between the existing records and the ones being inserted.Also in this case, we double the table 20 times, to pass the one million records mark.

It took a little more than the numbers, because of the overhead of calculating one million date intervals, but it was about ten seconds in total, which is a reasonable time to get a test table.There are other techniques as well, but these three are techniques that you can use to get the job done even without dedicated tools.