Monday, November 28, 2005

As you know, MySQL 5 stored routines language does not provide any sort of array structure. While this is not a problem when dealing with normal SQL, since arrays are a symptom of non-normalized data, when using a programming language, it becomes quite a strain, and the programmer is often struggling to achieve results that would be really natural and simple using an array, but are made unnecessarily hard by lack of them.Thus comes into existence this library of routines that emulate arrays in MySQL 5.

Implementation

Arrays, as part of the MySQL General Purpose Stored Routines Library are implemented as datasets, all belonging to one table, and identified by a unique name. The details of the implementation are hiddenby the interface, and you should not worry a lot about it.

General concepts about the array library

Whenever possible, each routine is implemented both as a procedure and as a function. Only when a function returns an array value or an array property, then there is no corresponding procedure. This is a design choice, so that you are free of calling the routine in the most convenient way. When using a procedure that has a function synonym, a user variable with the same name of the procedure is set with the function return value. For instance, calling function array_create() you get the array ID.

If you choose to call the procedure instead, then @array_create will contain the array ID.

Array creation

You can create an array using array_create(name, size). 'Name' is any string not yet used as an array name. If you want to be sure that the name is not taken, use the function array_exists(name). 'size' is the number of items that you want to create (they will be set to NULL).Using a zero size will create an empty array.

call array_create('myfirst',0);

You can also create an array from string containing a comma separated list, with the function array_from_list(string, array_name).

If you want to use a different separator, then you may want to look at array_from_list_complete(string, array_name, separator).

Also known as hashes in Perl or maps in C++. You can set the value by key rather than by index.array_set_value_by_key(array_name, key, value) is the routine for this, and array_set(array_name,key, value) can be an alias for that routine as well. Notice that when you set a value by key, the array will still have its automatic index. So you can alway access an array by index, even if you set the values by key.

First in, first out, is the principle of queues. If you want to do this, insert your values using array_unshift(array_name, value) which adds a value at the beginning of the array, and retrieve them using array_shift(array_name), which returns (and deletes) the last element inthe array.Notice that array_unshift is the most expensive method in this collection, because all the array_indexes need to be decreased by 1.

Last in, first out. This is one of the most used device in programming. array_push(array_name,value) adds an item to the end of an array, and array_pop(array_name) retrieves the last item and deletes it.

retrieving data

No matter how you entered data into an array, you can always retrieve your items by index. The lowest index is zero, the highest one is returned by array_max_index(array_name). Knowing this, you can quickly create a loop to walk through an array, using array_get_value_by_index(array_name, index). If you inserted your values by key, you can retrieve them in the same way. array_get_value_by_key(array_name, key) does the job. For your convenience, there is also a array_get(array_name, key_or_index) function that you can use to get values by key or index. If you want to set a key for an existing array value, you can use array_set_key_by_index(array_name, index, key).

information on arrays

to see the list of all available arrays, use array_list(), that prints alist of all public arrays.Private arrays are all arrays with a name starting with an underscore character('_'). These arrays are usually created and used inside a procedure and shouldnot be shared.If you want to see them as well, use array_full_list().

advanced array usage

inserting several items at once

The array routines were designed to allow flexibility in insertion. In several programming languages you can fill an array with just a statement, be it a declaration or an action. Of course, the quickest way is to use the array_from_list function, but if you have an already existing array and you need to add more items, you can do one of the following:

Populate an array from a dataset

To achieve this result, we need the help from another part of the library. For example, if we want to insert into an array the list of USA cities as keys with their population as value, we can do this:

call for_each_table_value('world', -- the database'City', -- table name'Name', -- the first column to use'Population', -- the second column to usenull, -- no need for a third column: null'CountryCode="USA"', -- filter clause'call array_set("USA_cities","$I1","$I2")');

In this statement we invoke a for_each_table_value engine to execute our query for each row in table City where CountryCode = 'USA'. The query to execute will call the array_set procedure, using the columns as arguments ($I1 and $I2). As a result, we'll have an array of 264 rows, containing all USA cities with their population.

If your dataset is reasonably short, there is another trick you can pull, without using a 'for each' routine. Let's assume that you want to put into an array just the districts for Italy, taken from table 'City' in the example world database.

Well, a few are missing, but not because of our routine, which seems to work fine. Notice that if the list of entries is larger than 1024 bytes, group_concat will not work as expected, and you need to increase a system variable to get correct results. But for short lists this is a trick to keep in mind.Even better, using array_from_pair_list, you can populate an array with a list of key/value pairs.

set group_concat_max_len = 8000; -- necessary, or group_concat can't work -- well in this caseselect array_from_pair_list( (select group_concat( concat(Name,"=>",population )) from world.City where CountryCode ='USA'), 'USA_cities')\G

And USA_cities will get the same contents we have seen when using the for_each_table_value method.

What else can you do with arrays? Wonderful things, as you'll see in the next articles about this library.

Sunday, November 27, 2005

There are rumours from the MySQL community that some sort of routine repository is going to be organized sooner or later, and after I heard them I almost dismissed my initial idea.However, some people convinced me that starting such an initiative could always be beneficial for the community, and after all, if the official repository shows up, it would be easy to merge into it.

Therefore, here it goes. This is the first article of a series that will explain what I have made in the field of general purpose routines.

The project is available at savannah.org. The documentation is till in the making, so I will start with a general overview.

PURPOSE

This routine library will collect general purpose MySQL 5 storedprocedures and functions. What is to be considered "general purpose"?Any routine that enhances the language itself or that can be usefulin an abstract database fits the bill.

Since its appearance, MySQL stored routines language has proved itselfuseful and powerful, and yet it lacks many features that we have becomeaccustomed to in other languages, and we would like to see in MySQL::SPas well. Stored routines that improve the language espressiveness andusability fall in this category.

Furthermore, there are common tasks that can get accomplished by a storedroutine, and are not related to a particular database. Also these routinesare good candidates for this library.

CURRENT CATEGORIES

At the very beginning of this project, I have put some categories that gotpopulated faster than others, and thus they became, in my view, the ones withthe highest priority.

ARRAYS AND BASIC DATA STRUCTURES

Perhaps the biggest disappointment when MySQL::SP language was disclosed wasthe lack of arrays. Not as a SQL type, of which frankly I can do without, asit breaks normalization, but as a language component. Without arrays, andwithout things like queues and stacks, I feel that my programming abilitiesare severely cut down.Thus this first component, which inmplements named arrays where you accessitems by index, but that can also be treated like Perl hashes, where itemsare accessed by name. Built-in with them are some stack and queue features,so that an array is automatically a queue, if accessed with shift/unshift,or a stack, if accessed with push/pop.Further routines to transform arrays to strings and strings to arrays makethis library ready for prime time.

"FOR EACH" LOOPS

i.e. I get a list out of mysql, treat the list with some sort of filter,and then feed it back to the database.Sometimes I fire up a quick Perl script to do the whole task, but most of thetimes the whole thing would be just a simple matter if MySQL had a FOR loopthat you can use from a regular SQL statement.This component came out of this need. Now I have routines that acceptsome parameters and perform for me a loop with a counter, or using adataset of table values, or walk through an array.

NAMED PARAMETERS

In the process of making the "for each" routines, I came to the disappointingdiscovery that I had to use quite a lot of parameters if I wanted to achievesome powerful results. When you have routines that need more than threeparameters, as every programmer knows, it is quite difficult to get the orderright, and many mistakes can come from this situation. Thus, I wanted toimplement a simple way of calling my complex routines with maned parameters,something that in Perl I do like this:

This way, if I misplace one of the parameters, it does not matter, becausethey are retrieved by name. My routine does almost this ("almost", becauselack of language power prevents an exact implementation). It lets you passan array of named parameters (key + value) and a hidden mechanism willsort out the parameters and pass them to the receiving routine in thecorrect order.The corresponding MySQL code would be:

As a bonus, I now have an array of parameters that I can reuse when I needthem, perhaps after changing one or two parameter values.

SYNTAX HELPERS

As I said, my routines can grow their parameters quite fast, and rememberingall of them can be difficult even for their own author. Here comes anothercomponent that registers all the routines syntax, and has a way of showingyou a quick excerpt of their usage.This helper is also a key component of the routines that deal with namedparameters.Just as an example the reference guide that is currently in the repository was created by just these two calls:

The first parameter is compulsory, and it's made of "dbi", which is the same for all connections, the database driver, which in our case is "mysql", and the database name, in our case "test". For a self respecting database, we should also specify the user name and password.Nothing to complain about if this script is safely stored in your home directory, screened from prying eyes by the OS permissions.But what happens when you need to distribute this script to somebody? You need to remember to change username and password to some dummy text, to avoid the burden of having to change our real credentials if by dumb chance we sent away our script with the password for everybody to see it.

Luckily, there is an easy solution to hide connection variables from others. You just need to store them in a configuration file, which can be kept in a guarded location and never distributed. So it will be visible to your script as long as it is within the boundaries of your system, but it will be invisible to others, and moreover it won't follow the script if you need to send it away.

It works like this. In your home directory (or the home directory of the user that will execute the script, for example, for a cron job) you create afile named ".my.cnf", containing the following:

[client]user=user_namepassword=my_secret_password

The file needs to be protected with the right privileges (chmod 600 .my.cnf) so that it can't be accessed by other users. Then, you change the connection statement like this:

The "mysql_read_default_file" option will point to the file with the connection variables. Notice that instead of username and password there are two null values (undef). The file name, in this case, is built from the environment variable "$HOME" (in Unix like systems). You can also state it literally, but this trick will make the script usable from different users without modifications. The only thing each of them needs is a configuration file.For more stuff on Perl/DBI, see DBI recipes.

Tuesday, November 08, 2005

Data on Demand is a paradigm of storing data for a data warehouse with huge storage requirements. There are cases where creating statistical tables is not enough, and the detail data should be kept online for long periods. This technique has been applied in cases of MySQL DW associated with industrial production control, saving up to 80% storage space and improving performance and scalability.

Saturday, November 05, 2005

MySQL success is due to not only to its efficiency, but also to its ease of administration.And despite what its detractors say about lack of features (that was before5.0, of course), there is a feature that is crucial for enterprise databasemanagement, and MySQL has had it built-in since early version. I am talkingabout data replication that can trasform a single server into a scalable andhighly reliable system with minimum effort and expense.

Recently, one more element was added to MySQL set of features. MySQL Cluster is an ambitious engine that turns your servers into a full fledged cluster that can be accessed from any node without worry. Any node is master and slaveat the same time. However, these features come at the price of some higherhardware requirement (MySQL Cluster is, as of today, an all-in-RAM database).

MySQL data transfer methods

There are differences between these two approaches. Choosing MySQL Cluster or standard replication is a trade off.

method

description

the good

the bad

replication

asynchronous transfer, based on one master and N slaves

Fast, reliable, robust, easy to set up, no special HW required, transaction-safe

needs some care in the code

MSQL Cluster

synchronous transfer, based on a cluster engine

fast, no code modification needed

require huge amounts of RAM, not easy to set up, minimal transaction support

If you start from scratch, it's easy to build your applications in such a way that they are replication-aware, thus benefitting from the low-cost high scalability of this paradigm.OTOH, if you have an application that is already established (and perhaps complex and large), adapting it to the one-master-many-slaves architecture could be a problem.There are external tools that promise to fill the gap, and offer clever solutions, mostly based on in-between filtering layers that give your application the illusion that there is just one server, while in truth there are several ones. Also these solutions cost you something, not only in mere money, (since they are - with limited exceprions - commercial tools), but also in features, because filters mostly are not transaction safe.

Different needs for data transfers

Aside from the mere method of data movement, there are other needs from the enterprise. Notice that this is not some stretch of my imagination, but the practical result of several years of consulting in various database-related matters.

One thing that most medium to large companies wanted is the reverse of replication, i.e. they want the master to be updated by many slaves. The need is clear when you think of several branches of a company in need of sending updates about sales records and store levels. Unfortunately, simple replication does not allow this feature, and even the cluster would not be bent easily to this need.Another thing that companies want is a conditional replication, based on rules that are far beyond the limited configuration range of both replication and cluster. For example some companies wanted a table migrated from server A to server B, but to different tables, depending on a time frame.

Enter MySQL 5, and a new world of possibilities is open to the clever administrator.

New data transfers paradigms: FEDERATED tables

The reverse replication can be achieved with tables using the FEDERATED engine .For each table you need to replicate from the branch database to the main house, you just need to establish a FEDERATED table and a couple of triggers.Let's day, just for the sake of it, that you have the main server at big.central.mycompany.com, and the branches as far.away.mycompany.com (codename : FAW) and at fairly.distant.mycompany.com (code name: FDI). The main table you need to update at big.central lays in the "biz" database and looks like this:

Similar triggers (with a branch_id of 'FDI') will be created in the second branch. Now, every record in branch_sales is replicated to the central table without himan intervention. For extra safety, you may add some codeto check that the record does not exist yet, before inserting.

New data transfers paradigms: BLACKHOLE tables

Have you heard about the BLACKHOLE engine?It's a simple concept. A table with a full description, that does not store anything. Every thing you send to such table is simply discarded, and that is no surprising, since there is no storage device associated with BLACKHOLE tables.What's the use, then? You may ask. The use is to take advantage of the side effects of such table, rather that its storage capabilities. For once, records sent to a BLACKHOLE table will leave a trail in the binary log, thus allowing you to replicate its data from a server that does not have much storage available, but you need it just as an intermediate level to spread your data to several more slaves.Moreover, and here we come to the point of our quest, BLACKHOLE tables can fire triggers, and we are going to use them to meet the second need we have outlined, i.e. the conditional update.Let's say that big.central.mycompany.com want the sales records sent to different tables depending on the time of the day (it's a silly example, but bear with me for a while, just pretending it's something sensible). However, the application that deals with the data transfer is already working, and nobody dares to touch it, for fear of disrupting something. But a simple change can achieve the goal.Let's change the structure of all_sales:

Similarly, you can create a trigger for updates and deletes, and of course you can get creative and update several tables at once with different parts of the incoming data.Here. Now you have some food for thought and launch yourselves into a new data transfer enterprise.