By default, SQL Server does not provide integrated functions to do pattern matching with regular expressions. If you want to use Talend Open Profiler to analyze the matching of your data against some regular expressions (for example to analyze a column that contains emails, dates,…) you need to install a CLR User-Defined Function (UDF) on your SQL Server.

Amine who has done an internship at Talend has tried it. He provided us a step-by-step user guide about how to create the UDF, install it on your SQL Server and call it from Talend Open Profiler. This will allow you to use all predefined patterns to ensure the quality of your data.

You can also find new regular expressions for Talend Open Profiler in Talend Exchange.

1. create a ssh key: ssh-keygen (without a passphrase or it will be asked each time you do an update)

2. connect to the remote server via ssh

3. edit or create a file “.ssh/authorized_keys” and put the content of your local “.ssh/id_rsa.pub” file (if the “.ssh” folder does not exist, execute “ssh-keygen” in order to generate it. Then remove the generated key files: “id_rsa” and “id_rsa.pub” which are not required)

What’s the link between quantum physics and data quality? In a recent discussion, Jim Harris suggests that data quality projects yield to a measurement problem as there is in quantum physics. He asks:

“When does a data quality project stop existing as potential success or failure and become one or the other?”

The question makes sense, but I don’t think there is any paradox in it. In my opinion, as long as we do not measure the data quality, nothing can be said about the success or failure of the data quality project. It’s like with any probabilistic issue: as long as the throw of a die is not read, we don’t know the outcome but there is an outcome. In other words, the status of the data quality project is in an undefined state, but it is certainly not in both states at the same time as the Schrödinger’s cat.

Now what could be common to quantum physics and data quality?

I think that something which could be common to both fields is that a context is required in order to perform the measurement. Data quality is contextual. We cannot speak about the quality in an absolute manner (see this discussion). In some context the data will be of good quality whereas in other contexts they could be of bad quality. It depends on the intended use or purpose as says Henrik Liliendahl Sørensen. But it’s not sufficient yet to be quantum. Other features are required before we can speak about quantum quality. I suppose that we would need to identify non-commuting contexts in data quality projects. This would mean the order of the measurements of data quality would play a crucial role. I am not sure that we can exhibit this kind of behavior. And we need probably to define more precisely all the concepts before we can show that data quality has something quantum.

I give here a tip to detect random text by using regular expressions. But before I give this expression, let me explain when and what for it is useful.

Nowadays, we find webforms almost everywhere in the Internet. These forms are used to collect data entered by users. If the form is well designed, the required fields cannot be left empty by the user. Other controls can also be done, such as to check the address, the country, the phone number format and so on.
Sometimes these controls are not done, either because the field is a free text field or simply because the developper of the web site did not develop them. In this case, the user can type anything in the text area and then anything enter in the database.

If the user does not want to fill in a field and this field is required, what is going to happen?
Either the user goes away and does not fill the form, or he types something like “lmqkjfdmklgj”

if you are doing data quality, you may need to identify this kind of bad data. The question is: how to detect something that can be anything but has clearly no meaning?
There can exist several solutions to this question. The first one could be to check that the data is composed of real words. For this solution, you need a dictionary. And it should be rather complete in order not to miss some words. Then how do you handle proper names?

Another solution is to use regular expressions. But what would be the regular expression that matches random text?

When I looked at my keyboard (an azerty keyboard as in the picture above) I saw that all the vowels are on the second row. Moreover, the default starting position of the hands on the keyboard is to put the left index finger onto the “F” key and the right index finger onto the “J” key. These keys are on the third row of keys (called the home row). This means that when you want to type something randomly, there’s a great chance that you will type only letters from the home row. And on a French keyboard, this means that there will be no vowel in the entered text.

Given these considerations, a random text is a string of characters without any vowel. The regular expression to match it can be something like:[zrtypqsdfghjklmwxcvbnZRTYPQSDFGHJKLMWXCVBN]{4,}

This expression matches any 4 consecutive consonants. Maybe, it’s not enough and some real words will be matched by this expression. For example, it matches the word “length“.
Either you can require at least 5 consecutive consonants or you can restrict the expression to the letter of the home row:[qsdfghjklmQSDFGHJKLM]{3,}

Try it on your data with Talend Open Profiler. You can either create your own “pattern” or download it on Talend Exchange.

For the English keyboards, the “a” vowel appears in the home row. This adds some difficulties because there are probably several words that could be formed with “a” and the letters of the home row. I let you adapt the regular expression to your needs and keyboard…

Here is an interesting screencast about how to create in 16 minutes a web repository to manage your data quality projects.

This application allows you to define data quality tasks and assign them to people. If you are starting a new data quality project, this example application can really help you to structure and organize your project.

In the last milestone version of Talend Open Profiler, it’s possible to define tasks attached to your analyses or to your other items such as a database table or column… This feature should help you to manage your data quality projects at the user level.Try it.

A new service is offered on the Talend Forge website. It’s called Talend Exchange and is aimed at providing a place where you can share TOS components, TOS templates, TOS jobs and TOP patterns. It is an enhancement of the previously called “Ecosystem” service.

I have added a few new regular expressions for Talend Open Profiler which may be useful to French users. You may download them and import them in Talend Open Profiler-1.1.x via the “import patterns” menu available on the “Libraries/Patterns” folder.

You may add your own regular expressions too. For the Talend Open Profiler-1.1.x platform, you need to create a simple xls file with the following columns
* C1: Label: the label of the pattern
* C2: Purpose: the purpose of the pattern
* C3: Description: the description of the pattern
* C4: Regular expression: the regular expression applicable to all databases
* C5: MySQL Regexp: the regular expression applicable to MySQL databases
* C6: Oracle Regexp: the regular expression applicable to Oracle databases
* C7: Author: the author of the regular expression
Then compress it in the zip format and you’re ready to upload it on Talend Exchange.

For the current development version of Talend Open Profiler-1.2.0xx, the task is even easier because there is a menu to export your regular patterns and/or your SQL patterns in csv format. For this version, it is suggested to export your patterns individually. Then compress your csv file(s) and upload them on Talend Exchange to share them with other users. The “import” menu will be able to read these files (once unzipped).

We’ll work later on a more integrated mechanism to share your patterns in one click.