This month’s T-SQL Tuesday is being hosted by Pat Wright (blog | twitter). He selected the topic of automation with T-SQL or PowerShell. As it happened, I’d just had a scenario come up that seemed appropriate for a little PowerShell, so I went with it for this post.

The Varigence toolset is based on a language called BIML (Business Intelligence Markup Language), which is an XML-based language for creating BI solutions. We needed to make some mass updates to some of our sample files, but there were quite a few individual ones to update. I already had an XSLT transform that would update each file individually, so all I needed was a quick way to run it on each file in a directory.

I’m pretty new to PowerShell, but it seemed like a good fit for this. So, after a little trial and error, I came up with the following script:

Basically, it goes through a directory, renames all the existing .XML files to create a backup, and then gets a list of the files. It then loads an XSLT transformation (used to transform XML documents by inserting or removing markup). Finally, it loops through the files in the folder, and uses the XSLT transform to create an updated version of the file with the original filename.

There are probably better ways to do this, but it does exactly what I needed, and it let me learn a little more about PowerShell. It also struck me that this could be very useful to transform XML into a friendlier shape for processing in SSIS.

One of the interesting things about T-SQL Tuesday is that it’s a 24-hour event. So, being on the east coast of the US, I get the chance to read some of the blog posts for the event on Monday night, before I actually post mine, though I usually write them ahead of time. Last night I saw that Rob Farley (twitter | blog) had posted his T-SQL Tuesday post. Reading it, I came to a horrifying realization – Rob and I think alike! (If you know Rob, you’ll realize that’s a lot scarier than it sounds). His post and thought process was very similar to mine – though, while I did think about the screen resolution joke, I had the good sense not to mention it.

Seriously, since Rob’s one of the smartest guys I know, I think it’s pretty cool that we had the same approach.

This month’s T-SQL Tuesday is being hosted by Jen McCown (twitter | blog). She chose the topic of resolutions, with the new year in mind. I’ve never been big on the traditional idea of resolutions, and usually focus more on goals that I revisit on a regular basis.

So I started thinking about other ways that I could blog about resolutions, and the idea of resolving problems came to mind pretty quickly. In the roles I’ve filled over the last several years, I was often called in to deal with problems (technical and non-technical). The scope of the problems could vary quite a bit, from simple, 5 minute technical issues (“Why won’t my cube process?”) to business problems requiring full projects (“I need to analyze student performance and determine what factors impact it most heavily.”). One of the trends that I’ve noticed around solving problems is the shift from resolving the problem to simply fixing the glitch.

What’s the difference?

Well, just a second there, professor. We, uh, we fixed the *glitch*. So he won’t be receiving a paycheck anymore, so it’ll just work itself out naturally.
-Office Space

When you fix the glitch, you are basically looking for the quickest, cheapest, least effort patch to getting the immediate results you desire. When you resolve a problem, you are making sure the root cause of the issue is fixed, so that the problem doesn’t resurface in a new form a few months later, and so that you address the logical ramifications of the problem.

Of course, fixing the glitch is sometimes appropriate. If you have a server down situation and 50 people sitting around twiddling their thumbs till it’s back up, you need to get that fixed immediately. However, all too often, it stops there. Rather than investigating the root cause and how to prevent it, often the reaction is “It’s working – now back to business as usual.”

The biggest issue with fixing the glitch is that it sets you up for future failure. Most of the time, the patch fails to prevent issues because it didn’t address the problem completely. Even if the patch continues working, it’s creating something that will have to be worked around in the future. We’ve all dealt with those systems that have so many patches and quick fixes that everyone is scared to make any changes to them, for fear of breaking something else.

Resolving Problems

One of my goals for this year is to avoid quick fixes, and focus on resolving the root problems. That’s sometimes painful in the short term, as you don’t get that immediate satisfaction of resolving a problem 5 minutes into the conversation. The benefit, though, is that you won’t have to keep dealing with that same problem on a recurring basis for the next 5 years, so it’s a price I’m willing to pay.

This post is part of T-SQL Tuesday #13 – organized by Adam Machanic (Blog | Twitter), and hosted this month by Steve Jones (Blog | Twitter). The topic this month is “What issues have you had in interacting with the business to get your job done?”

Working in business intelligence, I do a lot of work with business users. For the most part, it’s an enjoyable process and a good interchange of information. I learn about the business, and in the process help the users gain new insights about their data and how it can be used to answer their questions. There is one consistent item, though, that I’ve found difficult to communicate to business users, and that’s the state of their data quality.

Data quality has a pretty big impact on the value of a business intelligence solution. Bad data quality = bad answers, and if your business intelligence solution isn’t delivering good answers, it’s not really worth much. But the data quality in a business intelligence solution depends in large part on the data being fed into it. And unfortunately, the data in a lot of source systems is not in good shape.

It’s Not Me, It’s You

It’s really difficult to communicate this to the business users, though. After all, they’ve been running their business for years on this data, and now I’m telling them it’s bad? Why haven’t people been complaining about it all along? There are a few things that I try mention at this point.

First, the level of data quality required to get good information from a BI system is very different from that required to run an operational system. For operations, there are usually some required, key pieces of information necessary. These are kept to a minimum, though, because operationally, the more data quality checks you implement, the more you impede the business process. So you want to have just enough data quality to for the system to not fall apart. You don’t really care that the customer entered the wrong zip code with their state (after all, the Post Office will work that out when you send them the bill, right?)

For BI work, though, you are flipping that around. To analyze and get meaning form the data, you need classification of the data, and some of those optional, not-so-important from an operational perspective pieces of information start coming into play. Knowing that your states actually align with your zip codes becomes pretty important if you want to display your data on a map.

Also, people probably have been complaining about data quality – they just aren’t complaining to the business users. The DBAs in most companies I’ve worked with are well aware of the issues in the company’s data. But when they complain about it, they complain to their managers or other DBAs. They don’t complain to business users, because, after all, we technology people are just supposed to make it work, right?

Can We Make This Work?

Convincing the business users of these two points can be pretty difficult, though. In the past, I’ve had to extract data, profile it, and then identify problem areas. Even then, going to a business user with a list of the 1000s of customers who have zip codes that aren’t valid for their state of residence often didn’t help. Business users can’t easily correlate the impact of those types of data issues on their reporting and analytics until they see it in the context of the actual analysis.

So, in the past, I often ended up doing a first pass on the data where I let anything in – quality data or not. Inevitably, as soon as users started looking at the numbers, they’d become much more interested in data quality. Today, tools like PowerPivot can help, because I can quickly build a functional prototype on top of the dirty data, and visually show the impact much more easily.

Once the business users understand the impact of bad quality data, they’ve always seen the value in either fixing it or putting in the appropriate safeguards to prevent it from entering the BI system in the first place. It can be challenging to communicate it, but the faster I could get this point across with the business users, the more successful the projects turned out. One of the keys things that I’ve learned from this process over time is that, while it’s difficult to show someone quality data, you can show clearly show the impact that it has on the solutions. For many business users, this communicates far better than all the talking in the world.

The topic for T-SQL Tuesday #006 is LOBdata, selected by this month’s host, Michael Coles. If you aren’t familiar with T-SQL Tuesdays, Michael has a nice summary in his post.

In database terms, LOBs are Large OBjects, also referred to as BLOBS (Binary Large OBjects). These are data types that can exceed the maximum row size of 8 KB in SQL Server. Most often, they are used for storing large amounts of text or binary data in the database. An example of this would be an application that stores documents in a database table. SSIS has some special features for handling LOBs in the dataflow, and this post will provide an overview of them and provide a few tips on using them wisely, as well as some approaches to keep them from killing your data flow performance.

Data Types for LOBs

LOBs are represented by three types in SSIS: DT_TEXT, DT_NTEXT, and DT_IMAGE. These map to the SQL Server types VARCHAR(MAX), NVARCHAR(MAX), and VARBINARY(MAX). Each of these types can hold 2,147,483,647 bytes, or almost 2GB of data. Columns with these data types have some special rules.

First, if you want to use them in a Derived Column expression, you’ll probably need to cast it to another type, like DT_STR or DT_WSTR, as the LOB types aren’t supported by most of the built-in functions. One thing to be careful of with this is that the entire contents of the LOB may not fit in the data type you are casting to, which will cause a truncation error. You can either set the Derived Column transformation to ignore truncation errors, or redirect error rows so you can handle the longer values differently.

If you are accessing LOBs in a Script component, you have to use the AddBlobData(), GetBlobData(), and ResetBlobData() methods on the column to get to the data.

Note that you should use the System.Text.AsciiEncoding class in the above code if you are working with non-Unicode data.

Transforms for LOB Data

There are two transformations in SSIS specifically designed for working with LOB data: Import Column and Export Column. The Import Column transformation is used to import the contents of a file from disk into a column in the data flow, as a LOB. The Export Column is used to do perform the reverse operation – that is, it saves the contents of a LOB column to a file on disk. Since you can process one file per row, these transformations can be really useful for moving multiple files in and out of a database.

There’s a few good walkthroughs on using these transformations, so rather than reposting the same information, here’s a few links:

Performance

Because these data types have the potential to hold so much data, SSIS handles them a little differently than the standard data types. They are allocated separately from regular data in the buffers. When there is memory pressure, SSIS spools buffers to disk. The potential size of LOB data makes it very likely to be spooled, which can be a pretty major performance bottleneck. To minimize it, you want to make sure that the LOB data is being spooled to a nice, fast disk that doesn’t have a lot of contention. To control what disk they are spooled to, you can set the BLOBTempStoragePath property on the Data Flow task. Note that this is a different setting than BufferTempStoragePath, which controls where regular data is spooled. Ideally, you want to make sure each of these settings points to a different physical disk.

The other item to consider is whether you actually need LOB data. In some scenarios, the LOB data types are used for convenience or because there is a possibility that a small percentage of the data would exceed the maximum length of a regular data type. If you are processing a lot of data with LOB types, and most of the LOB data is small enough to fit in an regular data type, then you may be better off using a two pass approach. In one pass, select only the data with LOB values small enough to fit in a standard data type, and cast it in the SELECT statement. SSIS will process this set of data using the normal buffer approach, which should be faster. Then, do a second pass where you select only the data that is too large for a regular data type. This will minimize the amount of data that has to be processed using LOB data types. To figure out the length of a LOB column in SQL Server, you can use the DATALENGTH function.

SELECT DATALENGTH(MyLOBColumn)
FROM MyLOBTable

As usual with performance, your mileage may vary on this one. It’s worked well for me in a few scenarios, but there’s some overhead involved in using two passes. So, as with any performance tips, test it in your environment, with a solid representative sample of your data, before implementing it in production.

There it is – a whirlwind tour of LOBs in SSIS. If you have any questions or comments, please post them.