Wednesday, December 29, 2010

I recently had a project that involved processing FTP log files. I chose to use SSIS for the task as it is great for this type of flat file ETL work. One of my requirements was to continue loading remaining log files even if an unhandled exception was encountered during the processes.

What I found is that the default behavior of the Foreach Loop container is such that any unhandled exceptions within the container cause it to exit the loop. Not what I wanted.

What I did want was for the container to do something on error (move the file to an error folder and send a notification email) and continue on to the next file. You can see this in the screen shot of the container below. Note the failure precedence constraint on the right hand side (red arrow).

Just like a C# application, unhandled exceptions "bubble-up" to their parent and if none of the objects handle the exception the program fails. In this case I wanted to know of the error, fail the child executable (in this case it was a Data Flow) but continue executing the parent (Foreach Loop Container).

It turns out this is pretty easy to do. The first thing you need to do is create an OnError event handler for the Data Flow (or any other child executable). Once the event handler is created show the system variables in the variables window, locate the Propagate variable and set it to false.

The Foreach Loop Container is now aware that the Data Flow has failed but the exception is handled in the child and therefore does not cause the container to fail and exit. We can now use the failure Precendence Constraint to do our additional processing (move the file to an error folder and send a notification email).

Thursday, November 4, 2010

For anyone that doesn't know what SQL Azure is, it's essentially SQL Server hosted in Microsoft Data Centers and accessible via the internet. Throw the word cloud in there if you want the executives and sales guys to listen. A competing product would be Amazon's SimpleDB.

SQL Azure is cool. No doubt about it. In a matter of minutes I can provision a SQL Server database somewhere in the ether and connect to it from my machine using the same familiar tools like SQL Server Management Studio and Visual Studio. Reporting Services has a SQL Azure connection type and SSIS allows for SQL Azure sources and destinations using ADO.NET type connection managers.

The benefits of having your database be a SQL Azure database are almost in lockstep with anything "in the cloud." You don't need your own hardware, you get high availability, no software patching/updating, etc. All that stuff becomes someone elses responsibility. Additionally you get scalability and you pay for only what you need. In the traditional, on-premise model of hosting your own database(s), if you have a business where you need to support peak load that only occurs a few days or weeks out of the year you have to pay for that horsepower in both hardware and software licensing costs. Move those same databases to SQL Azure and now you have a platform that scales to your needs on-demand and you only pay for the resources consumed. There's no doubt in my mind that this model will become ubiquitous just as virtualization has become over the years.

So let's move all of our SQL Server databases to SQL Azure! Well, not quite yet. There are some major limitations that will keep 99% of "real" applications off of this platform in the near-term. My biggest "deal-breakers" are listed below.

Database Size Limitations
SQL Azure databases are limited to 5GB for Web Edition databases and 50GB for Business Edition databases. Of course there are plenty of databases that might fall within these limitations but I wouldn't want to risk my job on deploying to this platform only to hit this limitation and have a catastrophe on my hands.

No Windows Authentication
The platform currently supports only SQL Server logins. Intuitively that makes sense since the server doesn't live in your domain but this can become painful if you're an organization that has standardized on using Windows authentication. It might also mean major rework of existing applications that rely on user-based permissions.

No local restore of backups
This one kills me and I think many others as well. While you can now restore a database to another SQL Azure database you cannot bring a backup to your local environment and restore it on a local server. I imagine this could make things rather difficult. Microsoft recommends trasferring your database locally by using SSIS. No thanks.

No Replication
For a lot of organizations this might not matter but if you are an organization that relies heavily on replication you simply can't move a subscriber or publisher out to the cloud.

No SSIS, SSRS, or SSAS
These services are not available on the SQL Azure platform. I would expect that they will be at some point but if I was a betting man I wouldn't put any money on them coming anytime soon. That doesn't mean you can't use these services on local servers while consuming Azure databases but this could hurt your bottom line. I know plenty of organizations that run any combination of these services on one physical machine without issue. So now if you're paying for the hardware to support SSRS why pay a subscription fee to move your database out to the cloud? Maybe you can make some arguments around high availability or scalability for you relational engine but if you're running them on the same server to begin with then these are probably not a concern.

Lack of Profiler
You can't attach a profiler instance to a SQL Azure database. That could get rather dicey when trying to track down all sorts of issues. Seems scary to me. There might be alternative but I'm not aware of them.

Bottom line: SQL Azure is cool, it will mature to the point that it will begin to become adpoted, but it ain't quite ready from prime time.

Tuesday, October 12, 2010

Most people that work with SSRS (2005, 2008, 20080 R2) are aware of two ways to deploy Reporting Services catalog items (reports, data sources, shared datasets, images, etc.). These include deploying directly from BIDS (Visual Studio) and by using Report Manager to upload files. There are also a number of web services available that enable us to do this this in a programmatic, automated fashion that allow for things such as dynamic data sources so that we can deploy reports to any environment without having to modify data source properties manually and specifying where in the Report Manager folder hierarchy we want our reports installed - again without having to modify data sources or data sets or references to either.

The following web services are available with for administrative tasks in SQL Server 2008 R2. Note that there is also a ReportExecution2005 web service that provides methods for actually executing reports.

ReportService2005

ReportService2006

ReportService2010

ReportService2005 was used for SSRS instances that are running in native mode.
ReportService2006 was used for SSRS instances that are running in SharePoint integrated mode.
ReportService2010 ties 2005 and 2006 together so that one web service works with both native and SharePoint integrated mode.

While you can work directly with these web services using your programming language of choice, Microsoft was kind enough to create RS.exe. This console application acts as a wrapper for these web services. It is installed with SQL Server and like any other executable it can be called by tools like NANT, MSBuild, InstallShield, etc.

My need for this tool arose when I was developing a set of reports for a commercial product that required installation through InstallShield. The development environment included ANT which we also had call RS.exe for automated deployment in an automated build environment.

Now I'll show you how to use RS.exe from the command line. I won't go into detail about how to use it from specific tools as that should be straightforward for anyone that is familiar with the tool they are using.
Below is a sample RS call with three input parameters.
-i is the file that contains your rss script. More on that later.
-s is the address of your ReportServer web service
-e is the web service end point. Remember from above that there are multiple web services available and you can target any of these endpoints using RS.

Now that we know how to call RS, the key to making it useful is your custom .rss (Reporting Services Script). It is in these scripts that we can write custom code to deploy reports, data sources, images and create folders as well as delete items.

I have posted a sample rss script that has the following generic methods.
CreateFolder: Creates folders in Report Manager
PublishItem: Publishes reports and shared datasets
PublishImage: Publishes images
DeleteItem: Deletes any item

These methods are essentially wrappers around the rs.CreateCatalogItem, rs.DeleteItem, rs.CreateFolder, rs.SetItemDataSources, and rs.SetItemReferences methods. They add additional exception handling and deal with the nuances of publishing reports in locations different than specified in your Visual Studio project and .rdl files.

Note the use of what are referred to as Global Variables. You can pass values into your rss scripts by using Global Variables. You don't declare these variables you simply refer to a non-declared "variable" in your rss script and RS is smart enough to know that you want to set them at the command line. The way to set them is by using the -v argument. See below for another example.

Sunday, October 3, 2010

I was recently troubleshooting a severe query performance issue when I ran into a query plan that used a nested loops join to combine data from two tables. The join was 97% of the cost. When I looked closer I saw that there were no join predicates specified. The join in the query looked pretty straightforward to me and used a single column on each side. Why would the optimizer do what equated to a cartesian join?

It turns out the query was a little more complicated than it looked. I was actually joining two views that were part of a view-based (non-persisted) dimension model. I was joining on logical primary and foreign keys which made sense.

When I dug into the views I found that one of the key columns was based on a user-defined fuction that was converting a datetime to an integer value so that it could be used to join to a date dimension view. Once I changed the underlying view to apply the same logic as the UDF at the view level, so that a call to the UDF was unecessary, the query executed within a second as expected.

Other behavior that I noticed was that if I changed the inner join to a left join, the optimizer came up with a different much more efficient plan. This appears to be a flaw in the optimizer but I would like to speak to someone at Microsoft before making that claim.

The lesson learned here is that if you have a poorly performing query due to a nested loops join with no join predicate(s) it's not ncessarily the case that you're missing join criteria in your query (as all the posts I was able to find seemed to point to). The culprit could be a UDF on a join column.

Wednesday, September 15, 2010

I often find myself needing a list of columns from a table or a view for a query, documentation, troubleshooting, etc. Easy enough to drag and drop them into a query window using SSMS but that's a pain. I have better things to do with my time. All of this information is available through SQL Server system views. The following query is also ANSI compliant.

Sunday, August 29, 2010

I was recently faced with a difficult performance issue. In an effort to save a client some time and money I convinced them that a view-based dimensional model was the way to go over a table/ETL based solution. I will blog about when this option is an option and the benefits to going this route some time in the near future. Anyhow, based on this decision I ended with a performance issue that I wouldn't have otherwise had.

The view-based dimensional database consisted of 30-40 fact and dimension views. As I created these views I made sure to test them for performance issues and tuned accordingly. I tested each view on its own and I tested individual joins between fact and dimension views.

Everything was working fine until I began creating sample queries to show end-users possible queries they could write against the new database. What I found was that when a particular dimension view was joined to a fact view along with one or more other dimension views I had unexpected performance issues. Queries that should take a few seconds to execute were taking 45-50 seconds.

In looking at the execution plans I saw that the optimizer was not doing what I had expected based on what I had seen for plans for the views executed on their own. I saw weird things like LEFT OUTER joins instead of INNER joins to my views resolving or partially resolving the issue.

I couldn't understand why the optimizer would treat the SQL in my views differently now that it was joined to other tables or views. To be honest, I'm still not sure. What I am sure of is that there is a query hint that forces the optimizer to create a plan for the view portion of the SQL irrespective of any other objects in the statement by forcing the optimizer to preserve the table/join order contained in the view definition. This hint is the FORCE ORDER hint and can be added to the end of the SQL statement using the following syntax.

Friday, August 27, 2010

Lately I've been getting a lot of requests for step-by-step documentation for non-technical end-users. Well actually not just lately. I always have. But lately I've been getting tired of writing the same document over and over so I've begun to create a little library of step-by-step guides as well as templates for standard database related documentation.

My latest request was for creating a SQL Server DSN on Windows 7. Attached are the steps for creating one. You can find these same instructions in an MS Word document that contains screen-shots for the entire process here.

Follow the steps below to create a DSN on a Windows 7 machine. These steps are also very similar for prior Windows versions.

Click Start->Run. Type ODBCAD32.exe and hit enter.

Another option would be to open the Windows Control Panel and then open Administrative Tools. You should see Data Sources (ODBC). Double click this icon.

Click on the System DSN tab.

Click Add.

Select the appropriate driver.

Click Finish.

The steps below are specific to setting up a SQL Server Native 10.0 DSN.

Enter the Name you wish to use for the DSN. This is usually the name of the database you’re connecting to or if you intend to use it for multiple databases you can use the server name.

Enter the name of the Server your database(s) reside on.

*If your database(s) reside on a local non-named instance of SQL Server your Server name should be localhost or the name of your machine. If you are running a named instance of SQL Server (which is the case for the default installation of SQL Server Express Edition) your Server name should be something like localhost\SQLExpress.

Click Next.

Choose your authentication type. If your Windows Active Directory account has the necessary privileges use Integrated Windows authentication. Otherwise use SQL Server authentication with a SQL Server user name and password that has the necessary level of permissions for the database you are connecting to.

Click Next.

Click the Change the default database to checkbox and select the database you want to connect to. This will be the default database when using the DSN. If you intend to connect to multiple databases you can leave this box unchecked.

Click Next.

Click Finish.

Click the Test Data Source . . . button to ensure that the DSN has been configured correctly.

Wednesday, July 28, 2010

I've recently begun using Visual Studio 2010 database projects (DBPro). I think it's a great tool and having been a .NET developer for many years it is nice to see SQL Server development get the respect it deserves in a full-fledged (well almost) Visual Studio project type.

While development productivity has been great with features like object level dependency checking, automated deployment with VSDBCMD, database dependency, data generation, pre and post build and deployment scripts, etc. I've recently encountered a rather glaring shortcoming that keeps Database Projects from having the full functionality of a C#, VB.NET, ASP.NET, etc. Visual Studio project.

The problem is related to the build process. Currently, to build a database project you need to have Visual Studio present on the machine that you are executing the build on. That might not seem like a big deal to many people since you might assume that any developer doing a build would have Visual Studio installed. Not the case. I'm currently working in a Java shop that uses SQL Server as the relational database platform. There are about 12+ developers that run any number of ANT targets to build their development environments on a frequent basis.

ANT can call MSBuild targets. That's not the problem. The problem is that to build a Visual Studio database project you must have certain MSBuild targets and a number of Microsoft assemblies in the GAC that are only installed with Visual Studio 2010. This is not the case for other Visual Studio project types. In most cases all that is needed is .NET 4.0.

While we're not dead in the water we have to rely on a solution that is less than ideal although not altogether terrible. What we have to do is build the project on one of the database developer's machines that has Visual Studio present. We then take the build output (.dbschema, .deploymanifest, etc.) and commit it to source control. From there any developer can run the ANT targets which then call VSDBCMD and deploy the database to their local development environment.

What we have works but it is a departure from how a standard Visual Studio project would work. It also requires the commital of "compiled" code which is never a good idea. The biggest struggle is making sure we the database developers build in the right configuration (Release) and commit the correct build output. It's an easy thing get wrong and can cause a lot of wasted troubleshooting time down stream if it is done incorrectly.

I have spoke to Microsoft about the issue. I am waiting to hear back from a Microsoft support escalation engineer that is in term getting in touch with the DBPro project manager for alternative options. Additionally, I have submitted a request on Microsoft Connect. Feel free to check if any progress has been made on the Microsoft Connect website.

Monday, July 26, 2010

Reports are often an afterthought and commonly believed to be an easy thing to create. As a result they are often overlooked during the requirements gathering and planning phases of a project and when they are considered they tend to be the red-headed stepchild and don't get the attention they deserve. This is usually a time consuming and costly mistake.

Reporting can often be complex and requirements interpreted many different ways by many stakeholders (e.g. business users, business managers, developers, architects, etc.). Reporting tools have come a long way and can do many different things many different ways. The tools can actually serve as a fully functioning UI (user interface) in many cases. Not to mention, data itself is just that - data. It can be manipulated in infinite ways to provide information that not only looks different but is actually different and leads to different values and therefore conclusions.

Most, if not all of the time, report writing will be an iterative process whether your an agile shop or not. The key here is to not iterate to the point that a single report costs $20k to create (yes I have seen this) and/or that the developers, business analysts, end-users, etc. are completely frustrated and just want to say it's done so they don't have to deal with it anymore (yes I have seen this as well).

I believe there are a few key here that can help avoid the $20k meaningless report.

Gather the minimum set of requirements necessary to create the fundamental pieces of a report. Use a template to collect this information. This will ensure that you have consistent information from which you can create a consistent set of reports.

Require the report requester to provide mock-ups. When possible, have them create the mock-ups in Excel and use real formulas for calculated columns. The greatest value mock-ups provide is that they force the requester to think about what they are asking for. They make the requester realize things like their idea was malformed and what they're asking for doesn't make sense and/or isn't possible. It also helps them think through things they wouldn't have thought about until the first iteration of report was delivered which helps cut down in the number of iterations and the need for developers and/or BAs to asked "stupid" questions that can annoy and embarrass the requester.

Minimum Set of Requirements
It doesn't matter what tool(s) you are using. It can be Crystal Reports, Cognos Report Studio, Jasper Reports, SQL Server Reporting Services, etc. They all have the same basic functionality and therefore the same requirements needed to develop a report using them.

A meaningful title for the report (e.g. Active Customers Name & Address)

A concise description of the report that includes

The primary "thing" being reported on (e.g. customer)

The subject of the attributes being report (e.g. name and address)

The main criteria (e.g. status = active)

For example, "The Active Customers Name and Address report includes the names and addresses of all customers with a Status of "Active" and is grouped by State of residence."

A list of the columns with

Database table and column name if possible

Label text for the report

Any calculation

If it's a link to another report or external source and if so a description of how the link should work

If it's a sortable column

Right, left, or middle alignment

Value formatting (MM-DD-YYYY, YYYY-MM-DD, $100, ($100.00), etc.)

Layout

Column order

Landscape or portrait

Cross-tab or row/column

Headers and footers

Image/logo

Creation date

Page number and format (p. 1, page 1 of 10, etc.)

Selected parameter value

Grouping

If grouping, what are the group levels?

How should each group be sorted?

Should the group have subtotals and if so which columns and what is the calculation if complex (e.g. weighted average)?

Should there be a page break and/or line before/between/after the group?

Sorting

How should the report be sorted (which column(s), in what order, and which direction - ascending or descending)?

Should the user be able to specify the sort order (dynamic sorting)?

If grouping is present, what order should the groups be sorted?

Is run-time sorting allowed on a column after report execution?

Filter Criteria

What filter criteria (WHERE clause) are there if any? For example, should only "active" customers be included. Anothe example would be "completed" orders.

Should these criterion be evaluated using an AND operator or an OR operator?

The column(s) that the filter is applied to should be clearly stated.

Parameters/Filters

Label/text for the parameter

Drop-down, text, yes/no

Are multiple selections allowed?

What are the values or where do they come from (static list of values or from a table)

Should we assume that all criteria are applied to the result set or are the "OR" conditions?

Does one parameter drive the value list of another parameter (e.g. Country changes values in state/region parameter drop-down)?

What column or field in the result set should this parameter/filter be applied to?

Schedule

If the report is deployed to a "report server" should it have one or more standard schedules on which it runs?

Export format

If the report is deployed to a "report server" what is the default format (Excel, Word, CSV, PDF) that it should be written to? If there are scheduled instances, what format should those be written to?

Recipients (burst or data-driven - what is the logic?)

If a scheduled report instance is burstable (Cognos) or a data-driven email subscription (SSRS) who are the recipients? Is it a static list or is it data-driven?

Misc

Charts or graphs?

Drill-downs?

Links to other reports (drill-through)?

This isn't an exhaustive list but it is a good starting place to build a template from which you can more effectively develop reports. Open communication and an iterative approach is good but why waste iterations on requirements you could have known upfront? Save time, save money, and produce a quality product. Our job as developers is to deliver value. This is one way to help do that.

Thursday, July 15, 2010

II'm currently working on a Business Intelligence project and depending on who I am talking to we are either building a Data Mart or a Data Warehouse and everyone's definitions of both seem to be vastly different. Since this makes communication a bit difficult at times I took it up on myself to attempt to clarify the difference between a mart and warehouse.

My project is following the Kimball approach of dimensional modeling so I tailored my definitions to this approach. Keep in mind that the detailed definitions differ depending based on whether you're talking about the top-down (Inmon) approach or the bottom-up (Kimball) approach. The differences are mostly around the modeling techniques and whether or not the data warehouse is in 3NF or de-normalized.

My non-technical description of data mart and data warehouse are as follows.

A data mart is typically a single star schema representing a specific subject area. Sales is a good example of this and Manufacturing is another. A data warehouse contains the subject specific data marts and attempts to tie an entire organization’s subject data together using conformed dimensions (attribute data that can be used consistently across multiple facts/marts/star schemas).

That is the very simple description I provided the group I am working with. That said, I still needed to determine if we were building a data mart or a data warehouse. From my definition you would think it would be an easy thing to figure out. Since I tend to overthink most things in life I struggled a bit with this.

The database I'm building to support the Business Intelligence (BI) solution consists of subject areas such as Orders, Complaints, Deliveries, and about 4 more subject areas. They will each have their own fact table and share a number of dimensions. Based on that I would say it is a data warehouse because it is a set of data marts/start schemas providing data across numerous subject areas and sharing conformed dimensions. However, this solution is part of a commercial software product that supports Restaurant Order Taking and is not being developed for or within a specific organization. So now I believe it's a question of perspective. From the customers/restauranteur's perspective, the order taking application and the new BI add-on is simply a small part of a much larger picture. The data contained with the mart/warehouse might be considered a single subject area to this organization. The subject area would be Order Taking or more specifically Web and/or Call Center Order Taking. The customer might have other data around accounting, human resources, inventory, supply chain, etc. The solution I'm building could just be part of a bottom-up enterprise data warehouse at some point at which point it would be considered a data mart by the purchasing organization.

So which one is it? I believe you can argue that it's both. However, I believe we must look at it from our perspective of my client who is a software products company. My client sells a suite of order taking products to the restaurant industry. From within the software product company this is a data warehouse that brings together all aspects (subject areas) of their products.

You could argue that it doesn't matter and that this is purely a symantics or philisophical debate. Or possibly I've missed the boat. What do you think?

In relation to the product and the company that owns and sells this product to restauranteurs, this solution can be considered a data warehouse as I stated above but what about looking at it from the customers (restaurant owner) perspective?

will be an add-on to I would say it is a data warehouse because it captures data across the solution and consists of many subject areas.

If you wanted to know what games each store sold from 1/10/2010 through 1/16/2010 you could write either of the following queries:

SELECT DISTINCT
Store,
Product
FROM
Sales

-OR-

SELECT
Store,
Product
FROM
Sales
GROUP BY
Store,
Product

So which one?? I hate to admit it but back when I first got into SQL performance tuning I was working in an Oracle 8i environment. If I remember correctly, GROUP BY performed better than DISTINCT for one reason or another. Well in the days of SQL Server 2005 and 2008, GROUP BY and DISTINCT perform equivalently. Therefore, my rule of thumb is to use DISTINCT when no aggregations on additional, non-unique columns are necessary. If you just want unique combinations of store and game use distinct. If you want to know how many of each game were sold by store then use GROUP BY.

Wednesday, June 16, 2010

A few things to keep in mind when deciding whether you want a unique index or constraint.

1) The index creation options other than FILLFACTOR that are available for a unique index are not available for a unique constraint.

2) A unique key can be referenced by a foreign key constraint but a column with a unique index cannot be referenced by a foreign key constraint.

3) A less subtle difference is related to the timing of validation. Constraints are checked before indexes and this can lead to a large multi-row insert/select or update to fail before modification. Often times indexes are validated at the end of a large modification. As a result it will take longer for a failure to occur with a rollback at the end of the modification.

Ever had to try to decipher SQL that was generated by a report writer or sql generation tool such as Cognos Framework Manager? How about trying to read through a co-worker's messy SQL? Not fun.

One of my favorite tools is the Instant SQL Formatter. It will clean up any valid SQL statement based on user configurable options. I usually just stick with the defaults. It's a great tool and it's free. Can't beat that. One of the first things I do when I set up a machine at a new client site is to add it to my favorites in IE.

Here are a few hints if your extracting data from an Oracle database using SSIS.

1) Use Oracle 11g client. The 10g client has bugs around the naming of the Program Files directory on 64-bit machines. It doesn't like the parantheses in Program Files (x86). There are workarounds for using the 10g client but it's been my experience that it's not worth the effort.

2) Make sure to install the 32-bit client for running SSIS in the IDE (debug mode) and the 64-bit client for the 64-bit SSIS runtime.

3) Use the Attunity connector for Oracle. This component can extract data up to 100 times faster than out of the box OLE DB for Oracle provider.

Wednesday, May 12, 2010

Are you in a dev environment and by screwing around with large dml statements you've created a monsert log file that you need to truncate and are not worried about loss of data? Is your database in Simple Recovery Mode? If you answered yes to these questions . . . fire away:

DBCC SHRINKFILE (<Log Name>, 0, TRUNCATEONLY)

If you get an error stating that the log file is in use but you have an immediate need to truncate the log file and don't want to wait you can also do the following:

1) Detach the database
2) Delete the log file from the file system
3) Attach the database and remove the the reference to the log file prior to completing the process in the UI

Upon attach, SQL Server will create a new log file at the minimum size specified.

Tuesday, April 13, 2010

Well it's not. One of the most important things you can do when performance tuning is to remember to clear the procedure cache between test cases.

Don't be fooled by the word procedure. The procedure cache deals with almost all queries submitted to the SQL Server engine, not just those that live in stored procedures. It stores compiled execution plans for later use which helps speed things up by not having to regenerate execution plans when one is already available.

Back to clearing . . .

It's important to clear the cache between query submissions so that we start from scratch when we submit our new query. If we have "left-over" plans in the cache the optimizer won't necessarily create a newly optimized plan thus corrupting your tuning/testing efforts.

DBCC FREEPROCCACHE will clear all execution plans from the cache causing all subsequent SQL statements, stored procs or not, to be recompiled the next time they run.

It's also helpful to run DBCC DROPCLEANBUFFERS to clear the data buffers. This will ensure more accurate testing as all queries will have to retrieve their data from disk. Running CHECKPOINT prior to DROPCLEANBUFFERS will move all dirty data pages to disk, even further ensuring accurate testing results.

Monday, March 29, 2010

I love the PIVOT and UNPIVOT commands. They have saved me a ton of time on quite a few projects. UNPIVOT especially, since many times when working on BI projects I'm handed a monster spreadsheet by someone on the business side. Business people like seeing pivoted data. Database people like seing normalized data. I won't get into the keywords/commands here since there is plenty of material out there, most of which do a better job than I ever could of explaining their use.

However, I will tell you that SQL Server does not like "UNPIVOTing" a large number of columns as I recently found out. I should note that the query I have contains a few nested SELECTs (which I'm sure is part of the problem). I don't know the internals of what the database engine is doing when you using these commands but I do know they produce some pretty ugly execution plans. Since I just started this project and have a year of work to deliver in about 2 months (I'm not kidding) I haven't had the time to deconstruct the execution (nor should I really be writing about this topic).

What I have figured out is that if I break up a 280+ column UNPIVOT into 20 column chunks (an unpivot for every 20 columns - yes this is a maintenance nightmare) then I get reasonable performance (a couple of minutes to execute) whereas my original query never completed.

There's no way I'm going to be ok with having 15 queries instead of one to solve a performance issue but this is what I know for now and I'm able to present a working proof-of-concept. I hope to have a better understanding of the issue and as a result, the solution soon. I'll keep you posted. In the meantime, if you know what's going on please share . . .

Sunday, February 28, 2010

Most SQL Server mortals assume that their primary keys should be clustered because that is the default behavior of SQL Server Management Studio when using the designer to create a table. Unfortunately, the primary key is not always the best candidate for the one and only clustered index you're allowed on a table.

I won't go into detail here about what a clustered index is but I'll say that it is essentially your table and the order in which your table rows/records/tuples are stored is based on the column(s) used in your clustered index.

In an OLTP database where your pimary keys consist of a single integer identity column you're usually in good shape if your clustered index is also your primary key.

However, if you use composite and/or "intelligent" keys you're going to experience a number of performance problems. I won't expound on these problems here because I almost never use intelligent or composite keys in an OLTP system and don't believe they should be used in such systems.

The place I most often have a non-primary key clustered index is in an OLAP fact table. OLAP systems are more concerned about high performance reads of blocks of data as opposed to reads and writes of individual rows as is the case in transactional systems.

In a data warehouse/OLAP database date is often an attribute used in most analyses. By creating a clustered index on the most important date in a fact table (i.e. order date in an order table), you enable better I/O and therefore faster query response times for the majority of queries executed.

Some might argue that the unique surrugate key of the fact table (if you even have one) should have a clustered index. Hmmm . . . how often do you query a fact table and have a surrogate key based predicate? The answer should probably be never (although I almost never say never)!

How about writes? Fact tables are often written to based on a date. By clustering on date we typically have sequential writes to the disk. When we load today's data we end up adding data to the end of the last data page for the table because today's date comes after yesterday's date. If we were to cluster on some kind of an intelligent key we'd end up having to insert rows into pre-existing data pages which leads to much higher I/O and therefore decreased performance.

How about archiving and partitioning? In most OLAP scenarios the business requirement is to have something like 25 months worth of data available in the system at all times. It's often beneficial for performance and disk expense reasons to take old data offline. The best way to do this in a SQL Server environment is to partition your fact tables using a year/month based partition function. For partitioning to be successful it is necessary for your partition column to be part of the clustered index. If you cluster on your date column and partition on your date column as well, you're all set.

In short, if you're working in a data warehouse/OLAP environment pay careful attention to what column(s) you create your clustered indexes on. It can make or break the performance of your solution!

Thursday, January 14, 2010

I was recently attempting to explain to someone the difference between derived tables and subqueries. I didn't do a great job and came to the conclusion that I wasn't completely clear on the terminology. I don't like not knowing things so I did a little research and this is what I found . . .

A derived table is not what I thought it was. I always referred to a derived table as being a query in a FROM clause. For example:

As you can see, the subquery has a WHERE clause that refers (correlates) to the outer/parent query.

So there you have it. Stop saying derived tables! Most likely you're misusing it the way I have ever since I learned about the concept (of subqueries - not derived tables ;)). Or maybe you're not as anal as I am. However, if you've read this far I'd say you are.

Wednesday, January 6, 2010

People seem to be confused when it comes to indexing of foreign key columns. For one, I believe some people assume that when you create a foreign key constraint (in SQL Server) an index is also created on the column(s) that make up the key/constraint. This is not the case.

That leads us to the original question . . . Should we create indexes on our foreign keys??

I believe that in most cases the answer is yes.

There are a few things to consider here:
1) Constraint checks
2) Cascading Updates and Deletes
2) Joins

1) If you delete a row in a table whose primary key or unique constraint is referenced by one or more foreign keys, SQL Server will search the foreign key tables to determine if the row can be deleted (assuming you have no cascade behavior defined). An index on the foreign key column(s) will speed this search.

2) The same is also true for update and delete cascades. If either are defined, the child/foreign key rows must be identified and updated and/or deleted. Again, the index helps find these rows more efficiently.

3) More often than not foreign key columns will be joined to primary key columns in queries. Joins are always more efficient when an index exists on both sides of the join.

If you ask me, it's a no-brainer to index your foreign key columns. How often do you have a database with foreign key relationships that don't fall under one of the 3 above scenarios??