I read this while on holiday and could not put it down. A compelling unifying theory describes and explains the psychology of normal people’s decision making, especially where and why we are not always logical.
There are many practical aspects that can be applied to project estimation, sales and relationship building, but mostly I found this book interesting because its full of the fascinating quirks and foibles of the human mind. These are laid bare through of surveys and experiments that reveal effects such as anchoring, priming , re-framing, time-insensitivity and many other delights.

One Note

Office OneNote (part of Office 2010) is a great, fast authoring and note taking tool. My kids use it at school for notes. You can cut and paste multiple media formats and don’t worry about the layout. Notebooks can be shared on SkyDrive or SharePoint.

The Problems

The problems I had were twofold:

I could never get One Note to open my SkyDrive notebook on my corporate desktop machine. When I tried to open a web-based notebook in OneNote it asked me to authenticate my Live ID and failed. This happened repeatedly and I could never get it to work. I could authenticate OK through windows explorer to access Hotmail and other browser based apps that need a Live ID. However One Note has a special authentication window that pops up and this always failed.

Last week I returned to my corporate desktop at the office, and after a restart I could no longer connect to Live Mesh. I tried various things including a complete uninstall and re-install of Live Mesh.

Anything to Learn

Clearly I am no network guru, but this is what I learned:

WinHTTP is a program interface to HTTP services on Windows. The WinHTTP proxy configuration can differ from the IE proxy configuration that you set in IE/Options/Connections/LAN Settings – which is the only setting that I knew about – and I think the only one most people know about.

Here is where I am guessing:

There is a service that exists whose job seems to be to keep these IE settings and the WinHTTP settings in sync, but sometimes this goes wrong and needs to be fixed with NETSH??

SSIS provides flexible and varied methods for configuring its packages. These allow operators or developers to change a package’s behaviour at run time to adapt to changing environments.

While SSIS is flexible, I personally found there was a lack of guidance as to the best choices. It took some time and a good deal of trial and error to find techniques that worked best. This article aims to distil that experience.

Only Create Configurations for Changeable Items

Use configurations for items that are likely to change when the operating environment changes. For instance when a system is moved to test or production or when a server name changes.

Do not create configurations for items that could conceivably change but are unlikely to do so. For instance there would normally be no need to create configurations for database names or the relative paths to individual files.

Paths to Files should be an Expression

Usually a connection manager for a file path should use an expression referencing a configurable file share concatenated with a relative path.

Configurations should be re-used

When a number of SSIS packages reference the same servers and file shares, then the configurations of these packages should be coordinated. This should be done such that each logical server/file share has a single configuration that applies to all relevant packages. Note the term logical server. Sometimes a single server can be used for multiple purposes – for instance to house sales data and logging data. In this case a call must be made on the likelihood of logging and sales being on the same server in all environments. If it is likely that these will be deployed on separate servers in one environment (e.g. test), then there should be a separate logical sales server and logging server.

Avoid creating lots of logical servers – only do so if it is probable that a separate server will be required.

This requires planning but pays off in simpler management of the packages at execution time

It is usually better to use configurations to set variable values rather than setting the connection manager properties directly.

Each configuration should set the value of a variable. This allows multiple connection managers to use the configured variable. For instance multiple file connection managers can share a common root file share via a variable. Similarly multiple database connections of different types can share the same server name via a variable.

Choose the Name with Care

The name of the variable being configured should be chosen with care. The name is global to the entire suite of packages. Do not use the physical server name. Use a logical name where possible. For instance use the name SalesReportingServer. Do not use the name Dev09.

Use Matching Names

Where possible the internal name of the configuration, the variable name and the external name of the configuration should match. This allows people maintaining packages to quickly identify and line up configurations.

The internal name if a configuration is the name given to a configuration within a package. By the external name I mean the name of the filter for an SQL configuration or the name of the Xml file for an Xml configuration.

Choosing between Xml, SQL and Environment Configurations

SSIS supports 4 main types of configurations:

Xml files

SQL Server

Environment variables

Parent Package

There are many options, and combinations.

One problem with configurations is that if they are used in an uncontrolled way they can become confusing. Its best to set and stick to some simple standards.

A simple scheme that I have found works well is to use XML configurations located in a standard folder on the C: drive. (say C:\SSISConfig). The C: drive is chosen because its generally available on servers as well as developers machines.

A slightly more sophisticated scheme I have used for one client is to use a combination of SQL Server configurations and environment variable configurations. The location of the SqlServer (server and database) containing the main configurations is set by environment variable configurations.

Use Parent Package Configurations for Passing Parameters

Parent package configurations allow a child package value to be overridden by the calling parent package. This mechanism allows parent packages to pass “parameters” to children. For instance a child package may export all data for a particular catalog and so takes the catalog ID from the parent. The parent package may have a loop and iteratively call the child passing different values for the catalog ID.

Parent configurations should not generally be used to set connection manager properties, such as server, database and file shares. One reason to avoid using parent package configuration for connection manager properties is that validation occurs before parent package configurations are set, so validation errors could occur in the child package before the “correct” value comes from the parent.

Overview

Auditing of user actions is a common business requirement. Typically users change data on a record-at-a-time basis using an interface that allows maintenance of a single record. In this case each change results in a single entry in an audit log. Sometimes, however, a user will initiate a batch job that results in a change to many records at a time. This document describes a design pattern for auditing in the latter situation.

To distinguish this from other auditing techniques I will call this snapshot auditing.

High Level Design

The snapshot audit design pattern uses a single generic table to hold a record of the user action and a snapshot table for each table that is being audited.

In the examples that follow the single table is called Ctrl.LogSnapshot and the individual snapshot tables are called Ctrl.LogSnapshot<tableName> where <tableName> is the name of the table being audited.

A single generic stored procedure call can be made to create the required log entries in both tables.

The same stored procedure also implements a simple archiving. Currently snapshots are removed when they are over one month and at least three more recent snapshots exist for the same table.

Example

For example, suppose a user is making changes to a product mapping table dbo.MyProductMapping. The table maps an external product pack code to an internal product code, which then allows processing and matching external transactions with the internal product database. The batch job first validates all codes, and if everything is OK it updates MyProductMapping and creates the logs.

We have 2 tables to hold the logging:

Ctrl.LogSnapshot – holds a record for each user action

Ctrl.LogSnapshotMyProductMapping – holds a copy of the entire dbo.MyProductMapping table for each snapshot

The relationship between these tables is shown in the following diagram – (although these relationships should not be checked on the database because of the necessary archiving of the snapshot logs described later).

Data from LogSnapshot

Snapshot ID

Snapshot Time

User

Table

Message

12

2011-06-17 17:02

USER1

ProductMarketHierachy

Loaded file ProductMarket.xls

13

2011-06-17 17:34

USER1

MyProductMapping

loaded file ExportMappedProductsFixed.xls

14

2011-06-20 11:21

USER2

PolicyTolerance

loaded file ExportTolerances.xls

15

2011-06-20 11:30

USER2

PolicyTolerance

Loaded file NewTolerances.xls

16

2011-06-20 11:33

USER3

PolicyTolerance

Loaded file NewTolerances .xls

17

2011-06-20 11:33

USER2

PolicyTolerance

loaded file NewTolerances.xls

18

2011-06-20 11:51

USER4

CustomerMapping

loaded file ExportMappedCustomers.xls

19

2011-06-20 11:55

USER1

MyProductMapping

loaded file ExportMappedProductsFixed.xls

Data from LogSnapshotMyProductMapping

This table only contains data from snapshots #13 and #19.

Snapshot ID

Rown Number

Pack Code

Internal Code

13

1

AAX11

100391

13

2

AAY11

100392

13

3

BBC22

100393

…

….

…

…

19

1

AAX11

100391

19

2

AAY11

100392

19

3

BBC22

100395

19

4

CCZ22

100396

…

…

…

…

An advantage of snapshot logging is that it is decoupled from the actual updates themselves.

Other logging methods to which this can be compared include Triggers and Envers.

Here, Triggers, means the creation of database triggers on the target table so that INSERT/UPDATE or DELETE statements result in the creation of an audit entry.

Sometimes data needs to be imported into systems using an Excel file format.

This article is about understanding the problems encountered when using Excel for data import, and some ways to avoid them.

Let me say straight up that while I love Excel, there are issues with using Excel files for data transfer. In theory it should be easy, but in practice its fraught with problems. Typically, Excel is not the best format to use for data transfer. Far better to use a simple open format such as fixed width or comma or tab separated text files or XML.

Why is Excel used as a data transfer format?

This is usually because users need to use Excel to modify or view the file. Most users know Excel. Excel gives them the opportunity to view or modify a file before its loaded into another system. This is especially true when the user is a competent spread-sheet user and needs some ability to sort, compare, count or calculate with the data. Often Excel is a cost-effective alternative to having a custom application developed to manipulate data.

So, in some cases the benefits of using an Excel format for data transfer, outweigh the problems. But lets get better acquainted with these problems….

The Problems

Problem #1: Mixed Data Types in a Column Cause Missing Data.

Internally Excel stores numbers and text in a different object types. The Excel OLEDB driver is used by most programs (other than Excel itself) to read Excel files. The OLEDB driver decides if a column is text or numeric or memo (>255 characters) by examining the contents of the first 8 rows. OLDB decides this column is the same type as the first non empty cell. If the first non empty cell is text then the driver treats the column as a text column and looks only at text cells in that column. Numeric and memo data in a text column is ignored and so is text data in a numeric or memo column, etc.. There is nothing the programmer can do about this apart from bypassing OLEDB and using another access method.

Even SqlServer Integration Services (SSIS) suffers from this problem because its Excel data source uses the OLEDB driver under the hood.

Excel is very flexible and its easy for users to add extra worksheets, remove or re-arrange columns or to add an extra row at the top or change column headings.

Any of these things will cause a data upload to stuff up.

The problem can also occur when data is loaded into Excel from a .CSV or text file, edited and then saved.

Problem #3: Lost leading zeroes

Codes with leading zeroes are common in business systems (e.g. employee ID 0004578). These leading zeroes are silently truncated by Excel both when the string 0004578 is typed directly into a cell or the formula area. Excel will have the number 4578 instead of the text string 0004578.

Note that leading zeroes will be truncated when data is loaded into Excel from a .CSV or text file and then saved.

So problem #2 and problem #3 are both due to using Excel to edit data, rather than with the Excel format.

Solutions

My first rule is to avoid using the Excel format unless users need to use the Excel tool to edit the data.

Given that Excel must be used, we will deal with these problems in reverse order.

These can only be dealt with by user education and good error handling in the data load.

Users can be taught to:

Ensure that leading zeroes are captured (by using a single quote)

Leave the worksheet name alone

Leave the column names alone

Don’t introduce extra columns or remove columns

Error handling can attempt validate that the expected columns are present and that codes match existing master tables where possible and that values fall within expected ranges.

The feedback of errors to the correct users can be tricky if the data load is occurring unattended or asynchronously.

Problem #1

In its MSDN article, Microsoft describes a work-around that involves changing the registry so that more than 8 rows are read to determine the data type. This is not a great solution because there can still be mixed types in a column.

I think you need to either force users to enter data of a single type consistently in every row or bypass the Microsoft Excel Jet/DAO/OLDEB driver.

Given the problems with the Microsoft OLEDB driver, I feel its best to bypass it and find another BIFF reader.

Alternative 1: ExcelDataReader

I recently tried this library available on codeplex. It’s a lightweight and fast library written in C# for reading Microsoft Excel files (’97-2007).

It is open source and works on .Net 2.0 and later as well as mobile and mono.

I used both the convert to DataSet and the direct use of the reader. It worked fine for the 2 files I was working with – (a customer master of about 8000 rows and a product file of about 500 rows). It recognised data, even in mixed columns and was able to convert the data whenever C# had a valid cast. This snippet of code shows how I tested this from the file upload click event on an ASPX form.

Using the Office API to read the spread sheet data will get around the problems of mixed types, and also provides a huge amount of functionality.

However the Office API is not typically chosen to provide an Excel data upload for these reasons:

The office API error handling is not suitable for “unattended execution”. http://support.microsoft.com/kb/257757. “Microsoft does not currently recommend, and does not support, Automation of Microsoft Office applications from any unattended, non-interactive client application or component (including ASP, ASP.NET, DCOM, and NT Services), because Office may exhibit unstable behaviour and/or deadlock when Office is run in this environment.“

Office needs to be installed on the machine running the upload. If this is a server then it would not typically have office installed. Office requires licensing and service packs.

Each version of Office has a different API, so although the load will probably work between versions, there may be some problems with versioning.

Alternative #3: SpreadsheetGear

Another possibility for error-free reading of spread sheets is SpreadsheetGear.

This can be used for unattended execution. This is a great product that has worked well for me in a project that required programmatically pushing values into a spread-sheet to get calculated results.

SpreadsheetGear does require a license, however, and is probably too expensive and feature rich for what should be just a lightweight utility.

I have spent a lot of time migrating database changes between environments to support the various stages in the sofware development lifecycle. I have succeeded in automating aspects of this, and it has always been something of a holy grail of mine to create better automation in this area. Now a product has been released that seems to have achieved this, and done it very well.

The Scene

Almost all serious software development now makes use of source control, and many projects use continuous integration. Code written in C# and Java is nicely controlled using products such SVN and TFS. Developers can view the history of check-ins and releases, see what changes were made , revert to old versions etc. Source control also provides the basis for continuous integration using products such as Team City, which marshal scripted builds and tests. Some products such as TFS, go even further to provide traceability between source changes and project tasks and requirements. All lovely stuff, that provides massive productivity gains!

There is however one major component that has been slow in coming to the party: the database.

A lot of software development involves the design and development of database objects such as tables and sometimes views, functions and procedures. In addition to this structure, a database often contains static reference data that needs to be present in order for a system to operate. Examples are transaction types, or user privilege types that need to be stored as records in the database. The changes in the database are typically co-dependent with code changes, so that related changes in code and database should be released at the same time, and the system fails if they are not.

The Problem

So here is the problem:

Today it is rare that database development and release is smoothly integrated with the development and release of code. This lack of integration and standardisation creates a number of difficulties:

Mistakes often occur, due to poor synchronisation of changes between code and database

Just to be more clear, let’s use an example of a team developing a financial modelling product in C# which stores financial models and client information in a SQL Server database. The team uses Team City, so that when developers check-in a change, the code is automatically built and tested on a server. Check-ins can easily be tagged and a release built and deployed in a test or production environment. Now, a change is made, involving the addition of a new field to the model and a change of another field from integer to string. To support this change, a column is added to the FinancialModel table in the database, and another column is changed from Int to Varchar. When the code change is deployed, the system is broken unless the database changes are also deployed. Conversely if the database changes are deployed, the system is broken until the code changes are deployed.

The database changes would ideally be managed in the same way as code , using source control and automated deployment . But this is not so easy, and here is why:

Database objects don’t exist as simple files that can be versioned.

Database deployment cannot simply involve the copying and replacement of the target objects, because this may involve loss of user data. For instance, deploying a change to the FinancialModel table cannot involve simply replacing the FinancialModel table, because valuable data would be lost.

Deployment of database changes, involves the creation of alter and update scripts. Often these are hand-crafted, although there are tools that can automate the production of database synchronisation/deployment scripts. Examples of these tools are Red-Gate SQL Compare for SQL Server and Toad and SQL-Developer for Oracle .

The Solution

Now, there is a new tool that takes the next step and integrates database with source control. This is an important enabling technology for software developers, because it allows the database objects to be included in the continuous build and release cycle that works so well with code.