Tuesday, June 12, 2012

The Error

I'm importing from a text qualified, pipe delimited flat file with a the following columns (Name (Datatype, maxlength)):

CustomerIdentifier (WSTR, 120)

GroupName (WSTR, 120)

UserName (WSTR, 100)

UserStatus (WSTR, 15)

UserType (WSTR, 20)

My package was chugging along, until it failed.

Error: 2012-06-12 09:38:17.73

Code: 0xC02020A1

Source: InsertUserRecords _ Pipe Flat File Source [870]

Description: Data conversion failed. The data conversion for column "UserName" returned status value 4 and status text "Text was truncated or one or more characters had no match in the target code page.".

End Error

Error: 2012-06-12 09:38:17.73

Code: 0xC020902A

Source: InsertUserRecords _ Pipe Flat File Source [870]

Description: The "output column "UserName" (888)" failed because truncation occurred, and the truncation row disposition on "output column "UserName" (888)" specifies failure on truncation. A truncation error occurred on the specified object of the specified component.

I highlighted the offending column. The text (from the first 'S' to the last '0') is 99 characters. The column width is 100, so it should have fit. I loaded it up in a hex editor and went through every character to see if there were any spurious nonprinting characters. Nope.

The Problem

Out of curiosity, I deleted a character and tried to import it into a test database. That worked.

Let me reiterate: It worked with 98 characters, with a column width of 100. If you count the text qualifiers (the " character on either side of the string), that's 100 total characters.

I put the missing character back and deleted the " characters. I then changed the Flat File Source such that the UserName column was NOT text qualified, and tried again. 99 characters, no quotes, and it worked. I put the " characters back, turned Text Qualified back on, and it failed.

The Conclusion

The only conclusion I can draw from this is that SSIS initializes a column 100 characters wide, attempts to insert the whole string into it (text qualifiers included), and then strips the text qualifiers out.

The Solution

Here's how I wound up handling the issue gracefully. I set the UserName column width to 102 characters, to allow for 100 characters and the text qualifiers. In the Data Flow task where I handle that Flat File Source, I have a Derived Column transform adding a new column, ShortenedUserName (WSTR, 100) = Subtstring(Username, 1, 100). In theory there should be no instances where UserName.Length > 100 characters once the text qualifiers have been stripped, but I'm not going to take that chance in production. I'm also going to have it write to an error file whenever it encounters this scenario, so we can double-check the data integrity in the database.

Update

It appears this is only the case if the text qualifier is not set on the file level, but Text Qualified is set to True on the column level.

Second Update

I was wrong. Text qualifier was not set in the General tab. Observe:

The value I circled and put arrows toward needs to be set to the actual text qualifier. If it isn't, well, then all hell breaks loose as I described above.

Monday, June 11, 2012

Someone gave me a directory of compressed text files that had been compressed using the built in Windows zip program. That's a terrible idea for two reasons:

A bunch of separate .zip files will not compress as efficiently as one .zip file with all the text files inside of it.

Windows has no built in faculty for unzipping a bunch of files.

Bandwidth is cheap, so I'm ignoring the first point for now. How can I quickly and easily unzip a bunch of .zip files? Using Cygwin. Cygwin is a *nix shell that sits on top of Windows, and it works spectacularly. You also have to install the unzip program for Cygwin, gunzip won't work.

Here's a tricky gotcha:

$ unzip *.zip

Archive: blah1.zip

caution: filename not matched: blah2.zip

caution: filename not matched: blah3.zip

caution: filename not matched: blah4.zip

caution: filename not matched: blah5.zip

Doesn't work. I Googled around, and here's the trick:

$ unzip \*.zip

You have to use the \ in front of the * in order for unzip to figure out the wildcard. Now I have a bunch of text files to process.

I hate this time of year (late May, early June). So many references to gifts for "Dads and Grads". As a college graduate who took way too long to actually graduate, that kinda sticks in my craw. To combat this, I wrote the Dads & Grads Eliminator Greasemonkey script. It's not too special, but install it if you like.

Very simply. It gets all the elements in the page, iterates through them, and replaces all the permutations of "Dads and Grads" I could think of with "bar". It works surprisingly well. I actually had to disable it to write this post, and to upload it to UserScripts.org.

Update: Not even I'm using this script any more. (un)Surprsingly, it breaks some things (like Blogger), and takes a really long time to go through every element of complex pages. Still, a fun introduction to Greasemonkey.

The Symptoms

Some geeks on StackOverflow pointed me in the right direction. Environmental variables sometimes don't persist when you sudo. I su'd as my hadoop user and ran the command again. Success!

The Cure

Of course, that's not the end of it. I went to start Hadoop using:

/usr/local/hadoop/bin/start-all.sh

I can't remember the exact error I got. Namenode and Jobtracker started right up, but Datanode, SecondaryNamenode and Tasktracker didn't. I did some digging, and the ones that worked are part of the Namenode, started by hadoop-daemon.sh. The ones that didn't are part of the Hadoop Datanode, and are started by hadoop-daemons.sh. The processes that were not starting all had error logs complaining about, guess what, JAVA_HOME not being set. Finally, I bit the bullet and hard-coded JAVA_HOME in conf/hadoop-env.sh.

Tuesday, June 5, 2012

Background
As I understand it, Oracle retired the Operating System Distributor's License for Java, meaning that Canonical could no longer include the JDK or JRE in their APT repositories. This means no more

I ran into a bunch of Google search results detailing how to install Java using PPA repositories, but I couldn't make any work. Finally, got fed up and decided to install Java the hard way.

Download Java
The first obvious step is to download the Java installer. I went here, clicked "Accept", and copied the hyperlink for Linux x64 (64-bit) (jdk-6u32-linux-x64.bin). On my linux machine, I ran

wget http://download.oracle.com/otn-pub/java/jdk/6u32-b05/jdk-6u32-linux-x64.binA file, jdk-6u32-linux-x64.bin appeared in my home directory. Per these instructions, I set the file executable and attempted to execute it.

chris@linux01:~/bar$ chmod +x jdk-6u32-linux-x64.binchris@linux01:~/bar$ ./jdk-6u32-linux-x64.bin./jdk-6u32-linux-x64.bin: line 1: html: No such file or directory./jdk-6u32-linux-x64.bin: line 2: head: No such file or directory./jdk-6u32-linux-x64.bin: line 3: title: No such file or directory./jdk-6u32-linux-x64.bin: line 4: META: No such file or directory./jdk-6u32-linux-x64.bin: line 5: link: No such file or directory