Wednesday, December 28, 2011

A few weeks ago I stumbled on this article "A co-Relational Model of Data for Large Shared Data Banks" ("coRel" hereon) in the on-line version of ACM Queue (this is the Association of Computing Machinery's magazine). The article was authored by two employees (Erik Meijer and Gavin Bierman) of Microsoft.

The authors summarize their thesis as thus: "Contrary to popular belief, SQL and noSQL are really just two sides of the same coin." The problem with this article, is that the author's are asking the entirely wrong the question. They are looking at the world with a very narrow and single minded perspective of data - a perspective which was conventional wisdom up until the 1970s (academically), and the 1980s (commercially).

In technical terms, the authors are basically asking this question: If it is possible to implement a Network Model using a Key/Value [NoSQL] database, and it is possible to implement a Network Model using a Relational database, then can the two be queried and modified by the same declarative language? The answer to this question is a resounding yes. Unfortunately, the Network Model is not perspective neutral, which is why the Relational Model was invented.

Backing up a big, allow me to explain what I mean. The Relational Model takes a perspective neutral approach, and regards all entities, no matter how insignificant they may seem, as "first class citizens". Other data models such as the Network Model and Hierarchical Model lock the data into a given perspective and make certain entities "first class citizens" and others "second class citizens". For example, if you have ever organized your inbox e-mails into folders, or documents on your computer into folders, you have probably chosen a certain hierarchy. Maybe you organized your folders by customer, so that way when a customer asks a question, you can quickly go to the right folder and find all the necessary information. But what if a project manager comes to you and starts asking questions about a particular project, and that project cuts across customers? What normally happens here is you start searching through each customer folder looking for e-mails or documents that pertain to the project. Most people will just copy (or create short-cuts) to those project documents to another folder. We've all been through these searching and sorting exercises. When software developers are confronted with the same problem, they pretty much do the same thing - they reorganize (or refactor) the data.

However, if the data were Normalized (i.e. modeled relationally) to begin with, no such reorganization would be necessary.

The author's (and the majority of software developer's) myopia is apparent in their view of history, the example data model they provide, and even in computer science theory.

Let's talk about history first. coRel has this to say:

Codd's relational model and SQL allowed implementations from different vendors to be (near) perfect substitutes, and hence provided the conditions for perfect competition. Standardizing on the relational model and SQL created a secondary network effect around complementary producers such as educators, tool vendors, consultants, etc., all targeting the same underlying mathematical principles. Differences between actual relational database implementations and SQL dialects became to a large extent irrelevant.

While it's true that standardization around SQL led to wide adoption, such standardization had already emerged prior to the introduction of the relational model. Namely, CODASYL (the same body that created COBOL) developed a standard around the aforementioned Network Model, often referred to as the Data Base Task Group (DBTG). Much of what you see in modern SQL standards actually comes from this standard - in particular the separation of DDL (data definition language) from DML (data manipulation language). However CODASYL vendors (and there were a lot of them), were blindsided by Codd's relational model.

Interestingly, Codd's original language was not SQL, but rather Alpha. Also, the first two major RDBMS vendors had competing standards: Ingress used a language called Quel; and Oracle and IBM used SQL. But because Ingres was always based on the relational model, it was able to simply slap on support for SQL. Ingress lives on to this day in the form of PostgreSQL. The other non-RDBMS vendors also live on to this day, but tend to serve particular niches (e.g. IBM's IMS is still heavily used in banking). There was nothing inherently special about SQL, and other relational languages are still around and continue to be invented. What is special is the underlying Relational Model.

For a better explanation of why the Relational Model entered the marketplace, here's a passage I scanned in from the article "The Commercialization of Database Management Systems, 1969-1983" found in the IEEE Annals of the History of Computer, Volume 31, Number 4, October-December 2009

I am now going to talk about the sample data model used by coRel. The example is based on an example taken from Amazon's SimpleDB. Here is what the original data looks like, as described by Amazon:

Here is how the authors model this using an object model (essentially a Network Model):

Here is how the authors model this using a Relational Model:

If you're an experienced relational modeller, you will observe that the model is not in BCNF (Boyce-Codd Normal Form), as the Ratings and Keywords entities two of the entities have overlapping candidate keys.

Here is what the data model should look like in BCNF:

From the perspective of the Product Catalog application, this change seems somewhat academic. However, if we extend our normalized data model to include Tweets retrieved via Twitter keyword searches - which is important from the perspective of a marketer - things get more interesting. Here's the updated model:

What is important to note in this normalized model is that all entities are "first class citizens". As an Product Catalog application developer, I can ask questions or make changes to Keywords and Products, without having to involve Tweets. Or, as a marketer I can ask questions about Tweets and Products without having to involve Products.

When data is seen in a larger context with many different perspective, the relational model makes sense. While it may be more efficient to model data for a particular perspective (i.e. the product catalog application) using a Network or Object model, the same model can be very inefficient and lead to anomalies and contradictions in the data. This point is lost on many developers, since most only deal with a single perspective of the data. The following paragraph in coRel makes this very clear:

Summarizing what we have learned so far, we see that in order to use a relational database, starting with a natural hierarchical object model, the designer needs to normalize the data model into multiple types that no longer reflect the original intent; the application developer must reencode the original hierarchical structure by decorating the normalized data with extra metadata; and, finally, the database implementer has to speed up queries over the normalized data by building indexes that essentially re-create the original nested structure of the data as well.

See the problem? There is rarely such thing 'natural' hierarchy. Perspective is everything, and depending on how we view an ontology, we can ascribe many different hierarchies.

Now, you may be wondering if it is possible to represent a relational model in a NoSQL database, such as a key/value store. The answer is: sort of. While it is possible to recreate the structure of the relational model, it is not possible to centralize the integrity of the relational model. This is not a trivial point. Referential integrity (and other forms of integrity, such as uniqueness, nullability, and value domain constraints) are what ensure the correctness of ad hoc queries. When such constraints are removed, it is up to the application developer to examine the underlying data and perform numerous tests to ensure its integrity. The end result is poor data integrity and poor data quality. I can speak quite frankly on this last point as I see the difference between poorly constrained data models and well constrained data models, all the time. Just like the second law of thermodynamics, when unconstrained, over time data entropy tends to infinity.

In other words, the relational model allows us to manage information as a separate concern. You might even say this is the whole point of the relational model.

I now want to move on to the theoretical aspects of the paper, in particular the question of compositionality. CoRel's authors define compositionality as: "the ability to arbitrarily to combine complex values from simpler values without falling outside the system" they go on to say

SQL is rife with noncompositional features. For example, the semantics of NULL is a big mess: why does adding the number 13 to a NULL value, 13+NULL, return NULL, but summing the same two values, SUM(13, NULL), returns 13?

A more precise definition of compositionality comes from Wikipedia which states "An important aspect of denotional semantics of programming languages is compositionality, by which the denotion of a program is constructed from denotions of its parts."

SQL guarantees compositionality since it doesn't have any side effects. Contrast this with most concurrent programming languages (e.g. Java, C#, Python), and they do not guarantee compositionality since it's possible to write modules which impact the other modules.But I don't think the authors were thinking along these lines. They're really arguing that SQL is inconsistent and point out the example with the NULLs.

The reason why there is this perception of non-compositionality is that sets and tuples are treated as primitives. You cannot make an aggregate function out of tuple functions, and you cannot make a scalar function out of aggregate functions.

NULLs are controversial to this day, and Codd even wanted to take things a step further, and distinguish between "unknown but applicable" and "unknown but inapplicable".Codd's basic argument for the inclusion of NULLs (and three-value logic) can be summarized as thus: In the real world, handling unknown values is inherently complex. Instead of thrusting the complexity back to the user, the RDBMS should handle unknowns "correctly" - in so far as the behavior correctly models real world behavior. This can result in NULLs being counter-intuitive, but just because something is counter-intuitive it doesn't mean its wrong (think flat earth intuition, round earth reality).

For example, let's say you are hotel manager and you want to know the average number of days a guest stayed for. Assuming you have a database where front reception can log check-in and check-out times. Obviously when you haven't checked out, the check out time is unknown (or doesn't exist), so that attribute would be NULL. When taking the average of checkin [minus] checkout, you will only be including rows where both the check-in and check-out times are known, since when tuple contains an unknown element [unless stated otherwise] the tuple as a whole cannot be known, and should be eliminated from the set. This is what you want. Putting it in a SQL query, it would look like this:

SELECT AVG(DATEDIFF(CheckOutTime, CheckInTime))FROM BOOKINGS

As you can see, it's a very simple query to write and validate, and reflects the correct handling of NULLs. Quite the opposite of a "big mess".

But the whole question of compositionality is also missing the point with SQL. SQL is not a computational language - it's a data retrieval language, based on the relational model. When we're talking about data, compositionality is not our main concern - normalization is. Data which is not normalized, is like a program which is non-compositional. Nasty side effects can and will arise.---The funny thing about the relational model is that it is predicated on Relational Algebra which is completely orthogonal to the Universal Turing Machine. The former is about logic, and the latter is about flow. They are not in competition.

But when I hear people say things like "it's about time somebody built a better database than those stupid RDBMSs", it's akin to saying "it's about time somebody build a better Universal Turing Machine". Makes no sense really.

Before I conclude this post, I want to share with you an excerpt from Joe Celko's "Thinking In Sets" which is very telling:

Many years ago, the INCITS H2 Database Standards Committee (née ANSI X3H2 Database Standards Committee) had a meeting in Rapid City, South Dakota. We had Mount Rushmore and Bjarne Stoustrup as special attractions. Mr. Stoustrup did his slide show with overhead transparencies (yes, this was before PowerPoint was ubiquitous!) about Bell Labs inventing C++ and OO programming, and we got to ask questions.

One of the questions was how we should put OO features into the working model of the next version of the SQL standard, which was known as SQL3 internally. His answer was that Bell Labs, with all their talent, had tried four different approaches to this problem and they came to the conclusion that it should not be done. OO was great for programming but deadly for data.

Summing up. While I am being critical of the coRel paper, there was clearly a lot of thought that went into it, and the authors come across as being intelligent and having a good pedigree. My point is that there is an institutional bias towards application-centric data modeling, which comes at the expense of perspective neutral data modeling - i.e. the Relational Model.It has been my experience that this bias has led to a great deal of friction between software or application architects, and data architects. Much to the frustration of both.

My greatest hope is that by educating students at an earlier age, this deep rooted bias can be avoided. This, I will point out is a long running project of mine. A blog for another day.

Sunday, October 09, 2011

A common problem faced by database developers involves keeping schemas and data in sync between environments. The problem isn't even restricted to development. Often databases must be kept synchronized in operational scenarios. Data is often replicated for reasons of high availability and performance.In this posting, I want to illustrate a real-world scenario that frequently comes up, and a modeling approach that ensures this scenario never turns into a problem. Even if you have not encountered this scenario, it's worth reading this blog in its entirety for tips on better data modeling, and why the relational model is still very relevant.

As anyone who has had to keep their calendar or contacts in sync between devices, or for developers accustomed to using Version Management systems like Visual Source Safe (VSS) or Subversion, the problem really boils down to the following three requirements:

If the data is in the source data store but not the target, then add it to the target.

If the data is in the target but not in the source, then remove it from the target.

If the data is both in the source and the target, overwrite the target's attributes with the source's attributes.

Simple right? Indeed everything is simple... IF you have identifiers which are consistent between source and target data store. For the purposes of this blog posting, I'm going to focus on the use case of syncing data between a development server and a test server, which is equivalent to syncing between a test server and a production server.

I am not going to get into the intricacies of syncing calendar or personal contact data - that scenario is actually quite different since there is never a universally agreed upon key or way of identifying individuals, so you really don't have a proper key to compare against. That's why most PIM synchronization software tends to take a "fuzzy" approach when syncing up contacts. Companies like Facebook and LinkedIn are in a privileged position to address this ongoing problem, as they are effectively becoming a de facto registry for contact information. But I digress...

For this posting, to make my examples more explicit I'll refer to the Microsoft technology stack, although all of the major RDBMS vendors support the same functionality and tools.

The crux of the problem when synchronizing most data comes down to the alignment of keys (or identifiers if you prefer).

A common conundrum that comes up when modeling data is defining the right primary key. For example, I can create an employee table and use the employee's social insurance number as my primary key. Or perhaps I don't have access to the social insurance number, and instead use telephone number and first and last name as my primary key. In the latter case, people are known to change their name or their telephone number. The key is effectively out of our control, in order to have reliable key we end up creating what I refer to as a "technical key". Technical keys are often auto-generated from incrementing sequences, but they can just as easily be generated from a GUID. It is common to find auto-generated primary keys because not only is the data modeler in complete control of the keys values, but they are also compact, especially when referenced as a foreign key from other [dependent] tables. Data synchronization problems begin here simply because the technical key depends on the database or application that generated the key. If you import the same data into a different database for the first time it can and probably will have a different auto-generated key. I am well aware that it is possible to load data into other databases while preserving keys, and if you're taking a "master slave" approach to data synchronization you have nothing to worry about, since you're effectively just copying the data over. However if you already have data loaded in the target table and you simply need to update a few columns then you're going to run into problems.

A good solution to this problem is to define an alternate natural key. Basically this is just a UNIQUE index on one or more columns. For our above example, this could be comprised of name and telephone number. Given that the natural keys can be different, the developer should perform some analysis to assess and quantify the risk that natural keys are different.Once you have determined the natural keys are in sync, you can [as an example] use Visual Studio 2010 Schema Compare to perform a data schema comparison which will generate a SQLCMD script (assuming you are promoting schema changes) and you can also use Visual Studio 2010 Data Compare to generate a SQL script to perform all the UPDATEs, INSERTs, and DELETEs. In VS 2010 you can choose whether to use a table's primary key, or one of the alternate natural keys you have defined.

I should point out that if your table doesn't have any set of columns (apart from the primary key) which can be guaranteed to be unique, and the primary keys themselves are generated and therefore database server specific, you should try your best to rectify this situation as you now have a more fundamental problem, which I won't be addressing in this blog (hint: You need to start looking at the target's change history).

So far so good. But what if you have a dependent/child table whose natural key depends on the parent table? Let's say there is a table called EMPLOYEE_INVENTORY which is a list of items that have been provisioned to the employee. The natural key for this table might be the composite of EMPLOYEE_ID and the INVENTORY_ITEM_ID (and the Primary Key for EMPLOYEE_INVENTORY is an auto-generated key). To keep things simple, the Primary Key and Natural Key for INVENTORY_ITEM is one and the same, and that it is a universal SKU # of the inventory item. So in summary, the primary key for EMPLOYEE_INVENTORY "ID" is a technical key, and the natural alternate key is "EMPLOYEE_ID" + "INVENTORY_ITEM_ID"

Now we have a bit of a problem when it comes to keeping EMPLOYEE_INVENTORY in synch. Namely, half of its Natural Key is derived from the Auto-generated Primary Key in its parent table (EMPLOYEE). Should we wish to synch based on Natural Keys we're forced to develop code to perform lookups, comparisons, UPDATEs, INSERTs, and DELETEs - we can no longer rely on Data Synching software like VS 2010 Data Compare to do this for us. While there is nothing inherently complicated about this, it will invariably take you a chunk of time to write this code, whether you do it in an ETL tool like SSIS, or stick to a procedural code. If you have another dependent/child table which in turn depends on EMPLOYEE_INVENTORY (e.g. EMPLOYEE_INVENTORY_LOG) things get more complicated and you're spending considerably more time to complete the task.

The preferred approach is to create what is known as an UPDATABLE VIEW which will allow you to substitute the parent table's technical primary key for its natural key. Just to clarify, an UPDATABLE VIEW is exactly what its name says it is: a VIEW you can UPDATE. As you can imagine, there are limitations as to which VIEWs can be updated (clearly anything with an aggregate would not be updatable). For our scenario though creating an UPDATABLE VIEW might look like this:

For SQLServer if you want to make this VIEW UPDATABLE, there are two things you need to do. First, you need define it with the "SCHEMABOUND" option. Second, you need to defined a CLUSTERED UNIQUE INDEX on the natural key columns of the view (i.e. "TELEPHONE_NUMBER", "FIRST_NAME", "LAST_NAME", "INVENTORY_ITEM_ID").

You are now in a position where you can use data synchronization tools like VS 2010 Data Compare to automatically synchronize the data for you. Because everything we haven't had to write any procedural code, we can focus instead on problems which arise from the data itself (e.g. a natural key mismatch), as opposed to betting bogged down in throwaway code.

Problem solved.

********

What I want to illustrate in this blog is the power that comes from mindful modelling and adhering to relational principles.

In the last couple of years, NoSQL databases like MongoDB, Redis, Cassandra, Google DataStore have flourished. Indeed, these databases provide significant advantages for application developers in terms of scalability. They also feel like an ORM layer, but with much greater efficiency, so they are very desirable to application developers. And not to be ignored, many of these new databases are open source and can be used for little or no money. Case-in-point I'm planning on building a new hobby application using Google Data Store since I get up and going without paying a cent.

The downside of these modern database technologies is that they suffer from many of the same limitations that plagued pre-relational database developers. Namely, there is a "perspective lock-in". What this means is that once the application developer has modeled data for their application's use cases, it may be difficult for future applications to use the data for their own purposes. It will also be difficult to run any sort of ad hoc queries without first exporting the data to an analytical RDBMS.

This is not to say that NoSQL databases should not be used. This is to say that they should be chosen with eyes wide open and a clear understanding of the trade-offs involved. In fact, I believe the most compelling reason to use a NoSQL database is for low cost (in particular hardware costs). Let me repeat that: If you absolutely cannot afford to scale using an RDBMS and you can't see yourself bootstrapping yourself along, then go with a NoSQL database.

Put another way, virtually every NoSQL innovation I've seen, has been absorbed into an RDBMS. Take for example binary large objects (BLOBS) like videos, music, and documents: Microsoft (and presumably others) now allow you access these objects directly through the file system, while allowing them to be managed transactionally through the RDBMS. Analytical databases like ParAccel an Vertica allow for Petabyte scaling - and still allow full relational capabilities. Contrast this with Google DataStore which doesn't even support basic JOINs or aggregations (GROUP BYs). This means you have to write this code on your own. Not only is that going to be cumbersome and error prone, it's also going to perform worse as well. This is why you're beginning to see "SQL layers" Google Tenzing being added to NoSQL databases to speed up the commonly requested tasks. Even Facebook, which started the Cassandra project, still uses MySQL (combined with Memcached in a sharded configuration).

There's also the notion of BASE versus ACID, and the types of business models each can tolerate. But that's a discussion for another day...

Monday, December 27, 2010

Earlier this year I blogged about an alleged case of silver manipulation as raised by Andrew Maguire to the CFTC. If you haven't already, I urge you to read that post before continuing on with this post.

I ended my post with a challenge to any serious journalist to properly investigate Andrew Maguire's case. So far I have yet to see anything resembling a real investigation undertaken by the media establishment. While I cannot say for sure why this is given how much of a juicy story the Andrew Maguire case appears to be. I am beginning to see an interesting pattern emerge that may partly explain why this is. I must say that I'm a bit surprised by my own findings, but they are enough for me to draw conclusions.

For a period of time the only newspaper that reported the Andrew Maguire story was the New York Post. You can read the first report here which was first published in March (shortly after Maguire blew the whistle). A follow-up report can be found here. However, a third story was published shortly after I posted my last blog entry which in hindsight is most interesting. You can read it here. What's interesting about the last post is that one week following its original publication, a correction was issue which basically nullified the entire story. Namely, JPMorgan responded by saying there was no investigation to begin with. The correction is posted at the bottom of the story. The NYPost never countered.

Fast forward to October 8th of this year. Reuters reported that CFTC commissioner Bart Chilton had publicly announced that the CFTC was investigating silver manipulation. Oddly, the statement wasn't from the CFTC itself. Rather Chilton felt the public deserved some answers, speaking on his own accord. You can read that story here.

By the end of October, Reuters published a follow-up story reporting that the CFTC was in particular investigating JPMorgan and HSBC for silver manipulation. You can read that story here. This would appear to directly contradict the statements made to the NYPost back in May, that they were NOT being investigated for silver manipulation. JPMorgan declined to comment on the matter.

A day later, Reuters reported the first of at least five class action suits lawsuit filed by investors targeting JPMorgan and HSBC. Here is the article, and here is a scanned in copy of the court filing as filed in the Southern District of New York. About a week later, a second class action lawsuit was filed. You can read the court filing here. A week after that, a third law firm posted a press release announcing they were investigating silver manipulation by JPMorgan and HSBC on behalf of their clients. As a follow-up, Reuters posted another article providing historical context to these cases. The article begins with the obligatory mention of the Hunt brothers, some mention of "gold bugs" and "conspiracies", but no mention of Andrew Maguire himself.

To be fair to the media establishment, on November 3rd Andrew Maguire was mentioned by name in this Wall Street Journal article describing a fourth lawsuit against JMorgan and HSBC. You can read that article here. Apart from that brief mention in the WSJ, most newspapers either leave out mentioning Andrew Maguire by name, or they describe a "London based Metals Trader".

For those that are curious about how the manipulation works, a fifth class action lawsuit was just launched on December 28th, which describes in detail HOW the manipulation works and how JPMorgan and HSBC profited from it. It also describes WHY they were in a unique position to do so.

Let's pause for a moment and ask some questions. Why is it that only the NYPost originally reported on the story? Why didn't NYPost continue to follow the story they claim to have broken after it developed? Why doesn't Reuters mention the name of Andrew Maguire? Why doesn't the WSJ and Reuters print or mention the original e-mail transcript of the correspondences by Andrew Maguire to the CFTC? Is this just a one-off oversight?

Perhaps.

But around the same time Bart Chilton's announcement was made, another interesting and related story came to light. Namely, The Washington Post reported that one of the two CFTC judges (Judge Painter) had announced his retirement and that he had issued an Order requesting that all seven of his open cases NOT be transferred to the other CFTC judge, Judge Levine. I have quoted the key paragraph of the one page order (which you can read here) below so you can see for yourself:

There are two administrative law judges at the Commodity Futures Trading Commission: myself and the Honorable Bruce Levine. On Judge Levine's first week on the job, nearly twenty years ago, he came into my office and stated that he had promised Wendy Gramm, then Chairwoman of the Commission, that we would never rule in a complainant's favor. A review of his rulings will confirm that he has fulfilled his vow. Judge Levine, in the cynical guise of enforcing the rules, forces pro se complainants to run a hostile procedural gauntlet until they lose hope, and either withdraw their complainant or settle for a pittance, regardless of the merits of the case. See Michael Schroeder, If You've Got a Beef With a Futures Broker, This Judge Isn't for You - In Eights [sic] Years at the CFTC, Levine Has Never Ruled in Favor of an Investor, Wall St. J., Dec. 13, 2000, at A1 (copy attached).

In the last sentence of the previous paragraph, Painter cites an article that appeared on the front page of the Wall Street Journal back in 2000. You can read that article following Painter's Order here. It is a well researched investigation into the conduct of Judge Levine and sure enough does raise some important and relevant questions. It's not entirely one sided, but it does put Levine on the defensive. You would not be unreasonable to think the Wall Street Journal would publish a follow-up article vindicating their original investigation. Right?

Wrong. The Wall Street Journal has taken an entirely different and rather underhanded tack. Namely, Sarah Lynch penned an article which is a blatant smear on Judge Painter, describing him as a mentally ill alcoholic who would sleep at work and who was a failure in his private life. Sandwiched in the middle of the article we find a very brief mention of Painter's Order against Levine, which is immediately followed by "Judge Levine declined to comment, but a former colleague defended Judge Levine's record and said he is fair." The article continues to smear Painter to the very end quoting a doctor that diagnosed him with "cognitive impairment, alcoholism and depression". The tone and content is more in line with something you might read in The National Enquirer or US Weekly. The article seems highly uncharacteristic and rather beneath a newspaper like the Wall Street Journal with such a distinguished history.

Reading Lynch's WSJ article after reading Schroeder's original investigative article and the recent Washington Post article left me feeling stunned. The only thing I will say in the article's defense is that it does help to explain why Painter kept this to himself for so long. I suspect his enemies were holding a grenade over his head, and clearly they've pulled the pin.

Lynch wrote back to note she did a story on the judge last Friday, but it ran on newswires but was not picked up by WSJ. (Reporters have no control over those editorial decision). The current article is a follow up to that prior piece.

This would all suggest that Lynch's article was taken out of context and something immoral is going on within the higher ranks of the WSJ with respect to reporting around the CFTC and the silver manipulation story.

******

I recently went to see the movie "Inside Job". The film documents the reasons that led up to the 2008 economic meltdown, the impact it has had so far, and why nothing has substantially changed in terms of policy and regulation. What surprised me was how interconnected the corruption is. There is no grand conspiracy of the sort you might think is required to fake the moon landing, or plan 9/11 as an inside job. Rather, the conspiracy is of incentives and motives which readers of Freakonomics will quickly recognize.

The movie describes the interconnectedness between the big investment banks (e.g. Goldman Sachs, JP Morgan, Morgan Stanley, etc.), the Federal Reserve, the government executive branch (both Bush and Obama), the government regulators (i.e. SEC and CFTC), the rating agencies (i.e. Moodys, S&P, and Fitch), and even the prestigious business schools (e.g. Harvard, Columbia, Wharton, etc.). On that last one, I have to admit I never realized that so many business professors were on the payrolls of these banks. Actually, they're usually not on a direct payroll, but rather work as consultants through intermediaries like Analysis Group. Inside Job argues that influence into academia is so pervasive that most of these school's curriculum has morphed to reflect the will of the banks, and tends to be very negative in view of regulation.

In light of this I would argue then that the media establishment, which has become highly consolidated is also susceptible to the same conspiracy of motives that has adversely influenced some of the world's most respected economic professors. In this blog post I have attempted to present evidence supporting my theory.

Newspapers like the Wall Street Journal are generally regarded as pillars of the fourth estate. I no longer believe the WSJ should be trusted and I'm sure others feel the same way. Since 2008, trust in governments, banks, rating agencies, business schools, and the media has been in steady decline. This is not good for any society.

I don't believe there is any silver bullet to resolve this. However, by continuing to shine as much light on media biases I believe it is possible for us through Citizen Journalism and other forms of grass roots reporting to get out of these dark days, hold the media to account, and in turn hold the government, banks, and corporations to account. Looks like an uphill battle, but I'm an optimist.

Monday, August 23, 2010

Over the past couple months I've been jumping between three different ETL tools (well PowerShell is not exactly an ETL tool but has some overlapping functionality). The experience has given me new perspective on the strengths and weaknesses of each tool. I hope to share with you my experiences, opinions, and recommendations.

This report is not going to be as structured, or cover as many tools as something you might find from a Gartner report. My approach here is to get into specific experiences with the tools as well as discuss why I think ETL tools are important to begin with. My perspective is not intended to be a definitive decision making tool, but rather a useful component to the decision making process when choosing an ETL tool.

Before getting into my experiences I'll give you a brief overview of what ETL is, and how these tools fit into the ETL landscape.

ETL stands for "Extract Transform Load". ETL tools first started showing up in the mid nineties as a response to the growing demand of data warehouses. The first major ETL tool was Informatica, and continues to be one of the best ETL tools available on the market today. The main reason ETL tools were invented was that it has traditionally been time consuming and error prone to extract data from multiple source systems, merge those data together, and load into a data warehouse for reporting and analytics.

To this day, the majority of techies in IT are unaware of what an ETL tool is. However those same people are often tasked with solving the very problems ETL tools are designed to solve. More experienced developers will achieve their goals through a combination of shell scripts and SQL scripts. Less experienced developers will simply stay in their "comfort zone" and use whatever programming language they happen to be the most familiar with, with little regard for software maintenance and support.

The task itself of writing a computer program to extract data, merge it (or perform some other transformation), and loading those data into a target data warehouse is conceptually very simple for the average developer. However, it is not until you start running into the following problems do you realize that the "follow your nose" approach doesn't work so well. Here are some of the issues you will likely run into when maintaining ETL jobs for a data warehouse (I've run into them all). In other words, here is what an ETL tool is designed for:

Code readability: This is probably the biggest difference between an ETL tool and pretty much every other programming language out there. It is driven off of a visual metaphor. When I first saw an ETL job, it reminded me of one of those crazy Rube-Goldberg machines. However, once you acquaint yourself with the iconography, and what all the connecting lines mean, it becomes very easy to look at an ETL job and understand what it is doing, and how it works. Unfortunately, many developers scoff at this visual approach, after all "real coders write real code". I also wonder if many developers are afraid of developing ETL jobs as it lays their code bare for all to see. Managers on the other hand love this visual approach, as it allows them to scrutinize and partially understand how code is working. Furthermore, an ETL tool will visually show your data coursing through it from one step to the next, much like observing water flow through tributaries into rivers into lakes. When a full load takes several hours to complete, it is very reassuring to see what is actually going on with the data.

Data Element mapping: Much of what you're doing in data warehousing involves mapping a source data element to a target schema. This is tedious work and it's potentially error prone - especially if you're lining up an INSERT statement with a SELECT statement. ETL tools make it very easy and safe to line up and map data elements in a fraction of the time you would normally take.

Impact Analysis: Agility is often seen as a function of management. However, if it takes you a month to figure out what has being impacted as a result of a minor schema change (e.g. adding a column to a table), then you're going to get stuck in your tracks before you have the chance to yell "Scrum!". ETL tools make it easy to identify where data is being sourced from, and where it is being targeted. Some ETL tools will even produce impact reports for you. Keep in mind that this only one aspect of impact analysis, but every little bit helps.

Incremental loading: By default most developers will develop code to perform a full load on a data warehouse. This code is not only easy to develop, but is fairly reliable. Unfortunately, full loads can take a lot of time to complete, and as time goes on will take longer and longer. Switching to an incremental load is tricky since it involves figuring out what data is new, changed, or has been removed. Depending on your source system, your options may be limited. Nevertheless, ETL tools provide functionality to assist with different scenarios

Parallelism: Writing concurrent software is not for the "faint of heart". There is an inherent overhead required to manage semaphores and other forms of inter-process communication. Furthermore, the threat of deadlocks or livelocks in complex systems can often only be discovered through trial-and-error. There are modern frameworks which greatly simplify parallelism in languages like Java or Python. However it is hard to argue against the simplicity of the ETL approach when developing parallel data processing applications. Furthermore, some ETL tools (e.g. IBM DataStage Enterprise Edition) can even parallelize JOINs. This makes it possible to outperform an in-place JOIN within an RDBMS like Oracle (most SQL queries still run single threaded).

Checkpointing and job recovery: Imagine you have just developed a script to perform a large data warehouse load. Let's say you run the script on a weekly basis, and the load takes 8 hours to run (you run it over night). One morning you arrive to find the script has failed 80% into the job. Some of the tables have been updated, but there are a few that haven't been updated. You can run the job from the beginning, but this is problematic since you need to back out all the data you have already inserted (otherwise you'll get duplicate data). Alternatively, you can pinpoint the point of failure and run from there. However, this requires a code change, which means you will need to test your modified code before getting it into production (and let's not forget the psychological pressure when people are waiting on fresh data). ETL tools make it relatively easy to checkpoint your code, and resume from a failed step. Some ETL tools have built-in checkpointing which means you don't have to even instrument your code for checkpointing. Furthermore, if code changes are required [due to a failed job], they are more often a change in configuration, rather than a real change in code, so it is generally safe to make a configuration change and resume the job with less risk than traditional scripting. That said, not all ETL tools support checkpointing. However, even ETL tools which don't explicitly support checkpointing can be more easily instrumented to checkpoint than traditional scripting or programming languages.

Logging and monitoring: I've already touched upon this point with respect to monitoring. To reiterate, ETL tools allow you to see (seeing is believing) data flowing from one step to the next. It's powerful data visualization that is not unlike a floor plan or map visualization. Namely, it can communicate far more information than a bar chart or line chart ever will. As for logging, ETL tools by default centralize logs and implicitly logs what you need to log (with configurable views of granularity), without having to instrument your code.

Centralized Error Handling: Scripting languages don't provide centralized error handling (this has more to do with the OS and legacy applications), so you're forced to always explicitly check for errors and call the appropriate error handling routine. ETL tools often include generic error handling routines which are invoked regardless of where the error originated from or how it is raised.

Slowly Changing Dimensions: Although I've listed this as #9, perhaps this should be #1 from a data quality perspective. If you don't implement a Type II (or higher) SCD policy, you're precious data warehousing investment will most certainly degrade over time. Slowly Changing Dimensions are crucially important, as they allow you to show a consistent view of history. I could go on at greater length as to what an SCD is, but if you're not sure just read the Wikipedia article. At any rate, ETL tools have built-in SCD steps which allow you to define the technical key, the natural key, the effective begin and end date fields, and in some cases a version field and "current version" flag field. Without an ETL tool SCDs are a major pain in the butt to manage, and developers will often not implement a Type II SCD policy (often because they aren't aware of what an SCD policy is), but also due to the added effort. Don't forget the classic cop-out: "Hey, if the business analyst doesn't ask for it, I'm not going to build it". SCDs are one of those classic business requirements that few people realize they need until it's too late.

Pivoting. It is increasingly popular to see data being stored and logged in what are known as Entity Attribute Value data models. I generally prefer to see explicit data models over generalized data models (which effectively treat the database as "bit bucket"). But they're here to stay, and in some cases actually make sense. However, it is virtually impossible to analyze EAV data without pivoting it back into columns. This is another task which is well served through an ETL tool, and which can be very cumbersome if you intend to accomplish through traditional scripting. In short, an ETL tool allows you specify the grouping ID (essentially the primary key for the row), the pivot key column, and the various pivot values you wish to map to columns. Furthermore, if you need to extract data out of some kind of BLOB or something like an XML or JSON document, this too is easily achieved through standard ETL steps. If you were to do this through scripting, you would probably have to rely on temporary tables and plenty of code. In other words, just like with Slowly Changing Dimensions, there are fewer Function Points to worry about with an ETL tool

In-memory batch manipulation: Often when manipulating data sets in bulk, many developers will rely on temporary tables, and from there perform batch operations against those temporary tables. While this works, and is sometimes even necessary, ETL tools allow you to more easily process large batches of data as part of a continuous in-memory pipeline of records. It is this notion of a "data pipeline" which also distinguishes ETL from other programming paradigms, such as scripting approaches (even those scripting languages that support piping operations.

Distributed transaction control: While it is possible to create distributed transactions using a transaction coordinator, often developers who eschew ETL tools will also avoid using other high level tools, such as a distributed transaction co-ordinator. Some ETL tools include built-in integration with distributed transaction co-ordinators. However, in my own experience I have never had the need for two-phase-commit in an ETL tool. I suppose if I needed to ensure that my data warehouse was always 100% consistent, even during load times, then this makes sense. However, I would urge caution when using transaction managers, since large transactions can easily lead to locking, not to mention they perform horribly under full load scenarios, and a single COMMIT statement can take hours (depending on how you've configured indexes).

Before I get into my current experiences with PowerShell, SSIS, and Pentaho DI (PDI), I'd like to point out that there is also something called ELT (Extract Load Transform). ELT differs from ETL in that it tends to rely on the target DBMS to perform the transformations. ELT tools achieve this through code generation. Sometimes they are referred to as "code generating ETL" versus most other ETL which distinguishes itself with the moniker "engine-based ETL". ELT works like this:

Popular ELT tools include the open source TalenD and Oracle Data Integration (formerly Synoptic). The strength of this approach is that the target DBMS can be leveraged to achieve superior performance. Furthermore, it is possible to override the code generation and hand-tune the generated code.

If you target a robust DBMS (e.g. Oracle), the results can be impressive. However, since you are relying on the target DBMS, your options are more limited. Yet this is not the main reason I am wary of the ELT approach. My worry with ELT tools is that they can easily break the visual metaphor by relying too much on hand generated code. Furthermore, during the transformation stages it is impossible to see what is going on (I prefer maximum transparency during long running transformations). Over time, your ELT jobs can devolve into something resembling scripting.

I should point out that when I first started developing transformations and jobs using Ascential DataStage (now an IBM product), I found that most data was being read from and written to the same Oracle database, where I was working. At my workplace I was quickly able to run circles around most ETL developers using bread-and-butter SQL queries. Indeed, Oracle has one of the best (possibly THE best) query optimizer out there, and few of my peers understood how to take advantage of DataStage's parallel extender, and were probably fairly average ETL developers. But since that time I have come full circle and am a firm believer in the visual metaphor. Yes, I may sacrifice some performance, but at the time I was doing this I didn't have the perspective that I'm giving you in the aforementioned 12 reasons to use an ETL tool.

PowerShell

Getting back to my recent experiences, I'll start with PowerShell. To be clear, PowerShell is NOT an ETL tool. It's a shell scripting tool. When I first discovered it (as part of MS SQL Server 2008) I was very impressed with what I saw. There was a time many years ago when I was a proficient UNIX shell script programmer. As any UNIX buff knows, mastery of the Bourne/Korn or C Shell (and all the standard command line tools) is the fastest route to becoming a UNIX power user. Occassionally I'll still fire up Cygwin (I used to use MKS toolkit) and will pipe stuff through sed or awk. Vi is still my favourite text editor.

I was therefore very eager to embrace PowerShell as a potential ETL disruptor (one should always be on the lookout for disruptive technologies). There's a lot to love about PowerShell. As much as I will diss Microsoft for its monopolistic practices, they are one of the few true innovators of the back-office. PowerShell is no exception. Let me quickly point out some of the more powerful features of PowerShell:

Objects can be piped: Piping in UNIX is a way of life, but it can also be tedious as you're forced to serialize and deserialize streams of text, in order to achieve the desired result. PowerShell allows you to pipe .NET objects from one shell app to the next. For example, it's possible to pipe a result set, or directory listing as an object. This comes in handy quite a bit. For example, it is very easy to retrieve a particular set of columns from a query's result set, without having to sift through and parse a whole bunch of text

Snap-ins can be used to move between shell contexts: Currently the only snap-ins I've used are for the OS shell, and the MS SQL Server shell. The OS shell is the default shell, and listing "entities" will basically just list files and directories as you would normally expect. But while in SQLServer mode, the shell lists databases and tables as though they were directories and files. While most DBAs are accustomed to running database commands using a query shell window, PowerShell goes far beyond what is easily possible using a basic SQL (or T-SQL) query interface.

Parsing files is easy: Going back to the object pipes, it is also very easy to pipe the contents of a file, and parse those contents as a file object. PowerShell has also reduced the number of steps required to open and read a file, making such operations second nature. Let's face it, processing text files is as common activity in shell processing as ever (e.g. you need to parse a log file for an error message), so it's very much a pleasure to use PowerShell for these tasks

Great documentation: It's not often I compliment software documentation, but I feel it should be acknowledged that Microsoft even rethought those dry UNIX "man" pages. Perhaps it's just one technical writer at MS making this difference. But whoever you are, hats off for having a good sense of humour and making the dry and technical a pleasure to read.

With all that said, I've decided that PowerShell in its current form (I used version 2.0) has some major drawbacks from an ETL perspective. First off, by definition it's a "scripting language". So by definition it will never follow the visual metaphor. Depending on your perspective this is may be a good thing. But I for one am a huge believer in code visualization. While it is possible to run a script and examine log output in real time, it's not easy to tell what is happening where when tasks are running in parallel, and what the relationship between those tasks are.

The second major issue I have with PowerShell is that it simply cannot perform the function of what an ETL tool can do. Originally I had coded my ETL tasks as T-SQL Stored Procedures, which were basically wrappers for INSERT/SELECT statements. Unfortunately under high volumes, these queries caused database locking to occur. Apart from using TEMP tables, my only other option was to perform my JOINs within an ETL tool. This all but makes PowerShell a complete showstopper from a data warehousing perspective when compared to an ETL tool. Nevertheless, I did for a while attempt to call my individual ETL transformations within PowerShell, and instead treat PowerShell as a job controller.

The third issue with PowerShell is that I couldn't find a simple way to log everything. Depending on the type of error being raised I had to capture it in any one of three ways. Furthermore, I basically had only one easy option for where to put the log: into a text file.

Overall I am very impressed with PowerShell from the point of view that Microsoft has elevated scripting to the next level. But as an ETL job controller, it simply cannot do as good a job as an ETL tool like Pentaho DI or SSIS.

SSISAt first I wasn't overly impressed with SSIS. It struck me as overly architected, and too reliant on scripting steps, thus undermining the all important visual metaphor. I have come to believe that it's architecture is more elegant than meets the eye, but I still believe it could use a heck of a lot more built-in steps. For example, the lack of an INSERT/UPDATE step is a glaring omission. I also couldn't find any built-in step for hashing (which comes in very handy when comparing multiple columns for changes) or an explicit DELETE or UPDATE step (you must accomplish DELETEs and UPDATEs using an OLE DB transformation step). SSIS also lacks a Slowly Changing Dimension step. Instead, SSIS provides you with a wizard which will generate multiple steps for you out of existing components. I don't agree with Microsoft's approach here since it makes SCDs less configurable, and more reliant on moving parts. Contrast this with Pentaho DI which includes all of these steps, as well as supporting a single configurable Slowly Changing Dimension step. That said, it is possible to add in third party steps, or simply code the step using the built-in scripting components. However, the problem with third-party add-ins is that they complicate deployments. And the problem with over reliance on scripting steps is that they can't be explicitly configured, rather logic must be coded. As I keep mentioning, scripting steps break with the visual metaphor, so you have to either read the step's label or open it up to know what it means (ETL iconography makes it easily possible to see what is going on in a single glance).

Continuing my gripes with SSIS, there is some confusion as to what steps to use when it comes to database connectivity. Namely, SSIS 2008 offers both OLE DB and ADO.NET connectors. I'm currently using OLE DB, but should I be using ADO.NET now? I'm not too concerned about that dilemma. What drives me bananas is that they also include a SQLServer destination step. It turns out that this step only works properly if you're running your SSIS server on the same server as your SQL Server. For myself this is maddening, and I'm not sure why Microsoft even bothered including this step - I couldn't discern any difference in performance from the OLE DB destination step.

Okay, that's the bad news. The good news is that SSIS has some pretty nifty features making it worthwhile, especially if you're executing on a Microsoft strategy. I'll list out the strong points of SSIS, and what sets it apart from Pentaho:

Checkpointing and transaction management: This is probably my favourite feature of SSIS, when compared to Pentaho DI. Checkpointing is more or less what you would expect. If a job fails part way through, SSIS records (in a local XML file) the point of failure, as well as all of the job variable values. This allows an operator to investigate the cause of failure (typically by examining logs), take corrective action, and then restart the job from point of failure. In the world of ETL, this scenario is not uncommon. On top of checkpointing, SSIS also makes it possible to contain multiple steps as part of a single transaction, so that if any step within the transaction fails, previous steps are rolled back. This is accomplished through tight integration with Microsoft's Distributed Transaction Controller - a component of the Windows operating system. I have experimented with this feature, and have got it to work successfully, although at first it was less than straightforward to configure. That said, I've never had a need for this feature. Also, this transaction management (from what I can tell) only works at the job level, and not at the transformation level, which is where it's more needed. One last thing I should point out about checkpointing: Don't confuse job checkpointing with incremental loading. Incremental loading (as opposed full refreshes) requires the ETL developer to identify when data has been INSERTed, UPDATEd, or DELETEd since the last time the job was run, which requires upfront analysis and cannot be addressed solely by a tool.

Consolidated logging, and ability to log to Windows Event Log. All ETL tools produce a single log. Most ETL tools allow for varying levels of verbosity, all from a single setting. SSIS goes one step further and nicely integrates with the Windows Event Viewer which if you're a Microsoft shop is HUGE benefit. As I mentioned earlier, logging was a sore spot for PowerShell. So the difference between SSIS and PowerShell is day-and-night.

Consolidated error handling. Typically when a job fails (for any reason), you want your ETL job to fire off an e-mail to technical support. SSIS makes this very easy to do, as there is a single consolidated Error Handling job which you can develop for. Both PowerShell and Pentaho DI appear to lack this feature.

C# and VB.NET scripting steps: Most ETL tools include a scripting step. This makes it possible to perform arbitrary transformations, as well as doing other tasks like adding or removing rows from the pipeline, or even adding and removing columns. Pentaho DI relies on either Java or JavaScript for its main scripting step. Personally, I'm fine with just the JavaScript step in Pentaho. SSIS allows you to script in either VB.NET or C# (C# is only supported in SSIS 2008 or later) which if you're already a .NET developer is a major benefit. Furthermore, it's significantly easier for that same code to me made into a standalone data flow step component (which gets you back to the visual metaphor). My only warning with SSIS's scripting step, is that I can easily see developers overly relying on it to perform the majority of their transformations. As I keep saying: the visual metaphor is the most powerful concept behind ETL, and dramatically lowers post-implementation costs, such as support costs, and the cost of impact analysis.

Persistent look-up cache: This is new with SSIS 2008. This feature allows you to physically store data in a local binary file optimized for keyed look-ups. I've only gone so far as to test that this thing works, and can be used in a look-up step. I'm not exactly sure what the primary business driver is behind this feature. I would be nice if Microsoft would elaborate on what the recommend use cases are, and just as importantly, where NOT to use this feature. My concern is that developers may use this feature in bulk look-up situations, where a sort/merge/join would perform just as well if not better. Compare this to, Pentaho DI offers "stream lookup" step. Can't say which approach is better though.

Excellent SQLServer integration: This goes without saying. My only gripe here is the dedicated SQLServer target step which appears only to work if the job is running on the same server as the SQLServer DB. I found out the hard way that it won't work otherwise. While I could go on about some of the subtle benefits of the SQLServer integration (e.g. you can easily configure batch INSERTs into cluster indexed tables to perform very well), the biggest advantage is the same advantage you get whenever you homogenize technology under a single vendor. Namely, by using SSIS and SQLServer together, if something goes wrong Microsoft can't blame some other vendor. Nor can they say "well, it's really not designed to work for _that_ database)." These games that vendors will play with you is the main reason why IT departments prefer homogeneous architectures. On the downside, there is also very tight MS Excel integration. As of this writing though, this integration doesn't work when running SSIS in 64-bit mode. Maybe with Office 10 this will be fixed now, since Office 10 is the first version of Office to ship a 64-bit version.

Dynamic step configurations using expressions: SSIS includes a declarative expression language which allows you to dynamically configure most settings. The most obvious use of expressions is to configure the database connection string(s), based on a variable value (which is set at run-time). At a higher level, expressions make it much easier to develop generic packages/jobs which share the same code base, but have different source data stores and target data marts. I feel Pentaho DI is lacking in this department.

Built-in text analytics steps. This is one of the first things I noticed about SSIS. Basically there are two Data Flow (i.e. transformation) steps which support basic text analytics. The first is a keyword extractor. Namely, the step can be configured to extract relevant keywords from documents. These keywords can then be used to classify documents, making them easier to search against and report on. The second text analytic step allows you to search a given set of text against a list of keywords retrieved from a configurable database query. This step is very useful for doing things like sentiment analysis (i.e. determining if a block of text includes words like "cool" or "fun" versus words like "sucks" or "fail"), but it could also be used to flag sensitive information, such as people's names. While, Pentaho DI does not include any built-in steps like these, it does have a regular expression step (which SSIS does not explicitly include), but if I had to choose, I prefer SSIS's text analytics functionality.

So as you can see SSIS has some very powerful features. If you're already a Microsoft shop and already use MS SQLServer, and have a pool of developers and application support personnel already trained on Microsoft's technology, I doubt you'll find a better SSIS tool to meet your requirements. Even if you're not using SQLServer but are doing most of your development on a Microsoft stack, it's probably a very good fit. However, if you're on non-Microsoft platform, there are many other options you should to consider.

Pentaho Data Integration (PDI)

Which brings me to Pentaho DI. One of the most platform neutral ETL tools out there. I am an unabashed fan of Pentaho, partly I'll admit, because I have quite a bit of experience working with it. Before I get into what it is I like and dislike about Pentaho DI I'll explain how it's different from other ETL tools at a very high level.

Pentaho DI is an "Open Source" ETL tool, based on the Kettle project. However, unlike Linux which has many different flavours and supporting vendors, there is only one vendor, Pentaho, that supports the tool. This is not unlike MySQL which was only being supported through SUN and now Oracle. I'm fine with that.

Just to digress for a moment, years ago I read "The Cathedral and the Bazaar" which is basically a manifesto for open source development. It argues that people will self select projects to work on, and this will drive the future of software development. I don't agree with the overarching thesis, but there are merits to this argument. The problem with volunteer based projects is that people tend to volunteer for selfish reasons. In the case of Linux, it is seen by many as a way of combating Microsoft's perceived tyranny. Shared adversaries are after all the strongest bonds between humans (I heard that from an anthropologist, and believe it to be true). When it comes to other forms of crowdsourcing, people often derive enjoyment from contributing. For example, I've contributed to Open Street Map adding in streets and landmarks in my neighbourhood. If I had more time on my hands I'd probably be doing more of it. It's kind of fun. I can also see how millions of other people might find Open Street Map fun. And let's not forget how popular Wikipedia is. But the problem is that most of the work people get paid to do is not necessarily fun - even in the software world. Some things need to get done, even if nobody wants to do them. The other problem with open source projects, is that until they reach a critical mass, competing projects can pop up, which not only dilutes the overall pool of talent but also reduces the likelihood of any one project succeeding. For example, if there were many different open source operating systems at the time Linux began to emerge, all competing with each other, would we have Linux today? I'm not so sure. My point in all this is that I support companies like Pentaho backing open source projects, even if that means more corporate control over the project. The important thing is that the entire source code is code is available, and anyone can contribute to that source code. This therefore puts Pentaho in the position of being a _service_ company, which is really the future of software any way you look at it.

Okay, moving on to Pentaho DI itself. First, the bad news. As with any piece of software, there are bugs. Now, if you have a support contract with Pentaho, this is never a problem as they will provide you with a patch (and you will receive intermediate patch releases). It's also possible to install those patches yourself (or fix the bug yourself), but that involves building the application from source files. That said, all of the bugs I've encountered can be worked around. And to be fair, I haven't found any bugs yet in the latest version (version 4.0).

Now on to the good news. Here is what I love about Pentaho DI:

Plenty of built-in steps: Although I've had to resort to the JavaScript scripting step on more than one occasion (version 4.0 now includes a Java scripting step), most transformations can be entirely built with their built-in steps.

Web ETL steps built-in: PDI comes with several web extraction steps, include GET and POST lookups, WSDL lookups, and a step to check if a web service is even available. There is also a step for decoding XML documents. For JSON documents, I just use the JavaScript step (it's not too hard to find JavaScript code to decode JSON documents). I mention all of these steps because I believe web ETL is a big part of the future of ETL. This is especially true now with so many data services out there. Contrast this to SSIS, which is rather weak in this department.

Single Type II Slowly Changing Dimension step: Unlike SSIS (which doesn't have a single SCD step, but rather uses a wizard to generate multiple steps), PDI includes a single Type II SCD step. I've used it quite a bit, and it works very well and is simple and straightforward to configure.

Normalized database repository: The other ETL tools I have used (SSIS and DataStage) will allow you to save jobs in the form of a file, or within a database. However, I find those files to be relatively opaque and difficult to query (I suppose with the right tools it would be possible to query them). PDI on the other hand stores all of its Jobs and Transformations in a normalized database. This makes it possible to query your ETL code using standard SQL queries. I have taken advantage of this by writing queries to produce simple impact analysis reports, showing which tables are impacted by which jobs. Since impact analysis is a huge aspect of data Enterprise Architecture, this is a real benefit. Furthermore, because PDI jobs are stored within a centralized RDBMS database, I find that it is much easier to deploy and troubleshoot jobs in production than it is with SSIS. To be fair, SSIS does allow packages to be deployed to SQLServer, but their really just deployed as an opaque "BLOB" which can't be easily queried.

Java based (runs on any platform): There was a time where Java applications were not considered appropriate for applications requiring high performance, such as an ETL tool. Well, to my surprise, Java has caught up pretty closely to binary native compiled code. As such, the benefits of "write once, run anywhere" are starting to come into focus. Indeed, I have run PDI on both Windows and Linux machines without incident. I also can't tell the difference between the performance of SSIS versus PDI

Clustering/Cloud deployment: I honestly don't have any hands-on experience deploying PDI as part of a cluster, or within a cloud. Although I have heard some impressive benchmarks, with reports of over 200,000 records processed per second within a single transformation. Because it is possible to cluster PDI servers, this means it is possible to scale up individual ETL transformations across multiple server nodes. Furthermore, Pentaho is now beginning to offer cloud based solutions. This is also something I have no hands-on experience with, but I like what I'm hearing. Contrast this with SSIS, which is not "cluster aware" and must be installed as "standalone" in Windows Cluster. To be fair, I believe part of the reason for SSIS's shortcoming is that it is more transaction aware than PDI, and with that comes certain challenges. I also think that it is possible to run SSIS as part of a cluster, but Microsoft does not recommend this.

Well, that's my experience report. Hope you learned something. Feel free to e-mail me or comment any corrections. It's quite possible I've made an incorrect assumption, so if any PowerShell, SSIS or PDI developers are reading this, don't hesitate to chime in. I'll happily update this blog if I agree there is a problem with something I've written.

Monday, July 19, 2010

"What do you think of the long form census" I ask inquisitively, adding "You know, the census form, which one out of every five persons needs to fill out. In order to provide information about stuff like ethnicity, religion, and income".

"I don't want to give out information to the government if I don't have to." I'm sternly told.

"Why not?" I ask politely. "It will never get released." I elaborate: "I've worked quite a bit with the Canadian census and with StatCan. I know from dealing with them that it is impossible for them to release any private information." I qualify my argument further "StatCan has strict privacy regulations preventing them from releasing identifiable information of either an individual or a business. It's the law." I go on "they have a 'rule of 3' whereby information must be aggregated to at least 3 parties. Furthermore, they also have a 'round to 5 rule', whereby all respondent counts are rounded to the nearest 5". I natter on some more "there has never been a privacy breech, even for a terrorism related investigation. The chief statistician reports directly to the prime minister of Canada. The OECD ranks StatCan as the best national statistics body in the world, and their workers take great pride in all of this, and see privacy protection as their 'Prime Directive' - I know these guys."

No change. "I don't care. You never know what they'll do with this information. That's just what they tell you."

Hmmm, I had to dig deeper. "But they 'the big bad government' [not StatCan] already know how many bedrooms you have, where you live, what you buy, who you're friends are. If you believe they are corrupt, they can access bank records, your Internet history, your telephone calls, your property records. Even I can bring up Google Street Maps and take a look at your house. I can browse your Facebook profile [which Facebook surreptitiously has opened up unbeknownst to many]. They can check your vehicle registration, check your credit rating, your insurance premiums, your credit and debit card purchases. All they need to do is send a bunch of faxes and make a few phone calls. They have the authority already. You're too late." Feeling a bit frustrated, I become more forceful in my argument "The only way you can get around them is by living like The Unibomber Ted Kaczynski. You would have to abandon your home and all your possessions and start again. You'd have to drive up to northern Ontario in the cover of night, build a log cabin, and live there for the rest of your life, living off the land - even making your own soap. And there's no guarantee that will work."

It's no use, his mind is made up. No counterargument, just "I don't want to give out my personal information to the government."

I give up on this conversation and change the subject. I can see it's going nowhere. Thank goodness for blogs...

The problem is, without a census, it's really difficult to make educated policy decisions. The census was specifically designed to be analyzed in aggregate. It was not and is not intended to be used as a repository of personal information. Although detail records are required in order to generate accurate pictures at a higher level, and these analyses almost always are based around geography.

As you may well be aware, our Prime Minister, Stephen Harper has single-handedly dismantled one of the most essential and proud pieces of Canadian infrastructure: the Canadian Census Long Form. As such, he is hobbling all levels of government, including provincial and municipal, not to mention the thousands of businesses that depend on this information to make informed decisions like where to open their storefront. To make matters worse, he is greatly hindering foreign investors from investing in Canada. The biggest obstacle to making any kind of investment is basic information. Canada has traditionally been a very safe place for companies to invest, in large part due to the educated work force and middle class consumers. However, if you don't know where that workforce lives, or where those consumers are, it's a lot tougher to justify those multimillion dollar investments.

But the reason why Harper has decided to drop the Long Form census is that he knows there are a lot of people who have unfounded and irrational fears of the Census Long Form. Many of those people are immigrants who come from tyrannical governments who do abuse their power (although ironically these governments also don't need the census to locate and persecute).

Stephen Harper understand this hysteria all too well, and is using it bolster support for himself and his government, facts be damned. And in selling this hatchet job, he is using every dirty trick in the book. Last week the line was "the government doesn't need to know how many bedrooms you have in your house" - a clear attempt conflate the census with "the government should stay out of your bedroom", which was a common tag-line that comes up when sexual liberties are being discussed. Today it was "why does the government need to infringe on people's privacy?", a line that Glenn Beck or Bill O'Reilly may attempt to deliver with a straight face.

There is something also very "meta" about all this. Harper is effectively abandoning Evidence Based Decision making in order to... have everyone else abandon Evidence Based Decision making. It is shameless fear mongering, and it is bad for everyone, including those who don't like filling out forms.

On that note, I would like to conclude with a plea to reach out and help people understand why the census is so important, and why privacy fears are unfounded. A good start is to keep this conversation going. This should be our prime ministers job. Instead he would rather us be fearful of the bogeyman and hide under our beds.

Friday, May 07, 2010

In August 2008 economic world history was made, as a major world record was smashed. You probably never heard this news - even if you read the business section of a major newspaper, or work in the finance industry. Instead, what you may have heard was the consequence of this news. Namely: gold and silver prices hitting a major slump.

But behind those price slumps was the biggest and most concentrated naked short position ever held for any commodity. To give you some numbers, world silver production is 680 millon oz annually. The short position of 3 (or fewer) US banks increased 11 fold from July to August, with a naked short position of 169 million oz. That's 25% of all annual silver production! Gold followed a similar pattern and the same 3 banks increased their short position in August to account for 11% of all annual gold prodution. For those that aren't familiar with "short selling", it's the practice of selling something NOW (that you have borrowed), and paying for it in the future. So, if the price goes down you make a profit. If the price goes up you get "squeezed" and have to take a loss.

If you'd like to see the numbers for yourself, they're all available in the official Bank Participation Report for July and August. While it is possible that the banks themselves were not ordering the short-selling (but rather possibly clients), it is very unlikely that all short-sellers just happened to go to the same bank. It should be noted that this short-selling of silver caused the price of silver to halve from $20 down to $10.

So why would these banks be betting so heavily against silver and gold? Only the banks themselves know for sure, and they aren't saying much. Nevertheless, things worked out well for them and as a result of dumping so much bullion on the market at once, in effect they triggered a larger sell-off (in part due to stop loss orders) which created a prophecy that fulfilled itself. These banks more than likely generated huge profits.

Not surprisingly many people (especially "Gold Bugs") cried foul and claimed there was market manipulation. Independent Toronto based investor Harvey Organ submitted a written statement to the CFTC, which he accompanied with oral testimony at their March 25th hearing. You can see the video here (skip to 4:40:00 into the webcast). Silver analyst Ted Butler also wrote a piece on this event referring to it as The Smoking Gun.

Of course this doesn't actually prove there was any manipulation going on. A common retort to the conspiracy theorists was "If there was a conspiracy, surely somebody would spill the beans". Well, as if a Sasquatch were to walk out of the forest and into broad daylight, so arrived Andrew Maguire, a London based metals trader who mainly trades in the silver market. Maguire was privy to a very small network of metals traders (ostensibly with JP Morgan Chase at the helm) who would co-ordinate "takedowns" in the sliver market through the aforementioned "naked short selling" of huge concentrated quantities of silver contracts. Whether this constitutes long term or short term manipulation is certainly up for debate. But what Mr. Maguire's revelations appear to show (if they are true) is that manipulation is happening on a fairly regular basis, and at the great expense of honest investors. It should be pointed out that Maguire himself profited from these manipulations and is said to be a wealthy man.

For several months, Andrew Maguire provided evidence to the Commodities Future Trading Commission (this is the regulatory body [similar to the SEC] that regulates commodities trading, including gold and silver) that demonstrated how the manipulations were orchestrated, and how they played out. I have pasted the entire e-mail trail (which was later leaked to the Gold Anti-trust Action Committee) at the bottom of this blog. However, as you can hear in this audio interview, Maguire himself was prevented from attending the CFTC hearing on bank position limits on March 25th. Nevertheless, Adrian Douglas and Bill Murphy (of GATA) were able to attend, and were able to provide testimony on Maguire's behalf. You can see the footage on YouTube here.

But for me, what's most disturbing is that there has been no media coverage from any major newspaper or television network. And what's even weirder is that if you listen to the interview it's stated that interviews were lined up with Bloomberg and Reuters, but were both cancelled within 24 hours with no explanation. It is this fact alone that has prompted me to write this blog entry.

I have spent the last four weeks research this, and so far all signs point to market manipulation. In fact after a while it becomes hard to understand how the market could not be manipulated. To make matters worse, if you read the prospectuses on Gold and Silver ETFs (e.g. here is the GLD prospectus and SLV prospectus) your eyes will pop out at how complex and risky they appear to be. In fact just this week an excellent in depth investigation into these prospectuses was released which showed that there are major undisclosed conflicts of interest for JPMorgan (the custodian for SLV) and HSBC (the custodian for GLD). The report goes on to suggest that the ETF has been architected to effectively avoid any obligations around reporting these conflicts of interest. In some regards you could say if you wanted to design an investment product that avoids all forms of regulation, but which can be sold by trusted investment banks, you couldn't do a better job than SLV and GLD. What's also weird is that in Canada I can't even find the prospectus for IGT (the Canadian equivalent of GLD) on the Canadian iShares web site. And a search conducted just now on the Canadian ishares site for "IGT" returns nothing (but it was there in their web site 2 days ago, and a mention can still be found in the PDF). For a complete analysis of the SLV and GLD ETFs, I highly recommend you read the well researched report by Catherine Austin Fitts and Carolyn Betts.

One of the interesting items brought up in this report is the conflict of interest that the US Government itself has with respect to this manipulation. I quote the following paragraph:

The absence of these disclosures is particularly disturbing given the potential conflicts between the banks' responsibilities serving as custodians and trustee and their responsibilities and liabilities as members and shareholders of the Federal Reserve Bank of New York. The NY Fed serves as the depository for the US government and as agent for the Exchange Stabilization Fund on behalf of the U.S. Secretary of Treasury. The ESF allows the Secretary of the Treasury to deal in gold, foreign exchange, and other instruments of credit and securities. NY Fed member banks typically serve as agents of the NY Fed in providing services. In addition, JPMorgan Chase and HSBC maintain responsibilities as Primary Dealers of U.S. government securities. When called upon to defend the U.S. government’s interests in the bond market, or the U.S. dollar’s interest in the currency market, or to help prevent another financial meltdown, whose interests will be primary? Will it be the central bank and government with pressing national security interests or retail investors?

I was curious to see what if anything the US government has to say about the obvious decline in sliver prices due to that massive short sell in August of 2008. As such I referred to the US Geological Survey Annual Summary on silver for 2008 and 2009 (released in 2009 and 2010 reports, respectively). Given that they comment on changes on the average price of silver, I was curious to see what they would say about the precipitous drop in average price for 2009. Well, it's either complete ignorance or Orwellian (depending on your point of view), but the report states (contradicting its own reported numbers) that the price actually went up in 2009. And by up, they clearly mean down. Cue Seth Myers and Amy Poeler: "Really USGS! Really!!!"

To be fair, I have found the following rebuttal from Jeffrey Christian of The CPM Group (a consultancy which in part serves the bullion banks). While Mr. Christian's points are valid, they don't do much to put my mind at ease. Firstly, he mainly rebuts "red herring" arguments that aren't important, and doesn't bother to comment on the heart of Mr. Maguire's allegations or the activity in August 2008. Secondly, none of his explanations use much real data, instead relying on the abstract providing scant details of any names, numbers, or dates. I am skeptical of his sincerity in getting to the bottom of this problem. I think a truly dedicated consultant would say "Hey, I don't know for sure, but let's figure out how to get these facts". He doesn't go there.

So what's my point? Why am I blogging this?The reason I am blogging this is that I simply cannot figure out why no serious journalist has investigated Andrew Maguire's story. I mean let's face it: Either this is all one big hoax (which is a story unto itself). Or this is part of a massive fraud involving rigged markets and major international players. This is the kind of story journalists dream of. What's going on here people!

So, before I post a copy of Andrew Maguire's e-mails to the CFTC, I put forth the following challenge to any and all serious journalists out there:

Obtain a copy of or confirm or deny the existence of conversations between Andrew Maguire and the CFTC

Obtain a comment from the CFTC on their interactions with Andrew Maguire

Obtain a comment from the five (5) Bullion banks on the massive short of silver for August 2008 clarifying whether they were behind the short or not, and why

Record an interview with Andrew Maguire to learn more about his relationship with the bullion banks.

And so I end this blog post with a copy of the allegedly leaked e-mails between Maguire and the CFTC. Enjoy!

I thought you might be interested in looking into the silver trading today. It was a good example of how a single seller, when they hold such a concentrated position in the very small silver market, can instigate a selloff at will.

These events trade to a regular pattern and we see orchestrated selling occur 100% of the time at options expiry, contract rollover, non-farm payrolls (no matter if the news is bullish or bearish), and in a lesser way at the daily silver fix. I have attached a small presentation to illustrate some of these events. I have included gold, as the same traders to a lesser extent hold a controlling position there too.

Please ignore the last few slides as they were part of a training session I was holding for new traders.

I brought to your attention during our meeting how we traders look for the "signals" they (JPMorgan) send just prior to a big move. I saw the first signals early in Asia in thin volume. As traders we profited from this information but that is not the point as I do not like to operate in a rigged market and what is in reality a crime in progress.

As an example, if you look at the trades just before the pit open today you will see around 1,500 contracts sell all at once where the bids were tiny by comparison in the fives and tens. This has the immediate effect of gaining $2,500 per contract on the short positions against the long holders, who lost that in moments and likely were stopped out. Perhaps look for yourselves into who was behind the trades at that time and note that within that 10-minute period 2,800 contracts hit all the bids to overcome them. This is hardly how a normal trader gets the best price when selling a commodity. Note silver instigated a rapid move lower in both precious metals.

This kind of trading can occur only when a market is being controlled by a single trading entity.

I have a lot of captured data illustrating just about every price takedown since JPMorgan took over the Bear Stearns short silver position.

I am sure you are in a better position to look into the exact details.

It is my wish just to bring more information to your attention to assist you in putting a stop to this criminal activity.

Thought it may be helpful to your investigation if I gave you the heads up for a manipulative event signaled for Friday, 5th Feb. The non-farm payrolls number will be announced at 8.30 ET. There will be one of two scenarios occurring, and both will result in silver (and gold) being taken down with a wave of short selling designed to take out obvious support levels and trip stops below. While I will no doubt be able to profit from this upcoming trade, it is an example of just how easy it is to manipulate a market if a concentrated position is allowed by a very small group of traders.

I sent you a slide of a couple of past examples of just how this will play out.

Scenario 1. The news is bad (employment is worse). This will have a bullish effect on gold and silver as the U.S. dollar weakens and the precious metals draw bids, spiking them higher. This will be sold into within a very short time (1-5 mins) with thousands of new short contracts being added, overcoming any new bids and spiking the precious metals down hard, targeting key technical support levels.

Scenario 2. The news is good (employment is better than expected). This will result in a massive short position being instigated almost immediately with no move up. This will not initially be liquidation of long positions but will result in stops being triggered, again targeting key support levels.

Both scenarios will spell an attempt by the two main short holders to illegally drive the market down and reap very large profits. Locals such as myself will be "invited" on board, which will further add downward pressure.

The question I would expect you might ask is: Who is behind the sudden selling and is it the entity/entities holding a concentrated position? How is it possible for me to know what will occur days before it will happen?

Only if a market is manipulated could this possibly occur.

I would ask you watch the "market depth" live as this event occurs and tag who instigates the move. This would surly help you to pose questions to the parties involved.

This kind of "not-for-profit selling" will end badly and risks the integrity of the COMEX and OTC markets.

I am aware that physical buyers in large size are awaiting this event to scoop up as much "discounted" gold and silver as possible. These are sophisticated entities, mainly foreign, who know how to play the short sellers and turn this paper gold into real delivered physical.

Given that the OTC market (where a lot of the selling occurs) runs on a fractional reserve basis and is not backed up by 1-1 physical gold, this leveraged short selling, where ownership of each ounce of gold has multi claims, poses a very large risk.

I leave this with you, but if you need anything from me that might help you in your investigation I would be pleased to help.

If you get this in a timely manner, with silver at 15.330 post data, I would suggest you look at who is adding short contracts in the silver contract while gold still rises after NFP data. It is undoubtedly the concentrated short who has "walked silver down" since Wednesday, putting large blocks in the way of bids. This is clear manipulation as the long holders who have been liquidated are matched by new short selling as open interest is rising during the decline.

There should be no reason for this to be occurring other than controlling silver's rise. There is an intent to drive silver through the 15 level stops before buying them back after flushing out the long holders.

A final e-mail to confirm that the silver manipulation was a great success and played out EXACTLY to plan as predicted yesterday. How would this be possible if the silver market was not in the full control of the parties we discussed in our phone interview? I have honored my commitment not to publicize our discussions.

I hope you took note of how and who added the short sales (I certainly have a copy) and I am certain you will find it is the same concentrated shorts who have been in full control since JPM took over the Bear Stearns position.

It is common knowledge here in London among the metals traders that it is JPM's intent to flush out and cover as many shorts as possible prior to any discussion in March about position limits. I feel sorry for all those not in this loop. A serious amount of money was made and lost today and in my opinion as a result of the CFTC's allowing by your own definition an illegal concentrated and manipulative position to continue.

Bart, you made reference to it at the energy meeting. Even if the level is in dispute, what is not disputed is that it exists. Surely some discussions should have taken place between the parties by now. Obviously they feel they can act with impunity.

Now that gold is undergoing short covering, please look at market depth right now in silver and evidence the large selling blocks in a thin market being put in the way of silver regaining the technical 15 level, which would cause a short covering rally and new longs being instigated. This is resulting in the gold-silver ratio being stretched to ridiculous levels.

I hope this day has given you an example of how silver is "managed" and gives you something more to work with.

If this was long manipulation in, say, the energy market, the shoe would be on the other foot, I suspect.

I hadn't received any acknowledgement from you regarding the series of e-mails sent by me last week warning you of the planned market manipulation that would occur in silver and gold a full two days prior to the non-farm payrolls data release.

My objective was to give you something in advance to watch, log, and follow up in your market manipulation investigation.

You will note that the huge footprints left by the two concentrated large shorts were obvious and easily identifiable. You have the data.

The signals I identified ahead of the intended short selling event were clear.

The "live" action I sent you 41 minutes after the trigger event predicting the next imminent move also played out within minutes and exactly as I outlined.

Surely you must at least be somewhat mystified that a market move could be forecast with such accuracy if it was free trading.

All you have to do is identify the large seller and if it is the concentrated short shown in the bank participation report, bring them to task for market manipulation.

I have honored my commitment to assist you and keep any information we discuss private,however if you are going to ignore my information I will deem that commitment to have expired.

All I ask is that you acknowledge receipt of my information. The rest I leave in your good hands.