Tuesday, July 24, 2012

I’ve been working on a little project that allows me to merge my love of baseball with my knowledge of XML technologies. In the process of working through this project, I am creating XQuery modules that encapsulate the logic for the data. Part of the data that I’m looking at must account for different outcomes during the June amateur draft.

It turns out that the MLB June Amateur draft is quite interesting in that drafting prospects is a big gamble. Drafts may or may not sign in any given year, and remain eligible for drafts in subsequent years. If they don’t sign during that year, they could be drafted by another team in following years. Alternately, they could be selected by the same team and signed. However, even if they do sign, there’s no guarantee that they’ll make it to big leagues. And even if they do, they might not make it with the same team they signed with initially (in other words, they were traded before reaching the MLB).

In effect there are several scenarios, depending how the data is aggregated or filtered. However, these scenarios are well defined and constrained to a finite set of possibilities:

All draft picks

All signed draft picks

All signed draft picks who never reach the MLB (the vast majority don’t)

All signed draft picks who reached the MLB with the club that signed them

All signed draft picks who reached the MLB with another club

All unsigned draft picks

All unsigned draft picks who reached the MLB with a different club

All unsigned draft picks who reach with the same club, but at a later time

All unsigned draft picks who never reach the MLB

All of these scenarios essentially create subsets of information that I can work with, depending whether I’m interested in analyzing a single draft year, or all draft years in range. They’re essentially the same queries, with minor variations to filter to meet a specific scenario.

Working with various strongly typed languages like C# or Java, I would use a construct like an enum to encapsulate these possibilities into one object. Then I can pass this into a single method that will allow me to conditionally process the data based on the specified enum value. Pretty straightforward. For example, in C# or Java I would write:

public enum DraftStatus {
ALL, //All draft picks (signed and unsigned)
UNSIGNED, //All unsigned draft picks
UNSIGNED_MLB, //All unsigned picks who made it to the MLB
SIGNED, //All signed draft picks
SIGNED_NO_MLB, //Signed but never reached the MLB
SIGNED_MLB_SAME_TEAM, //signed and reached MLB with the same team
SIGNED_MLB_DIFF_TEAM //signed and reached with another club
};

The important aspect of enumerations is that each item in an enumeration can be descriptive and also map to a constant integer value. For example UNSIGNED is much more intuitive and meaningful than 1, even though they are equivalent.

Working with XQuery, I don’t have the luxury of an enumeration. Well, at least in the OOP sense. I could write separate functions for each of the scenarios above and perform the specific query and return a the desired subset I need. But that’s just added maintenance down the road.

At first I toyed with the idea of using an XML fragment containing a list of elements that mapped the element name to an integer value:

It works, but it’s not very elegant. Every value in the XML fragment has to be extracted through the xs:integer() function which is added logic and makes the code less readable. Add to that, IDEs like Oxygen that enable code completion (and code hinting) doesn’t work with this approach.

What does work well (at least in Oxygen, and I suspect in other XML/XQuery IDEs) are code completion for variables and functions, which led me to another idea. Prior to Java 5, there weren’t enum structures. Instead, enumerated constants were created through the declaration of constants encapsulated in a class:

public class DraftStatus {
public static final int ALL = 0;
public static final int UNSIGNED = 1;
public static final int UNSIGNED_MLB = 2;
public static final int SIGNED = 3;
public static final int SIGNED_NO_MLB = 4;
public static final int SIGNED_MLB = 5;
public static final int SIGNED_MLB_SAME_TEAM = 6;
public static final int SIGNED_MLB_DIFF_TEAM = 7;
}

This allowed static access to the constant values via the class, e.g., DraftStatus.SIGNED_MLB_SAME_TEAM.
The same principle can be applied to XQuery. Although there isn’t the notion of object encapsulation by class, we do have encapsulation by namespace. Likewise, XQuery supports code modularity by allowing little bits of XQuery to be stored in individual files, much like .java files. To access class members, you (almost always) have to import the class into the current class. The same is true in XQuery. You can import various modules into a current module by declaring the referenced module’s namespace and location.
Using this approach, we get the following:

Which gives as direct access to all the members like an enumeration:
The bottom line is that this approach has worked really well for me. I can use descriptive constant names that map to specific values throughout my code and shows how you can add a little rigor to your XQuery coding.