Ed-Fi Special Interest Group Tackles Issue of Controlled Vocabularies

Ed-Fi recently formed a Special Interest Group (or “SIG”) to look at emerging needs around how Ed-Fi standards and technologies support controlled vocabularies. As we all know, categorizing elements of real-world experience is crucial to strong data analysis, and as a result, our data models are full of elements that require controlled vocabularies.

These vocabularies — or enumeration sets — sort real-world experiences and entities into various pre-defined categories. Common examples we see all the time in K12 are race, gender, grade level, or job type. Most enterprise data models have dozens if not hundreds of such category sets, and the Ed-Fi core data model is no different.

We assembled this SIG to study the inherent diversity of such sets in the ecosystem, and how Ed-Fi should approach the ecosystem. We would like to thank and recognize those who joined us for this effort:

Warning: from this point on, the discussion wades waist deep into a few abstractions often faced by data analysts and by those who work across different data models. In other words, if you’re a data geek, keep reading!

The main struggle for Ed-Fi is that different sets of enumerations for the same concept exist. A good Ed-Fi example is the ExitWidthdrawDescriptor, which captures the reason a student is exiting or withdrawing from a school. By default, the Ed-Fi Data Standard values you might expect are:

Transferred

Graduated

Expelled

Reached maximum age

But there could be another list – let’s say a state-mandated list — that has different values, ones not in Ed-Fi’s core enumeration set. For example, a second set might look like this:

Transferred

Graduated

Expelled

Reached maximum age

Performing service year

Family move

…and so on. The additional values in this set – “Performing service year” and “Family move” – may be critical to the operations of the state education department, as the agency might be mandated by law or policy to use these values for various reports or contexts.

Now, we all know that no particular enumeration set is right or wrong; each set of values supports different “use cases.” In some contexts (that is, for some use cases), we may not care if a student left school because their family moved, and in another context that information may be critical to capture.

What happens is that we often need to move – or map — between a set of enumerations. Data is collected or managed according to one set of enumerations, then translated into another vocabulary. In the above example, that could result in a mapping like this:

Context AContext B

Transferred → Transferred

Graduated → Graduated

Expelled → Expelled

Reached maximum age → Reached maximum age

Performing service year → Transferred

Family move → Transferred

As you can see, this chart maps values from an expressive vocabulary to a narrower one, resulting in data loss. You can see this in how there are three values from context A (“Transferred”, “Performing service year” and “Family move”) that map to a single value in Context B (“Transferred”). But presumably, this loss of fidelity is not a loss that anyone in Context B cares out.

And as you can imagine, there are any number of possible contexts, all of them valid representations of this enumeration set. The reason why a student left school might require one set of values for a student transcript or report card, another for the Civil Rights Data Collection, another for a rostering system, and still another value for school district funding calculations.

What does this mean in the context of Ed-Fi?

Currently, the Ed-Fi data model envisions two contexts for enumeration sets: a local context and a cross-sector context. In the Ed-Fi Data Standard 2.0, these are actually two different classes of entities: the fields that support local context are called descriptors and the fields that support cross-sector values are types. Types are the immutable values governed as part of the data standard and apply anywhere in the K12 enterprise, while descriptors can be used to support local needs.

Having two different kinds of enumerations has been an effective abstraction for Ed-Fi, but we are starting to see signs that it needs to change:

As the above points assert, there are many possible contexts for any concept, not just two. Two has worked well for a while, but now that Ed-Fi is used for broader sets of use cases, we are finding more contexts.

In many cases, our community seeks to collaborate with other standards efforts (for example, Ed-Fi’s CEDS collaboration). We need to also support enumeration sets governed by other community needs’.

The presence of two different classes of enumerations in the model is confusing to many new Ed-Fi users.

As a result, Ed-Fi’s enumerations strategy needs an update. We need to revisit this pattern of types and descriptors to understand how the Ed-Fi standards and technology can support multiple use cases – and therefore multiple enumeration sets for the same concept — and do so while making adoption of the standards easier.

Some New Directions

Broadly speaking, the SIG (as well as the Ed-Fi Technical Advisory Group) reached a few conclusions:

Support for multiple sets of enumerations in the Ed-Fi technology makes sense, and providing for flexible and simple mapping between those is a strong candidate for the technology roadmap.

Related to #1, we need to deprecate the concept of “types” currently in the model and remove types from the Unified Data Model. Instead, there is a single entity type (descriptors) and Ed-Fi will govern a set of “core” descriptors to replace the types.

Continued community governance around a core, shared sets of values and supporting convergence is critical. Since Ed-Fi’s mission is data interoperability, support for multiple enumeration sets does not mean anything goes as far as controlled vocabularies for concepts like ExitWithdrawal are concerned. Not at all. There is a role for the Alliance and the Ed-Fi Community to play in standardizing particular sets of enumeration values for particular use cases and generating ecosystem buy-in to those specific sets.

Some complexities at the API level remain. Since Ed-Fi’s standards and technology also define and support API interactions, the need for more flexibility around enumeration sets raises new questions about how different API clients involved in different use cases (and hence using different enumeration sets) can use the same APIs.

As an example, and returning to our previous mapping of the ExitWithdrawal set:

Context AContext B

Transferred → Transferred

Graduated → Graduated

Expelled → Expelled

Reached maximum age → Reached maximum age

Performing service year → Transferred

Family move → Transferred

What happens when an application in Context B tries to write the ExitWithdrawal value “Transferred” to an entity whose values is currently a value from the Context A list? The problem – as you can see from the chart above – is “Transferred” can map to any one of three possible Context A values. Of course, there are possible solutions (e.g. define a reverse mapping from context B to context A and accept some data loss; refuse to do the data write; store both values and risk ambiguity; etc.) but all have their pros and cons.

We will continue discussing this topic, and the community should expect change coming in the future, likely in regards to the Ed-Fi API 3.0 and for the next iteration of the Ed-Fi Data Standard.

As always, we encourage you to join the conversation and please don’t hesitate to send your ideas to me at eric.jansson@ed-fi.org.