Parsing EDI to XML (and vice verse)

The supplementary source code for this article can be downloaded from edifabric.com

Introduction

I've been meaning to provide a brief overview of the EDI format from the perspective of parsing and interpretation. I have an affinity for data formats and parsers. My first encounter with EDI was when I worked for an airline services provider. I was tasked to engineer the EDI communication for the company's new line of business applications. Information on the subject was scarce and scattered across multiple resources on the web. I had to traverse numerous websites in order to assemble a finite set of rules, sufficient enough to be unambiguously understood by a normal human being and eventually automated. It was like solving a giant puzzle.

Most of the articles related to EDI revolved around business controversies and comparisons between the different formats and dialects. Completely irrelevant to my research. I still don't understand why do so many EDI formats still co-exist nowadays (> 5000). It appeared to me that EDI was veiled in mystery and the lack of information and cooperation was not something to be considered as a simple act of randomness...

I will leap over the entertaining side of EDI, like the conspiracy behind the multiple formats, the rebellious movement against VANs, and the ever ongoing discussion on whether XML will eventually bury EDI (with UBL being the latest contender). My goal here is to share my knowledge on the basics of parsing an EDI message, and hope that someone else may find that useful.

I'll retract from the trivial and won't go into details on the format itself - anyone interested please have a look at the below resources, which proved to be useful for me (in a random order):

A single group can only contain messages of the same type and version. EDIFACT has a very limited use of groups and an interchange usually contains messages of the same type, therefore the group segments in EDIFACT are optional.

A Group is identified by it's first segment, which is mandatory and always has a repetition of one.

Although EDI format may seem loose enough to allow ambiguous EDI structures, e.g same segment on the same level, it's up to the owner of the definitions to ensure the structure is valid. However, when writing a parser, make sure you don't end up in an endless loop because of a faulty structure.

Group is an EDIFACT term. In X12 they are called Loops and their visual representation is slightly different.

Loops can be bounded or unbounded - the first having a start and end segment, where the later repeats according to a count, with the first segment being unique. The corresponding terminology in EDIFACT is Group and explicit loop. They are very rarely used in EDIFACT and in X12 only few transactions support it.

Right, assuming that at this stage we are well familiar with the format, let's get down to something real. Let's parse an EDI message!

Parsing the message

For the purpose I'll need two samples - one for EDIFACT and one for X12:

Details on the communication channels widely used to exchange EDI messages, with AS2 and FTP being the most popular, are also out of the scope of this article. The story starts after a message has been physically received.

The steps I took to parse the message are:

1. Identify the message format - is it EDIFACT or X12 or other? EDIFACT always starts with UNB or UNA segment. X12 always starts with ISA segment. Make sure you get rid of the BOM before you proceed as we'll be counting characters here and every interference can ruin our efforts. No BOM, no leading blanks, no extra spaces.

2. Identify the separators - once the format is identified the relevant message properties must be extracted. These properties are the separators:

data element separator is the zero character in the UNA segment (zero indexed) :

data element separator is the zero character in the UNA segment (zero indexed) :

component data element separator is the first character in the UNA segment (zero indexed) +

repetition separator is *

segment terminator is is the 5th character in the UNA segment (zero indexed) '

release indicator is is the 3d character in the UNA segment (zero indexed) ?

3. Iterate through the interchange groups - once the separators are known we can proceed with the interchange. The parser should traverse the interchange structure and iterate through the interchange groups\loops.

4. Identify the message type and version - for every group it needs to identify the message type and version of the transaction. X12 contains the version information in the interchange group start segment. The parser will loop through all transactions and start parsing them one by one.

Message type and version properties are extracted as follows:

X12

- version is the 8th data elment in GS segment, first 6 characters.
- message type is the first data element in ST segment.

It needs to be noted that X12 comes with two different versions - one for the message and one for the ISA segment. The later is contained in the ISA segment itself and is used to parse the interchange header. This means that X12 messages can have an interchange header and transaction messages in two different versions. In our example the ISA version is 00204.

EDIFACT

- version is the second data element in UNH segment, second and 3d component data element D00A
- message type is the second data element in UNH segment, first component data element INVOIC

5. Parse the transaction according to a formal grammar - once we know exactly what message we've got, the real parsing begins. I'll asume that most of you are familiar with the terminology and techniques of parsing (otherwise why would you be still reading). In order to parse an EDI message we need a formal grammar, which is the actual definition of the EDI rules and in our case is in the form of XML schema or .NET class.

ediFabric has a predefined set of definitions, which can be extended\amended to suit every dialect or requirement. I also added an additional property, called Origin, which together with message type and version forms a unique key identifying the definition. This allows you to cater for multiple customer versions of the same message in the same time. It's an old pattern to combine the two halves in a single key - one is part of the external content and the other is under our control.

Anyway, regardless of how the grammar is retrieved, it is used to uniquely parse the EDI message according to the rules of that same grammar. It defines the form we would like to see at the end of the parsing process.

What are the main challenges in parsing the EDI message ? Undoubtedly it's the conversion from a linear segments structure to a hierarchical structure. An EDI message contains a simple flat list of segments. It's the parser's function to transform that flat structure into a hierarchical tree, where every node is either a parent or a child.

The goal is to build a structure, every element of which is connected to one or more other elements and is aware of three things - which is its parent, which are its children and what is its order on the same level.

The parser will process the message according to the grammar and will produce a parse tree, which is a well ordered hierarchical set. Our EDI message has been converted into an object and we know how to manipulate that object. Here comes XML.

The resulting object model

The resulting object model conforms to ISO/TS 20625. It can generate XML or be instantiated from XML, which adds the necessary cross platform flavor and allows for easy transformation (XSLT, XQuery, etc.).

The Parser

Closing words

In the closing of this already stretched narrative, I'd like to express my opinion on the use of parsing EDI to XML. What would be the use of it ?

I'm far from the heretic thought that a product like ediFabric can compete with a full blown, commercial and costly EDI parser. But in the same time I couldn't find an open source or low budget solution to offer me the flexibility I needed. It is the alternative I was after.

My interest was not only to design an EDI to XML parser. As a software professional I wanted it to be robust, extendable and to require very little maintenance. It was designed to cater for multiple custom formats and to easily change existing or add new definitions.

I don't look at XML as an alternative to EDI. I believe the two are complementary - EDI is popular in it's own business domain, lightweight in size, and with established communication channels. XML is standard and natively supported by almost every programming language. I felt there was a gap and I had to unite the dots. It was an isomorphism, which should have supposedly made EDI more application-friendly.

Comments and Discussions

.Net? Really? Sheesh. How about a solution for more popular langages...e.g. Java rather than a solution for a language that's number 10.....http://langpop.com.... on the search for something other than a Microsucks .Nut Solution...

Is this guy for real? Someone goes to the trouble of putting this together, walking you through the how's and the why's, and all this guy has to say is why did you use this particular technology? No good deed goes unpunished, I guess. Do you think Java-boy there could write all of this in C for us? LOL

I am trying to parse an edifact message that I can convert to XML with your V1.0 (I think) DLL's, but it keeps failing. What I have done done was to take your very first edifact test and I have replaced it with my CUSDEC message and included the class Edifact_CUSDEC_D96B that I have converted from the xsd, is there something wrong with my XSD or do you have any idea's?

I'd have preferred not to have to give you my email address to look at your code, this is a source code site after-all

Did you use an automatic process to translate from X12/EDIFACT 'specs' to your xsd's ? Thats a great approach, but be mindful that often the specs are 'cut down' for local conditions - I know at least two such implementations in the Australian Government that implement various messages, but not the whole spec.

Actually, in looking at the above statement, its not so much the translation, but the tools used to test a translation or derivative against a message etc that become more important.

You wrote "identify the message format" .. I hope that was just in the context of this article, for illustrative purposes - in reality, would you not have a trading partner agreement or such that says for a given partner, what transmission method they would use and what they would send to that 'end-point' ?

You are absolutely right - it is very hard (and usually costly) to obtain a set of specs in electronic form.
I really don't understand why there isn't a standard or unified approach to that. I had to create a small tool to convert the set I had into another, which conforms to ISO/TS 20625. This is supposedly the standard to describe EDI, although not very popular.

This is where the parser flexibility comes in - you should be able to hand craft the XSD's you need and use them immediately. How are you going to do that and convert the specs into ISO/TS 20625 schemas is out of scope. It is worth the effort though (I think).

It's correct that in the real world, an agreement between the partners will dictate the format exchanged on a particular channel. I got carried away and added this as a parsing step. This is how my online translator works, but it's only a toy of course.

I went back and checked the UNECE web site - I was sure at one stage the (for example) D97B messages were available in a form that with a bit of skill could be managed electronically - but my memory seems to be at fault here ...

A good thing to note, is that looking at D11A for example, it seems to have XML schemas already

Please dont think Im picky - this is just constructive discussion - I looked at your XML spec/schema - I have no denying that it all does what you say it does - the thing that concerned me was you used short element codes eg C_S010, D_0070_1 for example .. that makes it hard to read unless you have the spec open and with you .. I would have preferred something like a short_name and a long_name where short name was S009-0065 and long name was for example "Message Type Identifier" - that way the xsd is an entire reference within itself

I'll look at your work further - Im intrigued - thanks a lot for posting it

I've taken out the email address requirement - it's easier to download from the web site now. Nice link to the flat file parser btw, I'll look at it in more detail when time allows.

I haven't checked UNECE web site for a long time, the issue with their definitions was exactly what you said - some could have schemas, but more of them don't. I wanted to wrap up a larger set if possible.

Regarding the schemas using the short element codes - it is true that at one stage I was wondering what would be the best way to describe EDI as an XML. Initially I had my own proprietary schemas, which contained the full reference.
The downside of this approach was:
1. The schemas are proprietary. I looked for a standard way of representing EDI as XML. Then I found ISO/TS 20625 format, which should supposedly do exactly that. The format itself specifies all node names by using the short name only.
2. Adding the long names to the schema would make the resulting XML very large. I tested an EDIFACT message (can't remember which type), which was 50 KB in size and the converted XML became 50 MB. Imagine what would have happened, had it used the long names.

I understand it's very hard to read a schema like this, there is a way around that of course, I just didn't have the time.
The schemas can remain just like they are, and the long name can be added to the annotation of each node. It'll be a description for the node (apart from the long name you can put anything you want there). This will allow you to generate low size XMLs and have the schema opened in a proper XSD reader\editor\mapper, where the description for each node is displayed.

Initially I had an idea of hosting EDI to XML conversion as a cloud service, but the large size of the XML was the main issue. The EDIFACT conversion I mentioned above took ~45 seconds to complete - that is from uploading the EDI message to the cloud, conversion and then downloading the produced XML.

You are correct, that other aspects of the parser could be amended and improved - this is the reason I published it.

It should be possible even in the moment, to take a random xsd, amend it to use long names instead of short names (or a combination of these) and rename each node with it (preserving the M,S,C,D and the counter at the end) and import it to the repository. I will include more samples and general guides whenever find the time.

I've just purchased a copy of ISO/TS 20625 - it will make interesting reading

I agree that long names to the schema would make the xml large - point taken and it may be the way ISO/TS 20625 says to build the schema's precludes long names for this reason

btw, I have seen at least two other methods to 'represent' schemas/data, one came out of Edinburgh University, and was for more 'binary' data iirc, and the other was using a compact self-describing data format - almost what JSON is today - obviously, those of us who trade between partners using X12/EDIFACT are constrained, but as we both know, the standards have evolved and continue to do so.