Source data interface standardisation

We want to set a standard format for batch/bulk data interfaces to the data warehouses within our company. Who can give me a best practice? Is XML in some way an option?
And secondly: Does anybody have guidelines on (non-bulk) XML messages in the context of a DW?

Popular White Paper On This Topic

XML and bulk don't go together. XML is for transactional transmission
between two entities with a high contractual wall, such as a laboratory
and a hospital, or a vendor and a retailer. It only has practical use in
B2B and not for internal transmissions - the overhead is unnecessary.

XML has no context in a DW. XML interacts with a transactional system,
and once stored the data is moved in bulk to the DW.

Hi Patrick,
I'm going to respectfully disagree with remediators' response to your post. It is true that XML has overhead in a more than one respect, particularly bulk. However previous technologies were not that great either. It is also true that the original push behind XML was for B2B exchange of information. This is probably still true for business considering adoption of XML in many cases. Again, all of the major players disassemble the XML to store it in a far more compressed format in more traditional database not just to save disk space but gain the powerful indexing, reporting and analytical power of their application and BI / DW tools. So though everything in the post was true you should not reject XML.

You may be unaware the the M in the MAPI (email) applications you use does not stand for "Mail", it stand for MESSAGING. Another term for these tools is workflow applications because for many years they have been in use to transmit all manner of transactions (messages) around the world. The types of messages we are most familiar with are email, appointments, tasks, contacts and ecommerce or web site inquiries. However, these messages can contain any kind of transaction like data to be received by the recipient, including data for your DW.

However, once again, once the rather bulky message has been received (ignoring your email,) it is not kept in it's original form but extracted and stored in the appropriate database in the correct format, this magically determined from nothing more than the message type. So there is a second consideration to take into account, obviously, you could use a workflow tool that sends XML formatted data, but why?

Now to the point. XML is about creating formal definitions of data of any type. Your company should be aware what is going on not only as a means of serving you businesses, clients and suppliers, but XML has many parallel technologies that can TRANSFORM data to other formats but also PRESENT data. As in present the data in a WEB BROWSER. You probably have web browsers installed on one to two computers at work (pun).

BIG IMPORTANT QUESTION. In your post, you did not say what industry your company is in. XML is about creating formal definitions of data specific to each industry or area of knowledge in a standard way that should completely define the topic. It is created by industry experts working in collaboration in their field. As a shared endeavor, you company is accessing a large volume of research by some of the best minds in your industry (in all likelihood.)
If you have a DW, your organization can not be too small, so it needs to be plugged in to not only your clients and competitors, but trends and innovations within your industry. Did I mention it's free, you previously were paying $120 - $250 per hour for this at the low end. There is of course some hope that your organization will also contribute something back into the whole process but free riders and deadbeats are allowed so jump on board. W3C or OMG are two good places to start but Google isn't a bad choice I suppose.

The second part of your post, non-bulk guidelines for your DW is closely related. The page you are reading (HTML) would be covered by XHTML (note the Xwhatever acronyms.) XML is also about having standard ways to deal with both documents (this page) and data. Despite the huge definitions (Schema like your database) the important thing about XML is that all of the data fields (columns) are optional on a per row basis, so it allows you to keep all the data using very little space. So XML should not only serve you very well for the non-bulkly stuff, but allow you store a lot of information you would have previously thrown out as being to difficult to identify and store. That comment may not be relevant to the bulk transactions but I should be mentioned that most vendors have very high performance XML databases existing along side your traditional data for you to take advantage of where the bulky stuff is concerned.

Which get us to the end. Your software vendors are an ideal source for best practices and standards for dealing with all your DW needs. They will be happy to (again freely) supply you with many guides and white papers as well as information on where to find additional sources. If not, give them the boot.
Your company should digest all of this in developing a short and long term strategy for where you are going database wise, software wise (interfaces?) and this of course ties in with the marketing, sales and product development folks and of course clients that justify our existence in the first place.

Ignore the assertions that XML and bulk processing can co-exist with any
degree of scale or harmony. You can store
XML in scale but you can't process it this way. If you attempt to peform
bulk processin with XML you will need an engine to make
it happen, and since Rule #10 adjures us to NEVER process bulk in the
RDBMS, you will need a processing engine to pull it out and
process it. Using XML in any part of this is like sludge in the engine.

If you attempt to perform bulk processing in the RDBMS, you will rapidly
run out of gas and wonder what you could possibly have been thinking

Bulk processing is handled internally - you don't need XML to make this
happen, and those who have used it (bulk meaning millions of records in
a data warehouse) have rapidly backed it out because it creates far too
much overhead.

Thank you for responding with additional information, your experience in this area is appreciated. I was not disagreeing with any of your original post so much as placing focus on the importance of you DW folks getting involved and why. Though I would like to see you continue experimenting, it would be um... silly...to ever store transactional data in it's native XML format no matter how few transactions a business has or how they were delivered over a distributed network.

I had to research to see it there were a special RULES FOR BI DW as this is the second time I have seen references to RULE #? in this forum and I make no claims to being the foremost expert in DW.. I will do a new post on this subject but in summary I must stress that database experts frequently make us of denormalization to gain performance improvement in their designs. Performance is all in bulk transaction processing.
TRIGGERS and processing in databases are based upon leveraging set theory in the systems, by not moving database data to code, calculating and then returning it to the database. It is optimization related to processor, network and memory usage. This is not denormalization in the traditional sense but there is very sound rationale behind these technology.I am a very strong advocate of the use of mathematics in database theory and the need for advances in this area. At the same time, Codd's attempts at formalizing database theory through set theory is excellent and reflects the deeper intent in the vast and complex schema produced in XML without the penalties of forcing an instance of every column present in the schema. Relational database purist have the same appreciation for the importance of mathematics in databases and cognitives sciences as I but it is imperative that we get past these repetitive debates on theory and advance these sciences into new territory.

All large enterprise databases and applications avoid any but the most basic processing like the plague. You folks in DW and BI being the technologist that led business to see the wisdom of doing more with this data, but still very efficiently.

At the small to medium scale and where bulk is not such an issue other requirements are a higher priority. Comments that appear in SQL regarding how horrid concepts such a cursors are can be quite amusing but are not incorrect, they come from a different perspective. I appreciate the rationale being applied here and why DW's views lean this way.

In our applications, we don't do badly in dealing with transactions volumes and we strongly emphasize performance and disk storage optimization but we deal with different requirements. In my own work I continue to draw from work in the DW and back office areas as quite important. Hardware and performance are a big issue at any scale and your strategies and "rules" translate to a competitive advantage for any scale application and enterprise.

I have no issue with your point in that you are speaking of millions of transactions where performance and TOC are dependent on extreme efficiency and optimization. Your comment regarding engines being a very good example, I would love to here what you would say regarding SQL cursors which would not be too kind I have no doubt.

I was wondering if you had any chance to check the performance stats on IBM's Native XML and similar developing technologies. They claim that they are getting to the point where the can co-exist and perform keeping in mind that the bulk transaction aspects are already well handled.

I would really appreciate additional comments from yourself or anyone else in this group particularly if you could quantify any of these concerns or experiments.

To return to Patrick, the man is giving you very sound advice here in plain english despite our little side bar on processing in the database. He is quite correct, ignore it at your peril there are many ways to misuse these technologies.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
I must stress that database experts frequently make us of denormalization to
gain performance improvement in their designs. Performance is all in bulk
transaction processing.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>.>>
Denormalization maybe questionable undertaking in an OLTP database. It has
so many well-documented and researched negative consequences that one has to
have real good reason to do it.
On the other hand in DW database denormalization can be used for performance
gains since we are dealing with second-hand data, it does not have the same
negatives as in OLTP environment.

when we resort to these engineering tactics, we are propping up the
weaknesses of the RDBMS in performing work it was
never designed to do. RDBMS engines never scale for bulk processing,
hence the need for Rule #10. If you really want to get in
touch with the other rules - see the book Netezza Underground on
Amazon's web site - it provides significant detail on why the
RDBMS will never scale, and insight as to why vendors like Business
Objects have declared RDBMS engines like
Oracle to be "secondhand" technology for data warehousing.

Bulk requires physics. RDBMS engines derive their processing strength in
software. Software does not provide performance, it is only
a cost-center against the physics - in the hardware. You can never scale
the SMP hardware high enough to keep up with a n ever-growing
data warehouse.

Patrick,
you raise a good point....I looked into this in 2002 when I was designing SeETL...
What I wanted was a single standard self describing file format so that a file could be sent through firewalls etc from machine to machine and the receiving machine had no connection to the source machine an no secondary DTD file needed to be transmitted...
This is because I wanted to be able to place extractors and delta detectors inside the clients firewall and then transmit only the delta files to a centralised DW server......we have finally got this setup in place with some clients now......we have a centalised data center in Germany and some clients sending their data to our data center with vpn connections to the apps back out...
I dismissed XML because the processing overhead was too high. I dismissed simple delimited files as the null support is simply not good enough....null support in most ETL tools is frustrating at best....
In the end I designed a 'self describing file format' that has a header row to define the field names and the data types and other attributes of the data and then data records that contain the data, a null indicator for each field, and delimiters.....
I felt this was a good compromise between delimited files and the needed null support. It does add overheard but CPUs are getting faster and faster.....so when we use SeETL what we do is convert any non rdbms data to this internal file format first and then all the programs use that format....We transmit files in this internal format....it's only real weakness is that it uses newlines to denote a new line and therefore you must translate new lines to something else to move the data around and if you really want the newlines you must make sure the character you translate it to is also translated out of the file and then translate the character back into a newline before putting it back into the database.....newlines inside data are a pain in the neck....as are occurrences of the field you want to use as a delimiter.
If you want to standardise on that format for you company you are welcome to.....the software you need is all free on windows and file manipultion like this on windows is faster than solaris in any case.........there are various converters from various delimited formats to the internal format as well as converters back to what we called 'Load Interface Files'. A LIF is a file that is prepared for a specific database loader.
Any format that does not give you a null indicator really does suffer from not having that.....as much as we don't like them...nulls are a fact of life and we can't ignore them or get rid of them like we used to.....
Let me know if you would like to know more...
Best Regards
Peter Nolan
www.peternolan.com

Both your comments are greatly appreciated and thanks for the book reference. You have obviously gone to great length to eliminate the problem of verbose "marshalling". Your comments regarding nulls reflect the errors we observed in trying to avoid the issue despite their overhead.

Your experience with new line characters and delimiters is very consistent with work I was involved in on communications protocols. The use of "escaping" characters was used when needed, usually with binary file transmissions but it added some annoying overhead. Performance was of course much better when in the mode where characters were accepted as is without need for escaping them.

I have considered several alternate strategies that would gain some better performance in certain cases but only that. For instance a large file with very few escaped characters could contain a header containing these characters and their insertion points but I am sure you can see difficulties and deficiencies of most of these alternatives.
It might work in your case if one of the header fields was for new line insertion points but you would have to have a significant number of them to justify the wasted delimiter when there are zero new line characters in an item. If you had other problem characters you would be looking at a pairs of character type, insertion point information. There is also a lot of internal added string processing when these characters are inserted, it would need some analysis and benchmarking even if initially considered viable.

Do you truncate nulls off the end of these records and assume null? Or is a field count not known in advance or included in the header? In addition, once you are using escaping you can have a escape sequence roughly equivalent to:

This works in that after starting the escape, subsequent meaningful codes are not limited in complexity. Like the nulls, once you are forced to embed escape sequences you might as well develop some powerful syntax because you are already paying the price on both ends.

Interesting thread. Not speaking here to the technology, but prefer to have a service level contract in place for interfaces that is easily enforced. I suggest fixed format file based interface is easy to enforce. XML naturally could see changes that the receiver may possibly be required to address without notice.

That's probably a relevant comment from the view point of more that one different level in the organization.
There are a few ways for you pull the definitions onto your own systems (usually for performance reasons) and the extend them per your own requirements.

That being the technical aspect, given the value of the databases a SLA management strategy fits and avoids having the in-house software development costs. I think the other guys could speak to this much more effectively than I.