Now I am doing Machine Learning in the SkyHive. This is a startup which makes very interesting things in the job search.

We are in the Beta now and all Employers can place they Job Offers for free. Of course, the job searchers can place they information for free any time. So, you are welcome. Hope you find it helpful. It is SkyHive and it is awesome!

Any distributed system requires serializing to transfer data between systems and applications. The serializers used to be hidden in adapters and proxies, where developers did not deal with the serialization process explicitly. The WCF serialization is an example, when all we need to know is where to place the [Serializable] attributes. Contemporary tendencies bring serializers to the surface. In Windows .NET development, it might have started when James Newton-King created the Json.Net serializer and even Microsoft officially declared it the recommended serializer for .NET.

There are many kinds of serializers; they produce very compact data very fast. There are serializers for messaging, for data stores, for marshaling objects.

What is the best serializer in .NET?

Nope… this project is not about the best serializer. Here I show in several lines of code, how to use different .NET serializers. Want to serialize object and look for the sample code? You are on the right place, just copy-past this code in your project. The goal is to help developers with samples. Samples should be simple enough to copy-past code without effort. Also samples should provide effective code to be good enough for most messaging scenarios. I want to show serializer in the simplest way, but it is good to know that it would not hit your code performance. That is why I added some measurements, so you can do right decisions.

Please, do not take these measurements too seriously. I have some numbers, but this project is not the right place to get decisions about serializer performance. I did not spent time to get the best results. If you have the expertise, please, feel free to modify code to get numbers that are more reliable.

Note: I have not tested the serializers that require IDL for serialization: Thrift, , Cap'n Proto, FlatBuffers, Simple Binary Encoding. Those sophisticated beasts are not easy in work; they needed for something more special than straightforward serialization for messaging. These serializers are on my Todo list. ProtoBuf for .NET implementation was upgraded to use attributes instead of IDL, kudos to Marc Gravell. Single exception is the new Microsoft Bond (thank OniBait again for coding Bond part).

Installation

Most of serializers installed with NuGet package. Look to the “packages.config” file to get a name of the package. I have included comments in the code about it.

Tests

The test data created by Randomizer. It fills in fields of the Person object with randomly generated data. This object used for a single test cycle with all serializers, and then it is regenerated for the next cycle.

If you want to test serializers for different object size or for different primitive types, change the Person object.

The measured time is for the combined serialization and deserialization operations of the same object. When serializer called the first time, it runs the longest time. This longest time span is important and it is measured. It is the Max time. If we need only single serialization/deserialization step, this is the most significant value for us. If we repeat serialization / deserialization many times, the most significant values for us are Average time and Min time.

There are two average measurements:

The Avg-100% : all measured times are used in calculation.

The Avg-90% : 10% slowest measurements ignored.

Some serializers serialize to strings, others – to the byte arrays. I used a string as a denominator. The base64 format is used to convert byte arrays to strings. I know, it is not fair, because in many cases we use only a byte array to serialize, not a string. UTF-8 also could be more compact format.

Test Results

Again, do not take test results too seriously. I have some numbers, but this project is not the right place to get conclusions about serializer performance. You have to take this code and run it on your specific data on your specific workflows.

Here is a test result for the 100 repetitions. This number of repetitions shows stable results.

There is no such thing as the “best serializer”. If you invest time in optimizing code, the loser will be winner. If you change the test data, the winner would not be winner anymore. So I will mark serializer as a “winner” if it shows significant leadership in numbers, not just 1-3%.

There are several winners in each category.

Compression

This category important if you need smallest data size (on wire, in store…). Look at the Size: Avg measurement. All winners in this category use proprietary unreadable formats.

Winners are:

Solar.Bois

Avro

Solar.Bois

MsgPack

MessageSharkSerializer

NetSerializer

Bond

ProtoBuf

Notes:

All Json serializers create strings with almost the same size.

You can see the output serialized strings. They written to the trace output, so use DebugViw to see them.

Many Json serializers do not work well with DateTime format out-of-box. Only NetSerializer and Json.Net take care of DateTime format without additional customization.

Speed on Single Run

This category important, if you only need make a single serialization/deserialization. Look at the Max measurement.

Winners are:

Microsoft NetDataContract (surprise)

Json.Net

Microsoft Binary

Notes:

Avro, Bond, and NetSerializer also show good results.

Jil shows the worst result but, again, it means nothing.

Serializers show biggest incoherence in this category.

Speed on the Large Scale

This category important, if you need many serialization/deserialization acts on the same objects. Look at the Min and Avg:-90% measurements.

Winners are:

NetJSON

Bond

MessageSharkSerializer

NetSerializer

Jil

Avro

ProtoBuf

Solar.Bois

Notes:

MsgPack and BondJson also show good results.

Speed on Several Cycles

Winners are:

Avro

Bond

NetSerializer

Json.Net

This category important, if you need several serialization/deserialization acts on the same objects. Look at the Avg:-100% measurement. It is a mix of slowest and fastest times. As you can see, MessageShark, Jil and ProtoBuf not in leaders anymore because of the suboptimal initialization. But Json.Net now is a leader because of good initialization time on our test data.

Notes:

The classic Json.Net serializer used in two Json.Net (Helper) and Json.Net (Stream) tests. Tests show the difference between using the streams and the serializer helper classes. Helper classes could seriously decrease performance. Therefore, streams are the good way to keep up to the fast speed without too much hassle.

The first call to serializer initializes the serializer, that is why it might take thousand times faster to the following calls.

For Microsoft Avro my simple serialization interface is patched. For some reason it is not possible to pass generic T type into it. Type was hardcoded. Because of this hack, Avro makes tests faster.

Json and Binary formats bring not too much difference in compacting the serialized strings.

Serialization of the classes with explicit constructors usually has negative impact on the serialization speed. That’s why test class (Person) creation is implemented not by constructor but by a method.

Many serializers do not work well with Json DateTime format out-of-box. Only NetSerializer and Json.Net take care of DateTime format without special treatment.

Test prints the test results on the console. It also traces the errors and the serialized strings, which can be seen in DebugView for example.

The Apolyton.Json and HaveBoxJSON serializers failed in tests, that why we see zeroes in several measurements.

So, I have added this Part. It shows, how a small optimization could change the test results.

Actually, I have added just two lines of code, that’s it.

Next update is related to the garbage collection. Results were unstable, not now. The changes again were in small piece of code in test cycle:

GC.Collect(); GC.WaitForFullGCComplete();GC.Collect();

Measurements got additional numbers. The Max time is important in cases, where we want to do a single transformation. Min and Average-90% times are important, when we need a lot of cycles transforming the same objects. Average-100% time is important, when we are somewhere in between. 99st% shows how fast the times are stabilized and moved towards the Min.

Updated Test Results

Updated Conclusions:

Xslt transformation speed is on a par with Object transformation speed on broad spectrum of the message sizes.

If you use Json instead of Xml, Json is always faster. The object size is matters.

The object size is more important than the method of transformation for large objects. So the best way to improve speed for the large objects is to compact the object serialized string. Remove all indents, spaces, all white space and this significantly speeds up the transformation. Remove it for both, the source and the target strings.

The speed differences between three mentioned transformation methods are not significant, so choose on base of the development skills and speed. The development skills and knowledge beats the technology.

In the first part, I have compared the Xslt transformations of the XML documents with the Object transformations. It happens the Object transformations are faster in most cases. They are also simpler from the developer point of view.

The next natural question is “If we use Json instead of XML, how faster the transformations would be?”

If we got a freedom to choose Object transformations instead of Xslt transformations, it might be, we got even more freedom and choose Json or other contemporary serializing formats instead of XML. If this is a case, we should know possible improvements, right?

Changes in the Test Project

I have added NetSerializer Json serializer. You can get it as a NuGet “NetSerializer” package. It is one of the fastest Json serializer in .NET. The Object transformations now executed in two modes: with XML serialization and with Json serialization.

Test Results

Practical Conclusions:

It is worth to use Json instead of XML in Object transformations, if we have a choice

Json gives us a speed improvement:

for small documents of the several KByte size the improvement could be >100%

for documents of the several MByte size the improvement could be >30%

The code for Object transformation itself is the same, so there is no additional development effort to use Json instead of XML

The Xml transformation is an important part of the system integration. The Xml documents are everywhere despite surging JSON.

When we need to transform [or map] one Xml document to another Xml, we have several options. Two of them prevail. The first is the Xslt language. The second is the object transformation.

Xslt Transformations

The Xslt language was created exactly for this purpose, to transform one Xml document to another Xml document. I am copying the Abstract of the Xslt standard here: “This specification defines the syntax and semantics of XSLT, which is a language for transforming XML documents into other XML documents...”

In reality to make the Xslt map we possibly need the XML Schema for source and target Xml documents. The XML Schemas are not mandatory but many Xslt editors use them to create Xslt maps. The BizTalk Server Mapper is such example.

The Xslt map is defined as the Xml document itself. The Xslt operators and expressions defined with help of the XPath, another Xml related standard.

Object Transformations

An Object transformation is a transformation, when Xml is converted to the object graph of any programming language like Java, or C#. Then objects mapped to other objects, which converted back to Xml. Those two conversions could be performed by XmlSerializer which is part of .NET. The mapping written and executed as the generic C# code.

The Object transformation is not an official term. Usually in development you see terms “mapping”, “transformation”, “converting”.

Comparison

Theoretically, the more specialized tools should always beat the less specialized. The Xslt designed for exactly this purpose, so the question is how better is it? Transformation speed is the most important feature, so I tested the speed.

Test Project

Tests have two parameters: the number of repetitions and the number of the nested objects.

The repetitions should stabilize the test measurements, make them statistically more correct. Number of the nested objects implicitly defines the size of the transformed Xml document.

Test Data

The test Xml document is created by XmlSerializer from the Person class.

Transformers

There are two transformers:

XsltTransformer

NetTransformer

The XsltTransformer uses the PersonToEmployee.xsl and PersonToPerson.xsl Xslt stylesheets. If you want to use the third-party mappers to create or change the stylesheets, I have generated Xml schemas for you; they are in the XmlSchema.xsd file.

NetTransformer uses the same XmlSerializer. The transformation code is simple and boring, nothing to say about it.

The transformers do not try to produce the same transformations. Small differences in transformations don’t matter for our case.

Transformation and Enrichment

I have chosen two transformation types:

Enrichment, when the source and target Xml documents have the same schemas. It used to change content of the documents without changing the document structure.

Transformation, when the source and target Xml documents have different schemas. It creates a new target document using data of the source document. Target document has a schema, different from the source document schema.

For enrichment we operate with the same document structure, so theoretically enrichment should be simpler and faster.

How We Test

The test document created and tested for Xslt and Object transformations, and for both Enrichment and Transformation types. It is one test cycle.

A new test document created for each test cycle. This eliminates the possibility of the optimizations, which could be performed on the OS, the memory management level. The data for the test document created randomly by Randomizer class.

The test classes are initialized on the first cycle, so it takes much more time to perform the first test. I measure the maximum time, because it is important for the real life situation, when we need only one transformation.

To measure the average time the 5% of maximum and 5% of minimum values are removed from calculation.

I also measure the size of the transformed Xml document. It shows the importance of the spaces and new line symbols for the document size.

Note about Result Xml document

The sizes of the result Xml documents for Xslt and Object transformations should not be very different. Yes, you read me right. The result Xml documents should not be the same in each symbol for both transformations, and still documents can be recognized as equal. It is because of the ambiguity of the Xml standard. For example, the namespace prefixes could be different for the same namespace. In one result we can get “ns0:” prefix and the “abc12:” prefix in another, but both resolve the same namespace. As result the Xml documents got different size, but both are equal in terms of data values and structure. As result of this, we could not compare the Xml documents as the strings. We could converted Xml documents to the objects graph and compared the result object set. If all objects are equal, Xml documents are equal. I decided do not compare results because it is not the test goal. I just output the target Xml documents, so they could be easily compared, if needed.

Test Results

The result is surprising. The Object transformation won in both Transformation and Enrichment tests.

For this test the size of the Xml document was about 10K. I have change the size of the tested documents. When size grows, the difference between Xslt and Object transformations start to decrease. The average times were almost equal for the documents with 1M size.

It worth to mention the unstable result times. I write all measured times into Trace output.

As you can see, the measured times are stable but sometimes they grow significantly. Possibly, it is result of the garbage collection. Please, take this into account.

Practical Conclusions:

Xslt transformation is not faster than Object transformation, at last for the Xml documents smaller than 1M.

Is it worth to use Xslt, if we have a choice?

To use Xslt we have to study a new language, the Xslt. We have to study XPath, XML and maybe XML Schema standards. All this require time and effort. There are the special skills for testing and debugging Xslt code.

The Object transformation created in pure, generic programming languages. There is almost zero new knowledge we need to do it.

What about time spent on the creating transformations? There is not too much good Xslt mappers on market. The BizTalk Server Mapper one of the best. It is very performant in simple cases. But if there is a little bit more complexity, the mapper is not booster anymore but stopper. There are several books with tips and tricks how to do things with this Mapper. Are there such books for the Object mapping? Not sure they are, because the Object mapping is just generic programming, ABSOLUTELY NOTHING SPECIAL.

So, the conclusion is: use Xslt only for very, very, very good reason. Is it not faster at execution, It might be faster but without significant margin. It is not faster at development in most cases, especially if we claim all the time we spent on studying. Any programmer could develop Object transformations and only skilled Xslt developer could develop Xslt transformations.

This post is obsolete and kept only for sentimental reasons :)

Any distributed system requires serializing to transfer data through the wires. The serializers used to be hidden in adapters and proxies, where developers did not deal with the serialization process explicitly. The WCF serialization is an example, when all we need to know is where to place the [Serializable] attributes. Contemporary tendencies bring serializers to the surface. In Windows .NET development, it might have started when James Newton-King created the Json.Net serializer and even Microsoft officially declared it the recommended serializer for .NET.

There are many kinds of serializers; they produce very compact data very fast. There are serializers for messaging, for data stores, for marshaling objects.

What is the best serializer in .NET?

No, no, no, this project is not about the best serializer. Here I gather the code which shows in several lines of code, how to use different .NET serializers. Just copy-past code in your project. That is the goal. I want to use serializer in the simplest way but it good to know if this way would really hit your code performance. That is why I added some measurements, as the byproduct.

Please, do not take these measurements too seriously. I have some numbers, but this project is not the right place to get decisions about serializer performance. I did not spent time to get the best results. If you have the expertise, please, feel free to modify code to get numbers that are more reliable.

Note: I have not tested the serializers that require IDL for serialization: Thrift, new Microsoft Bond, Cap'n Proto, FlatBuffers, Simple Binary Encoding. Those sophisticated beasts are not easy in work, they needed for something more special than straightforward serialization for messaging. These serializers are on my Todo list. ProtoBuf for .NET implementation was upgraded to use attributes instead of IDL, kudos to Marc Gravell.

Installation

Most of serializers installed with NuGet package. Look to the “packages.config” file to get a name of the package. I have included comments in the code about it.

Tests

The test data created by Randomizer. It fills in fields of the Person object with randomly generated data. This object used for one test cycle with all serializers, then it is regenerated for the next cycle.

If you want to test serializers for different object size or for different primitive types, change the Person object.

The measured time is for the combined serialization and deserialization operations of the same object. When serializer called the first time, it runs the longest time. This longest time span is also important and it is measured. It is the Max time. If we need only single serialization/deserialization, this is the most significant value for us. If we repeat serialization / deserialization many times, the most significant values for us are Average time and Min time.

For the Average time I calculated three values:

For the Average 100% all measured times are used.

For the Average 90% the 5% slowest and 5 % fastest results ignored.

For the Average 80% the 10% slowest and 10 % fastest results ignored.

If we see significant difference between 80% and 90% average times, probably we need to increase the number of tests to get more stable and correct results.

I also provide two result sets for different test repetitions, so we can make sure the tests show stable results.

Some serializers serialize to strings, others – just to the byte array. I used base64 format to convert byte arrays to strings. I know, it is not fair, because we mostly use a byte array after serialization, not a string. UTF-8 also could be more compact format.

Test Results

Again, do not take test results too seriously. I have some numbers, but this project is not the right place to get conclusions about serializer performance. You'd rather take this code and run it on your specific data in your specific workflows.

The test results below are for the 100 and 200 repetitions.

The winner is… not the ProtoBuf but NetSerializer by Tomi Valkeinen. Jil and MsgPack also show good speed and compacted strings.

Notes:

The classic Json.Net serializer used in two Json.Net (Helper) and Json.Net (Stream) tests. Tests show the difference between using the streams and the serializer helper classes. Helper classes could seriously decrease performance. Therefore, streams are the good way to keep up to the fast speed without too much hassle.

The first call to serializer initializes the serializer that is why it might take thousand times faster to the next calls.

For Microsoft Avro I did not find a fast serializable method but its serialized string size is good. It has some bug preventing it from passing serialized type to the class (see the comments in code). I am really frustrated by Avro, it cannot run fast in my extremely simple code. It cannot fit in my simple serializing interface. I would appreciate the Avro experts, to optimize my code on GitHub.

Json and Binaryformats bring not too much difference in the serialized string size.

Many serializers do not work well with Json DateTime format out-of-box. Only NetSerializer and Json.Net take care of DateTime format.

The core .NET serializers from Microsoft: XmlSerializer, BinarySerializer, DataContractSerializer, NetDataContractSerializer are not bad. They show good speed but they not so good for the serialized string size. The JavaScriptSerializer produces compact strings but not fast. The DataContractJsonSerializer is more compact than DataContractSerializer.

The NetDataContractSerializer, BinarySerializer, and Json.Net show the smallest Max times. That means they are optimal choice for cases, when we need only single serialization / deserialization cycle.

Test prints the test results on the console. It also traces the errors, the serialized strings, and the individual test times, which can be seen in DebugView for example.

So far I saw only the waterfall methodology in the BizTalk project development. I worked for small and big companies and everywhere I saw only waterfall. Is there something special in the BTS project that Agile is never used with it?

So far we, BizTalk developers, have all disappointments of the waterfall development: the long stages of the project, huge and unusable documentation, disagreements between users, stakeholders and developers, bloated code, unsatisfactory quality of the code, scary deployments and modifications. Many such problems described here by Charles Young.

Recently our team decided to use Agile principles to address these issues. We had our victories and our defeats, but right now we feel better on our journey to the “Agile world”.

The business goal is simple, we desperately need faster development. When we deploy and test new code in hours not weeks, we make a lot more iterations, we make and find and fix a lot more errors. Which is just great. Now an error is not a catastrophe, it is a small thing. That means the more reliable code. Now our applications are more reliable and we fix errors very fast.

The main reason to use Agile is because of economics. It is not only faster and more reliable, it is cheaper.

We decided to use Agile together with SOA and microservice architecture. First we thought, the BizTalk is too heavy tool set to be used with Agile. But it happens the BizTalk Server has very special set of attributes that suits microservice architecture very well out of the box. If you think about orchestrations and ports as the microservices, this part of BizTalk fits perfectly to SOA.

Stoppers

Three main things keep BizTalk developers from using Agile. It is the artifact dependencies, “niche, unnecessary tools”, and manual deployment.

Here I am talking only about technical side of problem. The management side will be touched later.

Artifact Dependencies

Dependency is the other side of the code reuse. And here lays one of the main differences between BizTalk applications and generic applications. The later are created as a set of dll-s on C#, Java or any other programming language. In many cases we prefer to simplify things with code reuse, which creates some dependency problems, but usually this is not a big problem. What about the BizTalk applications? BizTalk Server keeps strict control of the working artifacts. Reliability is the king. We cannot just replace one buggy artifact if there is a dependency on it. If hurts redeployment but, remember, reliability is the king. Anything else is not so important. So for us, as the BizTalk developers, the cost of dependency is really high.

Moreover, dependency is a stopper for the SOA application. Keep services independent, and it is easy to modify then, to add new one, to the service versioning.

Niche, Unnecessary Tools

The BizTalk Server is a big toolset. It is impossible to hire a full team of the expert BizTalk developers. Most of the folks are not specialized in BizTalk too much.

So if we keep the technology stack limited, the time to make a new developer productive would be short. Any tool could be replaced by C# code and the decision was to use a bare minimum of the BizTalk tools: Schemas, Maps, Orchestrations.

Completely prohibited are BRE, BAV, ESB Toolkit.

The custom pipelines, direct binding and xslt are limited to the very special cases.

The differentiator was a question “Is this tool 100% necessary and does it require a special skill set?”.

I am not going to start another holy war. Our decisions are based only on our use cases. In your company, in your zoo the decisions would be distinct.

Deployment

How to make development iterations fast? One part of development cycle is deployment. It is not a problem at all if you write your program on Python or Go. It is not so big problem if you write your app on .NET. But in BizTalk development it is The Problem. So any methods to keep deployment quick are very important.

BTDF was a necessary tool in our case but the BizTalk PowerShell Provider is also used.

Deployment is always about dependencies. Penalize more dependencies and prize less dependencies, that is the idea.

Technology Rules

We started with defining rules about technology side of the projects. We forced several SOA and microservice rules:

Service size: Service encapsulates a single business function and exposes only a single interface. This rule effectively cuts a BizTalk application into a single orchestration (or couple ports) in most cases.

Shared Contracts, API: Services must communicate only through contracts. The only permitted dependences between services are the contracts. We never share maps, orchestrations, ports, pipelines between applications. We only share schemas and API-s.

Versioning: the service upgrade is published as a new service. Once published service is never changed and could be only removed, not changed.

Tests: Tests are important parts of application. Application without tests is not approved. We need only user acceptance tests. The minimum test set should cover: Successful tests and Failure tests. Performance tests and Unit tests are not mandatory. The special test data is part of design. We tried to design our application in such a way that test data can be used in production together with production data.

Automation: Automated deployment is important part of application. Application without automated deployment is not approved.

As you can see this list has some specifics for the BizTalk projects. BizTalk is oriented to XML that’s why we tell an XML Schema when we mean a Contract. In BizTalk projects it is really hard to test endpoints, so we tried to be easy with testing. The deployment of the BizTalk application is the complex and long process, so we force the deployment automation.

These rules were not easy to push in development. There were many unanswered questions on the way: What size is “small” and what is “big”? What modification is considered as a new version? What test coverage is enough?

These rules are not ideal. We had discussions, we added and removed rules, we changed them. We are still in the process.

Team Rules

Those are the “technological” rules. But we need the management, team rules, because Agile is about the team structure and communications, right? The Conway’s law cannot be ignored if you think about Agile. Also we cannot avoid all this hype with DevOps movement. So we put in practice some additional “team” rules:

A single service implemented by one developer.

A single service implemented in one week sprint.

Service developer is DevOp, he/she is responsible for deployment and all operations of the service in all environments, including production.

“One week” rule is just happened after couple months of experiments. We tried 2-3 days and 2 weeks. One week works because we want to keep all projects in one team. All developers work remotely, we never met in team together. And our team is not big, so “one week” was because of those factors.

One of the key issues with Agile and DevOps is the service knowledge is tightly coupled with single developer. Enterprises cannot tolerate this, because it drastically increases risks to lost service if developer is not accessible. So we added a new role (the senior developer) and two more “team” rules:

A Senior developer approves the service design. This senior developer performs tests and deployment of the service in production and sign service to production.

If service developer or service senior developer is not accessible, a second developer should take responsibility for this role.

The “Senior Developer” rule is tricky. With this rule a senior developer cannot just sign documents, approve something and voila. No way. This rule effectively forces senior developer to make good code review and monitor all development steps. Officially this rule covers two short tasks, but they are short only if this team of two invests good amount of time to communicate all details of the service.

Dependency Rule

The true heart of the SOA and microservice lays in restricted dependencies. With BizTalk applications the dependency problem is even more important than in the standalone applications. The BizTalk controls dependencies and prohibits many shortcuts permitted in the standalone applications. So eliminating dependencies simplifies development and makes possible to make simple services, the microservices, which is our goal.

But we cannot just remove all dependencies. The service dependencies are the necessary evils. We should share schemas, dll-s, services. So there are special “dependency” rules:

Any changes in the shared resources, any changes that could go outside of the service boundaries must be approved by the whole team. All team members should agree on this change. Any team member can veto the change.

For example, we implemented the shared infrastructure for logging. The first approach was to create the shared library (dll) and force everybody to use it. It was vetoed. The second approach was to create a special logging services and expose it as a single endpoint. It was also vetoed, because the code to use this service was too big. The next try was to use log4net or NLog and standardize the log format only. Then there was the next attempt and more. Now we discuss the InfluxDB, but what is matter, now we know much-much more what we need in reality and what we don’t need.

Complete Rule Set

Now we have this rule set:

Single Interface

Shared Contract

Never Change [Published]

Test inside Application

Automated Deployment

DevOp

Senior Developer

Second Developer

New Dependency

These rules are not ideal. For example, we still struggling with the Wrong Requirements problem. How to fix this problem?

Our rule set is too long. Now we consider to join the Test and Deployment rules.

This rule set works for our team. What is special about our team? A half of the team members are the full-time employee, which enables the DevOp and Senior Developer rules. Not sure these rules would work if all members are contractors. Our team has a big list of projects to develop and the big list of applications to support. If you have mostly applications to support, our approach is possibly not the best fit for you.

I did not describe the communications with our customers and stakeholders, which is important part of the whole picture.

I did not describe the Agile practices we use (but we definitely use the Kanban board), it is not the point of this topic.

We were happy with our management, which took a risk of changes in processes and the team. Now the management is happy (hmm… almost happy) with BizTalk development and operations. Now we are SUPERFAST team! We are not sharks, not yet, but we are not jellyfishes, not anymore.

The comparing process is simple. We create a sieve and filter the technologies through it. In the end we have got several, one or even zero winner.

The filters are the mix of technology, life cycle or old man questions. The whole filter process looks unscientific and unsystematic but it is sane (I hope) and simple.

Here are the filters:

What systems do we integrate?

What are your development resources: the team skills, the team agility, the team size?

Java, .NET or both?

What is the life horizon of the integration? 1, 2, 5, 10 years?

The nonfunctional requirements:

reliability

sustainability

scalability

performance: throughput (messages per sec/day); message size; latency

etc.

The integration technology could fail just on one filter and it will be enough to filter it out of competition.

So how NServiceBus and MassTransit compete there? They are very similar in technology aspects and a non-technology factor should give us the winner. It is the life horizon factor.

Here is the development activity on the code base of both systems:

Do I need to comment on those graphs? Probably not.

OK Let’s do one more check with the Google Trends:

There is a company that supports the NServiceBus, it is the Particular Software. The MassTransit is supported by open source community only, that had shown a lot of enthusiasm at early years but pretty much stopped by now. The key developers of the MassTransit have moved on some other projects.

I did not work with any of this system and, by any means, I do not choose the “best” integration system. Some of the features of one system could be much better then other, it doesn't matter now. I do not choose the better system, I do chose the failed system only for my specific requirements.

It happens the customer wants the integration code works and constantly improves at last next 5 years. From this perspective the MassTransit failed.

[Update: 2014-07-16 BTW One of the core developer of the NServiceBus, Udi Dahan, made a comment on this post about the first picture. Seems a good proof of the healthy system.]

The domain standard schemas are the repositories of domain knowledge. The message schemas are schemas for the real data transferring between real systems.
Use the domain standard schemas as a reference model for your schemas. Do not use the domain standard schemas as your message schemas.

In the EDI processing we need to transform data from the EDI format to the data formats of our applications and back.

What is the best way to do this?

The most popular transformation is using an intermediate XML format. We use the XML Schema and the XSLT standards to transform one XML format to another XML format. Is it the best way?

Let’s look at the whole data processing chain:

EDI to XML transformation;

data transformation (with XML Schema and XSLT);

XML to SQL transformation.

Now I’m going to step back and look at the EDI document structure a little bit.

Does the EDI structure resemble the XML structure or the SQL structure?

Definitely it resembles the SQL. The EDI segments are like the SQL table records. The EDI elements are like the table columns. (Sometimes the elements are compounded from several subelements but it is unimportant now.)

Does the EDI structure resemble the XML structure? It is not. The XML relations are expressed by nesting structures. For example in XML the Order Detail records are nested inside an Order record. In EDI the Order Detail and Order segments are not nested, the relations are defined the same way as it is in SQL by some correlated IDs in related segments.

So, EDI structure resembles the SQL structure not the XML structure.

Also remember our final goal, which is the application data in the SQL format. Can we just bypass the XML format and transform EDI directly to SQL format?

This is more natural. We throw out two complex EDI to XML and XML to SQL transformations and replace them by single EDI to SQL transformation. Why the EDI to XML and XML to SQL are so complex? Because we have to map the reference relations to the nesting relations or vise versa. It is not simple. There are the whole books teaching us tips and tricks for these transformations.

Why EDI to SQL transformation is simple? Because it is one to one mapping, a segment to a record and an element to a field, the SQL-like EDI structures to the SQL application structures.

The is one problem with this new approach. We have to create a code for the EDI to SQL transformation. It is not a hard problem if we use the contemporary techniques like Linq or Entity Frameworks. Those techniques look competitive even against the adapters implemented in specialized integration systems like the BizTalk Server of Mule ESB. Code is pretty straightforward if we use two step transformation.

EDI to SQL transformation;

data transformation (SQL to SQL)

The first step is to transform EDI segments and elements to the SQL records and fields that mirror of the EDI structure (EDI SQL), and the second step is to map those SQL records to the SQL records of our application database (App SQL).

This sample shows us how we could simplify our solutions, i.e. how to do some real architecture.

Is there something special in the BizTalk Server application deployment? Why is it so special?

BizTalk Deployment Hell

For the .NET applications the live is simple. There is an exe file and maybe several additional dll-s. Copy them and this is pretty much all deployment we need.

The BizTalk Server requires the dll-s placed in GAC and registered in the special Management database. Why this is required? Mostly because the BizTalk automatically hosts applications in the cluster and because of the reliability. BizTalk application maybe is not easy in deployment but it also not easy to break. For example, if the application A is related to the application B, BizTalk will prevent us from removing B accidentally.

This Why? question is a big theme and I will not cover it here.

Another factor is the BizTalk has many pieces that require a special treatment in deployment and run-time.

Yet another factor is the BizTalk application integrates the independent applications / systems, that means the application should take care of many configuration parameters, as endpoint addresses, user names and passwords and so on.

As a result the BizTalk application deployment is complex, deployment is error prone, slow, unreliable, requires a good amount of resources.

And look here, it is a savior, the BizTalk Deployment Framework.

The BizTalk Deployment Framework - the Champion

The BizTalk Deployment Framework (BTDF) is an essential tool in arsenal of the BizTalk Server developers and administrators. It solves many problems, it speeds up the development and deployment enormously.

It is a definitely a main face in the BizTalk Server Hall of Fame.

BTDF was created by Scott Colestock and Thomas Abraham.

It is an open source project despite the fact, that BTDF is more powerful, reliable and thorough than the most of commercial BizTalk Server third-party tools. I think it is fair to donate several dollars to those incredible guys on the Codeplex. Just think abut days and months we save in our projects, in our private life.

BTDF is an integration tool and it is created by guys with pure integration mindset. It integrates the whole bunch of open-source products in one, beautiful BTDF. Below I copy the "Contributors and Acknowledgements" topic from the BTDF Help:

Tim Rayburn for contributing a bug fix and unit tests for the ElementTunnel tool

Giulio Van for contributing ideas and bug fixes.

The hundreds of active users across the world who have promoted the Deployment Framework, reported bugs, offered suggestions and taken the time to help other users! …”

And how exactly BTDF saves our life?

BTDF is installed and tuned up. Now it is time to deploy an application.

The deployment was successful and I’ve got the long output, what exactly was done in this deployment.

Here is a log. I've marked my comments in the deployment log with “+++”. The full redeployment lasted 2 minutes.

If I perform the same amount of tasks manually, I would spent at last 3-4 additional minutes fully concentrated on this long list of deployment tasks. The probability of missing some step would be quite high as the probability of errors. With BTDF I start deployment and free to do another tasks.

So my gain is 3+ minutes on each deployment. Risk of an error is zero, everything is automated.

There is one more psychological problem with manual deployment. It is complicated and it requires the full concentration. As a developer I am concentrated on the application logic and a manual deployment task breaks my concentration. Each time after deployment I have to concentrate on my application logic again. And here is where the development performance goes down to the hell.

There are other helpful BTDF commands:

Restarts all host instances, only those are used in this application; restart IIS, if I deploy web-services, all with the "Bounce BizTalk" command.

Terminate all orchestration and messaging instances remained after tests.

Install the .NET component assemble into GAC.

Include the modified configuration parameters into a binding file.

Import this binding file.

Update SSO with modified configuration parameters.

All this with one click.

You do not have to use the slow BizTalk Administrative Console to do those things anymore.

Check my comments in the BTDF deployment log. Several automated tasks are like blessing! We do not create the drop folders anymore, do not assign permissions to those folders. BTDF makes this automatically.

We do not care about undeployment and deployment order, BTDF makes everything right.

We do not stop and start ports, orchestrations, applications, host instances, IIS. Everything is automated.

Look at the list of the top level parameters in BTDF:

Remember, I told you the BizTalk application deployment is a little bit complicated? All those application components in this list require a little bit different treatment in deployment. In a simple application we do not use all those components, a typical application uses maybe 1/3 or 1/2 of them, but you have an idea.

How to tune up BTDF?

One thing I love in BTDF is the Help. It is the exemplary, ideal, flawless help.

If you never tried BTDF, there is a detailed description of all parts, all tasks. Moreover there is the most unusual part, the discussions of the BTDF principles and the BizTalk deployment processes. I got there more knowledge about the BizTalk Server deployment than in the Microsoft BizTalk Server Help - official documentation.

BTDF Help helps if you are a new user or if you use BTDF several years in row. Descriptions are clear, they are not stupid and arranged in clear hierarchy. BTDF Help is one of the best. You are never lost.

Of course there is a detailed tutorial in Help and the sample applications.

OK Now we have to start with tuning.

Typical BTDF workflow for setting up a deployment of a BizTalk application:

Creating a BTDF project.

Setting a deployment project.

Creating an Excel table with configuration parameters

Setting a binding file

All configuration parameters are managed inside an Excel table:

Forget about managing different binding files for each environment. Everything is inside one Excel table. BTDF will pass parameters from this table into binding file and other configuration stores.

Excel helps when we compound parameters from several sources. For example we keep all file port folders under one root. The folder structure below this root is the same in all environments, only the root itself is different. So there is a parameter "RootFolder" and we use it as a part of the full folder paths for all file port folders. For example, we have a "GLD_Samples_BTDFApp_Port1_File_Path" parameter which defined in the cell with formula: = "RootFolder" + ""GLD_Samples_Folder" + "BTDFApp_Folder" + "Order\\*.xml" (of course in the Excel cell it would be something like = C22 & C34 & C345 & "Order\\*.xml"). If we modify the RootFolder path, all related folder paths will be modified automatically.

OK We work hard setting up the application development. Now surprise, we have to create a new environment. The best part of BTDF configuration fiesta is there: all configuration parameters for ALL ENVIRONMENTS for an application are here just in one table. (If you are tenacious enough you could keep a single table for ALL applications, but you have to ask me how.)

Settings for different environments: In our Excel table we copy-past a column of one of the existed environment to a new column for a new environment. Then we modify the values in this new column. And again, this is an Excel table, and if we define single RootFolder variable for a new environment and voila, all file port paths for this environment are modified.

Now we have to pass those configuration parameters to the binding files, right?

We replace all values in the binding file which are different for different environments, like the Host names, NTGroupName, Addresses, transport parameters like connection strings, etc. We replace these values with the variables, like this:

Now we save this binding file with the PortBindingsMaster.xml name.

That is pretty much everything we need. Now execute the Deploy BizTalk Application command and application is deployed.

Deployment into a Production environment is different. It is limited in the installed tools and we have to do additional installation steps. BTDF creates an msi and command file. This msi includes all additional pieces we need to install. We don’t have manually add resources to the BizTalk application in the Administrative Console anymore.

Conclusion: BizTalk Deployment Framework is a mandatory tool in the BizTalk development. If you are a BizTalk developer you must use it, you must know it.

Here I am going to show details of the BizTalk Server Publishers and Subscribers. We will investigate all of them step by step.

Publishers

There are two main publishers: receive locations and orchestrations. The real number of publishers is bigger but let's talk about it later.

Receive Location

I have created a receive port with a File receive location and the PassThruReceive pipeline. Let’s drop a file into this location. There is no subscribers for it, so we get a suspended message with these promoted properties:

Note the PassThruReceive pipeline does not promote any additional properties.

We'll see several ID properties, like ReceivePortID, which are used internally for the binding subscriptions. Not sure it is a good idea to use them explicitly.

Now we change the PassThruReceive pipeline to the XMLReceive pipeline. One more property is promoted, the MessageType:

Orchestration

I have created an orchestration with one send port shape that publishes messages of XmlDocument type. It is officially called "untyped" type in BizTalk, because of special meaning of the message type. For .NET developers it could be confusing, because messages have a type of XmlDocument class. Anyway...

Let’s send a message to this shape and check the properties promoted on this port shape.

There are two promoted properties: Operation and SPTransportID. The SPTransportID is the ID of this send shape in orchestration, which is a constant in the current deployment.

Note inconsistency: the ReceivePortName is promoted for the receive port but an orchestration send shape name is not promoted, neither an orchestration name. So we can set up the filter expression for the receive port for the port name, but cannot set up it for the orchestration name. This promotion makes sense as a part of the binding design, but this inconsistency is worth to mention.

Let's publish a typed message now. That means the Message Type is set up to the schema not to the XmlDocument class. Let’s check the context of the sent message:

A new MessageType property is promoted now. Note, the orchestration send port shape under hood uses the same mechanism as the XMLReceive pipeline which also promotes MessageType. All we have to do to get a MessageType property is to assign a schema to the Message Type parameter of the port type request:

Subscribers and Binding

There are three main subscribers: send ports, send port groups, and orchestrations.

We use this query to reveal subscriptions:

Send Port

Note: If we unenlist a send port, its subscription is removed from a subscription list.

Important! If a send port doesn’t have any filter expression and if it is not bound, it still has a subscription:

Any messages addressed to its ID will be routed properly. Technically it means this property should be promoted in a published message and this message will be routed to the port.

If we add a Filter expression as the subscription adds OR predicate. In our case we added a predicate, which subscribes to the messages from a specific receive port:

Send Port Group

If we unenlist the send port group or all of its send ports, its subscription is removed from the subscription list.

If the send port group doesn't have any filter expression and if it is not bound, it still has the subscription:

Send port group behaves as a single send port. it creates a separate subscription. The messages will be delivered to the ports included in this group for the group subscription. Also messages will be delivered to the individual ports for the port subscriptions. If you have the same filters on the group and on a send port included in this group, the messages will be delivered twice.

Rule of thumb: create filter on the individual send ports OR on the send port group, and never on both.

Orchestration

If the orchestration is unenlisted, its subscription is still here under Disabled status, it is not removed as it happens for the send port.

If the receive shape of the orchestration is untyped and has the Specify later binding, the subscription includes the ReceivePortID. It is an ID of the orchestration receive shape.

If the receive shape of the orchestration is typed and has the Specify later binding, the subscription includes the ReceivePortID and the MessageType.

Note: the AND predicate is used.

If the receive shape of the orchestration is typed and has the Direct binding, the subscription includes the MessageType. The ReceivePortID in not in subscription anymore.

Note: The send port and the send port group get the default subscriptions on IDs, which we cannot remove. These subscriptions is to the SPTransportID and to the SPGroupID respectively. The orchestration has the default subscription to the ReceivePortID but not for the Direct binding.

Compare standards which defined in form of the XML schemas and in form of the documents. It is almost impossible to verify if the data satisfy the standard or not if we use the text document where this standard is defined. And it is possible to validate it and validate it automatically, if we use the XML Schemas.

The domain specialists use XML Schemas to define standards in unambiguous form, in machine verifiable format.

Those schemas tend to be large, huge and very detailed. And it is for very good reasons.

But if we start to use XML Schemas for the first task, for processing XML documents in our programs, we need something different, we need the small schemas. In the system integration we need small schemas.

I want to emphasize it. If you work with a hundred values and use a schema with a thousand nodes, it is completely wrong. It smells all around, it intoxicates all your code. You don't want to know how programmers call this type of code.

We don't need an abundance of HIPAA schemas in most applications. We only need a small portion of schema to validate or transform the significant for this application part of schema.

We upload the megabyte size schemas, we perform mapping for these huge schemas, and it lasts for eternity and it consumes huge amount of CPU and memory.

For the most integration projects we don’t want to validate data to satisfy the standard. We want to transfer data between systems as fast as possible with minimal development effort.

How to work with those wealthy schemas? How to do our integration fast at run-time and in development?

First we have to decide, does our application require the whole schemas or not?

If the answer is "No" we could read further to solution.

How to Simplify?

Solution is to simplify the schema. Cut out all unused parts of schema.

The first step in our simplification is to decide which parts of original schema we want to transfer further, want to map to another schema. We keep these parts unchanged and we simplify all other unnecessary schema parts.

The second step is to research if the target integrated system perform the data validation of the input data or not. Good system usually validates input data. Validation includes the data format validation (is this field integer, date type or does it match a regex expression?), the data range (is this string too long or is this integer too big?), the right encoding (is this value belong to the code table?), etc.

If the target system performs this validation, it doesn't make sense to us perform the same data validation on the integration layer. We just pass the data without any validation to the target system. Let this system validate data and decide what to do with errors: send errors back to the source system or try to repair or something else. Actually it is not good architecture, if an intermediary (our integration system) is trying to do such validations and decisions. It means spreading the business logic between systems where target system delegates the data validation logic to intermediary. The integration system deals with data validation only if it needed.

Example: HIPAA Schema Simplification in the BizTalk Server

Now let's be more technical. The next example is implemented in the BizTalk Server and the HIPAA schemas, but you can use the same principles with other systems and standards.

The first step in the schema simplification is the structural modification. It is pretty simple. We replace the unused schema parts with <any> tags [http://www.w3.org/TR/xmlschema-0/#any]. If we are still want to map this schema part but without any details, we can use the Mass Copy functoid.

The second part of the schema simplification is the type simplification.

Open schema again with Schema Editor, make any small change and undo it. Editor will recalculate type information and pops up the Clean Up Global Data Types window. Check all types and click OK.

This cleans up all unused Global Data Types.

Previously we replaced all those types with “xs:string” type and those types are not used anymore.

It takes 5 minutes for this replacement. What is the result?

The modified schema is twice smaller.

- is the dll size with original schema.

- is dll size with modified schema.

The assembly for modified schema also cut twice in size.

Result is not bad for 5 min job.

How these simplified schemas change our performance?

All projects with schemas and maps are compiled in Visual Studio notably faster. I like this improvement as a developer.

How about the run-time performance?

I have made a simple proof of concept project to check the performance changes.

Test project

The project compounded of two BizTalk applications and two BizTalk Visual Studio projects. Do not do this in production projects! One Visual Studio solution should keep one and exactly one BizTalk application.

Each project keeps one HIPAA schema, one very simple schema, one “complex” map (HIPAA schema to HIPAA schema), and one simple map (HIPAA schema to the very simple schema).

The first project works with original HIPAA schema and the second project with simplified HIPAA schema.

Build and Deploy one project.

Each BizTalk application compounded of a receive file location and a send file port. The receive location uses the EdiReceive pipeline to convert the text EDI documents into the XML documents. So we need to add a reference to the “BizTalk EDI Application”:

After deployment import the binding file which you find in the project folder. Create the In and Out folders and apply necessary permissions to those folders. Change the folder paths in the file locations for your folders.

There is also a UnitTests project with several unit tests. Change folder paths in the test code.

Perform tests.

Then delete the application and deploy second BizTalk project and perform tests again.

Do not deploy both projects side by side.

Performance results:

Note: Before each test start the Backup BizTalk job to clean up a little bit the MessageBox.

Tests for 1, 10 and 100 messages did not show visible difference. The difference could be noticeable in my environment in 1000 message and 3K message batch tests. The above table shows the test result for 3K batch tests.

The performance gain is about 10%. It is not breathtaking but anyway it is not so bad for the 5 minutes effort.

Conclusion: The schema type simplification is worth to do if the application expects the sustainable high payloads, the high peak payloads, and everywhere you want to get the best possible performance.

Sometimes we need to make complex decoding in the data transformations. It
could happen especially in the big EDI documents as HIPAA.

Let’s start with examples.

In one example we need to decode a field. We have the source codes and the
target codes for this field. The number of both codes is small and the mapping
is one to one or many to one (1-1, M-1). One of the simplest solution is to
create a Decode .NET component. Store the code table as a dictionary and
decoding will be fast. We could hard code the code table, if codes are stable,
or extract it in cache, reading it from a database or a configuration file.

The next example is on the opposite side of complexity. Here we need to
decode several fields. Target codes related to several source codes/values
(M-1). It is not the value-to-value decoding but this decoding also includes
several if-then-else conditions, which opens a can of worms with 1-M and M-M
cardinality. Moreover the code tables are big and cannot be placed in the
memory.

We can implement this decoding with numerous calls to the database to get the
target codes and perform these calls inside a map or inside a .NET component. As
a result for each document transformation we calls database many times.

But there is another method of implementing this without flooding the
database with these calls. I call this method the “SQL Decoding”.

We might remember that SQL operations are the set operations working with
relation data directly. The SQL server is very powerful in executing these
operations. Set operations are so powerful, we might decode all fields in a
single operation. It is possible, but all source values should be in database at
this time. All we have to do is to load the whole source document to the SQL
data. Hence our method is:

Load the source message to SQL database.

Execute encoding as a SQL operation or series of operations.

Extract the target message back from the SQL.

We can do all structure transformations in maps (XSLT) and perform only
decoding in SQL form. Or we also can do some structure transformations in SQL.
It is up to us.

The Pros of this implementation are:

It is fast and highly optimized.

It does not flood database with numerous calls.

It nicely utilizes the SQL engine for complex decoding logic.

The Cons are:

Steps 1 and 3 can be not simple.

In real life we usually don’t have a clear separation between our scenarios
and the intermediate solutions can be handy. As an example we can load and
extract not a whole message but only part of it, related to decoding.

Personally, I use the SQL Decoding in the most complex cases where mapping
takes more than 2 days of development.

Note:

If you are familiar with LINQ, you can avoid steps 1 and 3 and execute
set operations directly on the XML document. I personally prefer to use LINQ. But if XML document is really big, SQL approach works
better.

Conclusion: If we need a complex decoding/encoding of the
Xml document, consider to use the set operations with SQL or LINQ.