May 31, 2011

How to speed up a data conversion program that loads Microsoft CRM 4.0

Overview

Our organization recently adopted Microsoft CRM 4.0. We developed a data conversion program to convert the data from the source systems and insert it into the new CRM system. Our initial effort was unsatisfactory because it ran way too slowly. We eventually solved our performance problems, and experienced a six-fold improvement in performance over our initial design! This report summarizes the challenges we faced, and our solutions, so that others who face performance issues with data conversion programs might benefit from the lessons we learned. Some of the lessons are specific to Microsoft CRM 4.0. But other lessons are of general interest.

Current and Historical Context

The data volumes are not large by common standards of data conversion programs. There are roughly ten target tables, with data volumes ranging from 10,000 to 100,000 records.

There were several factors that caused us to minimize concern about performance at first. First, the conversion program grew gradually, as new requirements were discovered. So the final scope was not recognized initially, or we might have paid more attention to speed. Second, we thought the conversion would run only one time in production. We later learned that we would be releasing the features over a several month period. During this time, we must run the conversion every single night. Finally, the hunger for features and defect correction compete with the desire for performance, and our scarce programming resources were devoted to the former. For these reasons and others, performance was not addressed early. It was like putting a frog in cool water and heating the pot slowly.

Finally, at one point, the program took 17 hours to run. Fortunately, we got uncomfortable enough to hop out of the pot before we boiled to death. Therefore, various techniques were applied to speed up the operations.

Technical Summary

The CRM data load program is a console application, written in C#. There are also some SSIS packages that are called from the console application. It updates the Microsoft CRM 4.0 Database via the CRM SDK and web services. Microsoft SQL Server 2008 is the backend database for CRM 4.0. Microsoft SQL Server 2005 is the database for the legacy source systems.

Created a dedicated CRM web services server, with Keep Alive Disabled

We setup a dedicated server to handle CRM Web Services calls from the data load program. We discovered that running high volumes of calls against the main CRM application server caused network socket exhaustion errors. We further determined that to avoid these socket errors, and run our load at maximum speed, we had to disable the HTTP Keep Alives on the web services web site. But if you disabled HTTP Keep Alives on your main CRM server, it no longer works as a CRM server via the web browser. Hence, we setup a separate server for the web services, disabled HTTP Keep Alives, and disabled the Microsoft CRM Async Service. We left the async processing to the main server, so it could handle the workflows that might get triggered. If there are workflows registered for the entities being loaded, then high volumes of activity can overwhelm the background services. Our main CRM server could handle this, but we could have setup another server to offload this if needed.

This technique is good to know if your data loader is constrained by writing to a web service. If we had not resolved the constraint of exhausted sockets, we could never have used parallel processing techniques as described next. The extra processing rates provided by parallel techniques would simply have overwhelmed our web services.

Applied parallel processing to the load

With the 4.0 version of the .Net Framework, Microsoft included the Task Parallel Library. See http://msdn.microsoft.com/en-us/library/dd460717.aspx. There are two main approaches they describe: data parallelism and task parallelism. By applying both of these techniques, we gained an amazing six-fold improvement in speed. The comparison is shown in the table and charts below.

Runtime (Hrs:Mins)

Sequential

Parallel

Savings

Speedup

13:56

2:32

11:24

6

How did we achieve these gains?

We applied data parallelism aggressively to spawn multiple threads for our loops. So a loop that processed 90,000 records might be processing several at one time. Once we did this, we increased the load on our dedicated web services server. So we added two processors to the web server and configured the application pool to run up to two processes at once. The table below and its accompanying graph reveal dramatic increases in load rates with this technique.

Gain through Data Parallelism (Parallel.ForEach)

Processing Rate (recs/min)

Module

Sequential

Parallel

Speedup

# of Records

Account Loader

282

598

2

14,000

Parent Contact Deleter

440

1,837

4

40,000

Relationship Deleter

413

5,530

13

90,000

Since we were already using data parallelism (Parallel.ForEach) on our loops, we did not want to run several parallel loops in parallel with each other. But we did use task parallelism in other ways. There is a 60 minute segment at the beginning when we extract data from the source systems, and consolidate it into staging tables. During that time, those operations do not use the CRM servers at all. This seems like a waste of idle CRM server resources. Most of the later processes have data dependencies on this first segment, and thus, they must wait. But fortunately, there are a few CRM tasks that do not depend on any earlier data to run. They can be run at the beginning. These are thee tasks that cleanup data on the target. We chose to run those in parallel with the Non-CRM segment. See the diagram below for an illustration.

Notice, that by running the deletion processes during the Non-CRM extraction processes, we utilize the CRM servers that would otherwise be idle. And we take 40 minutes off of the total load time.

Redesign the data load to convert only data that has changed

Since this was no longer a “one-time” load, but a nightly load process that would run for months, we reconsidered our design. We revised some of the largest modules to load everything once. We then updated only changes on subsequent runs.

Conclusion

By removing technical constraints in our web server, using parallel processing techniques, and applying smart design, we dramatically improved the performance of the CRM data load.

If we had done this work early in the project, and given a priority to maintaining performance, we would have accelerated the development process over the life of the project. It is hard to test a program that takes 17 hours to run. Fewer test cycles per week, means fewer fixes. By reducing runtimes to a more modest level, we can run tests more frequently and validate our changes sooner.

In addition to the reasons cited above, additional reasons we did not pursue some of these techniques earlier include:

We had no idea we could solve the socket exhaustion constraint. We did a lot of searching early on, and found no web references to our specific problem. I’m sure the relationship between HTTP Keep Alives and Socket Exhaustion is well understood by someone. But it was not easy for us to find it. I wrote about this in a previous post here https://codecontracts.info/2011/05/31/how-to-avoid-socket-exhaustion-on-web-servers-by-disabling-http-keep-alive/ . The thought of doing parallel processing crossed our minds, as we could see that the servers were underutilized. But until we resolved this constraint, we knew that we would simply generate more socket errors faster.

We had not learned of the Task Parallel Library until our local Microsoft developer evangelist, Jeff Brand, gave a talk on it. Once we saw the possibilities, it rekindled our interest in finding a solution to the socket errors. The features in the Task Parallel Library and the concurrent debugging tools in Visual Studio 2010 were just too tempting to ignore.

Thank you David. It’s always interesting to hear about techniques that .NET developers use when they cross into the ETL discipline. This sort of cross-pollination is important to our trade. The Task Parallel library will give me some additional tools in my utility belt. Is there a link to Jeff Brand’s talk?

Thanks for the interest, Shane. Jeff’s talk is not posted on any link, but here is one from the Parallel Developer’s learning site that resembles his talk. It starts with a nice motivation for parallel programming, followed with some real code and great references for follow-up.Video on Parallel Programming

Excellent article. We are going through the same process ourselves, we have used SCRIBE for years, but its unreliable and very slow and out source data is nested XML. SCRIBE can only handle CSV style non repeating node fixed XML

Our data import is real-time 24/7 with files from multi-tenant clients, sizes from 1 record to 200,000 at any time of day. We are using MSMQ as a chucking mechanism and failover but are finding the matching to test if a record is an update or an insert and the subsequent web service calls required to be an incredibly expensive operation before we even insert or update entitiy records from the source files. CRM match codes do not allow you to replicate their logic in external c# code due to unpredicatable match code ordering issues, and writing our own is PLUGIN time consuming. Also, all the monitoring we perform to update how an import is progressing and retry handling is again web service expensive.

We are trying to put the service that performs the calls on the same VM as the CRM dedicated sedrvice with HTTP KEEP ALIVE turned off, but we find that the amount of web service calls required are still causing latency problems.

You have a challenging scenario, Garry. In all of those operations you describe, are any of them read operations? If so, what we tried to do when possible, was to read from SQL views to do all read operations, and avoid using the web services at all for reads, conserving its resources for performing our updates and inserts. Our principle here was to acknowledge to cost of CRM Web services calls and do any clever thing we could to minimize them. I am curious: which version of Microsoft CRM are you using? We were using Microsoft CRM 4.0, and we hoped that future versions would experience some improvement in web services operations. Lastly, are you using concurrency as much as possible like we did in the article above? We found a sweep spot – without concurrency, it ran at one speed, with some concurrency, we gained higher throughput, and at too high a concurrency, we overloaded the server.