Introduction

The main point of extensibility for Windows Workflow Foundation (WF) is the class System.ComponentModel.Actvity. It is trivial to subclass the Activity class, override the Execute method, and add your own behavior. You can then compose your new custom Activity into a workflow. However, the workflow engine is designed to support only fast activities with this recipe: Blocking activities or long running activities are not supported. That class of activities requires a bit more work.

Unfortunately, the recommended way to write a long running activity is quite a bit more involved. The activity must inform a background thread to perform the activity, and then let the workflow engine know it will be a while before the activity completes, and finally, notify the workflow engine when the background work is finished. The workflow runtime has specific interfaces for this interaction which are driven, in a large part, by the requirement that it be possible to write the workflow out to a persistency store. Loosely, the accepted method to do this is as follows:

Create a System.Workflow.Runtime.WorkflowQueuingService in the activity's Execute method, and let the runtime know that data arriving on this queue should be processed by another method in this activity. This queue will be used to ship notifications and data back from the external thread to the activity.

Use a System.Workflow.Runtime.WorkflowRuntimeService or some other external agent to start the long running operation on a different thread, or even in a separate process.

Tell the workflow runtime that this activity is being suspended temporarily, pending notification of results from the queue.

Once the long running activity is finished, its results are placed on the queue created in step 1.

The WF runtime notices the data on the queue and wakes up the workflow activity and calls the callback method specified in step 1.

The activity unqueues the results, perhaps acts on them, and then lets the workflow runtime know this activity is finished (or continues to wait for more data).

Note that the time between step 3 and 4 can be considerable. For example, if your activity sends an email request to a user it may be days! Further, things get tricky if workflow persistence is involved. For example, after step 3 the WF runtime may write the workflow out to persistent storage and unload it from memory. Thus, in step 4, the various activity variables may not be available for use!! To do this correctly the activity author will have to make a copy of all the activity variables and make sure they are properly cached to get around this. Needless to say, there is a fair amount of boilerplate code required.

The point of the sample code included with this article is to automate some of these steps — take care of the boiler plate code:

take care of marshaling the data required for the long running code from the activity arguments so that it is immune to the activity being unloaded by the persistence mechanism;

manage the communication between the activity and the long running code via the WorkflowQueuingService automatically;

and, provide some support for restarting a long running operation if the WF host program is restarted and the workflow is loaded from persistent store.

There is still some boiler plate work that needs to be done, as is shown below, but it is much simpler. The libraries included in this project, for example, mean you never have to create the queue required above in step 1.

The reason that the WF runtime doesn't deal with long running activities naturally is in the design of its execution engine. It doesn't spin up a new thread for each Activity.Execute method call. Thus, if those threads are used by your activity those threads cannot be used to execute other workflows or workflow activities: Scalability can become an issue. Further, the interaction with persistence - especially if an Activity is to run for days - would be tricky.

Background and Specific Requirements

There are plenty of examples out there describing how to write a long running activity. I learned what I know from a blog posting by Paul Andrew, as well as many others that one can find using a good search engine. However, a project I was writing required about 10 of these long running activities and so I decided a more generic approach was required.

I wanted the code to be able to deal with several different types of long running activities:

A calculation that takes a long time to run, but is run in the same process as the workflow. For example, rotating 1000 images.

Making a web query that might take several minutes to complete

Running an external program that might take 10-12 hours to complete.

Further, I wanted the activities to be robust against the workflow runtime crashing and the host machine being rebooted as well (i.e. once have my service up and running I don't want to need to touch it!).

If an external program was being executed and the WF host is restarted, I don't want it to try to re-run the external program, but just continue to wait on its results. On the other hand, if a machine restart had occurred then I did want the external program to restart automatically.

If the long running calculation was interrupted by a host restart or the external web request was interrupted then I wanted them restarted with the WF host came back up.

The long running workflow activities should fully support the persistency interface used by the workflow runtime.

Make the coding as simple as possible. The persistence requirement, actually, adds quite a bit of complexity. While I think that what I have now is significantly simpler there is still some boiler plate code that has to be written because of this (see below).

One thing that should be noted that was not a major requirement was being robust against workflow cancellation. I don't think that would be too hard to implement, but I've done no testing. Finally, the asynchronous method of supporting long running tasks is fully supported in the workflow runtime out of the box. In this method the programmer uses one activity to initiate a long running activity on a background thread (or a web request, etc.). The workflow then moves to an activity that waits for an event to fire. The event contains the resulting data from the external activity. This method was not fully explored because of the perceived difficulty in implementing the restart behavior and reusability. However, there is no reason the design presented here could not be implemented with using callbacks and out-of-the-box activities with some work.

Using the Code

The code download contains three projects:

CommonWFLibrary - Contains all of the base classes and a few concrete implementations of LongRunningActivityBase workflow activity.

CommonWFLibraryTest - Unit tests for all of the WF activities. Lots of examples of how to use and extend LongRunningActivityBase activity can be found in this project.

RunAndPauseWithArg - An external program that is used in the unit testing of the ExternalPorgramActivity activity in CommonWFLibraryTest.

The CommonWFLibrary contains a number of classes, whose use is demonstrated below. A quick outline provides a useful guide as we write some simple sample code:

LongRunningActivityBase - Any long running activity should inherit from this class. There are actually no methods to override. Rather new methods are defined and then marked with Attributes, which the LongRunningActivityBase picks up.

ExternalProgramActivity - This activity will run an external program and pause the WF while the external program is running. If the workflow host is restarted and the external program is still running the activity will wait until that instance completes. Otherwise the external program is restarted.

RunLongrunningWorkflow - This activity runs a second workflow and waits for it to finish. Because workflow persistence is expected to be running, this activity does very little when the host is restarted.

Let's focus on the LongRunningActivityBase, and extending it. First, a very simple activity that just sleeps for 10 seconds.

That is it! When this workflow activity is encountered by the workflow runtime it will run the above Run method on a threadpool thread, and allow the workflow runtime to continue on with other tasks!

The extension mechanism for LongRunningActivityBase is via CodeAttributes, like LongRunningMethod. There are a few very important things to note about the LongRunningActivity's method signature (in particular):

The method is static - this is because the workflow may be unloaded by the time the long running process starts executing on a background thread. There is no guarantee instance variables will be available. This does not apply in general to other extension methods that are described below.

The method is public - this is due to some .NET reflection requirements. This applies, currently, to every single extension method below.

I call those out explicitly because this will not work if that isn't the case for this particular method! An exception will be thrown if your LongRunningMethod method isn't both public and static!

Next, it is worth taking a quick look at the code that sets up a workflow host. Besides the usual WF runtime startup we also have to add a LongRunningActivityBaseService service to the runtime. This LongRunningActivityBase uses this service to coordinate these long running activities.

The RegisterService method is provided as a convenience - you don't have to call that if you don't wish to. If you are interested in persistency then you would modify the RegisterService call as follows - as well as adding a persistency service for the workflow runtime:

For a workflow with long running activities to be fully persistent two types of persistency must be supplied. First is the usual workflow runtime persistency service. In this case I've used FilePersistenceSevice. This code was pulled from Windows Workflow Foundation SDK Code Samples, supplied by Microsoft. It will take care of writing out all the Activity based variables and state for a workflow as it is written out. Here is it configured to write out the workflow each time it enters the idle state. This is exactly the state that the long running activity enters after LongRunningActivityBase starts the background method. Unfortunately, LongRunningActivityBaseService also has state that must be saved if it is to recover from host crash. The LongRunningActivityBaseService takes two delegates that are responsible for writing out that state and reading it back. The state is written out after every change, but is only read back when the workflow host is started - it keeps its state in memory at all times (see note below). The LongRunningActivityBinaryPersister is actually very simple, and could easily be replaced by database code or whatever was required to save the host state.

The above code is a recipe. There are lots of different ways to accomplish the same thing - and they are often dictated by how the workflow host is actually implemented. See below for a short discussion on how to properly implement this in a WCF service, for example.

DelayFor10Seconds is a very boring long running activity and worse, not all that useful! Much more interesting is to have an activity that has input and output arguments. This requires extending the previous example a bit:

There is less going on here than the amount of code implies: earlier I referred to the fact that I was unable to eliminate all the boilerplate code. Well, here it is! There are the steps that will happen to this activity when the workflow engine executes it:

The GatherArgs method is first called, which "marshals" the data required for the long running activity to run on the background thread. As noted earlier, there is no guarantee that the Activity will be in memory when the background method executes and so one can't rely on the instance variables, unfortunately. Note that your class that contains the marshaled object must be serializable.

The LongRunningActivityBase then saves the marshaled arguments and queues a ThreadPool work item with these arguments. It tells the workflow engine it is now idle waiting for some other work to complete.

The background work item calls the Run method with the marshaled arguments.

When the Run method is finished the Result object is again serialized (note same requirement!) and the workflow runtime is notified that the long running task is done and the workflow show be woken up.

Back in the context of the active workflow, the DistributeArgs is called to "unmarshal" the results from the long running activity back into the Activity's instance variables.

There are a few things to notice here:

The Gather and Distribute methods are instance methods, not static methods. They are also public as before. What name you give them does not matter, as long as they are Attribute'd appropriately.

The Gather signature returns a .NET CLR type that is passed as an argument to the Run method. Run must return the same type that is passed to the distribute method as an argument. Violating these rules can cause a silent failure in the current code. So make sure to unit test!!

If your activity has no results it is fine for the Run method to return void and then drop the Distribute method from your code. It is also fine for the Run method to accept no arguments but return a result. In that case you can drop the Gather method from your code.

The .NET CLR types for the arguments and results must be marked serializable, as noted above!! If not, you will get errors at runtime - and they are hard to track down to what causes the problem. When I was coding up my 10 long running activities this was the most frustrating error I encountered. The error messages tend to get swallowed by the runtime. I'm sure there are some usability improvements that could be implemented to help with this.

We also now can explain why the LongRunningActivityBaseService must be able to persist data. If the workflow host crashes while a long running activity is in progress, it must be restarted. This means calling the LongRunningMethod method again. And if the LongRunningMethod requires arguments they must be cached somewhere.

There is one final use case the LongRunningActivityBase was designed to handle. Consider the activity that is to run an external program that may take 5 hours to complete. We want the activity to prevent the workflow from proceeding until that external program is done. Further, if the workflow host is restarted (say it is hosted in a web service) after 4.5 hours, it would be nice not to have to restart the program - rather just wait for it to finish in 30 minutes and the workflow to pick up as if nothing went wrong.

In order for that to happen, the activity will have to be able to track something like the PID of the external running program. That way when it is resumed, it can query the system to see if the process is still running and wait on it, or restart it if it is no longer running. The ExternalProgramActivity activity supplied in the sample code does exactly this. Here is the meat of the code (full source code is included in the download):

This is the main run method. Previously we tagged it with LongRunningMethod. To signify the different behavior we now tag it with LongRunningMethodStarter. This tells LongRunningActivityBaseService that it should be called each time the host restarts. Understanding if this is a restart or a first time call is by looking at the argument lastTime. The first time this method is called it will be null. Note the return type of the RunExternalProgram method is the same as lastTime's type. If the workflow host crashes and restarts, this method will be called again with whatever it returned the first time. The if statement in the method body is where the decision happens in this code. The contents of the class lastTime are completely up to you. Again, however, it must be serializable.

The second half to this implementation is how the code informs the workflow host that the process has completed execution. The key here is the LongRunningContext object. Note the context object is stashed in the ProcessController object. That object sets up a callback that is fired when the process finishes execution:

The instance variable _context is set in the constructor of the ProcessController object. The Complete method is called with the result object. Keep in mind that the Distribute method must have the same type argument as "r" above in order for it to be properly called so that the results from the long running activity can be placed back in the Activity's properties and made available to the rest of the workflow!

This pattern underlies all the usage patterns I've described. Each one is just a specialization of this more general one.

Other Implementation and Usage Notes

If you are starting a workflow that contains a LongRunningActivityBase activity in a WCF service it can be a little tricky to make sure that you get the persistence and the LongRunningActivityBaseService persistency services running properly. Below is some sample code I have used to create the LongRunningActivityBaseService in a WCF service. The actual location is actually grabbed from the name-value section of the app.config file.

Limitations and Possible Future Directions

Judging from several of the work flow talks from PDC 2008 there are some changes in .NET 4.0 that will directly impact this work. Specifically, it will be possible to keep the activities instance variables pinned in memory while the long running activity is executing. This may obviate the need for the argument and results marshaling and that would be a significant reduction of the code boilerplate complexity. It was not clear from the PDC talks if you have to give up some types of persistency in this case, however, nor if this forced you to keep the entire workflow in memory.

The ideal form (in my mind) would be something like an iterator. In the first section, running in the context of workflow activity, would gather up the arguments required and then call yield. It would then be resumed to execute on the background thread - but any local variables prepared in the first step would still be relevant. Once again, when done it would yield. And finally it would be called back in the context of the live workflow. Even if this was possible this would be very error prone for the programmer: All the workflow parameters would remain in scope according to the IDE and the compiler, though untouchable at runtime. Those errors would be awful to sort out! Sort of like modifying GUI elements when on a background thread! Another possibility would be to mark the workflow Activity variables that you wanted to to be available and the code would do something reflection to make sure they were saved. However, this has several problems, one of which is it becomes difficult to deal with inherited properties.

A very nice feature of the design of the workflow that this code limits is the ability to move a workflow from host to host: You can persist a workflow on machine A and then start running it on machine B. Since everything relevant has been written out it shouldn't have any problem. However, the persistence for the LongRunningActivityBaseService's state breaks that: The state is no longer carried along with the workflow. To fix this issue the service state would have to be persisted with the activities state. It wasn't until late in the coding process I realized this limitation and so it wasn't an initial requirement of the design.

Along the same lines there is another potential problem with dual implementation of the states. The service writes out its state each time its in-memory state changes. The workflow, however, may be writing out its state on a different schedule. All of my tests always write out the state when the workflow becomes idle. This is well matched: That is almost the only time that the service state changes. However, if one has a different persistency schedule that may mean the states on disk get out of sync and if the workflow host crashes right at that minute it may cause some confusion when the host is restarted. This possible limitation was discovered during the writing of this article and has not been explored at all.

Last limitation is the serialization requirement for all the marshaling objects (arguments, results, and context classes). The difficulty is this is infectious: Not only must your class be marked serializable, but every class that your class contains must also be similarly marked! This is a real problem if you are trying to store a 3rd party library object. But even in your own libraries you can find yourself modifying lots of structures to be marked Serializable. The requirement that they be public can also be a potential problem, though not as frequently.

Share

About the Author

I'm a professor of physics at the University of Washington - my field of research is particle physics. I went into this because of the intersection of physics, hardware, and computers. I've written large experiment data aquisition systems (I've done a lot of multi-thread programming). My hobby is writing tools and other things that tend to be off-shoots of work-related projects.

When I click on teh download the source link at the top it asks me to log in. Once that is done I can click on the link again and it asks me where to save the zip file. So it seems like it works just fine for me... Try again and let me know.

You've done a lot of really interesting work here - but I'm not sure what your approach achieves that couldn't be simulated with a local data exchange service. If I wanted to start a long-running task from within a workflow instance and then idle to wait for completion of that task, I would use the CallExternalMethod activity, followed directly by a HandleExternalEvent activity.

The implementation of the external method would dispatch the work asynchronously so that control is returned immediately to the workflow instance, whereupon it would idle, waiting for an external event to arrive. Once the external work is completed, the external method would raise an event into the workflow, matching the signature expected by the HandleExternalEvent activity. Any result information required to be passed into the workflow is sent as part of the event arguments.

This decouples the workflow from the task implementation, via a well-known service interface, and keeps the workflow implementation light and simple.

I apologise if I'm totally missing the point, but workflow foundation is pretty complicated already, and I'd be interested to know what the benefits of this solution are over what is already available in the framework.

Yeah -- that is an excellent question. There is no reason that one could not accomplish the same thing using the external method you discuss above. In fact, when I was trying things out for my project this is what I used originally.

But I had a lot of trouble with that approach. There was a lot of boilerplate code spread over several different source files. Things that should be together (like the Activity variables and the code that used them) were actually in different files. And I had to repeat a lot of this code each time I created a new activity (in the project I'm working on I have of order 10 of these long running activities).

And once you have that done you have to support persistance as well for those external method calls (or recovery, etc.). My framework has a pattern built in, and if that pattern works for your task then it makes things a lot simpler. Nothing preventing you from writing a similar pattern for the call/handle way of doing it, of course!

That was how I came up with the above approach. It keeps the code close to where it is used. There is still more boilderplate code than I'd like (specifically having to explicitly code the marshal/unmarshaling of the Activity arguments), but at least it is all in one place.

Finally, the code I'd previously written didn't have events for the long running methods. So, this pattern meant I could avoid having to write a seperate thread for each long running activity that would fire an event when the method was finished. Of course, it should be possible to code up a framework similar to this that took advantage of that.

My code is actually fairly similar to the decoupling you describe. The code in the long running activity is responsible for taking the Activity arguments, putting them in the proper form for the method call, making the method call. When the call returns, it does whatever is required for the results and sends them back. So the real code is decoupled from the WF code itself, and the code that handles the long running aspects is decoupled from that code. There are some cases where the code is mixed. For example, I have a long running activity to delete directories (some of the directories I have to delete can contain over 300,000 small image files) - then there isn't a lot of decoupling.

You are right about complexity. I really like WF, and now that I've tried to do somethign mildly difficult in it I think I understand a lot more about why they made some of the design decisions they made. When I was a beginner it often felt like these were getting in the way. I still think they could do some work on the API to make stupid things simpler.

Which brings me to my last point. I'm a relative beginner when it comes to WF. This work is a hobby for me, not my real job (though I just found out I'm supposed to give a talk on the project this is connected with, sheesh). So there may be other solutions that are simpler. People should feel free to suggest here so others reading this article are aware of other approaches (like the Call/Handle External Method/Event)!

Thanks for the response. Here's a really good case study on multithreaded parallelism using WF that illustrates the approach to using CallExternalMethod / HandleExternalEvent with correlation for asynchronous invocation. Not all the code is provided, unfortunately, but the discussion is very detailed:

That is a great article. I've only had a quick read (more after the break, I guess). Really explains what is going on inside WF. They address the "cancel" issue, which I do not, but they don't seem to explicitly address the persistancy issue (at least not in the article).

The approach looks almost identical to mine except they use the external activity rather than the WF queues to communicate. In that sense I think my approach is less complex. I haven't had time to download the code, but it looks like they may have the same marshal/unmarshal difficulties I had. Also, it wasn't clear to me how artibrary arguments/results were passed back and forth between the activity and the external service.

Thanks for the pointer. I'll download the code when I have a chance and take a closer look. I should definately add this article as a reference; it does a very good job explaining the internals of WF, something I skipped on purpose.