Saturday, February 28, 2009

Parallelizing video routines

BasicsOne project shortly after the last gmerlin release was to push the CPU usage beyond 50 % on my dual core machine. There is lots of talk about parallelization of commonly used video routines, but only little actual code.

In contrast to other areas (like scientific simulations) video routines are extremely simple to parallelize because most of them fulfill the following 2 conditions:

The outermost loop runs over scanlines of the destination image

The destination image is (except for a few local variables) the only memory area, where data are written

The data argument points to a struct containing everything needed for the conversion like destination frame, source frame(s), lookup tables etc. The height argument was replaced by start and end arguments, which means that the function now processes a slice (i.e. a number of consecutive scanlines) instead of the whole image. The remaining thing now is to call the conversion function from multiple threads. It is important that the outermost loop is split to keep the overhead as small as possible.

The API perspectiveEverything was implemented

With a minimal public API

Backwards compatible (i.e. if you ignore the API extensions things still work)

Completely independent of the actual thread implementation (i.e. no -lpthread is needed for gavl)

/* Set the maximum number of available worker threads (gavl might not need all of themif you have more CPUs than scanlines) */void gavl_video_options_set_num_threads(gavl_video_options_t * opt, int n);

Like always, the gavl_video_options_set_* routines have correspondinggavl_video_options_get_* routines. These can be used for using the samemultithreading mechanism outside gavl (e.g. in a video filter).

The application perspectiveAs noted above there is no pthread specific code inside gavl. Everything is passed via callbacks. libgmerlin has a pthread based thread pool, which does exactly what gavl needs.

The thread pool is just an array of context structures for each thread:

typedef struct{/* Thread handle */pthread_t t;

/* gavl -> thread: Do something */sem_t run_sem;

/* thread -> gavl: I'm done */sem_t done_sem;

/* Mechanism to make the fuction finish */pthread_mutex_t stop_mutex;int do_stop;

The worker threads are launched at program start and run all the time. As long as nothing is to do, they wait for the run_sem semaphore (using zero CPU). Launching new threads for every little piece of work would have a much higher overhead.

Passing work to a worker thread happens with the following gavl_video_run_func:

BenchmarksFor benchmarking one needs a scenario where the parallelized processing routine(s) need the lions share of the total CPU usage. I decided to make a (completely nonsensical) gaussian blur with a radius of 50 over 1000 frames of a 720x576 PAL sequence. All code was compiled with default (i.e. optimized) options. Enabling 2 threads descreased the processing time from 81.31 sec to 43.35 sec.

CreditsFinally found this page for making source snippets in blogger suck less.