Archive

Computer vision should not be confused with image processing (as we all know). I love building computer vision pipelines, but sometimes menial tasks of pure image processing, automated editing come up.

Suppose you had the same astronauts from one of the previous posts participating in a study, where they are actually filmed watching something, say an episode of Star Wars. You ran your favorite face detection (Dlib-based, of course) on a sample of frames from that video, and found that your viewers don’t move around much. You then applied a clustering algorithm to determine the region for each of the viewers where their faces are most likely going to be during the entire video.

Now, for the next step of this study, you don’t want to keep the entire video, you only want viewers’ faces. So the idea is to split the original video into, in this case 14, individual small videos of just the faces. Also, this doesn’t need to be done on every video frame, but on a fraction of them. Every 3rd, 5th, etc. The graph of want you want to accomplish looks like this:

(Seems like skip & crop should be refactored into one operation, see below why they are not)

It’s simple enough to code something that does what you need (remember, the cut out regions remain constant throughout the video), but wouldn’t it be neat if there already were a powerful component that could take this graph as a parameter and do what’s required very fast?! FFmpeg does just this! FFmpeg is a command line tool, so wouldn’t it be even better if in our case where we need to specify lots of things on the command line, there would be a great scripting language/tool that would make creating these command lines a breeze? There is one, of course, it’s PowerShell. However, F# is a great scripting language as well and I look for any excuse to use it.

FFmpeg has a nice command line sublanguage that allows you to build video editing graphs. They are described nicely here as well as in a few other places.

Our graph is split into as many branches as there are faces in the video (see above). For each such branch (they are separated by a “;” and named in “[]” as f0 – f<n-1>, we instruct ffmpeg to take video stream 0 ([0:v]), take every 2nd frame of the stream, decrease the framerate by 1/2 and crop our a region described as (width, height, left, top). We are ignoring the audio since we are only interested in the faces.

One thing that took me a while to figure out was that I needed to repeat what would normally be factored out at every branch: couldn’t just say “framestep, reducerate” once and append that to the custom crop operation, different for every branch. However, it appears that these common operations do execute once in ffmpeg, so the entire process is very fast. Takes about 90 sec per 45 min of H.264 encoded video.