Stress-testing NVidia GPU for live transcoding

Nimble Streamer is one of our flagship products. It's a server software which takes live streams and files as input and makes them available for huge amount of viewers. It's a native C++ application which was ported to all popular OSes (Linux, Windows, MacOS) and platforms (x64, ARM). Low resources usage and high performance were key requirements from day one and we're showing good results so far.

Last year we introduced Live Transcoder, an add-on for Nimble Streamer. It allows taking live streams and files input in various formats and perform content transformations real-time. It allows decoding (both software and hardware-accelerated), changing media with various filters (re-scaling, overlay etc.) and encoding (also software and hardware).

The Transcoder is controlled via WMSPanel web service and transcoding scenarios are defined via drag-n-drop web UI. This allows visualizing the process and running various scenarios in a few clicks. Check these videos to see ho it looks like.

Transcoding scenario example

The decoding is done only once per each incoming stream before any further transformations. It allows saving resources on this high-consuming operation. You'll see the importance of this approach later during the tests.

One of the content transformation technologies available in the Transcoder is the hardware decoding and encoding via NVidia GPU. Recent generations of their devices allow handling some typical streaming tasks to off-load the CPU.

We've contacted NVidia representatives in order to arrange a combined stress-testing of Live Transcoder for Nimble Streamer and NVidia GPU. This would show us the economical effect of such tandem compared to CPU-only configuration. We would also like to see how GPU could be used in the most optimal way and would give some good recipes for our customers.

So we needed to get proper hardware with latest GPU for this research. So cloud access would be the best solution. We saw that AWS doesn't have VMs with Maxwell generation GPU and Azure doesn't have it, they only plan introducing that.

1. NVidia GPU in Softlayer cloud, setting up Nimble Streamer

In cooperation with NVidia, the IBM company gave access to their Bluemix Cloud Platform (formerly known as Softlayer). It's a big grid of data centers around the globe (about 50 at this moment), connected via private network providing decent number of cloud infrastructure services. All data centers are unified and they allow renting up to hundreds of virtual and hardware servers within a few hours, as well as balancers, storage systems, firewalls etc. to build reliable infrastructure.

IBM provided us with full access to web portal to control cloud services and to a server with required configuration which we used for further testing of our transcoding solution.

Hardware

First we were provided with bare-metal server with 128GB RAM and 2xGPU NVidia Tesla M60 with Ubuntu 14.04. All the servers details like credentials, SW versions, communications, hardware status etc. can be tracked right from the dashboard. There you can do all the manipulations which brings any tech support interactions to the minimum.

Once we started testing we saw that we cannot utilize this configuration in the most optimal way due to context generation issues on GPU side - we'll describe them later in this article. So we came to decision to reduce the configuration. As we used cloud platform, we requested that from tech support. The entire operation took about 2 hours within proper support window in Amsterdam data center. This is pretty convenient for developers as they don't need to deal with HW configuration themselves.

We came to the following server configuration:

Dual Intel Xeon E5-2690 v3 (2.60GHz)
24 Cores
64GB RAM
1TB SATA

So it's 2 CPUs with 12 cores each, and thanks for Hyper threading we got twice as much, i.e. 48 cores.
Also, we didn't use hypervisor in order to get maximum from the hardware resources.

Please notice that there was no affinity, chip tuning, overclocking and all that magic applied, just out-of-the-box CPU and GPU. For GPU we used official driver from NVidia website.

So we had the server. A brief overview of web UI, then SSH access and here we were in familiar Ubuntu command line, installing Nimble Streamer, registering new transcoder license and doing initial setup.

Nimble Streamer Transcoder

Nimble Streamer was set up for GPU contexts cache initialization. This is done due to GPU limitation for maximum of simultaneous decoding and encoding contexts being created, aong with significant creation time.
Please read more about this GPU problem in the section below and also read this article for more details on contexts in Nimble Streamer.

Nimble Streamer contexts parameters were as follows (this is an example for first series of tests):

Before running each new series, we'd set the context parameters according to each task requirements.

Creating transcoding scenarios

Going forward we we using WMSPanel service where all transcoding is set up. As mentioned before, all the operations are performed via web interface which makes it easy and clear. We created a number of scenarios which use different variations of CPU and GPU transcoding, different renditions and encoding options (CPU/GPU, profile, bitrate etc.)

Scenarios can be launched simultaneously in combined sets which allows involving various processing queues, increase load and change it according to test plan. We just check required scenarios and resume/pause them.

2. Transcoding FullHD 1080p streams

First we tried highest load scenarios to see the limit of our hardware. From practical point of view the "heaviest" use case in use by our customers is FullHD 1080p.

To generate source streams we took a file with FullHD (1920*1080) high profile H.264 video. The content itself is a city tour with medium intensity of changing scenes, views, colors etc. so this wouldn't make transcoder job neither easy nor difficult. So it's a typical load.

We used 36 equal input streams in various scenarios.

The transcoding scenario is typical: we take 1080phigh profile input then generate 720p, 480p, 360p main profile and 240p, 160p baseline profile. So it's 1 input and 5 output streams. Usually a pass-though stream is added in order to provide viewer with original 1080p stream. We didn't add it because it doesn't need transcoding as the data is passed from input to output. This case is highly optimized in Nimble Streamer and it doesn't consume significant resources - it'll make some RAM but not much.

There is no audio in the output streams. Adding audio doesn't append any significant CPU load but we've excluded it for clarity purposes.

Testing with CPU, no GPU

First we tried to process the streams without GPU by setting software encoder and encoder.

Only 16 input streams could be processed with 80 output streams total for all renditions.

CPU load was 4600%, i.e. it used ~46 cores. RAM consumption was about 15GB.

Testing with CPU and GPU

Context cache was set to "0:30:15,1:30:15" which means 30 encoding contexts and 15 decoding contexts for each GPU. Notice that we have 2 GPUs which allows running tasks in parallel.

The maximum load was processed using the following configuration of streams scenarios.

GPU0 and GPU1 got 15 streams each as an decoding input. So we got 30 decoded streams ready for further transformations. Each stream is decoded just once, regardless of a number of scenarios it is used at.

GPU0 and GPU1 encoders got 15 input streams each, they produced 720p, which means it was 30 streams of 720p for output.

GPU0 and GPU1 also got 15 input streams for 480p encoding, so it was also 30 streams of 480p for output.

As we ran out of encoding contexts, all other renditions were moved to CPU software encoding:

30 streams of 360p

30 streams of 240p

30 streams of 160p

CPU load was 2600%, decoder was 75% busy, encoder was at 32%.

After that we loaded CPU with 6 streams for decoding, each having 5 renditions, having 30 streams output.

A few details

We wanted to check the use case where we would process the "heaviest" tasks on GPU. That would be 1080p decoding with 720p and 480p encoding. The rest should went via CPU.

First we checked the limits of decoder. Enabling 22 decoding contexts we got problem with contexts: they could not be created at all. Making it 21 allowed the creation but decoders were 100% loaded and we saw image artifacts. So we ended up with 20 streams: both decoding and 160p encoding were working normally.

Moving forward we discovered that current graphic card with 16GB RAM could work with 47 contexts at most, regardless of whether it was encoding, decoding or both. Notice that we're talking about Tesla M60 GPU, other graphics cards may work differently. We assume that if the card would have 24GB RAM, this might help creating more contexts. However this needs to be tested.

As a result we used "15 decoding contexts + 30 encoding contexts" formula which gives us 30 input streams with 2 output renditions. So we let upper renditions (720p and 480p) be processed on GPU while the rest (360p, 240p, 160p) we processed on CPU. As CPU was still far from being overloaded, we added more streams to process on it.

3. Transcoding HD 720p streams

This is a scenario with typical load as long as the majority of content is still provided as HD.

To generate source streams we took HD (1280*720)high profile content similar to the one we used in section 2.

We used 70 equal input streams in various scenarios.

The transcoding scenario is as follows: the input is 720p high profile, the outputs are 480p, 360p main profile and 240p, 160p baseline profile. So we have 1 input stream with 4 outputs. As in previous example we didn't do pass-through of source stream. There is also no audio output.

Testing with CPU, no GPU

As in previous section, first we tried to transcode with CPU only. Top result was 22 input streams with 88 output streams of all renditions. CPU load was 4700%, i.e. 47 cores were used. 20 GB RAM we used overall.

Testing with CPU and GPU

46 streams of 720p were decoded using both GPUs. We also encoded 46 streams to 480p there. After that we encoded 360p, 240p and 160p on CPU, that was 46 streams of each rendition.
The load was: 2100% CPU, 61% decoder, 16% encoder.

In addition we added 24 input streams to transcode on CPU, each having 4 rendition.

As in previous case, we faced the contexts issue and more RAM on graphics card might probably help with the problem. But that also needs to be checked.

4. NVidia GPU contexts creation issues

This section is obsolete as now Nimble Streamer Transcoder allows using NVENC context share mechanism which allows efficiently use contexts. Please read this article for more details.

A few words on the issue which didn't allow us to process more streams via GPU.Last year we collaborated with NVidia team to run tests with multiple graphics cards. Using several cards at a time, we faced with server performance reduction. Each new encoding or decoding context took more time to create than previous one. It took ~300ms to create first context while each next added some time on top so after a couple of dozens of operations, it took 3-4 seconds each to be created. The transcoding scenario is defined by an end-user so it's assumed that it starts functioning with no delays, so this issue made Nimble Streamer un-usable in a high load usage, which means this eliminated all its advantages.First we suspected Nimble Streamer but then we checked the reference ffmpeg package provided by NVidia and found out that the GPU itself was spending time on context creation.The problem was reported to NVidia but it would take time to fix that on their side. So we implemented a context cache mechanism which allowed creating contexts on Nimble Streamer start. This solved the end-user problem but Nimble initial start-up may take some time.Nimble Streamer contexts setup is described in this article.Creating contexts is not the end of the story. If the amount of contexts is big enough, the NVENC API starts giving "The API call failed because it was unable to allocate enough memory to perform the requested operation." error when any related scenario is launched.After running several tests we found that one GPU may effectively work with 47 simultaneous contexts. There is no difference whether it's encoding or decoding contexts. We assume this is related to graphics card RAM. We had 16GB RAM cards and we might probably get more if we had 24GB RAM graphic card, however this needs to be checked.The results which we got are true for the particular card we had. Other cards must be tested separately.Contexts limitation is the main block for giving higher load on the GPU.

5. Conclusions

The goal of our testing was to learn more about the GPU efficiency for live streaming transcoding and to make some recipes and best practices. What do we have as a result?

Economic effect

As we saw above there is a huge difference between the number of streams which can be processed using CPU alone and using CPU+GPU tandem. Let's see what this means from cost perspective. Let's take renting prices from Softlayer as a baseline.

GPU configuration costs $1729/m for Amsterdam datacenter you can check the prices here. When GPU is used, the cost is higher as the form factor is 2U for this case. Economic effect may be higher in case of purchase but this need ТСО analysis of NVidia GPU products.

You can see the total renting savings yourself. Please also consider bandwidth costs as they will add some additional costs.

We didn't consider the purchase option as its TCO depends on various factors. However the preliminary calculations show the GPU-based solutions advantage.

Scaling

We found the single-card option more cost-efficient that two-cards option. If server had 1 graphic card then its resources will be used more productively.
The hardware decoder is always loaded more than the encoder because of the context creation problem. So if the second card is added, both cards cannot be fully loaded and most of the tasts will need to be completed on CPU.

We tested 2 cards option with help of Softlayer and we haven't shown all details here due to lower ROI.

So if you'd like to scale your transcoding you should add more servers with single GPU rather than adding more cards in existing servers.

If the number of input and output streams is low for your streaming project - e.g. a dozen of HD streams with a few filters and renditions for output - then you may consider using no-GPU server and fully utilize the CPU. Also notice that RAM amount is not as important as computing power for transcoding so you can save more on RAM as well.

Summary

The described hardware solution of CPU with Tesla M60 GPU works pretty good for high load tasks. GPU does all the heavy duty like decoding and high renditions encoding while CPU handles the rest of low renditions perfectly.