Description

I can observe from the debug data that very few frames are actually rendered using NVENC even when I have h264 selected from the UI. When I run it using the above command, it works wonderfully. I'm using r20858 on CentOS7 with VirtualGL on the server and an r20882 client on windows. The driver is 390.87. I'm using a C2075 for render and a K20c for encode.

I do not think the problem has much to do with my hardware or reversion because I'm having this trouble on all of my configurations including a brand-new new computer (the other one is pretty old) with a GTX 1060 running release 2.3.4 server with release 2.4 client.

Change History (99)

Xpra will only use video encoders when it thinks is necessary, for more details on the original feature see #410 - there have been many updates and fixes since.

Roughly speaking, the video encoder will not be triggered unless something on screen refreshes rapidly enough.
By only allowing h264, you are bypassing all the automatic encoding logic.

To test, you can run glxgears or anything else that generates lots of frames and verify that NVENC does get used.

If you feel that your screen updates would be better compressed with a video encoder and that the automatic logic is too slow to switch to the video encoder, you can debug the video region detection with -d regiondetect, the encoding selection logic with -d encoding.
Bear in mind that GPU encoders take longer to setup, so it is best to only use them when we're certain that the content is video-ish and is going to keep producing screen updates regularly.

Maybe we could tweak the logic to switch to video sooner when we know we're dealing with an efficient encoder (ie: NVENC).

Here's a "-d compress" log. For context, this log was collected during a *sustained* rotation operation on an OpenGL rendering of a finite element model of vehicle impact (it is a standard benchmark for us). The log looks like the below excerpt the whole way through. When XPRA jumps between RGB and YUV the screen briefly locks hard.

I have my system tray UI set to speed=auto, quality=auto and encoding = h264.

There are two cases I want to include:

(1) Without "XPRA_FORCE_CSC_MODE=BGRX" it is slow and throws "fail to encode errors" for NVENC sometimes, especially when I set speed to 100 (I had speed set to auto). In this case it didn't throw those errors, but the whole picture *did* turn yellow a few times during the collection of this log.

(2) with XPRA_FORCE_CSC_MODE=BGRX it is also very slow, but not as slow as before (but nothing like speed and perfection of having only NVENC enabled). There were noticeable jerks when it switched between encodings. Never any NVENC encoding errors nor random color changes like when BGRX isn't set.

First things first, something is not right with your setup if rgb ends up using zlib compression. You should have lz4 installed and if not then you should see a big warning in your server log on startup.
You can confirm the status by running xpra/net/net_util.py on Linux and Network_info.exe on win32.

As for the logs, I meant "-d compress" as well as, or combined with the -d regiondetect and -d encoding from comment:1.

From the first log sample, most of the time it seems to be taking 500ms to compress at a resolution of 1636x958. That's dreadfully slow. You may want to raise min-speed and leave the encoding set to "auto" so it can at least switch to vp8 / vp9 when not using NVENC.
Why it isn't switching to nvenc would require the -d score log output. I think that is going to be the key.

it is slow and throws "fail to encode errors" for NVENC sometimes

Please post the exact error message, including the full stacktrace.
You may need to add -d video,encoder to see it.
That could also cause encoder selection problems, we should be able to fix those or at least avoid the conditions that trigger it.

with XPRA_FORCE_CSC_MODE=BGRX

Why are you changing this switch exactly?
Setting a high enough (min-)quality should give the same result.
There are two ways that NVENC can handle BGRX input: natively for the cards + driver combinations that support it (can be turned off with XPRA_NVENC_NATIVE_RGB=0), or via a CUDA conversion step otherwise.
It's not clear what your card supports.
Please post the output of the server startup with -d nvenc
Either way, NVENC should win the scoring compared to doing the CSC step via another module (libyuv or whatever).

There are some scroll encoding packets in the middle, those can cause problems when encoding video as we then skip frames and encode them differently.
You can play with raising:

XPRA_SCROLL_MIN_PERCENT=80

Or even disable scroll encoding altogether:

XPRA_SCROLL_ENCODING=0

As for the IRC discussion regarding setting hints, we can worry about that once the encoding selection bug is fixed - it will help (see #1952 and #2023)

After a lot of struggling I've gotten LZ4 to work, but I don't understand why it was wrong. For some reason, on my computer, MINGW_BUILD.sh doesn't include lz4 correctly for version 2.1.0 or 2.1.1. When I backed down to version 1.1.0 my windows build correctly includes lz4, and I see this:

Version 2.1.0 works only if I run the python script from within mingw shell. However, lz4 doesn't seem to be getting put into the build directory correctly when I use version 2.1.0. Consequently, it doesn't make it into the installer. I'm not sure why it's wrong, and I'll look into a little more before giving up and moving on to your other suggestions. (This problem is less important to me than the encoder one, especially since I've got a workaround, namely downgrade to 1.1.0)

Making lz4 work didn't resolve the other issues regarding the NVIDIA encoder.

For this capture I had encoder set to H264 (everything else was set to auto). I rendered a 3D model and rotated it. I had XPRA_SCROLL_ENCODING=0 (because other experiments showed this helps some). This particular run was *very* jerky on the gigabit LAN. I have no idea why.

After a lot of struggling I've gotten LZ4 to work, but I don't understand why it was wrong.
For some reason, on my computer, MINGW_BUILD.sh doesn't include lz4 correctly for version 2.1.0 or 2.1.1.

So you're using custom builds rather than our official releases. Please make sure to mention that every time, as per wiki/ReportingBugs.
And test with official builds too to make sure that the problem doesn't come from your builds.
As for lz4, this is a well documented issue: ticket:1883#comment:4 with its own ticket: #1929

Making lz4 work didn't resolve the other issues regarding the NVIDIA encoder.

It wasn't meant to.
It was essential to find out why your output was not the one that was expected.
And it will help with performance (outdated benchmark here: https://xpra.org/stats/lz4/)

I had XPRA_SCROLL_ENCODING=0 (because other experiments showed this helps some)

"scroll" encoding is not used for windows we correctly tag as "video", hence why you should look into tagging your video windows.

So it is choosing the correct encoding (h264) and the correct encoder (nvenc) but for some reason the context gets recycled every time.
And since nvenc is slow to setup, that's probably where the delay comes from.

by the time the log samples start, the batch delay is already through the roof:

send_delayed for wid 2, batch delay is 233ms, elapsed time is 233ms

So something already happened before that which reduced the framerate to less than 5fps.
Could well be the wacky settings, or the re-cycling issue, -d stats should be enough to see why that is.

I *have* been testing with both builds consistently on the client side, and I've been using the builds from winswitch beta on CentOS7. There are no other surprises in my configs. I have not, for instance, set LD_LIBRARY_PATH to anything interesting. I had not seen wiki/ReportingBug before (I should have thought to look for such a document). My apologies!

Regarding tagging, can you point me to some information on how that works (I'm ignorant about how a program passes this kind of information through)? I will pass it on to our UI guys, and I expect that if it's straightforward I get can them to put it in our UI trunk right away. (They are using wxwidgets with gtk2 on X. Our software's window has a rectangular video region surrounded by buttons).

Regarding, the additional debug captures I will try to do that in a few hours.

After uploading the 7th one, only as control, I tried "--encodings=h264,h265" (this is not what I did for 8.txt). I found that it locks up just a tiny little bit at the beginning, unless I let it settle before setting in on rotating. On the 8.txt (which had all encoders enabled and was with the stock client) I was extra careful to let it "settle". As I expected, with settling it is still has the issue for which I opened this ticket, but I figure it's best to isolate one problem at a time. I, therefore, regard 8.txt as the nicest debug log.

And that causes failures when we reach the limit. 2 contexts should be enough if you only have one "video" window though.
When we do hit this limit, things go downhill fast:

setup_pipeline(..) failed! took 2197.59ms

That's the encoding pipeline blocked for over 2 seconds! ouch.

the batch delay is already high because of this:

update_batch_delay: congestion : 1.87,8.74 {}

You may want to turn off bandwidth detection to see if that helps with --bandwidth-detection=no. (#1912)
Or post the -d bandwidth log output so we can fix this.
But then it also goes up because of this:

I did not understand what you meant by "do not disable the other video encoders". For some reason I thought you meant don't use "--encodings=", I just now think you don't want "--video-encoders=" either. I apologize for that. I was not being intentionally obtuse. I am very grateful for your help. I do not mean to waste your time!

I was aware of the limit regarding encoder streams, but I didn't realize I was bumping up against it. Is there an easy way to buy or request licenses from NVIDIA (for testing)?

All of my users have GTX cards (and except for a solitary tesla card), that's all I have. I have two kerpler-era K20c tesla cards. That is the card that sometimes draws frames with weird colors (lots of yellow) with XPRA. I infer older cards have some limitations in this regard. That card, by the way, does work great for the render window if I add the "CSC_MODE=BGRX" environment variable and a --encodings. I think that card can also be made to work. If getting a Quadro (or some other card) will expedite this I'll buy whatever you recommend. Any recommendations?

Ultimately, I'd like to get XPRA to do NVENC only in my CAD render window and not anywhere else (everywhere else works just fine), so hopefully we can eventually use GTX cards. I'll try to get into the office later today so that I can perform the followup tests you requested.

I was aware of the limit regarding encoder streams, but I didn't realize I was bumping up against it

If we can fix all the bugs, you should be OK with just 2 contexts.

Is there an easy way to buy or request licenses from NVIDIA (for testing)?

I don't know.

I have two kepler-era K20c tesla cards. That is the card that sometimes draws frames with weird colors (lots of yellow) with XPRA

Could be bugs with lossless and / or YUV444 mode with those older cards.
If you can get us some "-d video" debug of when that happens, we may be able to restrict them to the modes that do work.

If getting a Quadro (or some other card) will expedite this I'll buy whatever you recommend. Any recommendations?

I've used both gamer cards (GTX 7xx to GTX10xx - with and without license keys) and quadros myself.
There is absolutely no difference in performance, the quadros just cost at least 4 times as much.
See ​Video Encode and Decode GPU Support Matrix

Ultimately, I'd like to get XPRA to do NVENC only in my CAD render window and not anywhere else (everywhere else works just fine)

I've installed the dummy driver. I've updated to r20927. I'm running with the environment variable you specified. I've turned off bandwidth detection. Bandwidth detection made no noticeable difference. The DPI problem seems to be resolved (based on my browsing of the log file).

If you can get us some "-d video" debug of when that happens, we may be able to restrict them to the modes that do work.

I just tried this exercise from the gigabit LAN. This older config is working much better than the GTX config. With bandwidth detection turned off my eyeball cannot detect anything wrong (when bandwidth detection is on it locks occasionally). Last time I saw the colors to go weird I was using broadband, so I'll go back home and try that (I came to work for this test).

For completeness I'm posting my experience on gigabit on my old config. I'm rendering to a C2075 (which is so old it cannot encode) and encoding on a K20c (which doesn't support rendering).

Update: From home I find that with nvenc and only nvence enabled (--encodings) perforamnce is great. With normal options (nothing disabled) performance is bad and my client eventually crashes. I've attached the server log from that.

this log sample seems to be missing XPRA_DEBUG_VIDEO_CLEAN=1 so I still cannot tell why the video encoder context gets recycled...

The sample from comment:17 lacks the log context around the backtrace. A single backtrace is not enough in any case, some video context re-cycling is normal.
That said, there aren't many instances of re-cycling during this test, so this may not tell us much.

the nvenc encoder seems to stick around for quite some time, long enough for over 750 frames:

I agree that the K20c run on the LAN was good (I meant to indicate that in my original post). I've since added 14.txt (I did not realize that you had already replied, otherwise I would have made a new reply).

14.txt was on broadband. Over broadband I'm seeing that --encodings=h264 is much better than the default options (I am only sending logs for the default options per your request).

Regarding the environment variable, you are right. I did forget to add the new environment variable on 12.txt and 14.txt. However, I did have the environment variable set correctly on 11.txt (my most recent run on the gtx). I'll redo 12.txt and 14.txt in the morning.

We could use the failures to detect the actual context limit value for each card, and then use this value to avoid trying to create new contexts when the limit is reached. (we have code for doing this already - it just uses the wrong limit: the default is 32)
It is difficult to see why this happened just from the log output, if you don't have too many windows then I suspect that it could be because the encoder context cleanup happens in the encode thread so when we destroy an nvenc encoder (for resizing or whatever), it isn't actually freed immediately whereas we may try to create the one immediately... I'm not sure how easy it would be to change that.

there aren't many instances of context recycling:

switch from x264 to nvenc (as per above)

a window is removed (2018-11-04 14:35:01,187)

we detect a sub-rectangle of the window where the "video" seems to happen: video subregion was None, now rectangle(3, 27, 1558, 879) (window size: 1652,998)

and later on the other way: video subregion was rectangle(3, 27, 1558, 879), now None (window size: 1652,998)

and finally at the end when there is no activity for ~10 seconds (video_encoder_timeout)

There's a dbus API you could use to turn off video region detection, or to set a particular region: #1060, #1401.
Maybe we could just enhance the content-type to support values like "video:full", "video:subregion" or something like that.
Re-cycling contexts is expensive with nvenc.

With normal options (nothing disabled) performance is bad and my client eventually crashes.

How did it crash?
Can you run it from the command line and see its output?
Please try with the latest builds, preferably the 64-bit ones.

I made a video demonstrating what our software looks like. This was rendered on the machine with the Tesla K20c (this is similar to 12.txt). I made it so that you can see what I am doing. For instance, the window closing events are for the "help" box and the "animate" box. The only part that needs nvenc encoding is the box with the 3d drawing.

This case works pretty well. It locks up during a rotation at 9 1/3 seconds (it would not do that with --encodings=h264), but it's pretty good. (I don't have a proper debug log because I was making it for demonstration rather than debug purposes).

Again, thanks for all your help. I'll be following up shortly with more debug logs.

I was able to get a repeat of 14.txt (with XPRA_DEBUG_VIDEO_CLEAN=1). Same server different client. This time I was using the stock XPRA 64bit client (from the web site) connecting through my cell phone hotspot. When tested with --encoding=h264 (for control) the performance was shockingly good (on cellphone hotspot''). However, with the normal options it was slow and eventually crashed.

I *was not* able to replicate using the same commands when connected on either 802.11g or 802.11n. In all cases I was port forwarding ssh. This trouble arises *only* on cell phone and broadband. Broadband is an especially important case, however, because we have several offsite users.

the context re-cycling issue is not a major factor in this log (see details below), so maybe this was a red herring or just an aggravating factor

the batch delay goes up because of bandwidth congestion (at one point is is above 100ms which would limit your screen updates to less than 10fps), try running the client with --bandwidth-detection=no --bandwidth-limit=100mbps and add -d bandwidth to the server debug flags to confirm (the client overrules the server's bandwidth-detection flag - and maybe we shouldn't allow that?) - h264 is more efficient, so it could be avoiding some bandwidth issues

we haven't looked at the impact of auto-refresh, maybe that's what is killing the performance over slower connections, try: --auto-refresh-delay=0 on the client and see if that helps

The video_context_clean events come from:

windows that are destroyed (_lost_window), twice - in both cases, the windows used h264 for a single frame (and webp, jpeg for a few more):

Normally, this isn't a huge problem but with nvenc this is more costly because of the initial setup cost and the fact that this is "wasting" a precious encoder context.
If the video encoders trigger too early, we could favour software encoders until we have enough frames to be certain that a hardware video encoder is worth using.

... with the normal options it was slow ...
This trouble arises *only* on cell phone and broadband.

Can you describe this "trouble"?
Is it stuttering, occasionally pausing, or just generally slow?
If the problems are punctual, can you match up the problems with timecodes from the log output? (roughly)
Try capturing "xpra info" at a time when things are slow.

I've been away all day dealing with some other stuff. I think you are on to something. When I turn off the bandwidth detection on the client side the performance is better than I've ever seen. I (almost) cannot distinguish broadband from LAN.

The interesting part: the longer I leave the session running the slower things get. I'll try to get debug data before the day is out. Otherwise I'll have it first thing tomorrow.

Added the --auto-refresh-delay option. Everything runs smoothly (with the Tesla card) for as long as I leave the window open.

I like auto-refresh very much so I'll be doing anything I can to help fix it. (with autorefresh turned on the video streams become slower and slower seemingly with each refresh until the client crashes--after reconnect its perfect at first, and, then, the process repeats exactly as before) I will be taking debug logs tomorrow. I'll also try to match lock-ups to timestamps in the logs.

I like auto-refresh very much so I'll be doing anything I can to help fix it.

The debug we need to get to the bottom of this would be:

leave --bandwidth-detection=yes on the client and post the server's -d bandwidth logs - it looks like the bandwidth detection is detecting the difference between LAN and broadband / phone connections, but maybe over-compensating

for auto-refresh issues, post the -d refresh,regionrefresh,compress,video for when things get out of whack - the bandwidth detection should have an impact on the auto-refresh so you may want to try with / without, nvenc's slow startup could be causing the auto-refresh to trigger too early, or maybe it triggers and causes context re-cycling - we'll see

Since I've updated my server to the latest beta and disabled the bandwidth detection I've not been able to replicate any trouble on the Tesla card when I use clients compiled by xpra.org. I've tried 2.4 and the latest betas both and they work!

The good news for me is that my builds give the same bad results with or without my patches. I really want to get my patches incorporated into mainline, but I don't want to send them until I make them work on Linux and MacOS.

The bad news (for me, not for you) is that, until I get our UI team to add the X properties and until I go back to testing the Gamer cards I think my only problems are created by something in the way I'm building XPRA on Windows.

I'm not really sure to begin debugging this. I'm going to start with the suggestions in your previous replies (I haven't tried them all yet). The log captures on the server side look pretty similar for both clients (lots of h264). My client is slow even when it *is* using h264 on the server side. I can't figure out how to read "-d all" because there are so many GTK messages.

I'm not exactly sure why my custom builds were not working on windows. I think I was accidentally setting some "speed=" things in my "conf.d" folder. Anyway, I can now connect flawlessly to my computer having a K20c tesla (for encoding and a C2075 for render) if I disable the bandwidth-detection.

I still cannot make the computer with the GTX card work right. I'm attaching a log from a XPRA session on a computer with single GTX card running the opengl job.

I am using gigabit Ethernet.

I'm still working on getting _XPRA_CONTENT_TYPE added to our software.

Since I've updated my server to the latest beta and disabled the bandwidth detection

As per comment:29, it would really help if you could reproduce the problems with bandwidth-detection enabled and -d bandwidth.
Then we can make it less aggressive and work better out of the box for everyone.

I still cannot make the computer with the GTX card work right.
I'm attaching a log from a XPRA session on a computer with single GTX card running the opengl job.

r21052 raises the context limit when we have multiple devices to choose from, records context failures with NULL values

r21053 allows you to specify the number of contexts the card can handle (applies to all cards - if you have an heterogeneous setup, you are better off just disabling the ones that have a low limit). So, for consumer cards without a license key, use:

XPRA_NVENC_CONTEXT_LIMIT=2 xpra start ...

There is no overcommit, so this should prevent the worst of the stuttering as we won't be trying to use nvenc once two contexts are in use (trying and failing takes time)

r21055 allows you to use both contexts rather than aiming for just one (half the limit - which is fine when you have 32...)

If you still have problems, then that probably means that we're choosing the wrong window for nvenc.
Please post -d encoding,damage with an explanation of which window id should be using the video encoders.

Note: I have seen what looks like some malformed frames in the client output: unknown format returned - ffmpeg fails to decode a frame then we reset the encoder.. which takes a bit of time. New ticket: #2051.

I will do the tests with bandwidth detection ASAP (probably within the next week). At the moment looking, among other things, at the ssh patch I submitted and the gnome-terminal ticket. Thanks for all your help!

Stumbled upon this hack for bypassing the licensing restrictions: ​https://github.com/keylase/nvidia-patch. It patches the driver so that the license key check always succeeds.
Another approach which can be used is to break on the license key checking code in gdb and then extract the license keys included in the blob.

Using the debug options discussed in this ticket I have found that, at this point, a good deal of my NVENC context recycling is caused by the the video-region-detection dimensions changing. I am now working with our LS-PrePost developers to get me a callback so that I can plug in some code I've written to notify XPRA on dbus of the OpenGL canvas coordinates.

I have already demonstrated that this helps a lot using the "xpra control" commands. It may even completely resolve the NVENC context recycling issues, but I'm not sure yet...

When I do this the video region gets stuck forever on x264 unless I disable x264 at startup (and x264 is laggy, also not sure if that is a bug). I'm not sure if this is a bug, or if I'm doing something wrong.

When I do this the video region gets stuck forever on x264 unless I disable x264 at startup

That's what setting a video region does.

(and x264 is laggy, also not sure if that is a bug). I'm not sure if this is a bug, or if I'm doing something wrong.

Are you talking about the laggyness?
What exactly is the symptom? Did you capture xpra info?

Using "video" may trigger some delayed frames with x264. (but not with nvenc - not yet anyway)
Assuming that you're using the x264 encoder, you may want to only enable vpx and not x264: --video-encoders=vpx.
Or you can also try to cap the number of delayed frames with x264:

XPRA_X264_MAX_DELAYED_FRAMES=1 xpra start ..

Maybe what we need to do is to define a new content-type ("animation" perhaps?) that enables video encoders but without the features that cause delayed frames (b-frames, multiple reference frames) as those are only really useful for true video content (ie: video player or youtube).

x264 and vpx are both laggy, but I think this might be an insurmountable computational intensity issue (on my the old Westmere chip I'm testing this on, at least). I'm seeing other behaviors consistent with this theory: if I turn off the mesh, for instance, it gets less laggy (and the lag goes away sometimes).

When I set the video region by hand (after using it a little) XPRA ends up in nvenc, but when I set it right at start it is in x264 and stays in x264 forever, and it never switches to nvenc.

My users are comparing xpra to VNC (and NX), and the only consistent complaint I have heard (from a small number of them) has to do with (at least I think it has to do with) the time it takes nvenc to start.

This region that I am flagging with DBUS is either resting (a constant lossless frame) or animating. Nothing like scrolling or UI happens in this region. My experiments tell me that VPX and X264 cannot keep up, but nvenc is amazing and turbojpeg is the best when there is no nvenc (at least on old westmere chips).

Based on my experience, therefore, and assuming I've made no other mistakes, I would like for my DBUS calls to make a *permanent* nvenc context in that region, as the shape of the region never changes without triggering my resize callback which will notify xpra.

I've done a little attempt at debugging and I've found that the score for nvenc never gets better than x264 when I set the region using dbus. I think it is because the setup penalty always swamps everything else for some reason, but I'm not sure that I'm reading this right.

x264 and vpx are both laggy, but I think this might be an insurmountable computational intensity issue (on my the old Westmere chip I'm testing this on, at least). I'm seeing other behaviors consistent with this theory: if I turn off the mesh, for instance, it gets less laggy (and the lag goes away sometimes).

I'm not sure what "turn off the mesh" means here.
Setting a high speed setting should reduce the encoding latency, it will just use up more bandwidth instead.

When I set the video region by hand (after using it a little) XPRA ends up in nvenc, but when I set it right at start it is in x264 and stays in x264 forever, and it never switches to nvenc.

Two things are likely at play here:

we don't use nvenc immediately unless we are certain that the video is here to stay - so initially x264 will win the scoring, even more so recently: see r21057, you can try lowering with XPRA_MIN_FPS_COST

we try to stick to the same encoder to prevent constant recycling of encoder contexts - though nvenc should still win out eventually

My users are comparing xpra to VNC (and NX), and the only consistent complaint I have heard (from a small number of them) has to do with (at least I think it has to do with) the time it takes nvenc to start.

We can make it start quicker, but then users may complain about jerkiness for the first few frames. (due to the nvenc initialization delay)

I've done a little attempt at debugging and I've found that the score for nvenc never gets better than x264 when I set the region using dbus.
I think it is because the setup penalty always swamps everything else for some reason, but I'm not sure that I'm reading this right.

Could be the fps change (r21057), and / or maybe setting the video region interferes with that.

we don't use nvenc immediately unless we are certain that the video is here to stay - so initially x264 will win the scoring, even more so recently: see r21057, you can try lowering with XPRA_MIN_FPS_COST

What's going on is identify_video_subregion is ending before it can update the fps when detection is disabled. I'm trying to come up with a patch... fps stays forever at 0.

forces a context recycle when the quality changes. This resolves the problem I was having with quality changes not taking effect. This is in window_video_source.py.

passes the detection property through to get_pipeline_score and then discards setup costs if the detection is turned off. I do this because if detection is turned off it is safe to assume that there won't be context recycling because the window's layout won't change for a while.

changes video_encoder_timeout in window_video_source.py so that the context does not timeout unless detection is turned on. This is related to the previous item.

r21260 use a much longer timeout for video encoder context: default is 10 minutes, you can set XPRA_VIDEO_NODETECT_TIMEOUT=0 to disable the timeout completely - there is no need for a get_detection method as this can be replaced with 2 lines of code

r21264 update fps value even when detection is turned off (tiny style changes)

All of my clients also had bandwidth detection disabled, so my claim that I had renabled it with bandwidth detection might have been false. However, my user from last night certainly had bandwidth detection *on*. I'll be retesting that momentarily.

This is a real bug, but this has been present for a long time so this may just be an unlucky symptom of another problem: you may be hitting this race condition now because there are many congestion events.
Fixed in r21288.

We have a target latency of 272ms for this connection: 118ms for network (based on measured latency), 54ms for sending the packet, and 100ms tolerance.
Yet somehow the ack packet comes back 102ms late. Spurious events like these can happen, so this may or may not be a problem.
There are a couple more events, before this important one:

you have nvenc enabled and it looks like that is causing a 500ms delay when it starts up:

DOH: regarding the 500ms delay, that's obviously something holding onto the python GIL during nvenc / pycuda setup.
If you can provide the -d cuda,nvenc output and also set XPRA_NVENC_DEBUG_API=1, I should be able to figure out which call is blocking the whole python interpreter and then we can cythonize it and release the GIL.

Also relatively slow is the encoder context cleanup call (after cuda_clean) - but since we call it twice, it could just be elsewhere... r21292 will help clarify that.

More and more, I am thinking of solving delay problems (see comment:21 and ticket:2048#comment:2) by moving video encoding to its own thread.
This would only fix some of the problems (#2048 mostly), but not if pycuda holds the GIL...

I've modified it as follows: (added init) I also reduced the sleep number from 0.1 to 0.01

from pycuda import driver
driver.init()
print("initilizing device")

output:

[root@curry nathan]# !py
python test_cuda_context_gil.py
******initilizing device
*device=<pycuda._driver.Device object at 0x7ff61455a680>
context=<pycuda._driver.Context object at 0x7ff5ce4ad668>
*done
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
-------------------------------------------------------------------
Aborted (core dumped)
[root@curry nathan]#

The error printed at the end can probably be avoided by calling cuda_context.detach().
In hindsight, this test code output is not very useful... I wanted to know if the print thread was being blocked during the call to make_context. Was it? (did the "*" stop being printed out).
A better way would have been to print timestamps every time instead.

There is a "print time.time()" on either side of the make_context. The stars are printing fast enough that I think they should have appeared. But you are right it would be clearer with timestamps. Here it is:

So, as I expected, the problem is that python-pycuda is stalling the whole python interpreter during both device init and context creation.
I'll ask for help, otherwise we may have to rewrite the cuda glue in cython so we can call it with ​Cython: Interfacing with External C Code: Releasing the GIL.

As per ​this reply: No particular reason from my end. I'd be happy to consider a PR with such a change. Should be pretty straightforward.
Time for me to work on pycuda a bit then.

In the meantime, this should achieve the same results as setting XPRA_BATCH_MAX_DELAY=50 but better: we expire the delayed regions quicker, but they only actually get sent if there really isn't any reason to wait longer.
This doesn't fix the problem where we are too sensitive to latency going up and not sensitive enough when it goes back down, but it should mitigate most of it.

cuda_context.detach() can take ~25ms, I have not modified it because it uses CUDAPP_CALL_GUARDED_CLEANUP which does not have a THREADED variant and I think there might well be a reason for that - this can be revisited if the call takes much longer on some setups

Based on comment:35, I've created #2111.
As per comment:47 and others, without the log sample we won't be improving the detection heuristics so you will need to use the dbus video-region api to avoid context recycling.

I hope that the congestion events were caused by pycuda and that they're now all gone.

Yes please, something like "better video region detection".
With a log of screen updates, and what the video region should have been.
Then we can see if we can tune the code to auto-detect it properly, given time..