FPS filter stutters when converting framerate

Description

I am seeing the FPS choose the wrong frames. YouTube? videos sometimes demonstrate this problem in both mp4 (with x264) and webm (with vp9) formats, usually only one of the formats instead of both. This can happen when the input video file is 60 fps, and the user is looking at a 30 fps version. The usual pattern is that every 3 frames, one of them is wrong. If the source video actually had a duplicate frame every other frame, then when this happens, one out of three frames will be a duplicate of one of the other two, and a frame that was doubled in the 60 fps video is missing entirely from the 30 fps one.

I thought this was because YouTube? was using an fps filter that selected the 'next' frame or something, because in recent years it also started handling variable-fps videos incorrectly. But I just found it, and was able to replicate it, in ffmpeg.

The manual says out of the different frame selection methods, the fps filter chooses the nearest frame.

It's safe to say I don't understand exactly why this is happening. My first attempt to replicate it failed.

The result is slightly different from the original file I found this in, but it still shows the problem. The filter selects frames with n= 1, 2, 4, 7, instead of every other frame.

This is somehow related to the offset. In the original file, this is apparently because the audio starts sooner. Removing the offset might fix the problem but cause slight desync issues. If YouTube? is in fact using ffmpeg to process input, video streams not starting at 0 could be the reason it's sometimes bugged.

In the original file, I tried looking at the detailed output from -v trace, and finding the frame that was nearest to an interval based on the output framerate, but could not understand why a frame was being dropped when it was nearest to the interval's middle.

Frame number (from 0) is the mean:[x] value. With round=near, it's [1,2,4,7]. With round=up and round=inf, it's [0,1,4,6]. With round=zero and round=down, it's [1,4,5,7]. With start_time=1 and '-t 1.2', it's [31,34,35,37]; however, with 'start_time=0.11:round=near' and '-t 0.31', it's [4,6,8,10].

This report would be much better if I offered a patch, but I don't have the expertise.

When framerate is halved, the fps filter uses an interval or point that's midway between the start point and the next input frame. I don't know how it's done mathematically or in the code, but this is how it works. Due to timebase rounding, sometimes the next frame is closer, and sometimes the previous one is.

With input framerate = output framerate, the fps filter will still duplicate and drop frames with some rounding methods; in this case, 'fps=30:start_time=0:round=down' leads to [1,1,2,4,4,5,7].

The frame duplication in the first case can be avoided by using a start_time offset that exceeds the variation in frames. But if YouTube? can get this wrong, normal users can as well. At least one of the rounding methods should work without adjusting the start time, and it might as well be 'near'.

Conceptually, the interval should be centered on the first frame, not on the average between first and last frames of the input interval. So for input=30fps and output=30fps, the interval is [-1/60,1/60]. Frame at 0 is closest. For output=15fps, interval is [-1/30,1/30]. Frame at 0 is still closest. Second interval for 15 fps is [1/30,3/30], and the frame at pts=67:pts_time=0.067 is closest. There's one very obvious candidate instead of two very similar candidate frames.

This can still lead to problems if the input timebase isn't divisible by framerate, and the offset gradually or suddenly jumps so that the decision point is equidistant from two frames. Could happen with variable framerate video that has been edited; when concatenating two videos with ffmpeg, which leads to odd offsets due to audio having a different length than video, like the second clip being 0.01 sec early or late; or when converting ~59.97 fps video to 30 fps.

The muxers in ffmpeg, or ffmpeg itself, will add or drop a frame based on an acceptable offset, like 0.5 or 1.5 times the distance between frames based on framerate from -r [rate]. The fps filter could try to select the next input frame to use for output based on the previous one used, by attempting to moderate the high-frequency variations introduced by timebase rounding, but this would be a more extensive patch and just making round=near work for halving framerate would a good fix by itself.

Conceptually, as decision point goes from 'low' to 'high', no tracking of previous input frame used would lead to low frame used, then high-frequency 'noise' as the filter switches between low and high, then high frame used. With tracking/average done, the 'low' frame would be used a little bit longer until it switches to high. If time goes down for some reason, possibly a video source that has variable lag, the transition will be delayed again. It's easy to set the default to a value that is unnoticeable for these edge cases but exceeds variation from timebase rounding. The choice of a timebase of 1/1000 for .webm videos is based on the assumption that 1/1000 sec is unnoticeable.

Maybe people don't discuss design choices in comments here, and instead use other venues like IRC. But I feel out of place there, particularly since I don't know any programming languages, and have not found it to be useful.

Current details of fps filter, as far as I can gather from limited testing without understanding the code:
Filter gets start point from user or first frame. I don't understand this comment or the code that follows it:

+ * The dance with offsets is required to match the rounding behaviour of the
+ * previous version of the fps filter when using the start_time option. */"

But anyway, having timestamps 0.04 sec later seems to cause the same output as using start_time=-0.04.

The input timestamps are rounded up or down to an output timestamp, as opposed to rounding the output timestamps to an input one.

The frame with the last input timestamp corresponding to an output timestamp is used for that output frame.

So with output r=10 and round=near, two input frames at 0.04 and 1.04 (1 fps with offset added), the output frame at 1.0 uses the second input frame. All frames before it duplicate the first frame.

With 100 fps input and 10 fps output, the out_frame at 1.0 uses in_frame at 1.04 for round=near; in_frame at 1.09 for round=down; in_frame at 1.0 for round=up.

One concern for design and use is whether the video remains synchronized. With the default round=near, the frame that's displayed for [1,1.1> is an average of what's happening during that time. At least one media player, totem, shows the upcoming frame if you pause it, which may or may not be a bug. If a 30-fps video is converted to 60 fps by duplicating frames and the user is processing the 60-fps version, then the second of each frame could be slightly higher quality, though I suppose it could be the opposite if the first displayed frame is a B-frame. I would tend to say that it's better for content to be displayed late (or rather, "on time") than early, but this might be my biases like the way low-fps lag works in computer games. Low fps triggers an expectation of what the content should be, even if it isn't being updated.

But more important is whether there are unwanted duplicates or skipping of frames. Suppose a user wants the input frame at 1.0 to be displayed at 1.0 in the output. Currently they could use round=up, which (maybe counterintuitively) does this. But if it causes every third frame to jitter by being one input frame early, this filter option is not useful.

Out of the three options (for positive timestamps), round=near is in the middle. Output frames should be displayed around the input timestamps used for this method. The other two rounding methods can return frames that are either later or earlier than this time.

With the 100 fps to 10 fps example, either start_time=0:round=near would continue to select input frames at 0.04, 0.14, 0.24 etc. but change to display them at those same times by using a timebase lower than 1/fps (timestamp interval higher than 1), or use the frames at 0, 0.1, 0.2 etc. and display them at those times.

Currently, round=near appears to work like the other methods, by processing all input frames up to the next output frame's time interval then outputting the last one. Interval for 10 fps, 1.0 timestamp appears to be [0.95,1.05>, so if input fps is 30.1 with frames at 1.03 and 1.063, the frame at 1.03 will be used even though 1.063 is closer to 1.05. The input frames are associated with the nearest output frame; the output frame does not select the input frame that's closest.

I think this should be changed so the output frame at time X is, in fact, the input frame closest to X, but the other rounding methods also have this bias. If only 'near' is adjusted, then on average it might be slightly above the average of 'round=up' and 'round=down'.

But anyway, as an adjustment to current code, there would be a check to see if each input frame is closer to the output frame, instead of just using the last one. Then adjust the intervals for round={up,down} down by half a frame.

This only fixes jittering or stuttering for the simple case, as discussed above.