Problems in swscale:
- poor API
* does not use ffmpeg internal data structures (e.g. AVFrame) or other approaches (e.g. AVOptions)
* has all kind of random crap that shouldn't be there (vector API, context caching, sws_convertPalette8ToPacked24 ???
- very static in its core conversion approach: "fast" (direct) AtoB converters vs. "scaled path" (the generic case)
* the fast AtoB converters are cute but cannot be chained. so anything not directly covered by a fast path goes through generic path even if it's slower than a chain of two fast paths
- the pre-swscale imgconvert.c implementation (https://ffmpeg.org/doxygen/0.5/imgconvert_8c-source.html) at least attempted to do chaining, so strictly speaking this is a regression
* the scaled path is static and limited
- only YUV allowed as internal common format
- almost always invoked for non-8bpp pixel format conversions
- several functions do multiple things at once for performance reasons, which makes adding new types of work very hard
> for example, since yuv is the internal common format, all rgb input is immediately converted to yuv. however, that means if we want to convert colorspaces, we need to convert back to rgb, ungamma, xyz, gamma, rgb, and yuv before we can move on, which is ridiculous
- assembly optimizations are hard
* the scaled path has some fairly modern optimizations, but they can't set constraints on buffer sizes (since that's an ABI change)
- sometimes we revert to C for end of image
- sometimes we don't convert last few pixels at all
- sometimes we overread/overwrite
* the unscaled paths are almost universally underoptimized (only mmx/mmx2 coverage, very little sse2 and virtually no avx2)
* fast conversion simd often uses contexts to pass around information, which isn't very portable
A good fix-up of swscale would have the following elements:
- completely rewrite the API
* should "feel" like all the other libs. libswresample may serve as a good example here
* use AVFrame, AVOptions, AVPixelFormat, AVColor*, hide most internal details away from the API
- it is specifically OK if this breaks mplayer.
* some features will disappear
- sws_convertPalette8ToPacked24 etc.
- context caching
- vector API
- SWS_CS_* (replace by AVColor*)
- more dynamic scaling paths
* "fast" direct conversions become chainable
* the "scaled" path becomes dynamic
- dynamic internal format
- dynamic ordering (chaining) of filters
- merging of operations (e.g. reading yuv input and yuv2rgb conversion to prevent memory stores) where possible and where it gains significant speed
* it may well be possible for the "fast" and "scaled" codepaths to be mixed, although this obviously needs more thought
- simd / platform optimizations
* these should ideally not be exposed in the common code (i.e. a grep for x86 in libswscale/*.c should be mostly empty)
* simd should be allowed to set constraints. E.g., over-reading and -writing of buffers should be enabled in some scenarios where that makes the assembly easier to write. This should be conveyed to the user so they can choose fast conversion by having padded buffers, or a slower conversion by not having padded buffers
* a "cleaner" SIMD API, see libavcodec/libavfilter/libswresample for examples, particularly for the "fast" versions
* bringing some sanity into the choice of which function is assigned to which pointer under which conditions, or at least documenting it in human-understandable form, would also help a lot
- documentation
* API docs
* user docs
- description of filter trade-offs
- filter recommendations for end user scenarios
- maybe even some comparisons of different filters on real-world content
Things to figure out:
- slicing
{{{
[8:43am] michaelni: poke
[..]
[9:02am] BBB, pong ?
[9:02am] michaelni: so who’s going to work on swscale?
[9:03am] like, you really seem to like swscale. I think it needs tons of love to get that kind of status. who will do that work?
[9:03am] (the loving)
[9:03am] i had hoped that a GSoC student would move it forward but i think noone submitted a proposal
[9:04am] that’s because it needs love
[9:04am] BBB: swscale can't be saved
[9:04am] you can’t expect students to give it love; students have no background, they add some new features
[9:04am] pedro does work slowly on improving it it seems
[9:04am] maintainers need to give it love, ideally the original developer(s)
[9:04am] swscale needs massive amounts of love
[9:05am] because it's not a code problem but a development problem
[9:05am] wm4: ^
[9:05am] that
[9:05am] wm4: I think it can be salvaged - I’m not sure any of the original code will be left at the end
[9:05am] BBB with comments like wm4&kieranks i will not even think about touching the code, iam not enjoying this environment
[9:06am] michaelni: but you’re not talking to them, you’re talking to me
[9:06am] michaelni: and I didn’t say we should rm -fr swscale
[9:06am] michaelni: I just said it needs love
[9:06am] yes it does
[9:06am] michaelni: so… who will give it that kind of love?
[9:06am] michaelni: well it's because you actively resists any radical measures which might actually save swscale
[9:06am] we need some maintainers stepping up the love game, otherwise it will go down the drain
[9:07am] michaelni: did you ever read https://wiki.libav.org/Blueprint/AVScale ? (ignore the politics for now, focus on the technical complaints in the document)
[9:07am] i read it long ago
[9:07am] not recently if it changed
[9:08am] I don’t think it did
[9:08am] so … let’s say (hypothetically) that we wanted to salvage swscale
[9:08am] can we totally redesign it? and will you actively help in coding it up?
[9:09am] iam happy if its totally redesigned, thats not a problem at all
[9:10am] i want the main practically used codepathes to stay fast though
[9:10am] a 5% slowdown should not happen
[9:10am] well, there’s many dimensions to the total set of problems in swscale… api is one, lack of integration with the rest of ffmpeg is another
[9:10am] the first thing you'd have to accept to make progress at all (instead of burning developer time on all these crazy details every time someone dares to take a look at it) is (temporary) deoptimization
[9:11am] swscale isnt very fast to begin with
[9:11am] all its "fast" optimization are super low quality
[9:11am] not true
[9:11am] the internals are very … unextendable, which isn’t necessarily a problem for corner case optimizations, but at this point, xyz support is a total hack, whereas if we are moving from to bt2020 support, we really should start thinking about xyz being a central component of the processing chain
[..]
[9:12am] michaelni: obviously one that’s skipped if it’s not necessary, but swscale as it is right now would be merely one of the steps
[9:12am] michaelni: fast paths for direct conversions are obviously ok, but it’s mostly very static and unpredictable right now
[9:12am] » michaelni need to awnser apoe call iam bac in 5min
[9:12am] michaelni: I still believe we go through a full yuv conversion and scaling path if we convert gbrp14 to gbrp12
[9:12am] ok
[..]
[9:13am] anotherother big question is whether to include the pseudo pixel formats
[9:14am] r210, v210 etc
[9:14am] nevcairiel: most optimizations are from 1985, so they’re mmx (or sometimes mmx2) at best
[9:14am] there’s little bits of sse2/avx2 added by various people a few years ago
[9:14am] and now finally some arm
[9:14am] but it’s mostly still a mess
[9:14am] optimisation behaves strangely when constraints don't fit btw
[..]
[9:15am] like, it seems x86 yuv2rgb will just ignore the last pixels if linesizes are not enough
[9:15am] the problem I hacked around is when doing chroma conversions swscale does a full luma multiply and shift which ends up being a NOOP
[9:15am] it would be great to have some kind of automatic c-fallback to fill the padding
[9:15am] but burns an entire cpu core
[9:16am] and make sure all simd received aligned & padded stuff
[9:16am] btw, anyone for my vscale question earlier? in yuv2planeX, offset can only be 0 or 3?
[9:17am] (x86 simd seems to assume so, and it looks like it's the case in practice in my tests so far)
[9:18am] ubitux: let me check
[9:20am] only one where it could not be 3 or 0 would be where it is use_mmx_vfilter ? (c->uv_offx2 >> 1) : 3
[9:20am] oh god the dithering
[9:20am] yes, dithering should be optional btw, it's badly done currently in the options
[..]
[9:24am] ubitux: I don’t know what uv_offx2 does ...
[9:24am] libswscale/utils.c: c->uv_offx2 = dst_stride + 16;
[9:24am] pile of hacks upon hacks let it rest in peace
[9:25am] durandal_1707: please be more constructive
[9:25am] we're not going to depend on an external lib for converts
[9:26am] where is use_mmx_vfilter set?
[9:26am] libswscale/x86/swscale_template.c: c->use_mmx_vfilter= 1;
[9:27am] so, it can only be ≠ (0, 3) when this special yuv2yuvX function is used
[9:28am] so it shouldn't matter
[9:28am] i was surprised the other day when it was triggering this yuv2yuvX instead of the ff_yuv2planeX_{mmx,see,...}
[9:28am] (which also exists)
[9:29am] while it is triggering yuv2planeX_8_c when running on arm
[9:29am] (or basically with no asm)
[9:29am] it seems we don’t set the sse2 versions if this was already initialized
[9:29am] that looks like a bug
[9:30am] ../libswscale/x86/swscale.c: case 8: if ((condition_8bit) && !c->use_mmx_vfilter) vscalefn = ff_yuv2planeX_8_ ## opt; break; \
[9:30am] we should just force c->use_mmx_vfilter to 0?
[9:30am] anyway
[9:30am] yes you’re right it’s only 0 and 3 then, for your use case
[9:34am] ubitux: re simd behaving stragenyl, that’s certainly something I’d like to discuss with michaelni later on in this conversation
[9:34am] ubitux: but let’s start by keeping things simple
[9:34am] ubitux: I'm still waiting for nlmeans
[9:34am] yeah, i should get done with this one
[9:36am] BBB: btw, i agree with michaelni about having the colorspace convert in sws
[9:36am] since it's already there...
[9:37am] what is already there?
[9:37am] BBB all scaling goes through yuv, unscaled rgb converts might be without yuv
[9:38am] BBB: SWS_CS_*
[9:38am] theres some avx in swscale
[9:38am] let’s do one dimension of this problem set at a time
[9:38am] let’s start with api: swscale api is from the 80s. it needs to die and the approach in the avscale blueprint isn’t so bad. backwards compatibility to hell, no part of the public api (except maybe the version macros) should stay
[9:39am] do you agree?
[9:39am] i think wm4 had a patch for that?
[9:39am] unless you're not talking about the AVFrame wrapping?
[9:39am] the api should integrate with the rest of ffmpeg (e.g. use AVOptions, like swresample does; use AVFrame; etc.), and no context caching, “sws_convertPalette8ToPacked24” or vector api
[9:39am] BBB the old API interface should be provided by using wrapers around the new
[9:40am] no, the old api should just die
[9:40am] yeah i agree with deprecating the old api
[9:40am] it’s useless and nobody wants it except mplayer, which is almost actual proof that it should die
[9:40am] no kill the old api
[9:40am] it's horrible
[9:40am] which implies a new lib
[9:40am] michaelni: I think you have pretty much consensus that the old api needs to die here
[9:41am] sure, if thats what people want
[9:41am] michaelni: so, that’s dimension 1. now, dimension 2: scaling stages…
[9:41am] (I’ll talk simd/platform ops later)
[9:41am] so, right now, we have two kind of paths: “direct conversions” like yuv422p_to_yuyv
[9:41am] and we have the scaled path, as invoked (iirc) by gbrp14 to gbrp12
[..]
[9:42am] (fortunately it uses a filter size of 1 so it’s not that bad, but still)
[9:42am] we need a more generic approach to “scaled path”
[9:42am] yes
[9:42am] I think avscale calls this kernels
[9:42am] you can sort of see this in the colorspace thing also, although that’s obviously fairly limited (on purpose)
[9:43am] internal format should be dynamic, it can’t be only yuv
[9:43am] if input and output is xyz, that should be ok
[9:43am] also, to go from rgb to rgb should not involve a yuv conversion, and xyz to rgb shouldn’t either
[9:43am] it was planed since the 80ties that the internal format should be anything just wasnt ever done
[9:43am] right, but so this is a component of “major love"
[9:44am] we’ve been saying for years that stuff needs to be done and nobody does it
[9:44am] it may well be that a new approach will be so fundamentally different that we need new implementations of every simd function, and that pretty much all existing code will eventually be rewritten
[9:44am] also you(plural) should talk with pedro arthur he imlemented a more generic filter path last year
[9:45am] and that code is enabled and works but i think it itself needs cleanup and documentation
[9:45am] a kernel style approach has issues btw, as can be seen in the avscale document
[9:45am] BBB: small parenthesis about xyz: it appears many people use lut3d to do the convert
[9:45am] (vf lut3d does that)
[9:46am] (i had a local patch to support xyz as input, needs to upstream it, but i have some format negociation issues)
[9:46am] I actually wrote avscale years ago, long before libav started caring about it, I tihnk I talked about it on irc back then
[9:46am] » michaelni maybe wasnt on iRC back then
[9:46am] the issue with kernels is that you tend to want to design it in such a way (naturally) that the number of memory intermediates goes up a lot compared to swscale
[..]
[9:47am] so this needs to be designed by ery knowledgable people who understand performance
[9:47am] and will end up with a lot of functions, some near-duplicates, to do the same thing, if you want the same performance as you have with swscale
[9:47am] (in all cases)
[9:47am] BBB i think you absolutely must talk with pedro arthur about this
[9:47am] so, lots of code, or slightly slower
[9:48am] as pedro did alot of work on swscale filter stuff and that is i think very similar to kernels
[9:48am] (and you can see why I eventually threw my avscale out of the window, I just didn’t care enough to write so much code)
[9:48am] so, simd is largely a game about assumptions, right? as kierank sayd, we have tons of simd that doesn’t work right on odd image sizes or non-multiple-of-4-s etc.
[9:49am] a nice thing of a new api is that we can document assumptions
[9:49am] e.g. “buffers should have at least this much padding"
[9:49am] and then just require that to be the case
[9:49am] and then we can just overwrite in our simd instead of underconvert
[9:49am] (or, worst of all, revert to C, which is just so ugly that I don’t know what to say)
[9:50am] (swscale does a combination of all 3)
[9:50am] the lack of SIMD assumtation docs is a real problem, yes
[9:50am] yes
[9:50am] and about pedro… I can look back at what he did, but you see my point about students, right? this is the job of a maintiner, not a student
[9:51am] if we want pedro to help discussing this, he should be here, or we should get him here. is he on irc?
[9:51am] pedro is no student anymore in the gsoc sense
[9:51am] is he ready to be maintainer?
[9:51am] i suspect he isnt on irc but yes he should join ideally
[9:51am] in the short term, none of this will happen, which is why I wrote the filter
[9:52am] it fixes my immediate problem and will allow me to move forward with actual things I’d like to do
[9:52am] i think he is ready but maybe he lacks confidence or maybe time i dont know
[9:52am] pedro agreed to co/backup mentor a swscale task if we had a student (which we dont)
[..]
[9:56am] michaelni: is pedro’s filter chaining work documented anywhere? or where do I find it?
[9:58am] ubitux: for video, I doubt people use xyz much, it’s primarily an image thing to actually ever want the data in xyz… but xyz is an intermediate useful to convert between various rgb chromaticity primaries (“colorspaces”)
[9:59am] BBB iam not sure its documented all that much, pedro last summer ended up doing more than i wanted and possibly good docs where one thing that he forgot
[10:00am] where do I look for the code?
[10:00am] I should make a wiki page out of this discussion so we can refer to it in future continuations
[10:01am] git log -p --author Arthur
[10:02am] I can help with coding if blueprints are there
[..]
[10:27am] BBB: about assembly, the unscaled path is actually very complex to handle
[10:27am] first, it has way too much arguments
[10:28am] 2nd, to workaround this, since it's using inline asm in x86, it passes the sws context
[..]
[10:28am] in which the fields are duplicated and directly readable
[10:28am] I think we should add that, I hadn’t complained that much about the unscaled optimizations yet, indeed
[10:28am] (by offsetting the context with some)
[10:28am] but some of that code is hideous also
[10:29am] (with some +macro)
[10:29am] yeah I remember
[10:29am] the fast bilinear scaler is also very “"interesting""
[10:29am] (it’s fast, I admit)
[10:30am] BBB: also slicing
[10:30am] I have surprisingly few opinions on slicing
[10:30am] some people seem to hate it, I don’t really care
[10:30am] do you know how it's done currently?
[10:30am] roughly, yes
[10:31am] I wonder if it isn’t easier if slicing was done internally and externally it just had a “-threads” parameter that users can set
[10:31am] it's not threadable slicing, it's just for the destination
[10:31am] I always thought it was for threading
[10:31am] no
[10:32am] :-P
[10:32am] well good thing I had no opinion on it I guess
[..]
[10:32am] apparently it was about some cache locality of some sort
[10:32am] btw as slicing is mentioned, theres the use case where the whole image doesnt fit in memory (gimp does that) this may be worth a thought in case of a redesign
[..]
[10:33am] michaelni: doesn't work by squares?
[10:33am] only slices?
[10:34am] photoshop works by big blocks (call it tiles/squares/bigblocks/whatever)
[10:34am] to clarify we dont support that currently, gimp uses its own scaler
[10:34am] i just saw that gimp caching code a long time ago (which used squares)
[10:34am] or rectangles dont remember
[10:36am] also some kind of square / slice /tiling for multitreading is probably "important"
[10:37am] yeah
[10:37am] BBB: about the poor api, let's talk about the options...
[10:38am] dithering configuration is broken
[10:38am] hm, dithering
[10:38am] like, you haven't a dither=none
[10:38am] it's just enabled, sometimes
[10:38am] (0=auto)
[10:38am] yeah, nobody knows when it’s enabled
[10:38am] btw, iam not sure though how important/usefull such non memory squares are ...
[10:38am] also the scaler selection mixed within flags
[..]
[10:39am] the avscale blueprint is pretty good at explaining that, the filter selection should be an enum
[10:39am] (AVOption)
[10:39am] and the rest of the flags can just be bools or whatevers as separate options
[10:39am] I Guess pre-AVOption extending api was hard so this made sense, but with AVOption we realy don’y need it anymore
[10:39am] small nit: wth are param0 and param1
[10:40am] hahaha right
[10:40am] they are options for some filters
[10:40am] wasn’t it alpha/beta for one of the larger filters?
[10:40am] error_diffusion belongs to dithering btw
[10:40am] it's currently a sws falgs
[10:40am] what are the implication of accurate_rnd too
[10:40am] » michaelni doesnt remember exactly what param0 and 1 did for each filter but yes they where filter params
[10:41am] For SWS_BICUBIC param[0] and [1] tune the shape of the basis function, param[0] tunes f(1) and param[1] f´(1) | For SWS_GAUSS param[0] tunes the exponent and thus cutoff frequency | For SWS_LANCZOS param[0] tunes the width of the window function
[10:41am] no other comment so far
[10:41am] these all should be AVOptions
[10:42am] there’s also SWS_FULL_CHR_H_INT/INP
[10:42am] in fact, I just noticed SWS_DIRECT_BGR, ...
[10:43am] I believe int/inp had to do with non-420p support
[10:43am] (I mean, practically speaking)
[10:43am] doesn’t accurate_rnd increase precision of some internal codepath/something?
[10:43am] small note: there are a few dithering algorithms in vf paletteuse filter
[10:43am] BBB, yes accuarte_rnd uses more accurate code
[10:44am] IIRC no pmulhw
[10:44am] which would loose the lsbs
[10:44am] it has different meaning depending on the codepath
[10:44am] I think it’s totally fine to basically eliminate all these flags and go back to “let’s just make it do the right thing”
[10:44am] it's not well defined
[10:44am] I can see how pmulh(u)w was critically important for performance in the mplayer era
[10:45am] I think it’s totally fair to say that with axv2, it is not all that relevant anymore
[10:45am] it's used between 32 vs 16 in some rgb code iirc
[10:45am] I also wonder if half of the filters should be deleted
[10:45am] like sinc, gauss
[10:46am] maybe even fast-bilinear
[10:46am] (that would clean up the code so much)
[10:46am] the filters like sinc gauss should have nearly no complexity as its just different numbers
[10:46am] it’s user complexity
[10:46am] we should expose the ideal configuration settings to our user
[10:47am] when to use spline or lanczos: when upscaling
[10:47am] (and caring about quality)
[10:47am] when to use bicublin: when speed is critical and you’re downscaling
[10:47am] that’s very helpful to end users
[10:47am] when to use gaussian?
[10:47am] I don’t know… I don’t think anyone knows
[10:47am] with scalig different people will want different options and some people just like to have the choice
[10:47am] i'd keep the different filters
[10:47am] it's useful to make various visual comparison
[10:48am] sinc is something that some people "know" is best until they try it
[10:48am] :-D
[10:48am] but doesn’t that mean we should remove it?
[10:48am] why keep the option there
[10:48am] people will bug you to implement it because it's the perfect filter
[10:48am] many of the various filters are just different kernels over the same kind of filter, so preserving them costs you practically nothing
[10:48am] do you know how many people thought x264 was the best encoder in the world but they were using it with default ffmpeg parameters (instead of presets)?
[10:49am]