Version 2 (modified by rbultje, 3 years ago) (diff)


Problems in swscale:

  • poor API
    • does not use ffmpeg internal data structures (e.g. AVFrame) or other approaches (e.g. AVOptions)
    • has all kind of random crap that shouldn't be there (vector API, context caching, sws_convertPalette8ToPacked24 ???
  • very static in its core conversion approach: "fast" (direct) AtoB converters vs. "scaled path" (the generic case)
    • the fast AtoB converters are cute but cannot be chained. so anything not directly covered by a fast path goes through generic path even if it's slower than a chain of two fast paths
    • the scaled path is static and limited
      • only YUV allowed as internal common format
      • almost always invoked for non-8bpp pixel format conversions
      • several functions do multiple things at once for performance reasons, which makes adding new types of work very hard

        for example, since yuv is the internal common format, all rgb input is immediately converted to yuv. however, that means if we want to convert colorspaces, we need to convert back to rgb, ungamma, xyz, gamma, rgb, and yuv before we can move on, which is ridiculous

  • assembly optimizations are hard
    • the scaled path has some fairly modern optimizations, but they can't set constraints on buffer sizes (since that's an ABI change)
      • sometimes we revert to C for end of image
      • sometimes we don't convert last few pixels at all
      • sometimes we overread/overwrite
    • the unscaled paths are almost universally underoptimized (only mmx/mmx2 coverage, very little sse2 and virtually no avx2)

A good fix-up of swscale would have the following elements:

  • completely rewrite the API
    • should "feel" like all the other libs. libswresample may serve as a good example here
    • use AVFrame, AVOptions, AVPixelFormat, AVColor*, hide most internal details away from the API
      • it is specifically OK if this breaks mplayer.
    • some features will disappear
      • sws_convertPalette8ToPacked24 etc.
      • context caching
      • vector API
      • SWS_CS_* (replace by AVColor*)
  • more dynamic scaling paths
    • "fast" direct conversions become chainable
    • the "scaled" path becomes dynamic
      • dynamic internal format
      • dynamic ordering (chaining) of filters
      • merging of operations (e.g. reading yuv input and yuv2rgb conversion to prevent memory stores) where possible and where it gains significant speed
  • simd / platform optimizations
    • these should ideally not be exposed in the common code (i.e. a grep for x86 in libswscale/*.c should be mostly empty)
    • simd should be allowed to set constraints. E.g., over-reading and -writing of buffers should be enabled in some scenarios where that makes the assembly easier to write. This should be conveyed to the user so they can choose fast conversion by having padded buffers, or a slower conversion by not having padded buffers
    • fast conversion simd often uses contexts to pass around information, which isn't very portable

Things to figure out:

  • slicing
[8:43am] <BBB> michaelni: poke
[9:02am] <michaelni> BBB, pong ?
[9:02am] <BBB> michaelni: so who’s going to work on swscale?
[9:03am] <BBB> like, you really seem to like swscale. I think it needs tons of love to get that kind of status. who will do that work?
[9:03am] <BBB> (the loving)
[9:03am] <michaelni> i had hoped that a GSoC student would move it forward but i think noone submitted a proposal
[9:04am] <BBB> that’s because it needs love
[9:04am] <wm4> BBB: swscale can't be saved
[9:04am] <BBB> you can’t expect students to give it love; students have no background, they add some new features
[9:04am] <michaelni> pedro does work slowly on improving it it seems
[9:04am] <BBB> maintainers need to give it love, ideally the original developer(s)
[9:04am] <BBB> swscale needs massive amounts of love
[9:05am] <wm4> because it's not a code problem but a development problem
[9:05am] <kierank> wm4: ^
[9:05am] <kierank> that
[9:05am] <BBB> wm4: I think it can be salvaged - I’m not sure any of the original code will be left at the end
[9:05am] <michaelni> BBB with comments like wm4&kieranks i will not even think about touching the code, iam not enjoying this environment
[9:06am] <BBB> michaelni: but you’re not talking to them, you’re talking to me
[9:06am] <BBB> michaelni: and I didn’t say we should rm -fr swscale
[9:06am] <BBB> michaelni: I just said it needs love
[9:06am] <michaelni> yes it does
[9:06am] <BBB> michaelni: so… who will give it that kind of love?
[9:06am] <wm4> michaelni: well it's because you actively resists any radical measures which might actually save swscale
[9:06am] <BBB> we need some maintainers stepping up the love game, otherwise it will go down the drain
[9:07am] <BBB> michaelni: did you ever read ? (ignore the politics for now, focus on the technical complaints in the document)
[9:07am] <michaelni> i read it long ago
[9:07am] <michaelni> not recently if it changed
[9:08am] <BBB> I don’t think it did
[9:08am] <BBB> so … let’s say (hypothetically) that we wanted to salvage swscale
[9:08am] <BBB> can we totally redesign it? and will you actively help in coding it up?
[9:09am] <michaelni> iam happy if its totally redesigned, thats not a problem at all
[9:10am] <michaelni> i want the main practically used codepathes to stay fast though
[9:10am] <michaelni> a 5% slowdown should not happen
[9:10am] <BBB> well, there’s many dimensions to the total set of problems in swscale… api is one, lack of integration with the rest of ffmpeg is another
[9:10am] <wm4> the first thing you'd have to accept to make progress at all (instead of burning developer time on all these crazy details every time someone dares to take a look at it) is (temporary) deoptimization
[9:11am] <nevcairiel> swscale isnt very fast to begin with
[9:11am] <nevcairiel> all its "fast" optimization are super low quality
[9:11am] <kierank> not true
[9:11am] <BBB> the internals are very … unextendable, which isn’t necessarily a problem for corner case optimizations, but at this point, xyz support is a total hack, whereas if we are moving from to bt2020 support, we really should start thinking about xyz being a central component of the processing chain
[9:12am] <BBB> michaelni: obviously one that’s skipped if it’s not necessary, but swscale as it is right now would be merely one of the steps
[9:12am] <BBB> michaelni: fast paths for direct conversions are obviously ok, but it’s mostly very static and unpredictable right now
[9:12am] » michaelni  need to awnser apoe call iam bac in 5min
[9:12am] <BBB> michaelni: I still believe we go through a full yuv conversion and scaling path if we convert gbrp14 to gbrp12
[9:12am] <BBB> ok
[9:13am] <kierank> anotherother big question is whether to include the pseudo pixel formats
[9:14am] <kierank> r210, v210 etc
[9:14am] <BBB> nevcairiel: most optimizations are from 1985, so they’re mmx (or sometimes mmx2) at best
[9:14am] <BBB> there’s little bits of sse2/avx2 added by various people a few years ago
[9:14am] <BBB> and now finally some arm
[9:14am] <BBB> but it’s mostly still a mess
[9:14am] <ubitux> optimisation behaves strangely when constraints don't fit btw
[9:15am] <ubitux> like, it seems x86 yuv2rgb will just ignore the last pixels if linesizes are not enough
[9:15am] <kierank> the problem I hacked around is when doing chroma conversions swscale does a full luma multiply and shift which ends up being a NOOP
[9:15am] <ubitux> it would be great to have some kind of automatic c-fallback to fill the padding
[9:15am] <kierank> but burns an entire cpu core
[9:16am] <ubitux> and make sure all simd received aligned & padded stuff
[9:16am] <ubitux> btw, anyone for my vscale question earlier? in yuv2planeX, offset can only be 0 or 3?
[9:17am] <ubitux> (x86 simd seems to assume so, and it looks like it's the case in practice in my tests so far)
[9:18am] <BBB> ubitux: let me check
[9:20am] <ubitux> only one where it could not be 3 or 0 would be where it is use_mmx_vfilter ? (c->uv_offx2 >> 1) : 3
[9:20am] <BBB> oh god the dithering
[9:20am] <ubitux> yes, dithering should be optional btw, it's badly done currently in the options
[9:24am] <BBB> ubitux: I don’t know what uv_offx2 does ...
[9:24am] <ubitux> libswscale/utils.c:    c->uv_offx2 = dst_stride + 16;
[9:24am] <durandal_1707> pile of hacks upon hacks let it rest in peace
[9:25am] <ubitux> durandal_1707: please be more constructive
[9:25am] <ubitux> we're not going to depend on an external lib for converts
[9:26am] <BBB> where is use_mmx_vfilter set?
[9:26am] <ubitux> libswscale/x86/swscale_template.c:                c->use_mmx_vfilter= 1;
[9:27am] <ubitux> so, it can only be ≠ (0, 3) when this special yuv2yuvX function is used
[9:28am] <ubitux> so it shouldn't matter
[9:28am] <ubitux> i was surprised the other day when it was triggering this yuv2yuvX instead of the ff_yuv2planeX_{mmx,see,...}
[9:28am] <ubitux> (which also exists)
[9:29am] <ubitux> while it is triggering yuv2planeX_8_c when running on arm
[9:29am] <ubitux> (or basically with no asm)
[9:29am] <BBB> it seems we don’t set the sse2 versions if this was already initialized
[9:29am] <BBB> that looks like a bug
[9:30am] <BBB> ../libswscale/x86/swscale.c:    case 8: if ((condition_8bit) && !c->use_mmx_vfilter) vscalefn = ff_yuv2planeX_8_  ## opt; break; \
[9:30am] <BBB> we should just force c->use_mmx_vfilter to 0?
[9:30am] <BBB> anyway
[9:30am] <BBB> yes you’re right it’s only 0 and 3 then, for your use case
[9:34am] <BBB> ubitux: re simd behaving stragenyl, that’s certainly something I’d like to discuss with michaelni later on in this conversation
[9:34am] <BBB> ubitux: but let’s start by keeping things simple
[9:34am] <durandal_1707> ubitux: I'm still waiting for nlmeans
[9:34am] <ubitux> yeah, i should get done with this one
[9:36am] <ubitux> BBB: btw, i agree with michaelni about having the colorspace convert in sws
[9:36am] <ubitux> since it's already there...
[9:37am] <BBB> what is already there?
[9:37am] <michaelni> BBB all scaling goes through yuv, unscaled rgb converts might be without yuv
[9:38am] <ubitux> BBB: SWS_CS_* 
[9:38am] <michaelni> theres some avx in swscale
[9:38am] <BBB> let’s do one dimension of this problem set at a time
[9:38am] <BBB> let’s start with api: swscale api is from the 80s. it needs to die and the approach in the avscale blueprint isn’t so bad. backwards compatibility to hell, no part of the public api (except maybe the version macros) should stay
[9:39am] <BBB> do you agree?
[9:39am] <ubitux> i think wm4 had a patch for that?
[9:39am] <ubitux> unless you're not talking about the AVFrame wrapping?
[9:39am] <BBB> the api should integrate with the rest of ffmpeg (e.g. use AVOptions, like swresample does; use AVFrame; etc.), and no context caching, “sws_convertPalette8ToPacked24” or vector api
[9:39am] <michaelni> BBB the old API interface should be provided by using wrapers around the new
[9:40am] <BBB> no, the old api should just die
[9:40am] <ubitux> yeah i agree with deprecating the old api
[9:40am] <BBB> it’s useless and nobody wants it except mplayer, which is almost actual proof that it should die
[9:40am] <kierank> no kill the old api
[9:40am] <kierank> it's horrible
[9:40am] <kierank> which implies a new lib
[9:40am] <BBB> michaelni: I think you have pretty much consensus that the old api needs to die here
[9:41am] <michaelni> sure, if thats what people want
[9:41am] <BBB> michaelni: so, that’s dimension 1. now, dimension 2: scaling stages…
[9:41am] <BBB> (I’ll talk simd/platform ops later)
[9:41am] <BBB> so, right now, we have two kind of paths: “direct conversions” like yuv422p_to_yuyv
[9:41am] <BBB> and we have the scaled path, as invoked (iirc) by gbrp14 to gbrp12
[9:42am] <BBB> (fortunately it uses a filter size of 1 so it’s not that bad, but still)
[9:42am] <BBB> we need a more generic approach to “scaled path”
[9:42am] <michaelni> yes
[9:42am] <BBB> I think avscale calls this kernels
[9:42am] <BBB> you can sort of see this in the colorspace thing also, although that’s obviously fairly limited (on purpose)
[9:43am] <BBB> internal format should be dynamic, it can’t be only yuv
[9:43am] <BBB> if input and output is xyz, that should be ok
[9:43am] <BBB> also, to go from rgb to rgb should not involve a yuv conversion, and xyz to rgb shouldn’t either
[9:43am] <michaelni> it was planed since the 80ties that the internal format should be anything just wasnt ever done
[9:43am] <BBB> right, but so this is a component of “major love"
[9:44am] <BBB> we’ve been saying for years that stuff needs to be done and nobody does it
[9:44am] <BBB> it may well be that a new approach will be so fundamentally different that we need new implementations of every simd function, and that pretty much all existing code will eventually be rewritten
[9:44am] <michaelni> also you(plural) should talk with pedro arthur he imlemented a more generic filter path last year
[9:45am] <michaelni> and that code is enabled and works but i think it itself needs cleanup and documentation
[9:45am] <BBB> a kernel style approach has issues btw, as can be seen in the avscale document
[9:45am] <ubitux> BBB: small parenthesis about xyz: it appears many people use lut3d to do the convert
[9:45am] <ubitux> (vf lut3d does that)
[9:46am] <ubitux> (i had a local patch to support xyz as input, needs to upstream it, but i have some format negociation issues)
[9:46am] <BBB> I actually wrote avscale years ago, long before libav started caring about it, I tihnk I talked about it on irc back then
[9:46am] » michaelni  maybe wasnt on iRC back then 
[9:46am] <BBB> the issue with kernels is that you tend to want to design it in such a way (naturally) that the number of memory intermediates goes up a lot compared to swscale
[9:47am] <BBB> so this needs to be designed by ery knowledgable people who understand performance
[9:47am] <BBB> and will end up with a lot of functions, some near-duplicates, to do the same thing, if you want the same performance as you have with swscale
[9:47am] <BBB> (in all cases)
[9:47am] <michaelni> BBB i think you absolutely must talk with pedro arthur about this
[9:47am] <BBB> so, lots of code, or slightly slower
[9:48am] <michaelni> as pedro did alot of work on swscale filter stuff and that is i think very similar to kernels
[9:48am] <BBB> (and you can see why I eventually threw my avscale out of the window, I just didn’t care enough to write so much code)
[9:48am] <BBB> so, simd is largely a game about assumptions, right? as kierank sayd, we have tons of simd that doesn’t work right on odd image sizes or non-multiple-of-4-s etc.
[9:49am] <BBB> a nice thing of a new api is that we can document assumptions
[9:49am] <BBB> e.g. “buffers should have at least this much padding"
[9:49am] <BBB> and then just require that to be the case
[9:49am] <BBB> and then we can just overwrite in our simd instead of underconvert
[9:49am] <BBB> (or, worst of all, revert to C, which is just so ugly that I don’t know what to say)
[9:50am] <BBB> (swscale does a combination of all 3)
[9:50am] <michaelni> the lack of SIMD assumtation docs is a real problem, yes
[9:50am] <michaelni> yes
[9:50am] <BBB> and about pedro… I can look back at what he did, but you see my point about students, right? this is the job of a maintiner, not a student
[9:51am] <BBB> if we want pedro to help discussing this, he should be here, or we should get him here. is he on irc?
[9:51am] <michaelni> pedro is no student anymore in the gsoc sense 
[9:51am] <BBB> is he ready to be maintainer?
[9:51am] <michaelni> i suspect he isnt on irc but yes he should join ideally
[9:51am] <BBB> in the short term, none of this will happen, which is why I wrote the filter
[9:52am] <BBB> it fixes my immediate problem and will allow me to move forward with actual things I’d like to do
[9:52am] <michaelni> i think he is ready but maybe he lacks confidence or maybe time i dont know
[9:52am] <michaelni> pedro agreed to co/backup mentor a swscale task if we had a student (which we dont)
[9:56am] <BBB> michaelni: is pedro’s filter chaining work documented anywhere? or where do I find it?
[9:58am] <BBB> ubitux: for video, I doubt people use xyz much, it’s primarily an image thing to actually ever want the data in xyz… but xyz is an intermediate useful to convert between various rgb chromaticity primaries (“colorspaces”)
[9:59am] <michaelni> BBB iam not sure its documented all that much, pedro last summer ended up doing more than i wanted and possibly good docs where one thing that he forgot
[10:00am] <BBB> where do I look for the code?
[10:00am] <BBB> I should make a wiki page out of this discussion so we can refer to it in future continuations
[10:01am] <michaelni> git log -p --author Arthur
[10:02am] <durandal_1707> I can help with coding if blueprints are there
[10:27am] <ubitux> BBB: about assembly, the unscaled path is actually very complex to handle
[10:27am] <ubitux> first, it has way too much arguments
[10:28am] <ubitux> 2nd, to workaround this, since it's using inline asm in x86, it passes the sws context
[10:28am] <ubitux> in which the fields are duplicated and directly readable
[10:28am] <BBB> I think we should add that, I hadn’t complained that much about the unscaled optimizations yet, indeed
[10:28am] <ubitux> (by offsetting the context with some)
[10:28am] <BBB> but some of that code is hideous also
[10:29am] <ubitux> (with some +macro)
[10:29am] <BBB> yeah I remember
[10:29am] <BBB> the fast bilinear scaler is also very “"interesting""
[10:29am] <BBB> (it’s fast, I admit)
[10:30am] <ubitux> BBB: also slicing
[10:30am] <BBB> I have surprisingly few opinions on slicing
[10:30am] <BBB> some people seem to hate it, I don’t really care
[10:30am] <ubitux> do you know how it's done currently?
[10:30am] <BBB> roughly, yes
[10:31am] <BBB> I wonder if it isn’t easier if slicing was done internally and externally it just had a “-threads” parameter that users can set
[10:31am] <ubitux> it's not threadable slicing, it's just for the destination
[10:31am] <BBB> I always thought it was for threading
[10:31am] <ubitux> no
[10:32am] <BBB> :-P
[10:32am] <BBB> well good thing I had no opinion on it I guess
[10:32am] <ubitux> apparently it was about some cache locality of some sort
[10:32am] <michaelni> btw as slicing is mentioned, theres the use case where the whole image doesnt fit in memory (gimp does that) this may be worth a thought in case of a redesign
[10:33am] <ubitux> michaelni: doesn't work by squares?
[10:33am] <ubitux> only slices?
[10:34am] <ubitux> photoshop works by big blocks (call it tiles/squares/bigblocks/whatever)
[10:34am] <michaelni> to clarify we dont support that currently, gimp uses its own scaler
[10:34am] <michaelni> i just saw that gimp caching code a long time ago (which used squares)
[10:34am] <michaelni> or rectangles dont remember