Opened 9 years ago
Closed 6 years ago
#5568 closed enhancement (fixed)
POWER8 VSX vectorization libswscale/swscale.c
Reported by: | David Edelsohn | Owned by: | |
---|---|---|---|
Priority: | wish | Component: | swscale |
Version: | git-master | Keywords: | bounty vsx |
Cc: | cand@gmx.com | Blocked By: | |
Blocking: | Reproduced by developer: | no | |
Analyzed by developer: | no |
Description
Optimize approximately 17 functions in libswscale/swscale.c for POWER8 VSX SIMD instructions on PPC64 Linux.
Change History (45)
comment:1 by , 9 years ago
follow-up: 9 comment:2 by , 9 years ago
swscale methods frequently appear high in profiles. The routines have been optimized for x86. IBM is sponsoring bounties to enable similar SIMD optimizations for the POWER architecture.
comment:3 by , 9 years ago
Keywords: | bounty added |
---|---|
Version: | unspecified → git-master |
comment:4 by , 9 years ago
Are there already developer(s) working on your task or will you add information about what you expect (at least) and how big the bounty is?
comment:5 by , 9 years ago
Some developers already have stated interest and started to work on the project. As a bounty, anyone is welcome to work on it. The bounties are posted at bountysource.com.
https://www.bountysource.com/issues/34315029-power8-vsx-vectorization-libswscale-swscale-c
comment:6 by , 9 years ago
Priority: | normal → wish |
---|---|
Status: | new → open |
Please consider looking at ticket #5508
comment:9 by , 8 years ago
Replying to edelsohn:
swscale methods frequently appear high in profiles. The routines have been optimized for x86. IBM is sponsoring bounties to enable similar SIMD optimizations for the POWER architecture.
It would be helpful if access to such hardware was provided or if a machine could be donated. Is this a possibility?
comment:10 by , 8 years ago
IBM already is running and reporting FATE regularly. And free access to VMs for Open Source developers is available.
follow-up: 12 comment:11 by , 6 years ago
Hi guys. I have something toward a vectorized hScale8To15_c from swscale.c. Profiling with callgrind shows it's a little faster than the unoptimized version, but not as fast as the extant altivec version. Hope to figure out what's causing the difference and improve it.
Unoptimized - 11,140,574
VSX optimized - 9,670,008
Altivec optimized - 3,511,966
Does this seem like the right direction?
comment:12 by , 6 years ago
Replying to bookmoons:
Hi guys. I have something toward a vectorized hScale8To15_c from swscale.c. Profiling with callgrind
(You are expected to test with FFmpeg's TIMER macros unless this does not work for some reason.)
shows it's a little faster than the unoptimized version, but not as fast as the extant altivec version. Hope to figure out what's causing the difference and improve it.
Unoptimized - 11,140,574
VSX optimized - 9,670,008
Altivec optimized - 3,511,966
Does this seem like the right direction?
The direction may be all right but to be accepted in the codebase the speed has to be improved significantly.
If you are interested in the bounty, make sure to first optimize one function to learn about our requirements: A patch that unfortunately contained a lot of work was rejected because it only offered minimal speed improvements.
comment:13 by , 6 years ago
Thank you very much cehoyos. Will look into those TIMER macros and start with 1 function.
If anyone knows: is there a resource with guidance on PowerPC optimization? For x86 for example Intel provides an optimization reference manual.
comment:14 by , 6 years ago
Power8 processor optimization guide https://www.redbooks.ibm.com/redbooks/pdfs/sg248171.pdf
comment:15 by , 6 years ago
In addition to comment:12, this is a relevant message:
https://ffmpeg.org/pipermail/ffmpeg-devel/2016-July/196395.html
comment:17 by , 6 years ago
Additional optimization information for Power
https://developer.ibm.com/linuxonpower/2018/05/29/lop-port-new-articles/
comment:18 by , 6 years ago
This optimization guide is a satisfying read. Some of those programs look great @edelsohn, will look into them.
Think I've got my head around the TIMER macros, so I'm going to try profiling soon.
follow-up: 27 comment:19 by , 6 years ago
Is there some pattern for using the TIMER macros?
I've tried wrapping them around the callsites. That seems to be how it's done everywhere else.
START_TIMER c->hyScale(c, (int16_t*)dst[dst_pos], dstW, (const uint8_t *)src[src_pos], instance->filter, instance->filter_pos, instance->filter_size); STOP_TIMER("hyScale")
Then building with --enable-linux-perf. It fails with:
error: implicit declaration of function 'syscall'
syscall is used in the START_TIMER macro. The header for it is unistd.h, which is included by timer.h, so I'm not sure what's happening.
comment:20 by , 6 years ago
Sorry, I got some numbers by building without --enable-linux-perf. Guess it falls through to some CPU specific timing method.
follow-up: 26 comment:21 by , 6 years ago
I'm getting some numbers from the TIMER macros. But the run counts bounce up and down. Does that seem right?
3521 UNITS in hscale, 32 runs, 0 skips
6714 UNITS in hscale, 64 runs, 0 skips
6456 UNITS in hscale, 128 runs, 0 skips
3182 UNITS in hscale, 64 runs, 0 skips
3344 UNITS in hscale, 64 runs, 0 skips
3375 UNITS in hscale, 128 runs, 0 skips
3419 UNITS in hscale, 128 runs, 0 skips
6693 UNITS in hscale, 255 runs, 1 skips
What I've done is wrap the callsites as shown above. There are 4 of them, all through function pointers. I gave all 4 the same name "hscale".
comment:22 by , 6 years ago
You should choose four different names.
The unit count for the highest number of runs with (nearly) no skips is the relevant number.
comment:23 by , 6 years ago
Alright, initial numbers for a small run with the relevant versions.
21410 Plain C
19616 New VSX optimized
06340 Altivec optimized
follow-up: 28 comment:26 by , 6 years ago
Replying to bookmoons:
What I've done is wrap the callsites as shown above.
That is correct.
There are 4 of them, all through function pointers.
I only found two instances, assuming alpha scaling is less important (and done the same way) only the first call in libswscale/hscale.c is relevant for a speed measurement.
How does the command line look that you used for testing?
comment:27 by , 6 years ago
Replying to bookmoons:
Then building with --enable-linux-perf. It fails with:
error: implicit declaration of function 'syscall'
This is a particular difficult feature to use;-)
You have to add #include "libavutil/timer.h"
(or #define _GNU_SOURCE
) on top of the file where you use the TIMER macro, this was mentioned in the original commit message: f61379cbd45a91b26c7a1ddd3f16417466c435cd
(Took me some time to understand.)
comment:28 by , 6 years ago
How does the command line look that you used for testing?
I used the scaling example in the wiki, with the image given there:
https://trac.ffmpeg.org/wiki/Scaling
ffmpeg -i input.jpg -vf scale=320:240 output.png
A call trace showed that this hits the hScale8To15_c function (with -cpuflags 0). So I chose that one to start.
comment:29 by , 6 years ago
The following may provide more stable numbers (you can compare with your results to verify):
$ ffmpeg -loop 1 -i input.jpg -vf scale=320:240 -vframes 100 -f null -
comment:31 by , 6 years ago
I have a first build of a skeleton asm version. This is promising.
Numbers from the new command line (with my poor exploratory code mentioned above). Now timing just the one call as recommended.
21340 Plain C 19609 New VSX optimized 6214 Altivec optimized
It does help stabilize. With the old command the first run would sometimes be wildly higher. There are a few lines with skips now, so I'm taking the last line with 0-2 skips.
comment:32 by , 6 years ago
Timed the same command on the major x86 extensions so we can compare. Column 3 is the cpuflags value.
0.19 9374 sse2 0.19 9383 sse4.1 0.19 9383 ssse3 1.00 48616 0 1.10 53473 sse3 1.10 53509 mmx 1.10 53543 fma4 1.10 53548 avx 1.10 53622 fma3 1.11 53730 sse4.2 1.11 54172 avx2 1.12 54221 sse
Most of them are strangely slower than plain C. It's consistent. I don't know what that's about. But sse 2 4.1, ssse 3 are all >4x faster.
comment:34 by , 6 years ago
Relative times for the current code.
0.29 6214 altivec 0.92 19609 vsx 1.00 21340 0
comment:35 by , 6 years ago
Minicloud is under maintenance until 7 Jul, so I've lost my test environment for a while.
@edelsohn The IBM Power Development Cloud seems like a nice alternative. Registration asks for company details. Do you know if there's a way to get access as an individual developer?
I looked at the Brno offering. FYI that they seem to have expanded available OSs. The cloud resources page says only RHEL. The request form now lists Fedora openSuSE Debian Ubuntu, and no RHEL. In case you can get word to whoever can update that page.
comment:36 by , 6 years ago
If you want an alternative, please request an account at OSUOSL Power development systems for Open Source developers
http://osuosl.org/services/powerdev/
You can list me as the sponsor.
comment:38 by , 6 years ago
Believe I have a reasonable unoptimized version. I'm sure there are kinks to work out, I'll have to find them when I get access to a VM again.
Once that's done, would it be OK to submit for an early review? To hopefully catch any obvious problems.
comment:39 by , 6 years ago
Once your VSX function clearly beats AltiVec, it would be a good idea to send the first patch for review, yes.
comment:40 by , 6 years ago
The swscale.c funcs with x86 versions now have ppc versions.
Speedups:
hyscale_fast: 4.27
hcscale_fast: 4.48 (x86 MMX is 4.8)
hScale8To19: 2.26 (x86 SSE2 is 2.32)
hScale16To19: 2 (x86 SSE2 is 2.37)
hScale16To15: 2.06
We're within a few percent of the x86 versions. I think this is a good result, since the ppc code is generic for all filter sizes, while x86 goes to lengths to get the best performance. The _fast MMX versions use runtime generated in-memory code, while the SSE2 hscale funcs have hardcoded versions for specific filter sizes (one of which was hit by my test case - it's possible the generic SSE2 version is slower than the generic ppc).
I didn't see a need to touch the one existing ppc hscale func mentioned in the above comments. It was already fast, and wouldn't benefit from the newer instructions. It already uses the VSX unaligned loads on VSX platforms, so it's not Altivec-only.
comment:41 by , 6 years ago
Cc: | added |
---|
comment:43 by , 6 years ago
I think the patches provided are good-enough for the initial VSX optimization of this feature request issue.
comment:45 by , 6 years ago
Resolution: | → fixed |
---|---|
Status: | open → closed |
Could you elaborate? Ideally before opening more tickets...
Are you planning to send patches?