Opened 5 years ago

Closed 2 years ago

#5568 closed enhancement (fixed)

POWER8 VSX vectorization libswscale/swscale.c

Reported by: David Edelsohn Owned by:
Priority: wish Component: swscale
Version: git-master Keywords: bounty vsx
Cc: cand@gmx.com Blocked By:
Blocking: Reproduced by developer: no
Analyzed by developer: no

Description

Optimize approximately 17 functions in libswscale/swscale.c for POWER8 VSX SIMD instructions on PPC64 Linux.

Change History (45)

comment:1 by Carl Eugen Hoyos, 5 years ago

Could you elaborate? Ideally before opening more tickets...
Are you planning to send patches?

comment:2 by David Edelsohn, 5 years ago

swscale methods frequently appear high in profiles. The routines have been optimized for x86. IBM is sponsoring bounties to enable similar SIMD optimizations for the POWER architecture.

Last edited 5 years ago by David Edelsohn (previous) (diff)

comment:3 by David Edelsohn, 5 years ago

Keywords: bounty added
Version: unspecifiedgit-master

comment:4 by Carl Eugen Hoyos, 5 years ago

Are there already developer(s) working on your task or will you add information about what you expect (at least) and how big the bounty is?

comment:5 by David Edelsohn, 5 years ago

Some developers already have stated interest and started to work on the project. As a bounty, anyone is welcome to work on it. The bounties are posted at bountysource.com.

https://www.bountysource.com/issues/34315029-power8-vsx-vectorization-libswscale-swscale-c

comment:6 by Carl Eugen Hoyos, 5 years ago

Priority: normalwish
Status: newopen

Please consider looking at ticket #5508

comment:7 by Carl Eugen Hoyos, 5 years ago

Keywords: vsx added; bounty removed

The bounty has disappeared.

comment:8 by Carl Eugen Hoyos, 5 years ago

Keywords: bounty added

Or maybe not.

in reply to:  2 comment:9 by llogan, 5 years ago

Replying to edelsohn:

swscale methods frequently appear high in profiles. The routines have been optimized for x86. IBM is sponsoring bounties to enable similar SIMD optimizations for the POWER architecture.

It would be helpful if access to such hardware was provided or if a machine could be donated. Is this a possibility?

comment:10 by David Edelsohn, 5 years ago

IBM already is running and reporting FATE regularly. And free access to VMs for Open Source developers is available.

comment:11 by bookmoons, 3 years ago

Hi guys. I have something toward a vectorized hScale8To15_c from swscale.c. Profiling with callgrind shows it's a little faster than the unoptimized version, but not as fast as the extant altivec version. Hope to figure out what's causing the difference and improve it.

Unoptimized - 11,140,574
VSX optimized - 9,670,008
Altivec optimized - 3,511,966

Does this seem like the right direction?

Last edited 3 years ago by bookmoons (previous) (diff)

in reply to:  11 comment:12 by Carl Eugen Hoyos, 3 years ago

Replying to bookmoons:

Hi guys. I have something toward a vectorized hScale8To15_c from swscale.c. Profiling with callgrind

(You are expected to test with FFmpeg's TIMER macros unless this does not work for some reason.)

shows it's a little faster than the unoptimized version, but not as fast as the extant altivec version. Hope to figure out what's causing the difference and improve it.

Unoptimized - 11,140,574
VSX optimized - 9,670,008
Altivec optimized - 3,511,966

Does this seem like the right direction?

The direction may be all right but to be accepted in the codebase the speed has to be improved significantly.
If you are interested in the bounty, make sure to first optimize one function to learn about our requirements: A patch that unfortunately contained a lot of work was rejected because it only offered minimal speed improvements.

comment:13 by bookmoons, 3 years ago

Thank you very much cehoyos. Will look into those TIMER macros and start with 1 function.

If anyone knows: is there a resource with guidance on PowerPC optimization? For x86 for example Intel provides an optimization reference manual.

comment:14 by David Edelsohn, 3 years ago

comment:15 by Carl Eugen Hoyos, 3 years ago

Last edited 3 years ago by Carl Eugen Hoyos (previous) (diff)

comment:16 by bookmoons, 3 years ago

Wonderful, thanks guys.

comment:17 by David Edelsohn, 3 years ago

comment:18 by bookmoons, 3 years ago

This optimization guide is a satisfying read. Some of those programs look great @edelsohn, will look into them.

Think I've got my head around the TIMER macros, so I'm going to try profiling soon.

comment:19 by bookmoons, 3 years ago

Is there some pattern for using the TIMER macros?

I've tried wrapping them around the callsites. That seems to be how it's done everywhere else.

START_TIMER
c->hyScale(c, (int16_t*)dst[dst_pos], dstW, (const uint8_t *)src[src_pos], instance->filter,
           instance->filter_pos, instance->filter_size);
STOP_TIMER("hyScale")

Then building with --enable-linux-perf. It fails with:

error: implicit declaration of function 'syscall'

syscall is used in the START_TIMER macro. The header for it is unistd.h, which is included by timer.h, so I'm not sure what's happening.

comment:20 by bookmoons, 3 years ago

Sorry, I got some numbers by building without --enable-linux-perf. Guess it falls through to some CPU specific timing method.

comment:21 by bookmoons, 3 years ago

I'm getting some numbers from the TIMER macros. But the run counts bounce up and down. Does that seem right?

3521 UNITS in hscale, 32 runs, 0 skips
6714 UNITS in hscale, 64 runs, 0 skips
6456 UNITS in hscale, 128 runs, 0 skips
3182 UNITS in hscale, 64 runs, 0 skips
3344 UNITS in hscale, 64 runs, 0 skips
3375 UNITS in hscale, 128 runs, 0 skips
3419 UNITS in hscale, 128 runs, 0 skips
6693 UNITS in hscale, 255 runs, 1 skips

What I've done is wrap the callsites as shown above. There are 4 of them, all through function pointers. I gave all 4 the same name "hscale".

comment:22 by Carl Eugen Hoyos, 3 years ago

You should choose four different names.
The unit count for the highest number of runs with (nearly) no skips is the relevant number.

comment:23 by bookmoons, 3 years ago

Alright, initial numbers for a small run with the relevant versions.

21410 Plain C
19616 New VSX optimized
06340 Altivec optimized

comment:24 by Carl Eugen Hoyos, 3 years ago

Are you testing this on the same hardware?

comment:25 by bookmoons, 3 years ago

Yeah, all on the same machine. I'm using a VM from Minicloud.

in reply to:  21 ; comment:26 by Carl Eugen Hoyos, 3 years ago

Replying to bookmoons:

What I've done is wrap the callsites as shown above.

That is correct.

There are 4 of them, all through function pointers.

I only found two instances, assuming alpha scaling is less important (and done the same way) only the first call in libswscale/hscale.c is relevant for a speed measurement.
How does the command line look that you used for testing?

in reply to:  19 comment:27 by Carl Eugen Hoyos, 3 years ago

Replying to bookmoons:

Then building with --enable-linux-perf. It fails with:

error: implicit declaration of function 'syscall'

This is a particular difficult feature to use;-)
You have to add #include "libavutil/timer.h" (or #define _GNU_SOURCE) on top of the file where you use the TIMER macro, this was mentioned in the original commit message: f61379cbd45a91b26c7a1ddd3f16417466c435cd
(Took me some time to understand.)

in reply to:  26 comment:28 by bookmoons, 3 years ago

How does the command line look that you used for testing?

I used the scaling example in the wiki, with the image given there:
https://trac.ffmpeg.org/wiki/Scaling

ffmpeg -i input.jpg -vf scale=320:240 output.png

A call trace showed that this hits the hScale8To15_c function (with -cpuflags 0). So I chose that one to start.

comment:29 by Carl Eugen Hoyos, 3 years ago

The following may provide more stable numbers (you can compare with your results to verify):

$ ffmpeg -loop 1 -i input.jpg -vf scale=320:240 -vframes 100 -f null -

comment:30 by bookmoons, 3 years ago

Nice, thank you. Will give that a try.

comment:31 by bookmoons, 3 years ago

I have a first build of a skeleton asm version. This is promising.

Numbers from the new command line (with my poor exploratory code mentioned above). Now timing just the one call as recommended.

21340  Plain C
19609  New VSX optimized
 6214  Altivec optimized

It does help stabilize. With the old command the first run would sometimes be wildly higher. There are a few lines with skips now, so I'm taking the last line with 0-2 skips.

comment:32 by bookmoons, 3 years ago

Timed the same command on the major x86 extensions so we can compare. Column 3 is the cpuflags value.

0.19  9374 sse2
0.19  9383 sse4.1
0.19  9383 ssse3
1.00 48616 0
1.10 53473 sse3
1.10 53509 mmx
1.10 53543 fma4
1.10 53548 avx
1.10 53622 fma3
1.11 53730 sse4.2
1.11 54172 avx2
1.12 54221 sse

Most of them are strangely slower than plain C. It's consistent. I don't know what that's about. But sse 2 4.1, ssse 3 are all >4x faster.

comment:33 by Carl Eugen Hoyos, 3 years ago

In this specific case, you have to (clearly) beat the AltiVec speed.

comment:34 by bookmoons, 3 years ago

Relative times for the current code.

0.29  6214 altivec
0.92 19609 vsx
1.00 21340 0

comment:35 by bookmoons, 3 years ago

Minicloud is under maintenance until 7 Jul, so I've lost my test environment for a while.

@edelsohn The IBM Power Development Cloud seems like a nice alternative. Registration asks for company details. Do you know if there's a way to get access as an individual developer?

I looked at the Brno offering. FYI that they seem to have expanded available OSs. The cloud resources page says only RHEL. The request form now lists Fedora openSuSE Debian Ubuntu, and no RHEL. In case you can get word to whoever can update that page.

comment:36 by David Edelsohn, 3 years ago

If you want an alternative, please request an account at OSUOSL Power development systems for Open Source developers

http://osuosl.org/services/powerdev/

You can list me as the sponsor.

comment:37 by bookmoons, 3 years ago

Thank you very much @edelsohn. I've sent a request.

comment:38 by bookmoons, 3 years ago

Believe I have a reasonable unoptimized version. I'm sure there are kinks to work out, I'll have to find them when I get access to a VM again.

Once that's done, would it be OK to submit for an early review? To hopefully catch any obvious problems.

comment:39 by Carl Eugen Hoyos, 3 years ago

Once your VSX function clearly beats AltiVec, it would be a good idea to send the first patch for review, yes.

comment:40 by cand, 2 years ago

The swscale.c funcs with x86 versions now have ppc versions.

Speedups:
hyscale_fast: 4.27
hcscale_fast: 4.48 (x86 MMX is 4.8)
hScale8To19: 2.26 (x86 SSE2 is 2.32)
hScale16To19: 2 (x86 SSE2 is 2.37)
hScale16To15: 2.06

We're within a few percent of the x86 versions. I think this is a good result, since the ppc code is generic for all filter sizes, while x86 goes to lengths to get the best performance. The _fast MMX versions use runtime generated in-memory code, while the SSE2 hscale funcs have hardcoded versions for specific filter sizes (one of which was hit by my test case - it's possible the generic SSE2 version is slower than the generic ppc).

I didn't see a need to touch the one existing ppc hscale func mentioned in the above comments. It was already fast, and wouldn't benefit from the newer instructions. It already uses the VSX unaligned loads on VSX platforms, so it's not Altivec-only.

comment:41 by cand, 2 years ago

Cc: cand@gmx.com added

comment:42 by cand, 2 years ago

@edelsohn: Please let me know if other changes are needed.

comment:43 by David Edelsohn, 2 years ago

I think the patches provided are good-enough for the initial VSX optimization of this feature request issue. Have all of the patches been approved and merged?

Last edited 2 years ago by David Edelsohn (previous) (diff)

comment:44 by cand, 2 years ago

Yes, they're all in as of Tuesday.

comment:45 by David Edelsohn, 2 years ago

Resolution: fixed
Status: openclosed
Note: See TracTickets for help on using tickets.