Opened 2 years ago

Last modified 6 weeks ago

#5568 open enhancement

POWER8 VSX vectorization libswscale/swscale.c

Reported by: edelsohn Owned by:
Priority: wish Component: swscale
Version: git-master Keywords: bounty vsx
Cc: Blocked By:
Blocking: Reproduced by developer: no
Analyzed by developer: no

Description

Optimize approximately 17 functions in libswscale/swscale.c for POWER8 VSX SIMD instructions on PPC64 Linux.

Change History (39)

comment:1 Changed 2 years ago by cehoyos

Could you elaborate? Ideally before opening more tickets...
Are you planning to send patches?

comment:2 follow-up: Changed 2 years ago by edelsohn

swscale methods frequently in profiles. The routines have been optimized for x86. IBM is sponsoring bounties to enable similar SIMD optimizations for the POWER architecture.

Version 0, edited 2 years ago by edelsohn (next)

comment:3 Changed 2 years ago by edelsohn

  • Keywords bounty added
  • Version changed from unspecified to git-master

comment:4 Changed 2 years ago by cehoyos

Are there already developer(s) working on your task or will you add information about what you expect (at least) and how big the bounty is?

comment:5 Changed 2 years ago by edelsohn

Some developers already have stated interest and started to work on the project. As a bounty, anyone is welcome to work on it. The bounties are posted at bountysource.com.

https://www.bountysource.com/issues/34315029-power8-vsx-vectorization-libswscale-swscale-c

comment:6 Changed 2 years ago by cehoyos

  • Priority changed from normal to wish
  • Status changed from new to open

Please consider looking at ticket #5508

comment:7 Changed 23 months ago by cehoyos

  • Keywords vsx added; bounty removed

The bounty has disappeared.

comment:8 Changed 23 months ago by cehoyos

  • Keywords bounty added

Or maybe not.

comment:9 in reply to: ↑ 2 Changed 23 months ago by llogan

Replying to edelsohn:

swscale methods frequently appear high in profiles. The routines have been optimized for x86. IBM is sponsoring bounties to enable similar SIMD optimizations for the POWER architecture.

It would be helpful if access to such hardware was provided or if a machine could be donated. Is this a possibility?

comment:10 Changed 23 months ago by edelsohn

IBM already is running and reporting FATE regularly. And free access to VMs for Open Source developers is available.

comment:11 follow-up: Changed 8 weeks ago by bookmoons

Hi guys. I have something toward a vectorized hScale8To15_c from swscale.c. Profiling with callgrind shows it's a little faster than the unoptimized version, but not as fast as the extant altivec version. Hope to figure out what's causing the difference and improve it.

Unoptimized - 11,140,574
VSX optimized - 9,670,008
Altivec optimized - 3,511,966

Does this seem like the right direction?

Last edited 8 weeks ago by bookmoons (previous) (diff)

comment:12 in reply to: ↑ 11 Changed 8 weeks ago by cehoyos

Replying to bookmoons:

Hi guys. I have something toward a vectorized hScale8To15_c from swscale.c. Profiling with callgrind

(You are expected to test with FFmpeg's TIMER macros unless this does not work for some reason.)

shows it's a little faster than the unoptimized version, but not as fast as the extant altivec version. Hope to figure out what's causing the difference and improve it.

Unoptimized - 11,140,574
VSX optimized - 9,670,008
Altivec optimized - 3,511,966

Does this seem like the right direction?

The direction may be all right but to be accepted in the codebase the speed has to be improved significantly.
If you are interested in the bounty, make sure to first optimize one function to learn about our requirements: A patch that unfortunately contained a lot of work was rejected because it only offered minimal speed improvements.

comment:13 Changed 8 weeks ago by bookmoons

Thank you very much cehoyos. Will look into those TIMER macros and start with 1 function.

If anyone knows: is there a resource with guidance on PowerPC optimization? For x86 for example Intel provides an optimization reference manual.

comment:14 Changed 8 weeks ago by edelsohn

comment:15 Changed 8 weeks ago by cehoyos

Last edited 8 weeks ago by cehoyos (previous) (diff)

comment:16 Changed 8 weeks ago by bookmoons

Wonderful, thanks guys.

comment:17 Changed 8 weeks ago by edelsohn

comment:18 Changed 8 weeks ago by bookmoons

This optimization guide is a satisfying read. Some of those programs look great @edelsohn, will look into them.

Think I've got my head around the TIMER macros, so I'm going to try profiling soon.

comment:19 follow-up: Changed 8 weeks ago by bookmoons

Is there some pattern for using the TIMER macros?

I've tried wrapping them around the callsites. That seems to be how it's done everywhere else.

START_TIMER
c->hyScale(c, (int16_t*)dst[dst_pos], dstW, (const uint8_t *)src[src_pos], instance->filter,
           instance->filter_pos, instance->filter_size);
STOP_TIMER("hyScale")

Then building with --enable-linux-perf. It fails with:

error: implicit declaration of function 'syscall'

syscall is used in the START_TIMER macro. The header for it is unistd.h, which is included by timer.h, so I'm not sure what's happening.

comment:20 Changed 8 weeks ago by bookmoons

Sorry, I got some numbers by building without --enable-linux-perf. Guess it falls through to some CPU specific timing method.

comment:21 follow-up: Changed 8 weeks ago by bookmoons

I'm getting some numbers from the TIMER macros. But the run counts bounce up and down. Does that seem right?

3521 UNITS in hscale, 32 runs, 0 skips
6714 UNITS in hscale, 64 runs, 0 skips
6456 UNITS in hscale, 128 runs, 0 skips
3182 UNITS in hscale, 64 runs, 0 skips
3344 UNITS in hscale, 64 runs, 0 skips
3375 UNITS in hscale, 128 runs, 0 skips
3419 UNITS in hscale, 128 runs, 0 skips
6693 UNITS in hscale, 255 runs, 1 skips

What I've done is wrap the callsites as shown above. There are 4 of them, all through function pointers. I gave all 4 the same name "hscale".

comment:22 Changed 8 weeks ago by cehoyos

You should choose four different names.
The unit count for the highest number of runs with (nearly) no skips is the relevant number.

comment:23 Changed 8 weeks ago by bookmoons

Alright, initial numbers for a small run with the relevant versions.

21410 Plain C
19616 New VSX optimized
06340 Altivec optimized

comment:24 Changed 8 weeks ago by cehoyos

Are you testing this on the same hardware?

comment:25 Changed 8 weeks ago by bookmoons

Yeah, all on the same machine. I'm using a VM from Minicloud.

comment:26 in reply to: ↑ 21 ; follow-up: Changed 8 weeks ago by cehoyos

Replying to bookmoons:

What I've done is wrap the callsites as shown above.

That is correct.

There are 4 of them, all through function pointers.

I only found two instances, assuming alpha scaling is less important (and done the same way) only the first call in libswscale/hscale.c is relevant for a speed measurement.
How does the command line look that you used for testing?

comment:27 in reply to: ↑ 19 Changed 8 weeks ago by cehoyos

Replying to bookmoons:

Then building with --enable-linux-perf. It fails with:

error: implicit declaration of function 'syscall'

This is a particular difficult feature to use;-)
You have to add #include "libavutil/timer.h" (or #define _GNU_SOURCE) on top of the file where you use the TIMER macro, this was mentioned in the original commit message: f61379cbd45a91b26c7a1ddd3f16417466c435cd
(Took me some time to understand.)

comment:28 in reply to: ↑ 26 Changed 8 weeks ago by bookmoons

How does the command line look that you used for testing?

I used the scaling example in the wiki, with the image given there:
https://trac.ffmpeg.org/wiki/Scaling

ffmpeg -i input.jpg -vf scale=320:240 output.png

A call trace showed that this hits the hScale8To15_c function (with -cpuflags 0). So I chose that one to start.

comment:29 Changed 8 weeks ago by cehoyos

The following may provide more stable numbers (you can compare with your results to verify):

$ ffmpeg -loop 1 -i input.jpg -vf scale=320:240 -vframes 100 -f null -

comment:30 Changed 8 weeks ago by bookmoons

Nice, thank you. Will give that a try.

comment:31 Changed 8 weeks ago by bookmoons

I have a first build of a skeleton asm version. This is promising.

Numbers from the new command line (with my poor exploratory code mentioned above). Now timing just the one call as recommended.

21340  Plain C
19609  New VSX optimized
 6214  Altivec optimized

It does help stabilize. With the old command the first run would sometimes be wildly higher. There are a few lines with skips now, so I'm taking the last line with 0-2 skips.

comment:32 Changed 7 weeks ago by bookmoons

Timed the same command on the major x86 extensions so we can compare. Column 3 is the cpuflags value.

0.19  9374 sse2
0.19  9383 sse4.1
0.19  9383 ssse3
1.00 48616 0
1.10 53473 sse3
1.10 53509 mmx
1.10 53543 fma4
1.10 53548 avx
1.10 53622 fma3
1.11 53730 sse4.2
1.11 54172 avx2
1.12 54221 sse

Most of them are strangely slower than plain C. It's consistent. I don't know what that's about. But sse 2 4.1, ssse 3 are all >4x faster.

comment:33 Changed 7 weeks ago by cehoyos

In this specific case, you have to (clearly) beat the AltiVec speed.

comment:34 Changed 7 weeks ago by bookmoons

Relative times for the current code.

0.29  6214 altivec
0.92 19609 vsx
1.00 21340 0

comment:35 Changed 7 weeks ago by bookmoons

Minicloud is under maintenance until 7 Jul, so I've lost my test environment for a while.

@edelsohn The IBM Power Development Cloud seems like a nice alternative. Registration asks for company details. Do you know if there's a way to get access as an individual developer?

I looked at the Brno offering. FYI that they seem to have expanded available OSs. The cloud resources page says only RHEL. The request form now lists Fedora openSuSE Debian Ubuntu, and no RHEL. In case you can get word to whoever can update that page.

comment:36 Changed 7 weeks ago by edelsohn

If you want an alternative, please request an account at OSUOSL Power development systems for Open Source developers

http://osuosl.org/services/powerdev/

You can list me as the sponsor.

comment:37 Changed 7 weeks ago by bookmoons

Thank you very much @edelsohn. I've sent a request.

comment:38 Changed 6 weeks ago by bookmoons

Believe I have a reasonable unoptimized version. I'm sure there are kinks to work out, I'll have to find them when I get access to a VM again.

Once that's done, would it be OK to submit for an early review? To hopefully catch any obvious problems.

comment:39 Changed 6 weeks ago by cehoyos

Once your VSX function clearly beats AltiVec, it would be a good idea to send the first patch for review, yes.

Note: See TracTickets for help on using tickets.