#7706 closed defect (fixed)
20-30% perf drop in FFmpeg (H264) transcode performance with VAAPI
Reported by: | eero-t | Owned by: | |
---|---|---|---|
Priority: | important | Component: | avcodec |
Version: | git-master | Keywords: | vaapi regression |
Cc: | linjie.fu@intel.com | Blocked By: | |
Blocking: | Reproduced by developer: | yes | |
Analyzed by developer: | no |
Description
Summary of the bug:
VAAPI H264 transcode performance dropped 20-30% between following commits:
- d92f06eb6663dce8cb8942a1314f736a07f255e0 2019-01-22 19-59-10 avformat/img2enc: mention -frames:v in error message
- 3224d6691cdc59ef0d31cdb35efac27494ff515b 2019-01-24 16-38-34 avfilter/afade+acrossfade: allow skipping fade on inputs
There's no drop with QSV backend. Between the indicated commit range, there's a series of changes to FFmpeg VAAPI support (and couple of other changes).
Setup:
- Ubuntu 18.04
- drm-tip kernel v4.20
- FFmpeg and iHD driver built from git
- HW supported by iHD driver
How to reproduce:
$ ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i input.264 -c:v h264_vaapi -y output.h264
I see the drop on all platforms supported by iHD driver i.e. Broadwell and newer. With i965 driver (which supports also older platforms), the drop is visible on Braswell too, but I don't see it on Sandybridge or Haswell.
This drop may be there also for other codecs, but I've tested only H264.
GPU is running at full speed before and after the change, but it and CPU use less power, i.e. they're underutilized compared to earlier situation. If one runs many instances of FFmpeg at same time so that HW is definitely fully utilized, that still retains same perf => FFmpeg VAAPI usage seems to become more synchronous.
Change History (17)
comment:1 by , 6 years ago
Keywords: | vaapi regression added |
---|---|
Version: | unspecified → git-master |
comment:2 by , 6 years ago
Did manual bisect (they go from newer to older commits):
- slow: 2019-01-23 23-04-12 916b3b9079f783f0e00823e19bba85fa0f7d012f: vaapi_encode_vp9: Support more complex reference structures
- slow: 2019-01-23 23-04-11 494bd8df782efe53e85de8ce258a079cea4eca72: vaapi_encode: Let the reconstructed frame pool be sized dynamically
- slow: 2019-01-23 23-04-11 5fdcf85bbffe7451c227478fda62da5c0938f27d: vaapi_encode: Convert to send/receive API
- fast: 2019-01-23 23-04-11 26ce3a43a35fe3a43c895945252aa22c6b46ffb7: vaapi_encode: Allocate picture-private data in generic code
=> encode perf regression came from:
commit 5fdcf85bbffe7451c227478fda62da5c0938f27d Author: Mark Thompson <sw@jkqxz.net> AuthorDate: Thu Dec 20 20:39:56 2018 +0000 Commit: Mark Thompson <sw@jkqxz.net> CommitDate: Wed Jan 23 23:04:11 2019 +0000 vaapi_encode: Convert to send/receive API This attaches the logic of picking the mode of for the next picture to the output, which simplifies some choices by removing the concept of the picture for which input is not yet available. At the same time, we allow more complex reference structures and track more reference metadata (particularly the contents of the DPB) for use in the codec-specific code. It also adds flags to explicitly track the available features of the different codecs. The new structure also allows open-GOP support, so that is now available for codecs which can do it.
comment:3 by , 6 years ago
Component: | undetermined → avcodec |
---|---|
Priority: | normal → important |
Status: | new → open |
follow-up: 5 comment:4 by , 6 years ago
VAAPI transcode performance is now slower than with QSV, whereas it earlier was in most cases faster (at least for H264, on Intel).
Guilty commit is not codec specific, so it's likely to regress VAAPI encoding perf also with other codecs than H264.
comment:5 by , 6 years ago
Replying to eero-t:
Guilty commit is not codec specific, so it's likely to regress VAAPI encoding perf also with other codecs than H264.
Ticket #7797 could also be due to this regression, as there has been no improvement to VA-API performance since this regression in January (except from drm-tip kernel v4.20 -> 5.0 upgrade).
comment:6 by , 5 years ago
VAAPI performance drop is not seen on Kaby Lake or coffee lake with kernel version 4.20 using regression patch.
comment:7 by , 5 years ago
Cc: | added |
---|---|
Reproduced by developer: | set |
comment:8 by , 5 years ago
One possible reason:
For old API encode2, vaBeginPicture and vaSyncSurface will be called in a more "asynchronous" way:
Two pics will be sent to encoder without vaSyncSurface, thus the encoder would not be blocked by the sync and map procedure.
[mpeg2_vaapi @ 0x55a3ad371200] vaBeginPicture happens here. Last message repeated 1 times [mpeg2_vaapi @ 0x55a3ad371200] vaSyncSurface happens here. Last message repeated 1 times
For new send/receive API, vaBeginPicture is strictly followed by vaSyncSurface.
[mpeg2_vaapi @ 0x55bb10ea9200] vaBeginPicture happens here. [mpeg2_vaapi @ 0x55bb10ea9200] vaSyncSurface happens here. [mpeg2_vaapi @ 0x55bb10ea9200] vaBeginPicture happens here. [mpeg2_vaapi @ 0x55bb10ea9200] vaSyncSurface happens here. [mpeg2_vaapi @ 0x55bb10ea9200] vaBeginPicture happens here. [mpeg2_vaapi @ 0x55bb10ea9200] vaSyncSurface happens here. [mpeg2_vaapi @ 0x55bb10ea9200] vaBeginPicture happens here. [mpeg2_vaapi @ 0x55bb10ea9200] vaSyncSurface happens here. [mpeg2_vaapi @ 0x55bb10ea9200] vaBeginPicture happens here. [mpeg2_vaapi @ 0x55bb10ea9200] vaSyncSurface happens here.
comment:9 by , 5 years ago
Currently, vaapi encodes a pic if all its referenced pics are ready,
and then outputs it immediately by calling vaapi_encode_output(vaSyncSurfac).
When working on output procedure, hardware is be able to cope with encoding
tasks in the meantime to have better performance.
So there is a more efficient way to encode the pics whose refs are all ready
during one receive_packets() function and output the pkt when encoder is encoding
new pic waiting for its references.
It's what vaapi originally did before the regression, and the performance could be
improved for ~20%.
CMD:
ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128
-hwaccel_output_format vaapi -i bbb_sunflower_1080p_30fps_normal.mp4
-c:v h264_vaapi -f h264 -y /dev/null
Source:
https://download.blender.org/demo/movies/BBB/
Before:
~164 fps
After:
~198 fps
However, it didn't totally meet the performance benchmark before the regression in my experiment.
Hi Eero,
Would you please help to verify this patch:
https://patchwork.ffmpeg.org/patch/16156/
comment:10 by , 5 years ago
I've started some VA-API tests with that patch applied to FFmpeg. I'll check the results (FPS & PSNR) tomorrow.
comment:11 by , 5 years ago
I ran several variants of 6 transcode operations and few other media tests:
- In 8-bit (max FullHD) AVC transcode tests perf improves up to 20%, when running single transcode operation
- In 10-bit 4K HEVC transcode [1], perf increase was 3-4%
- When running multiple transcode operations in parallel, there was no perf change (all changes were within daily variance)
- There were no performance regressions
Even with the patch, there's still very clear gap to original January performance. Because that perf drop concerned only single transcode operations (parallel ones were not impacted), it's possible that some part of the gap was due to P-state power management (I'm not fixing CPU & GPU speeds in my tests, on purpose).
I was testing this on KBL i7 GT3e NUC with 28W TDP. Some observations on power usage:
- In the tests improving most, patch increases GPU power usage without increased CPU power usage, i.e. FFmpeg was better able to feed work to GPU
- When many instances of the same test are run in parallel, things are TDP limited. Either there's no change in power usage, or patch causes slightly higher CPU usage, which results in GPU using less power. No idea how latter behavior is able to maintain same speed, maybe P-state is better able to save GPU power with the interaction patterns caused by the patch?
[1] Note: I'm seeing marginal reproducible quality drop (0.1% SSIM, 2-3% PSNR) in this test-case: https://trac.ffmpeg.org/ticket/8328
I assume that's something related to frame timings like with QSV, not a change in encoded frame contents.
follow-up: 13 comment:12 by , 5 years ago
Thanks for verifying this patch.
There is still room for performance improvement.
So based on the test results, does this regression affect single process only?
If that's true,one possible reason is that multiple process encoding makes full use of the resource (hardware/encoder maybe), while single process seems to keep the encoding procedure idle or waiting for sometime.
And it's kind of weird that it benefits HEVC little since the modification is in the general vaapi encode code path.
And since the test covers the whole transcoding procedure, how about the performance of decode/encode separately?
comment:13 by , 5 years ago
Replying to fulinjie:
So based on the test results, does this regression affect single process only?
Looking at the old results, correct. Running multiple (5-50) parallel transcode processes wasn't affected by the regression, even when they were not TDP limited.
If that's true, one possible reason is that multiple process encoding makes full use of the resource (hardware/encoder maybe), while single process seems to keep the encoding procedure idle or waiting for sometime.
Yes, single 8-bit AVC transcode doesn't fill any of the GPU engines 100%, that happens only with multiple parallel transcode operations (it's easy to see from IGT "intel_gpu_top" output).
If GPU is "full" all the time, stuff needs to be queued. For average throughput (= what I'm measuring) it doesn't matter when you go to the queue, GPU is anyway fully utilized. Your patch might help a bit with VA-API latency in multiple parallel transcode cases, but I don't measure that.
And it's kind of weird that it benefits HEVC little since the modification is in the general vaapi encode code path.
My HEVC test-case is 10-bit instead of 8, and 4K instead of 2K or smaller. Therefore, it's processing >4x more data than my AVC test-cases. HEVC encoding is also heavier.
=> As each frame takes longer, feeding GPU timely is less of a problem for keeping average GPU utilization high.
(I was a bit worried about potential extra CPU usage, because that's away from power/temperature "budget" shared with iGPU, but that seems to be low enough not to be a problem.)
And since the test covers the whole transcoding procedure, how about the performance of decode/encode separately?
I did some decode tests with HEVC (for 2K 10-bit data) with and without hwdownload, and as expected, perf of that wasn't impacted.
(RAW data encoding is less of interest for me as the input data is so large that end-to-end perf can be bottlenecked more by disk/network data transfer, rather than GPU usage.)
comment:14 by , 3 years ago
Resolution: | → fixed |
---|---|
Status: | open → closed |
This issue is fixed on commit e0ff86993052b49a64d434bac345e92fc149f446 and d165ce22a4a7cc4ed60238ce8f3d5dcbbad3e266
comment:15 by , 3 years ago
Verified.
There's a large improvement to all single transcode operations VA-API performance, whether it's for small resolution AVC, or 10-bit 4K HEVC transcode. In few cases, the improvement is to significantly higher level than before the regression, and in most at least to same level.
For parallel transcode operations which (more than) fill whole GPU, there can be up to couple of percent regression, but at least in one such case, there was also an (1%) improvement.
=> I.e. in total, this is significant improvement to previous state.
comment:16 by , 3 years ago
For parallel transcode operations which (more than) fill whole GPU, there can be up to couple of percent regression
Command?
comment:17 by , 3 years ago
For parallel transcode operations which (more than) fill whole GPU, there can be up to couple of percent regression
Command?
The general use-case for them is:
- Start dozen(s) of *identical* FFmpeg transcode operations in parallel
- Calculated resulting average FPS over all of them
(Tested transcode use-cases are HEVC->HEVC, AVC->AVC, AVC->MPEG2, and include several different resolutions.)
While one use-case goes couple of percent down on one HW, it goes up on another HW, and for another use-case, the change is the other way round. For all use-cases where there's some HW where result goes marginally down, it goes marginally up on another HW. While results are reproducible, there's no uniform regression from the fix, i.e. you should ignore that effect.
(With slightly more work being done by commit d165ce22a4a7cc4ed60238ce8f3d5dcbbad3e266, but it improving perf, that kind of behaviour could even be expected, as both will affect kernel scheduling.)
Note that after this fix to a huge perf regression, which in some cases even improves things more than just fixing the regression, VA-API is now faster in *all* of those cases than doing indetical thing using QSV API. IMHO what should be looked at now, is QSV perf.
Please use
git bisect
to find the commit introducing the regression.