Opened 10 months ago

Last modified 13 days ago

#7706 open defect

20-30% perf drop in FFmpeg (H264) transcode performance with VAAPI

Reported by: eero-t Owned by:
Priority: important Component: avcodec
Version: git-master Keywords: vaapi regression
Cc: linjie.fu@intel.com Blocked By:
Blocking: Reproduced by developer: yes
Analyzed by developer: no

Description

Summary of the bug:

VAAPI H264 transcode performance dropped 20-30% between following commits:

There's no drop with QSV backend. Between the indicated commit range, there's a series of changes to FFmpeg VAAPI support (and couple of other changes).

Setup:

  • Ubuntu 18.04
  • drm-tip kernel v4.20
  • FFmpeg and iHD driver built from git
  • HW supported by iHD driver

How to reproduce:

$ ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i input.264 -c:v h264_vaapi -y output.h264

I see the drop on all platforms supported by iHD driver i.e. Broadwell and newer. With i965 driver (which supports also older platforms), the drop is visible on Braswell too, but I don't see it on Sandybridge or Haswell.

This drop may be there also for other codecs, but I've tested only H264.

GPU is running at full speed before and after the change, but it and CPU use less power, i.e. they're underutilized compared to earlier situation. If one runs many instances of FFmpeg at same time so that HW is definitely fully utilized, that still retains same perf => FFmpeg VAAPI usage seems to become more synchronous.

Change History (13)

comment:1 Changed 10 months ago by cehoyos

  • Keywords vaapi regression added
  • Version changed from unspecified to git-master

Please use git bisect to find the commit introducing the regression.

comment:2 Changed 10 months ago by eero-t

Did manual bisect (they go from newer to older commits):

=> encode perf regression came from:

commit 5fdcf85bbffe7451c227478fda62da5c0938f27d
Author:     Mark Thompson <sw@jkqxz.net>
AuthorDate: Thu Dec 20 20:39:56 2018 +0000
Commit:     Mark Thompson <sw@jkqxz.net>
CommitDate: Wed Jan 23 23:04:11 2019 +0000

    vaapi_encode: Convert to send/receive API
    
    This attaches the logic of picking the mode of for the next picture to
    the output, which simplifies some choices by removing the concept of
    the picture for which input is not yet available.  At the same time,
    we allow more complex reference structures and track more reference
    metadata (particularly the contents of the DPB) for use in the
    codec-specific code.
    
    It also adds flags to explicitly track the available features of the
    different codecs.  The new structure also allows open-GOP support, so
    that is now available for codecs which can do it.

comment:3 Changed 10 months ago by cehoyos

  • Component changed from undetermined to avcodec
  • Priority changed from normal to important
  • Status changed from new to open

comment:4 follow-up: Changed 10 months ago by eero-t

VAAPI transcode performance is now slower than with QSV, whereas it earlier was in most cases faster (at least for H264, on Intel).

Guilty commit is not codec specific, so it's likely to regress VAAPI encoding perf also with other codecs than H264.

comment:5 in reply to: ↑ 4 Changed 8 months ago by eero-t

Replying to eero-t:

Guilty commit is not codec specific, so it's likely to regress VAAPI encoding perf also with other codecs than H264.

Ticket #7797 could also be due to this regression, as there has been no improvement to VA-API performance since this regression in January (except from drm-tip kernel v4.20 -> 5.0 upgrade).

comment:6 Changed 5 months ago by hbommi

There is a 10 - 15% performance drop on HEVC(H265) transcoding using regression patch
test command:
ffmpeg -v verbose -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -hwaccel_output_format vaapi -c:v hevc -i hevc_inputstream -c:v hevc_vaapi -vframes 500 -y output.h265

Last edited 5 months ago by hbommi (previous) (diff)

comment:7 Changed 3 weeks ago by fulinjie

  • Cc linjie.fu@intel.com added
  • Reproduced by developer set

comment:8 Changed 3 weeks ago by fulinjie

One possible reason:

For old API encode2, vaBeginPicture and vaSyncSurface will be called in a more "asynchronous" way:

Two pics will be sent to encoder without vaSyncSurface, thus the encoder would not be blocked by the sync and map procedure.

[mpeg2_vaapi @ 0x55a3ad371200] vaBeginPicture happens here.
    Last message repeated 1 times
[mpeg2_vaapi @ 0x55a3ad371200] vaSyncSurface happens here.
    Last message repeated 1 times

For new send/receive API, vaBeginPicture is strictly followed by vaSyncSurface.

[mpeg2_vaapi @ 0x55bb10ea9200] vaBeginPicture happens here.
[mpeg2_vaapi @ 0x55bb10ea9200] vaSyncSurface happens here.
[mpeg2_vaapi @ 0x55bb10ea9200] vaBeginPicture happens here.
[mpeg2_vaapi @ 0x55bb10ea9200] vaSyncSurface happens here.
[mpeg2_vaapi @ 0x55bb10ea9200] vaBeginPicture happens here.
[mpeg2_vaapi @ 0x55bb10ea9200] vaSyncSurface happens here.
[mpeg2_vaapi @ 0x55bb10ea9200] vaBeginPicture happens here.
[mpeg2_vaapi @ 0x55bb10ea9200] vaSyncSurface happens here.
[mpeg2_vaapi @ 0x55bb10ea9200] vaBeginPicture happens here.
[mpeg2_vaapi @ 0x55bb10ea9200] vaSyncSurface happens here.

comment:9 Changed 2 weeks ago by fulinjie

Currently, vaapi encodes a pic if all its referenced pics are ready,
and then outputs it immediately by calling vaapi_encode_output(vaSyncSurfac).

When working on output procedure, hardware is be able to cope with encoding
tasks in the meantime to have better performance.

So there is a more efficient way to encode the pics whose refs are all ready
during one receive_packets() function and output the pkt when encoder is encoding
new pic waiting for its references.

It's what vaapi originally did before the regression, and the performance could be
improved for ~20%.

CMD:
ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128
-hwaccel_output_format vaapi -i bbb_sunflower_1080p_30fps_normal.mp4
-c:v h264_vaapi -f h264 -y /dev/null

Source:
https://download.blender.org/demo/movies/BBB/

Before:

~164 fps

After:

~198 fps

However, it didn't totally meet the performance benchmark before the regression in my experiment.

Hi Eero,

Would you please help to verify this patch:
https://patchwork.ffmpeg.org/patch/16156/

comment:10 Changed 2 weeks ago by eero-t

I've started some VA-API tests with that patch applied to FFmpeg. I'll check the results (FPS & PSNR) tomorrow.

comment:11 Changed 13 days ago by eero-t

I ran several variants of 6 transcode operations and few other media tests:

  • In 8-bit (max FullHD) AVC transcode tests perf improves up to 20%, when running single transcode operation
  • In 10-bit 4K HEVC transcode [1], perf increase was 3-4%
  • When running multiple transcode operations in parallel, there was no perf change (all changes were within daily variance)
  • There were no performance regressions

Even with the patch, there's still very clear gap to original January performance. Because that perf drop concerned only single transcode operations (parallel ones were not impacted), it's possible that some part of the gap was due to P-state power management (I'm not fixing CPU & GPU speeds in my tests, on purpose).

I was testing this on KBL i7 GT3e NUC with 28W TDP. Some observations on power usage:

  • In the tests improving most, patch increases GPU power usage without increased CPU power usage, i.e. FFmpeg was better able to feed work to GPU
  • When many instances of the same test are run in parallel, things are TDP limited. Either there's no change in power usage, or patch causes slightly higher CPU usage, which results in GPU using less power. No idea how latter behavior is able to maintain same speed, maybe P-state is better able to save GPU power with the interaction patterns caused by the patch?

[1] Note: I'm seeing marginal reproducible quality drop (0.1% SSIM, 2-3% PSNR) in this test-case: https://trac.ffmpeg.org/ticket/8328

I assume that's something related to frame timings like with QSV, not a change in encoded frame contents.

comment:12 follow-up: Changed 13 days ago by fulinjie

Thanks for verifying this patch.
There is still room for performance improvement.

So based on the test results, does this regression affect single process only?
If that's true,one possible reason is that multiple process encoding makes full use of the resource (hardware/encoder maybe), while single process seems to keep the encoding procedure idle or waiting for sometime.

And it's kind of weird that it benefits HEVC little since the modification is in the general vaapi encode code path.

And since the test covers the whole transcoding procedure, how about the performance of decode/encode separately?

comment:13 in reply to: ↑ 12 Changed 13 days ago by eero-t

Replying to fulinjie:

So based on the test results, does this regression affect single process only?

Looking at the old results, correct. Running multiple (5-50) parallel transcode processes wasn't affected by the regression, even when they were not TDP limited.

If that's true, one possible reason is that multiple process encoding makes full use of the resource (hardware/encoder maybe), while single process seems to keep the encoding procedure idle or waiting for sometime.

Yes, single 8-bit AVC transcode doesn't fill any of the GPU engines 100%, that happens only with multiple parallel transcode operations (it's easy to see from IGT "intel_gpu_top" output).

If GPU is "full" all the time, stuff needs to be queued. For average throughput (= what I'm measuring) it doesn't matter when you go to the queue, GPU is anyway fully utilized. Your patch might help a bit with VA-API latency in multiple parallel transcode cases, but I don't measure that.

And it's kind of weird that it benefits HEVC little since the modification is in the general vaapi encode code path.

My HEVC test-case is 10-bit instead of 8, and 4K instead of 2K or smaller. Therefore, it's processing >4x more data than my AVC test-cases. HEVC encoding is also heavier.
=> As each frame takes longer, feeding GPU timely is less of a problem for keeping average GPU utilization high.

(I was a bit worried about potential extra CPU usage, because that's away from power/temperature "budget" shared with iGPU, but that seems to be low enough not to be a problem.)

And since the test covers the whole transcoding procedure, how about the performance of decode/encode separately?

I did some decode tests with HEVC (for 2K 10-bit data) with and without hwdownload, and as expected, perf of that wasn't impacted.

(RAW data encoding is less of interest for me as the input data is so large that end-to-end perf can be bottlenecked more by disk/network data transfer, rather than GPU usage.)

Note: See TracTickets for help on using tickets.