Opened 5 years ago

Closed 2 years ago

#7797 closed defect (fixed)

AVC->MPEG-2 transcoding with VA-API 2-3x slower than with QSV

Reported by: eero-t Owned by:
Priority: normal Component: undetermined
Version: git-master Keywords:
Cc: linjie.fu@intel.com Blocked By: 7706
Blocking: Reproduced by developer: no
Analyzed by developer: no

Description

Setup:

  • Ubuntu 18.04 with drm-tip 5.x kernel from Git
  • iHD media driver, MediaSDK and FFmpeg built from Git

Summary of the bug:

  • Bad VA-API performance with transcoding. Doing AVC -> MPEG-2 transcoding with QSV is 2-3x faster than using VA-API

How to reproduce:

$ export LIBVA_DRIVER_NAME=iHD
$ ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v h264_qsv -i 720x480p_30.00_4mb_h264_cabac_180s.264 -c:v mpeg2_qsv -b:v 2000K -compression_level 4 -y output.mpg
ffmpeg version N-93330-g7ff89574c7 Copyright (c) 2000-2019 the FFmpeg developers
  built with gcc 7 (Ubuntu 7.3.0-27ubuntu1~18.04)
...
Input #0, h264, from 'input/720x480p_30.00_4mb_h264_cabac_180s.264':
  Duration: N/A, bitrate: N/A
    Stream #0:0: Video: h264 (High), 1 reference frame, yuv420p(tv, smpte170m, progressive, left), 720x480 [SAR 10:11 DAR 15:11], 30 fps, 30 tbr, 1200k tbn, 60 tbc
Stream mapping:
  Stream #0:0 -> #0:0 (h264 (native) -> mpeg2video (mpeg2_vaapi))
Press [q] to stop, [?] for help
[h264 @ 0x55a62883c380] Reinit context to 720x480, pix_fmt: vaapi_vld
[graph 0 input from stream 0:0 @ 0x55a628872f40] w:720 h:480 pixfmt:vaapi_vld tb:1/1200000 fr:30/1 sar:10/11 sws_param:flags=2
[mpeg2_vaapi @ 0x55a62883eb80] Input surface format is nv12.
[mpeg2_vaapi @ 0x55a62883eb80] Using VAAPI profile VAProfileMPEG2Main (1).
[mpeg2_vaapi @ 0x55a62883eb80] Using VAAPI entrypoint VAEntrypointEncSlice (6).
[mpeg2_vaapi @ 0x55a62883eb80] Using VAAPI render target format YUV420 (0x1).
[mpeg2_vaapi @ 0x55a62883eb80] RC mode: VBR.
[mpeg2_vaapi @ 0x55a62883eb80] RC target: 50% of 4000000 bps over 500 ms.
[mpeg2_vaapi @ 0x55a62883eb80] RC buffer: 2000000 bits, initial fullness 1500000 bits.
[mpeg2_vaapi @ 0x55a62883eb80] RC framerate: 30/1 (30.00 fps).
[mpeg2_vaapi @ 0x55a62883eb80] Using intra, P- and B-frames (supported references: 1 / 1).
[mpeg2_vaapi @ 0x55a62883eb80] Driver does not support some wanted packed headers (wanted 0x3, found 0x10).
[mpeg2_vaapi @ 0x55a62883eb80] Sample aspect ratio 10:11 is not representable, signalling square pixels instead.
[mpeg @ 0x55a62883a580] VBV buffer size not set, using default size of 230KB
If you want the mpeg file to be compliant to some specification
Like DVD, VCD or others, make sure you set the correct buffer size
Output #0, mpeg, to 'output/0039_SD03MP2_1.0.mpg':
  Metadata:
    encoder         : Lavf58.26.101
    Stream #0:0: Video: mpeg2video (mpeg2_vaapi) (Main), vaapi_vld, 720x480 [SAR 10:11 DAR 15:11], q=-1--1, 2000 kb/s, 30 fps, 90k tbn, 30 tbc
    Metadata:
      encoder         : Lavc58.47.103 mpeg2_vaapi
...

And QSV:

$ ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i 720x480p_30.00_4mb_h264_cabac_180s.264 -c:v mpeg2_vaapi -b:v 2000K -compression_level 4 -y output.mpg
...
[AVHWDeviceContext @ 0x565392bd4280] Initialize MFX session: API version is 1.28, implementation version is 1.28
[AVHWDeviceContext @ 0x565392bd4280] MFX compile/runtime API: 1.28/1.28
[AVHWDeviceContext @ 0x565392bf2f00] VAAPI driver: Intel iHD driver - 1.0.0.
[AVHWDeviceContext @ 0x565392bf2f00] Driver not found in known nonstandard list, using standard behaviour.
[graph 0 input from stream 0:0 @ 0x565392d785c0] w:720 h:480 pixfmt:qsv tb:1/1200000 fr:30/1 sar:10/11 sws_param:flags=2
[mpeg2_qsv @ 0x565392bd1f40] Using the variable bitrate (VBR) ratecontrol method
[AVHWDeviceContext @ 0x565392cfc340] VAAPI driver: Intel iHD driver - 1.0.0.
[AVHWDeviceContext @ 0x565392cfc340] Driver not found in known nonstandard list, using standard behaviour.
[mpeg2_qsv @ 0x565392bd1f40] profile: main; level: 8
[mpeg2_qsv @ 0x565392bd1f40] GopPicSize: 250; GopRefDist: 4; GopOptFlag: closed ; IdrInterval: 0
[mpeg2_qsv @ 0x565392bd1f40] TargetUsage: 4; RateControlMethod: VBR
[mpeg2_qsv @ 0x565392bd1f40] BufferSizeInKB: 500; InitialDelayInKB: 500; TargetKbps: 2000; MaxKbps: 2000; BRCParamMultiplier: 1
[mpeg2_qsv @ 0x565392bd1f40] NumSlice: 30; NumRefFrame: 0
[mpeg2_qsv @ 0x565392bd1f40] RateDistortionOpt: unknown
[mpeg2_qsv @ 0x565392bd1f40] RecoveryPointSEI: unknown IntRefType: 0; IntRefCycleSize: 0; IntRefQPDelta: 0
[mpeg2_qsv @ 0x565392bd1f40] MaxFrameSize: 0; MaxSliceSize: 0; 
[mpeg2_qsv @ 0x565392bd1f40] BitrateLimit: unknown; MBBRC: unknown; ExtBRC: unknown
[mpeg2_qsv @ 0x565392bd1f40] Trellis: auto
[mpeg2_qsv @ 0x565392bd1f40] VDENC: OFF
[mpeg2_qsv @ 0x565392bd1f40] RepeatPPS: unknown; NumMbPerSlice: 0; LookAheadDS: unknown
[mpeg2_qsv @ 0x565392bd1f40] AdaptiveI: unknown; AdaptiveB: unknown; BRefType: auto
[mpeg2_qsv @ 0x565392bd1f40] MinQPI: 0; MaxQPI: 0; MinQPP: 0; MaxQPP: 0; MinQPB: 0; MaxQPB: 0
[mpeg2_qsv @ 0x565392bd1f40] FrameRateExtD: 1; FrameRateExtN: 30 
[mpeg @ 0x565392bd1500] VBV buffer size not set, using default size of 230KB
If you want the mpeg file to be compliant to some specification
Like DVD, VCD or others, make sure you set the correct buffer size
Output #0, mpeg, to 'output/0039_SD03MP2_1.0.mpg':
    Metadata:
      encoder         : Lavc58.47.103 mpeg2_qsv
    Side data:
      cpb: bitrate max/min/avg: 0/0/2000000 buffer size: 0 vbv_delay: -1
...

GPU is running at full speed in both cases, so this isn't related to ticket #7690. It could be related to regression #7706, but I can't test it because ticket #7650 ("invalid RC mode") was fixed only after that regression.

When looking at CPU utilization and power usage, QSV utilizes more CPU, but has also more iowait, and correspondingly, it's using both more CPU and GPU power than VA-API. Maybe VA-API isn't running asynchronously enough?

There are also (AVC) transcode single-stream cases where VA-API is slower, but gap is much smaller, and if one runs multiple processes in parallel, VA-API is actually slightly faster. In this case, VA-API is slower also with multiple parallel transcode processes.

I'm seeing similar perf gap on all the Core devices [1] currently supported by iHD: BDW, SKL, KBL & CFL, both on GT2 & GT3e devices i.e. issue isn't platform specific.

[1] This test-case doesn't work on the only GEN9+ non-core device I have (BXT/APL).

Extra info:

  • With a larger 1280x720p_29.97_10mb_h264_cabac input, performance gap was still about same >2x
  • When using even larger 1920x1080i_29.97_20mb_mpeg2_high as input, gap decreased to ~25%, but performance with both APIs had also dropped to a fraction.

Change History (7)

comment:1 by Dennis E. Mungai, 5 years ago

Same issue here.
QSV, with the same source, is literally 2x faster than VAAPI.
Tested on CFL, with current ffmpeg git head.

comment:2 by eero-t, 4 years ago

There's been a clear improvement to AVC -> MPEG2 performance this last week, between...

These versions:

  • ffmpeg (dd019473) 2019-10-15 avformat/latmenc: abort if no extradata is available
  • media-driver (e370de92) 2019-10-15 [VP] Add HDR flag support in vphal render.

And these:

  • ffmpeg (e831f601) 2019-10-16 fate/source: add libavfilter/af_arnndn.c
  • media-driver (6e7275cf) 2019-10-16 encode_mpeg2: use 16kb as the buf unit for mpeg2 vbvBuf

Especially in multi-process VA-API encoding, where it's now about same perf as MSDK and FFmpeg QSV.

(Single process encoding is still behind a bit, I guess that's mostly due threading differences between QSV & VA-API backends and resulting kernel power management handling.)

comment:3 by Linjie.Fu, 4 years ago

Hi Eero,

Would you please help to verify whether this issue is also benefited from the patch mentioned in #7706:
https://patchwork.ffmpeg.org/patch/16156/

in reply to:  3 comment:4 by eero-t, 4 years ago

Replying to eero-t:

There's been a clear improvement to AVC -> MPEG2 performance this last week, between...

[...]

Especially in multi-process VA-API encoding, where it's now about same perf as MSDK and FFmpeg QSV.

In multi-process case, VA-API is now slightly faster than QSV on SKL GT2 & KBL GT3e, on SKL GT4e, it's still slight slower. I.e. that was 2x perf perf improvement to earlier.

(Single process encoding is still behind a bit, I guess that's mostly due threading differences between QSV & VA-API backends and resulting kernel power management handling.)

Single process AVC->MPEG2 VA-API encode+downscale performance improved from <1/2 to 2/3 of the QSV perf.

(Gap is now close to the gap with same conversion where output is AVC, VA-API perf there is 3/4 of QSV perf.)

---

Replying to fulinjie:

Would you please help to verify whether this issue is also benefited from the patch mentioned in #7706:
https://patchwork.ffmpeg.org/patch/16156/

Your patch increased single process performance by 5-10%, from slightly over 2/3rd to almost 3/4th of the QSV perf.

(For the same conversion with AVC output, your patch improvement is 10-15%, from 3/4th to 7/8th of QSV performance.)

---

In FullHD MPEG2 -> AVC bitrate conversion with VA-API:
ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i 1920x1080i_29.97_20mb_mpeg2_high.mpv -c:v h264_vaapi -b:v 6000K -compression_level 7 output.h264

vs. QSV:
ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v mpeg2_qsv -i 1920x1080i_29.97_20mb_mpeg2_high.mpv -c:v h264_qsv -b:v 6000K -compression_level 7 output.h264

VA-API is also slower, 2/3rd of QSV performance in single process transcode. Most of that gap is naturally due to #7706 regression, but it was (1/8th, 10-15%) slower than QSV also earlier. Your patch improves it to 3/4th of QSV performance.

In my other test-cases, VA-API is close to QSV speed (or better).

comment:5 by Linjie.Fu, 4 years ago

Cc: linjie.fu@intel.com added

Thanks for the info and it really helps.
Seems #7706 should be addressed first then we could turn to mpeg2 specific transcoding.

comment:6 by eero-t, 4 years ago

Blocked By: 7706

I agree, #7706 should be handled first -> I added it as blocker

Based on my last observation, there may be something also with the MPEG-2 decoding, but let's see that after #7706.

comment:7 by eero-t, 2 years ago

Resolution: fixed
Status: newclosed

Closing this as fixed.

VA-API performance was fixed for parallel AVC->MPEG2 transcoding somewhere in 2019 fall, and the single transcoding case performance was fixed a month ago with #7706.

In all the transcode cases I'm testing (on GEN9 + TGL), VA-API performance is now same or better as for QSV, sometimes significantly.

(Except in one parallel use-case, where 4K HEVC is downscaled + converted to YUV, downloaded to CPU side and discarded. That one is faster with QSV than VA-API on all platforms.)

Note: See TracTickets for help on using tickets.