Opened 8 months ago

Last modified 8 months ago

#7690 new defect

FFmpeg QSV decode + VPP performance is just a fraction of what one gets with VA-API and MediaSDK

Reported by: eero-t Owned by:
Priority: normal Component: undetermined
Version: git-master Keywords: qsv
Cc: Blocked By:
Blocking: Reproduced by developer: no
Analyzed by developer: no

Description

Summary of the bug: Running 10-bit HEVC decode + downscale + upload with QSV backend is only 20-30% of the performance with the VA-API backend performance, or of the performance with the Intel MediaSDK sample application.

Setup:

  • Distro: Ubuntu 18.04
  • FFmpeg: latest compiled from Git
  • MediaSDK & its deps: latest compiled from Git
  • HW: tested on KBL GT2, KBL GT3e and CFL GT2

Steps to reproduce:

  1. Encoding 1080p 10-bit HEVC test input video with libx265:
    $ ffmpeg -i 4k_uhd_hevc_10bit_60fps.mkv -frames 4800 -s 1920x1080 -pix_fmt yuv420p10le -x265-params level=5.2 1920x1080_10bit_60fps.h265
    
  2. export LIBVA_DRIVER_NAME=iHD
  3. Decode + VPP with QSV for that video:
    ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v hevc_qsv -i 1920x1080_10bit_60fps.h265 -vf scale_qsv=w=300:h=300,hwdownload,format=p010 -f null -
    
  4. Decode + VPP with VA-API:
    $ ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i 1920x1080_10bit_60fps.h265 -vf scale_vaapi=w=300:h=300,hwdownload,format=p010 -f null -
    
  5. Decode + VPP with MediaSDK sample app:
    $ sample_decode -hw h265 -w 300 -h 300 -p010 -i 1920x1080_10bit_60fps.h265 -o /dev/null
    

Expected output:

  • Similar performance for all 3 cases

Actual outcome:

  • MediaSDK and VA-API cases performance is within few percent of each other
  • QSV performance is only a fraction of MediaSDK and VA-API performance, at best ~30%

When looking at the GPU information:

  • GPU runs at minimum freq with QSV, but at max with others
  • despite this, video engine is only half utilized with QSV, fully utilized with others

As QSV uses less CPU than the other two cases (CPU utilization percentage is nearly same, but according to RAPL, CPU core power usage is much smaller with QSV), issue could be some extra synchronization between CPU & GPU with FFmpeg QSV backend.

QSV outputs following errors at beginning:

Stream mapping:
  Stream #0:0 -> #0:0 (hevc (hevc_qsv) -> wrapped_avframe (native))
Press [q] to stop, [?] for help
[NULL @ 0x556f79e64840] missing picture in access unit with size 1
    Last message repeated 1 times
[hevc_qsv @ 0x556f79e75dc0] A decode call did not consume any data: expect more data at input (-10)
[NULL @ 0x556f79e64840] missing picture in access unit with size 1
[hevc_qsv @ 0x556f79e75dc0] A decode call did not consume any data: expect more data at input (-10)
[NULL @ 0x556f79e64840] missing picture in access unit with size 1
[hevc_qsv @ 0x556f79e75dc0] A decode call did not consume any data: expect more data at input (-10)
[NULL @ 0x556f79e64840] missing picture in access unit with size 1
    Last message repeated 2 times
Output #0, null, to 'pipe:':

And a huge amount of following warnings during rest of the pipeline:

[NULL @ 0x556f79e64840] missing picture in access unit with size 1peed=7.51x    
    Last message repeated 224 times
[NULL @ 0x556f79e64840] missing picture in access unit with size 1peed=7.49x    
    Last message repeated 262 times

Change History (14)

comment:1 Changed 8 months ago by lizhong1008

Have you got the performance data of H264 or 8bit hevc cases?

comment:2 follow-up: Changed 8 months ago by eero-t

I transcoding the 10-bit video used in this ticket to 8-bit (with Ubuntu 18.04 FFmpeg):
ffmpeg -i 1920x1080_10bit_60fps.h265 -pix_fmt yuv420p 1920x1080_8bit_60fps.h265

And repeated the above test-cases using that video, with hwupload format changed to nv12.

=> Results are similar, iHD & i965 drivers VA-API backend, and MediaSDK sample application are all 3x-4x faster than using QSV.

Btw. I had earlier tried upscaling 1920x540 8-bit HEVC to 1920x1080 and doing hwdownload to nv12. In that case QSV was also clearly slower than the other alternatives, but the perf gap was smaller "only" about 40-60%.

PS. 10bit cases should work also on BXT, but there's some bug with scaling, see: https://github.com/intel/media-driver/issues/499

comment:3 Changed 8 months ago by cehoyos

  • Keywords qsv added

comment:4 follow-up: Changed 8 months ago by oromit

QSV on Linux is just a wrapper around a wrapper around VAAPI.
There is really no reason to use it, specially as it's lacking a lot of features VAAPI (and Windows QSV) has, and apparently also performs notably worse.

comment:5 in reply to: ↑ 4 Changed 8 months ago by lizhong1008

Replying to oromit:

There is really no reason to use it, specially as it's lacking a lot of features VAAPI (and Windows QSV) has

Would you please specify which features are lacked compared with vaapi and Windows QSV?

comment:7 in reply to: ↑ 6 Changed 8 months ago by lizhong1008

Replying to oromit:

http://git.videolan.org/?p=ffmpeg.git;a=blob;f=libavcodec/qsvenc.h#l49

All of these can't be supported by vaapi too, and QVBR has been enabled and the patch is under-review: https://patchwork.ffmpeg.org/patch/11222/

If we just talk about Linux and compare ffmpeg-vaapi with ffmpeg-qsv, I like ffmepg-vaapi's decoder too. But for encoding, qsv can support more features, just like Look_ahead, ICQ which can provide better quality, and MFE can provides better performance.

comment:8 in reply to: ↑ 2 Changed 8 months ago by lizhong1008

Replying to eero-t:

I transcoding the 10-bit video used in this ticket to 8-bit (with Ubuntu 18.04 FFmpeg):
ffmpeg -i 1920x1080_10bit_60fps.h265 -pix_fmt yuv420p 1920x1080_8bit_60fps.h265

And repeated the above test-cases using that video, with hwupload format changed to nv12.

=> Results are similar, iHD & i965 drivers VA-API backend, and MediaSDK sample application are all 3x-4x faster than using QSV.

I suppose this an issue of qsv hwdownloading. FFmpeg-vaapi can get an image directly via vaDeriveImage() (https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/hwcontext_vaapi.c#L788). But possibly MSDK is via vaGetImage() (https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/hwcontext_vaapi.c#L815) thus making a copy cause performance drop. MSDK sample is working in system memory thus no need an action of hwdownloading.

I will take a deeper look and go back to this issue.

comment:9 Changed 8 months ago by eero-t

There is really no reason to use it, specially as it's lacking a lot of features VAAPI (and Windows QSV) has, and apparently also performs notably worse.

In FFmpeg transcode operations, their performance seems on par, depending on case and HW, sometimes VA-API is marginally faster, sometimes QSV. In most cases both are a bit faster than the old i965 driver, but not always (because old i965 and new iHD driver seem to split work differently between compute and video engines).

From the feature support side, VA-API has issue #7650, which isn't a problem with QSV.

I suppose this an issue of qsv hwdownloading

I'll do some additional comparisons for doing just decoding, and doing decoding + downscaling, to see how it impacts the perf gap, and report the results tomorrow.

comment:10 Changed 8 months ago by eero-t

I tried running multiple FFmpeg pipeline processes in parallel.

With 2 processes, GPU is already running at full speed and with 3-4 processes video engine is 100% utilized.

While there's still clear performance gap to VA-API with multiple processes on KBL GT2, the gap is *much* smaller (<10%) than with a single pipeline.

So, the main performance issue seems to be QSV backend somehow managing to fool 4.20 kernel (P-state powersave governor) into thinking that QSV isn't GPU bound, and keeping GPU at lowest speed, on all platforms I were able to test.

comment:11 Changed 8 months ago by eero-t

f I force KBL GT2 GPU to max speed (gt_min_freq_mhz = gt_max_freq_mhz), and run just single pipeline, QSV performance doubles, but there's still 40-50% gap to the other options.

Same performance gap between QSV & VA-API backends is visible also when doing just decode (or decode+downscale):

ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v hevc_qsv -i 1920x1080_10bit_60fps.h265 -f null -

I guess when running parallel pipelines, difference is much smaller because things hit a another bottleneck that's common with all drivers (100% video engine utilization), thereby diminishing the impact of the QSV pipeline issue visible when running just one pipeline.

Note: I've earlier seen VA-API being being much slower than QSV when doing 1920x540 8-bit HEVC decode, I need to retest that in case there's been some change recently in kernel, mediasdk or FFmpeg since, or are things resolution dependent.

comment:12 Changed 8 months ago by eero-t

More testing with KBL-i7 GT2 (and latest Git versions of everything).

With 8-bit 1920x540 HEVC decode, QSV is clearly faster than VA-API:

ffmpeg  -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v hevc_qsv -i 1920x540_60_yuv420p_4800.h265 -f null -
...
ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -i 1920x540_60_yuv420p_4800.h265 -y -f null -

Whereas with (8-bit & 10-bit) 1920x1080 HEVC decode, QSV was clearly slower than VA-API.
In both cases, also when GPU is forced to run at max speed.

In the 1920x540 case, QSV gets (clearly) slower than VA-API when something is done to the decoded data. If I do VPP upscale from decoded 1920x540 to 1920x1080, unlike 1920x540 decoding, QSV perf of that also impacted by kernel power management (has much lower perf when GPU isn't forced to max).

To summarize findings so far:

  • Resolution impacts whether doing (HEVC) decoding is slower with QSV or VA-API backends
  • In larger resolutions, VPP operations with QSV backend are slower than with VA-API
  • Depending on what the pipeline does and at what resolutions, QSV backend can fool kernel to lower GPU speed so that it's much slower than with VA-API (when there's single pipeline running at the same time). IMHO this is larger of the issues, but it could be related to the VPP issue

I suppose this an issue of qsv hwdownloading. FFmpeg-vaapi can get an image directly via vaDeriveImage() (​https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/hwcontext_vaapi.c#L788). But possibly MSDK is via vaGetImage() (​https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/hwcontext_vaapi.c#L815) thus making a copy cause performance drop

...

I will take a deeper look and go back to this issue.

Any conclusions? Something like that might explain operations being synchronous enough that kernel power management doesn't think use-case to be GPU bound (enough).

comment:13 Changed 8 months ago by eero-t

Binding FFmpeg QSV process to a single core will also help in keeping the GPU speed up (e.g. with "taskset 0x1 ffmpeg ...").

I ftraced kernel task migrations and those aren't the problem. I guess taskset helps because all FFmpeg/QSV threads being on the same core helps keeping CPU speed up, which can help keeping GPU speed up (you don't get vicious cycle where other side being slow & waited on, causes freq of the other side to be dropped, which is then slow when control switches between GPU / CPU).

When tracing per-frame vaBeginPicture() calls done by QSV & VA-API backends, I noticed that QSV backend does all of them from a single thread, whereas VA-API backend splits those calls evenly across all (5) threads. Work split differences like that can certainly explain differences in how kernel power management handles these cases...

comment:14 Changed 8 months ago by eero-t

Btw. this is the source for the converted video http://4ksamples.com/ses-astra-uhd-test-1-2160p-uhdtv/.

Note: See TracTickets for help on using tickets.