Opened 5 years ago

Last modified 3 years ago

#7690 new defect

FFmpeg QSV decode + VPP performance is just a fraction of what one gets with VA-API and MediaSDK

Reported by: eero-t Owned by:
Priority: normal Component: undetermined
Version: git-master Keywords: qsv
Cc: Blocked By:
Blocking: Reproduced by developer: no
Analyzed by developer: no

Description

Summary of the bug: Running 10-bit HEVC decode + downscale + upload with QSV backend is only 20-30% of the performance with the VA-API backend performance, or of the performance with the Intel MediaSDK sample application.

Setup:

  • Distro: Ubuntu 18.04
  • FFmpeg: latest compiled from Git
  • MediaSDK & its deps: latest compiled from Git
  • HW: tested on KBL GT2, KBL GT3e and CFL GT2

Steps to reproduce:

  1. Encoding 1080p 10-bit HEVC test input video with libx265:
    $ ffmpeg -i 4k_uhd_hevc_10bit_60fps.mkv -frames 4800 -s 1920x1080 -pix_fmt yuv420p10le -x265-params level=5.2 1920x1080_10bit_60fps.h265
    
  2. export LIBVA_DRIVER_NAME=iHD
  3. Decode + VPP with QSV for that video:
    ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v hevc_qsv -i 1920x1080_10bit_60fps.h265 -vf scale_qsv=w=300:h=300,hwdownload,format=p010 -f null -
    
  4. Decode + VPP with VA-API:
    $ ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i 1920x1080_10bit_60fps.h265 -vf scale_vaapi=w=300:h=300,hwdownload,format=p010 -f null -
    
  5. Decode + VPP with MediaSDK sample app:
    $ sample_decode -hw h265 -w 300 -h 300 -p010 -i 1920x1080_10bit_60fps.h265 -o /dev/null
    

Expected output:

  • Similar performance for all 3 cases

Actual outcome:

  • MediaSDK and VA-API cases performance is within few percent of each other
  • QSV performance is only a fraction of MediaSDK and VA-API performance, at best ~30%

When looking at the GPU information:

  • GPU runs at minimum freq with QSV, but at max with others
  • despite this, video engine is only half utilized with QSV, fully utilized with others

As QSV uses less CPU than the other two cases (CPU utilization percentage is nearly same, but according to RAPL, CPU core power usage is much smaller with QSV), issue could be some extra synchronization between CPU & GPU with FFmpeg QSV backend.

QSV outputs following errors at beginning:

Stream mapping:
  Stream #0:0 -> #0:0 (hevc (hevc_qsv) -> wrapped_avframe (native))
Press [q] to stop, [?] for help
[NULL @ 0x556f79e64840] missing picture in access unit with size 1
    Last message repeated 1 times
[hevc_qsv @ 0x556f79e75dc0] A decode call did not consume any data: expect more data at input (-10)
[NULL @ 0x556f79e64840] missing picture in access unit with size 1
[hevc_qsv @ 0x556f79e75dc0] A decode call did not consume any data: expect more data at input (-10)
[NULL @ 0x556f79e64840] missing picture in access unit with size 1
[hevc_qsv @ 0x556f79e75dc0] A decode call did not consume any data: expect more data at input (-10)
[NULL @ 0x556f79e64840] missing picture in access unit with size 1
    Last message repeated 2 times
Output #0, null, to 'pipe:':

And a huge amount of following warnings during rest of the pipeline:

[NULL @ 0x556f79e64840] missing picture in access unit with size 1peed=7.51x    
    Last message repeated 224 times
[NULL @ 0x556f79e64840] missing picture in access unit with size 1peed=7.49x    
    Last message repeated 262 times

Change History (18)

comment:1 by Zhong,Li, 5 years ago

Have you got the performance data of H264 or 8bit hevc cases?

comment:2 by eero-t, 5 years ago

I transcoding the 10-bit video used in this ticket to 8-bit (with Ubuntu 18.04 FFmpeg):
ffmpeg -i 1920x1080_10bit_60fps.h265 -pix_fmt yuv420p 1920x1080_8bit_60fps.h265

And repeated the above test-cases using that video, with hwupload format changed to nv12.

=> Results are similar, iHD & i965 drivers VA-API backend, and MediaSDK sample application are all 3x-4x faster than using QSV.

Btw. I had earlier tried upscaling 1920x540 8-bit HEVC to 1920x1080 and doing hwdownload to nv12. In that case QSV was also clearly slower than the other alternatives, but the perf gap was smaller "only" about 40-60%.

PS. 10bit cases should work also on BXT, but there's some bug with scaling, see: https://github.com/intel/media-driver/issues/499

comment:3 by Carl Eugen Hoyos, 5 years ago

Keywords: qsv added

comment:4 by Timo R., 5 years ago

QSV on Linux is just a wrapper around a wrapper around VAAPI.
There is really no reason to use it, specially as it's lacking a lot of features VAAPI (and Windows QSV) has, and apparently also performs notably worse.

in reply to:  4 comment:5 by Zhong,Li, 5 years ago

Replying to oromit:

There is really no reason to use it, specially as it's lacking a lot of features VAAPI (and Windows QSV) has

Would you please specify which features are lacked compared with vaapi and Windows QSV?

in reply to:  6 comment:7 by Zhong,Li, 5 years ago

Replying to oromit:

http://git.videolan.org/?p=ffmpeg.git;a=blob;f=libavcodec/qsvenc.h#l49

All of these can't be supported by vaapi too, and QVBR has been enabled and the patch is under-review: https://patchwork.ffmpeg.org/patch/11222/

If we just talk about Linux and compare ffmpeg-vaapi with ffmpeg-qsv, I like ffmepg-vaapi's decoder too. But for encoding, qsv can support more features, just like Look_ahead, ICQ which can provide better quality, and MFE can provides better performance.

in reply to:  2 comment:8 by Zhong,Li, 5 years ago

Replying to eero-t:

I transcoding the 10-bit video used in this ticket to 8-bit (with Ubuntu 18.04 FFmpeg):
ffmpeg -i 1920x1080_10bit_60fps.h265 -pix_fmt yuv420p 1920x1080_8bit_60fps.h265

And repeated the above test-cases using that video, with hwupload format changed to nv12.

=> Results are similar, iHD & i965 drivers VA-API backend, and MediaSDK sample application are all 3x-4x faster than using QSV.

I suppose this an issue of qsv hwdownloading. FFmpeg-vaapi can get an image directly via vaDeriveImage() (https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/hwcontext_vaapi.c#L788). But possibly MSDK is via vaGetImage() (https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/hwcontext_vaapi.c#L815) thus making a copy cause performance drop. MSDK sample is working in system memory thus no need an action of hwdownloading.

I will take a deeper look and go back to this issue.

comment:9 by eero-t, 5 years ago

There is really no reason to use it, specially as it's lacking a lot of features VAAPI (and Windows QSV) has, and apparently also performs notably worse.

In FFmpeg transcode operations, their performance seems on par, depending on case and HW, sometimes VA-API is marginally faster, sometimes QSV. In most cases both are a bit faster than the old i965 driver, but not always (because old i965 and new iHD driver seem to split work differently between compute and video engines).

From the feature support side, VA-API has issue #7650, which isn't a problem with QSV.

I suppose this an issue of qsv hwdownloading

I'll do some additional comparisons for doing just decoding, and doing decoding + downscaling, to see how it impacts the perf gap, and report the results tomorrow.

comment:10 by eero-t, 5 years ago

I tried running multiple FFmpeg pipeline processes in parallel.

With 2 processes, GPU is already running at full speed and with 3-4 processes video engine is 100% utilized.

While there's still clear performance gap to VA-API with multiple processes on KBL GT2, the gap is *much* smaller (<10%) than with a single pipeline.

So, the main performance issue seems to be QSV backend somehow managing to fool 4.20 kernel (P-state powersave governor) into thinking that QSV isn't GPU bound, and keeping GPU at lowest speed, on all platforms I were able to test.

comment:11 by eero-t, 5 years ago

f I force KBL GT2 GPU to max speed (gt_min_freq_mhz = gt_max_freq_mhz), and run just single pipeline, QSV performance doubles, but there's still 40-50% gap to the other options.

Same performance gap between QSV & VA-API backends is visible also when doing just decode (or decode+downscale):

ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v hevc_qsv -i 1920x1080_10bit_60fps.h265 -f null -

I guess when running parallel pipelines, difference is much smaller because things hit a another bottleneck that's common with all drivers (100% video engine utilization), thereby diminishing the impact of the QSV pipeline issue visible when running just one pipeline.

Note: I've earlier seen VA-API being being much slower than QSV when doing 1920x540 8-bit HEVC decode, I need to retest that in case there's been some change recently in kernel, mediasdk or FFmpeg since, or are things resolution dependent.

comment:12 by eero-t, 5 years ago

More testing with KBL-i7 GT2 (and latest Git versions of everything).

With 8-bit 1920x540 HEVC decode, QSV is clearly faster than VA-API:

ffmpeg  -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v hevc_qsv -i 1920x540_60_yuv420p_4800.h265 -f null -
...
ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -i 1920x540_60_yuv420p_4800.h265 -y -f null -

Whereas with (8-bit & 10-bit) 1920x1080 HEVC decode, QSV was clearly slower than VA-API.
In both cases, also when GPU is forced to run at max speed.

In the 1920x540 case, QSV gets (clearly) slower than VA-API when something is done to the decoded data. If I do VPP upscale from decoded 1920x540 to 1920x1080, unlike 1920x540 decoding, QSV perf of that also impacted by kernel power management (has much lower perf when GPU isn't forced to max).

To summarize findings so far:

  • Resolution impacts whether doing (HEVC) decoding is slower with QSV or VA-API backends
  • In larger resolutions, VPP operations with QSV backend are slower than with VA-API
  • Depending on what the pipeline does and at what resolutions, QSV backend can fool kernel to lower GPU speed so that it's much slower than with VA-API (when there's single pipeline running at the same time). IMHO this is larger of the issues, but it could be related to the VPP issue

I suppose this an issue of qsv hwdownloading. FFmpeg-vaapi can get an image directly via vaDeriveImage() (​https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/hwcontext_vaapi.c#L788). But possibly MSDK is via vaGetImage() (​https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/hwcontext_vaapi.c#L815) thus making a copy cause performance drop

...

I will take a deeper look and go back to this issue.

Any conclusions? Something like that might explain operations being synchronous enough that kernel power management doesn't think use-case to be GPU bound (enough).

comment:13 by eero-t, 5 years ago

Binding FFmpeg QSV process to a single core will also help in keeping the GPU speed up (e.g. with "taskset 0x1 ffmpeg ...").

I ftraced kernel task migrations and those aren't the problem. I guess taskset helps because all FFmpeg/QSV threads being on the same core helps keeping CPU speed up, which can help keeping GPU speed up (you don't get vicious cycle where other side being slow & waited on, causes freq of the other side to be dropped, which is then slow when control switches between GPU / CPU).

When tracing per-frame vaBeginPicture() calls done by QSV & VA-API backends, I noticed that QSV backend does all of them from a single thread, whereas VA-API backend splits those calls evenly across all (5) threads. Work split differences like that can certainly explain differences in how kernel power management handles these cases...

comment:14 by eero-t, 5 years ago

Btw. this is the source for the converted video http://4ksamples.com/ses-astra-uhd-test-1-2160p-uhdtv/.

in reply to:  12 comment:15 by Linjie.Fu, 5 years ago

Hi eero-t:

The performance evaluation may be a bit confused and there is a related discussion in MSDK about this performance issue:
https://github.com/Intel-Media-SDK/MediaSDK/issues/1550

With 8-bit 1920x540 HEVC decode, QSV is clearly faster than VA-API:

ffmpeg  -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v hevc_qsv -i 1920x540_60_yuv420p_4800.h265 -f null -
...
ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -i 1920x540_60_yuv420p_4800.h265 -y -f null -

Above command line is not fair.

For VAAPI, "-f null -" means no copy from video surface to system memory.
For QSV, even if "-f null -" is set, there is memory copy from video surface to system memory internally in MSDK:

1 ) app initializes MSDK to produce system memory. MSDK internally decodes to video memory and then internally
2) makes copy from video memory to system memory. It can be done by sw copy("vaDeriveImage->vaMapBuffer->memcpy->vaUnmapBuffer->vaDestroyImage)", or GPUCopy. Application
3) gets system memory.

That's the root cause for

  • Resolution impacts whether doing (HEVC) decoding is slower with QSV or VA-API backends
    • In larger resolutions, VPP operations with QSV backend are slower than with VA-API

(VAAPI without copy, but QSV with copy)

The performance gap is related with copy video memory to system memory.

For qsv, the best performance may be

  1. gpucopy for Tile surface data(like nv12)

http://git.ffmpeg.org/gitweb/ffmpeg.git/commit/5345965b3f088ad5acd5151bec421c97470675a4

  1. hwmap=mode=direct(if possible) for Linear surface data, derive data in the surface and use it directly to avoid any memory copy.
  1. hwdownload

You can compare the results of exactly output to /dev/null to evalute the performance.

comment:16 by eero-t, 5 years ago

In the original bug report, video is scaled so small that output copy doesn't matter.

Let's let ticket #7943 be about the extra copy (related to later comments in this ticket), and this one about the issue in the original bug report.

I just checked it with the latest media-driver and FFmpeg git versions, and one still needs both to:

  • bind FFmpeg to single core (prefix FFmpeg with "taskset 0x1")
  • set fixed GPU frequency

To get QSV performance close to VA-API performance. Without those, QSV / VPP performance is half of the corresponding VA-API performance.

comment:17 by eero-t, 4 years ago

Since I last commented this, there have been some noticeable changes to the performance in different APIs, but unfortunately they're all over, and too device / platform specific, for me to be able to do any conclusions of them.

Btw. #7943 is now mainly about the 2x QSV memory usage, so I wonder whether gpucopy support should have another ticket?

comment:18 by wenbin,chen, 3 years ago

now I tried the two backends and they have similar performance (qsv fps:554, vaapi fps:615). The gap may caused by memory copy that mentioned by linjie.

Note: See TracTickets for help on using tickets.