Opened 5 years ago

Last modified 20 months ago

#7943 reopened enhancement

Ffmpeg QSV backend uses >2x more GPU memory compared to VAAPI or MSDK

Reported by: eero-t Owned by:
Priority: normal Component: undetermined
Version: git-master Keywords:
Cc: linjie.fu@intel.com Blocked By:
Blocking: Reproduced by developer: no
Analyzed by developer: no

Description

Summary of the bug:

GPU accelerated Video transcoding can use GBs of RAM, but it's in DRI/GEM objects which don't show up anywhere, can't be limited, can't be swapped out, and can therefore easily cause OOM-kill havoc in the rest of the system.

FFmpeg QSV backend uses way too much of these resources.

How to reproduce:

  1. monitor GEM object usage

# watch cat /sys/kernel/debug/dri/0/i915_gem_objects

  1. Transcode with FFmpeg / QSV

$ LIBVA_DRIVER_NAME=iHD ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v hevc_qsv -i Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265 -c:v hevc_qsv -b:v 20M -async_depth 4 output.h265

  1. Transcode with MediaSDK sample app:

LIBVA_DRIVER_NAME=iHD $ sample_multi_transcode -i::h265 Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265 -o::h265 output.h265 -b 20000 -async 4 -hw

  1. Transcode with FFmpeg / VAAPI

$ LIBVA_DRIVER_NAME=iHD ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265 -c:v hevc_vaapi -b:v 20M output.h265

Expected outcome:

  • Both FFmpeg backends take about same amount of GPU resources as they do about the same thing, and that resource usage is reasonable
  • Reasonable being along lines of some (tens of) frames for prediction i.e. 4K * 16-bit * frames = few hundreds of MBs

Actual outcome:

  • VAAPI backend uses ~1.1GB of GEM resources
  • QSV backend uses ~2.5GB of GEM resources, more than double
  • MSDK app uses ~1.1GB of GEM resources like VAAPI backend, so the QSV backend issue doesn't seem to be due to MSDK

I tried playing with -async/-async_depth option, and I was able to raise MSDK app GEM object usage to 2.5GB with "-async 20" option, but even with "-async_depth 1" option QSV backend lowered only to 2.2GB of GEM objects, so it must be something else.

Change History (10)

comment:1 by Linjie.Fu, 5 years ago

This seems to be similar with the performance gap issue #7690.

There is an internally memory allocation and memory copy in MSDK core library(may not be involved in sample_multi_transcode),
CommonCORE::DoFastCopyExtended

https://github.com/Intel-Media-SDK/MediaSDK/blob/master/_studio/shared/src/libmfx_core.cpp#L1522

Thus the memory usage is twice large.

https://github.com/Intel-Media-SDK/MediaSDK/issues/1550#issuecomment-515417010

It could be verified by gpu_copy decode which do not allocate memory internally:

Allocate memory internally:
ffmpeg -y -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v hevc_qsv -an -i SES.Astra.UHD.Test.1.2160p.UHDTV.AAC.HEVC.x265-LiebeIst.mkv -vf hwdownload,format=p010 -f rawvideo /dev/null

ffmpeg: 155 objects, 1680379904 bytes
1.6 GB

No memory allocation internally with gpu_copy:
ffmpeg -y -qsv_device /dev/dri/renderD128 -gpu_copy on -c:v hevc_qsv -an -i SES.Astra.UHD.Test.1.2160p.UHDTV.AAC.HEVC.x265-LiebeIst.mkv -f rawvideo /dev/null

ffmpeg: 140 objects, 393277440 bytes
400 MB

The memory gap is obvious.

comment:2 by Linjie.Fu, 5 years ago

Cc: linjie.fu@intel.com added

comment:3 by Linjie.Fu, 5 years ago

Another input:
FFmpeg qsv sets initial_pool_size to a larger value (64) compared to vaapi(29):

https://github.com/FFmpeg/FFmpeg/blob/feaec3bc3133ff143b8445c919f2c4c56048fdf9/fftools/ffmpeg_qsv.c#L96
which make sure there is enough memory in the pool for allocation.

hwframe_pool_prealloc
https://github.com/FFmpeg/FFmpeg/blob/feaec3bc3133ff143b8445c919f2c4c56048fdf9/libavutil/hwcontext.c#L301

If you set the initial_pool_size to a relatively small value, for example:

frames_ctx->initial_pool_size = 22 + s->extra_hw_frames;

Similar GPU memory occupation could be obeserved:

ffmpeg: 321 objects, 1179193344 bytes
1.1GB

However, a small initial_pool_size may lead to some unexpected errors like:

[hevc_qsv @ 0x55febd948300] get_buffer() failed
Error while decoding stream #0:0: Cannot allocate memory

comment:4 by Linjie.Fu, 5 years ago

Resolution: wontfix
Status: newclosed

in reply to:  3 comment:5 by eero-t, 4 years ago

Resolution: wontfix
Status: closedreopened

Replying to fulinjie:

Resolution set to wontfix

Why QSV backend can't be made to do whatever MSDK sample transcode application does?


Replying to fulinjie:

It could be verified by gpu_copy decode which do not allocate memory internally:

This workaround option seems to require very latest FFmpeg git version, but it doesn't help, neither with the original transcode command, nor when doing just decode. Memory usage is the same, and FFmpeg outputs this warning:

[hevc_qsv @ 0x5638f71fd840] GPU-accelerated memory copy only works in MFX_IOPATTERN_OUT_SYSTEM_MEMORY.
    Last message repeated 1 times

(Above output is with last night Git versions of drm-tip kernel, media-driver, MSDK and FFmpeg.)


Replying to fulinjie:

However, a small initial_pool_size may lead to some unexpected errors like:

Why VA-API doesn't fail to those errors when using same sized initial pool size?

(Smaller initial pool size causing alloc errors sounds like bug, normally such things should affect only performance.)


PS. IMHO wontfix would be acceptable resolution if QSV backend would be dropped and the few extra things provided by it [1] would be added to VA-API backend (besides QSV being slower and using 2x GEM memory, it lacks support for most of the formats supported by VA-API, listed in #7691, and need to identify & specify decoding codec is annoying).

[1] low power mode & extra bit-rate control modes with HuC. Is there anything else QSV provides over VA-API with the iHD Media driver?

comment:6 by Linjie.Fu, 4 years ago

Type: defectenhancement

Based on the discussion with Eero:

QSV uses a pre-allocation pool for memory allocation.
One of the reasons why it needs a relatively large memory usage compared with VA-API is for look_ahead.

Pre-allocation memory allows look_ahead to Analyze LookAheadDepth frames to find per-frame costs using a sliding window of DependencyDepth frames.

https://github.com/Intel-Media-SDK/MediaSDK/blob/master/doc/mediasdk-man.md#figure-6-lookahead-brc-qp-calculation-algorithm

If pre-allocation memory is set to similar value in VA-API, there'll be memory allocation error when look_ahead is enabled with a large look_ahead_depth.

It could also be verified with sample_multi_transcode that MSDK does use more memory with LA.
(However, it scales the memory usage according to the option)

This could be improved in FFmpeg QSV backend to scale the memory allocation
according to some options. And we need to find a proper method to handle this and
make sure it won't introduce other concerns/regressions.
Let's keep this issue open and maybe change into enhancement.

Details:


  1. Why QSV backend can't be made to do whatever MSDK sample transcode application does?

Actually, QSV backend is able to match the behavior of sample transcode in MSDK.

It *works well* if you set initial_pool_size to *the similar value* with VAAPI

frames_ctx->initial_pool_size = 22 + s->extra_hw_frames;

And similar GPU memory occupation could be observed.

If you set the value too small, error may happen.

  1. If similar value to VA-API works well, why not use that as a fix?

QSV and VA-API are not the same codec and they have different features, for example, look_ahead. (for h264_qsv specific currently).

Good point.

One of the important reasons for the pre-allocation memory could be allow look_ahead to Analyze LookAheadDepth frames to find per-frame costs using a sliding window of DependencyDepth frames.

Details:
https://github.com/Intel-Media-SDK/MediaSDK/blob/master/doc/mediasdk-man.md#figure-6-lookahead-brc-qp-calculation-algorithm

If pre-allocation memory is set to similar value in VA-API, there'll be memory allocation error when look_ahead is enabled.

CMDLINE for testing:
ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v hevc_qsv -i Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265 -vf scale_qsv=format=nv12 -c:v h264_qsv -b:v 20M -look_ahead 1 -look_ahead_depth 40 -async_depth 4 output.h264

Gives:

> ----------------
> Error while filtering: Cannot allocate memory
> Failed to inject frame into filter network: Cannot allocate memory
> ----------------

With LA depth 20 still works, 30 or more doesn't.

Also, would you please help to verify whether sample_multi_transcode uses more GEM memory when look_ahead is enabled?

Same with MSDK:
sample_multi_transcode -i::h265
Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265 -ec::nv12
-o::h264 output.h264 -b 20000 -async 4 -hw

MSDK does use more memory with LA:

  • No LA: 0.67 GB
  • LA depth 10: 0.89 GB (options: -la -lad 10)
  • LA depth 100: 2.12 GB

But FFmpeg QSV uses much more:

  • LA depth 10: 2.54 GB (!)

This workaround option seems to require very latest FFmpeg git

version,

but it doesn't help, neither with the original transcode command, nor

when

doing just decode. Memory usage is the same, and FFmpeg outputs

this warning:

GPU copy could only be used when copy data from video memory to

system

memory, thus it won’t help in transcode pipeline.

I tried to explain the memory usage in MSDK with GpuCopy as an

example.

In case I was unclear, it fails also with decoding:

 ----------------------------------------------------------------
> >> fmpeg -an -y -hwaccel qsv -qsv_device /dev/dri/renderD128 -gpu_copy
> on
> >> -c:v hevc_qsv -i
> >> Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265
> >> -vf hwdownload,format=p010 -f rawvideo /dev/null
> >> ...
> >> Input #0, hevc, from
> >> 'input/Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265':
> >>     Duration: N/A, bitrate: N/A
> >>       Stream #0:0: Video: hevc (Main 10), yuv420p10le(tv), 4096x2160, 60
> >> fps, 60 tbr, 1200k tbn, 60 tbc
> >> Stream mapping:
> >>     Stream #0:0 -> #0:0 (hevc (hevc_qsv) -> rawvideo (native))
> >> Press [q] to stop, [?] for help
> >> [hevc_qsv @ 0x55f94862d840] GPU-accelerated memory copy only works
> in
> >> MFX_IOPATTERN_OUT_SYSTEM_MEMORY.
> >>       Last message repeated 1 times
> >> Output #0, rawvideo, to '/dev/null':
> >> ----------------------------------------------------------------

There is something wrong with the test cmdline:

"-hwaccel qsv" should be removed,

Thanks, I hadn't noticed. Without that it works!

and "-vf hwdownload,format=p010 " is not needed.

Try the provided cmdline in:

https://trac.ffmpeg.org/ticket/7943#comment:1

or use the following cmd:

ffmpeg -an -y -qsv_device /dev/dri/renderD128 -gpu_copy on
-c:v hevc_qsv -I

Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265

-f rawvideo /dev/null

=> 0.47GB

Why VA-API doesn't fail to those errors when using same sized initial

pool

size?

Both VAAPI and QSV works well.

4.IMHO, the pre-allocation doesn’t mean QSV really uses all the memory
allocated in the pool.

If it shows up as GEM allocation in sysfs, I think it's really used in a
sense that it's away from everybody else in the system, but I don't know
for sure.

(With normal memory, non-dirtied allocations are all mapped to same
zero-page, but GEM memory doesn't show up in SMAPS statistics, and I
don't know whether QSV dirties all of its allocations.)

Btw. Because DRI/GEM allocations aren't visible to rest of the system,
they can't be controlled [1] and they can easily cause OOM-kills to be
done to innocent (other) processes. I've seen this happen with 3D in X
desktop, and with Media services in Kubernetes environment (control
plane gets killed and node needs to be rebooted).

Yes, GEM memory is allocated in the pre-allocation pool. So from
this perspective, all allocated GEM memory is used by FFmpeg.

And from another perspective, FFmpeg QSV get memory from the pre-

allocation

Pool to "actually" use these memory. So the exact memory used in QSV

may be

smaller than GEM memory, but could not be observed through " GEM

object usage ".

I assume that's what GPU cgroups support is going to be looking at when
it finally gets implemented (Cgroups is mandatory to have any reasonable
GPU sharing on container loads). If it does, it would matter, a lot.

Summary:
1.Larger memory pre-allocation for QSV is reasonable for some

features(look_ahead for example).

2.Gpucopy could work if cmdline is refined.

It works, but IMHO isn't really usable, normal users aren't really going
to find out all these weird command line option combinations needed to
get perf & reasonable memory usage out of QSV (-gpu_copy option doesn't
even seem to be documented).

GpuCopy mainly helps on improve the performance,(on APL, 4K nv12 decode,
performance could be improved from 3.4 fps to 50 + fps), memory usage is kind of
a side-benefit.

And thanks for remind, a doc for this optional is reasonable, I'll think about
send a patch to add this.

I think these kind of optimal encoding settings should be handled by the
FFmpeg backend automatically (at least for cases with >= 2x
differences). It should be enough for user to tell it what to do, not how.

  1. GEM memory is all allocated and used by FFmpeg, but QSV may only use

part of these memories

allocated in the pre-allocation pool.

I'm not sure whether there are other potential concerns if we initialize a

smaller pre-allocation pool.

One possible solution is to leave it as an enhancement to enable large

allocation for specific feature only.

MSDK sample application seems to scale its memory usage according to
what is actually needed. Couldn't FFmpeg QSV backend do the same?

Agree, this could be improved in FFmpeg QSV backend to scale the memory allocation
according to some options. And we need to find a proper method to handle this and
make sure it won't introduce other concerns/regressions.
Let's keep this issue open and maybe change into enhancement.

comment:7 by eero-t, 4 years ago

Is there some progress in making QSV backend do allocations based on how much memory is actually needed?

(Instead of it just pre-allocating huge amount of (non-swappable) GPU memory...)

comment:8 by Linjie.Fu, 4 years ago

Previous patch in the community:
https://patchwork.ffmpeg.org/project/ffmpeg/patch/1572576045-5252-1-git-send-email-linjie.fu@intel.com/

Needs to use look_ahead_depth.
(had some attempts locally long ago, and needs to be more integrated with this option)

comment:9 by eero-t, 4 years ago

What about gpucopy being used automatically by QSV when applicable, or it at least being documented?

in reply to:  7 comment:10 by eero-t, 20 months ago

2.5 years have passed. Any progress on these?

Replying to eero-t:

Is there some progress in making QSV backend do allocations based on how much memory is actually needed?

(Instead of it just pre-allocating huge amount of (non-swappable) GPU memory...)

Replying to eero-t:

What about gpucopy being used automatically by QSV when applicable, or it at least being documented?

Note: See TracTickets for help on using tickets.