Opened 6 years ago
Last modified 2 years ago
#7943 reopened enhancement
Ffmpeg QSV backend uses >2x more GPU memory compared to VAAPI or MSDK
Reported by: | eero-t | Owned by: | |
---|---|---|---|
Priority: | normal | Component: | undetermined |
Version: | git-master | Keywords: | |
Cc: | linjie.fu@intel.com | Blocked By: | |
Blocking: | Reproduced by developer: | no | |
Analyzed by developer: | no |
Description
Summary of the bug:
GPU accelerated Video transcoding can use GBs of RAM, but it's in DRI/GEM objects which don't show up anywhere, can't be limited, can't be swapped out, and can therefore easily cause OOM-kill havoc in the rest of the system.
FFmpeg QSV backend uses way too much of these resources.
How to reproduce:
- monitor GEM object usage
# watch cat /sys/kernel/debug/dri/0/i915_gem_objects
- Transcode with FFmpeg / QSV
$ LIBVA_DRIVER_NAME=iHD ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v hevc_qsv -i Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265 -c:v hevc_qsv -b:v 20M -async_depth 4 output.h265
- Transcode with MediaSDK sample app:
LIBVA_DRIVER_NAME=iHD $ sample_multi_transcode -i::h265 Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265 -o::h265 output.h265 -b 20000 -async 4 -hw
- Transcode with FFmpeg / VAAPI
$ LIBVA_DRIVER_NAME=iHD ffmpeg -hwaccel vaapi -vaapi_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265 -c:v hevc_vaapi -b:v 20M output.h265
Expected outcome:
- Both FFmpeg backends take about same amount of GPU resources as they do about the same thing, and that resource usage is reasonable
- Reasonable being along lines of some (tens of) frames for prediction i.e. 4K * 16-bit * frames = few hundreds of MBs
Actual outcome:
- VAAPI backend uses ~1.1GB of GEM resources
- QSV backend uses ~2.5GB of GEM resources, more than double
- MSDK app uses ~1.1GB of GEM resources like VAAPI backend, so the QSV backend issue doesn't seem to be due to MSDK
I tried playing with -async/-async_depth option, and I was able to raise MSDK app GEM object usage to 2.5GB with "-async 20" option, but even with "-async_depth 1" option QSV backend lowered only to 2.2GB of GEM objects, so it must be something else.
Change History (10)
comment:1 by , 5 years ago
comment:2 by , 5 years ago
Cc: | added |
---|
follow-up: 5 comment:3 by , 5 years ago
Another input:
FFmpeg qsv sets initial_pool_size to a larger value (64) compared to vaapi(29):
https://github.com/FFmpeg/FFmpeg/blob/feaec3bc3133ff143b8445c919f2c4c56048fdf9/fftools/ffmpeg_qsv.c#L96
which make sure there is enough memory in the pool for allocation.
hwframe_pool_prealloc
https://github.com/FFmpeg/FFmpeg/blob/feaec3bc3133ff143b8445c919f2c4c56048fdf9/libavutil/hwcontext.c#L301
If you set the initial_pool_size to a relatively small value, for example:
frames_ctx->initial_pool_size = 22 + s->extra_hw_frames;
Similar GPU memory occupation could be obeserved:
ffmpeg: 321 objects, 1179193344 bytes
1.1GB
However, a small initial_pool_size may lead to some unexpected errors like:
[hevc_qsv @ 0x55febd948300] get_buffer() failed Error while decoding stream #0:0: Cannot allocate memory
comment:4 by , 5 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
comment:5 by , 5 years ago
Resolution: | wontfix |
---|---|
Status: | closed → reopened |
Replying to fulinjie:
Resolution set to wontfix
Why QSV backend can't be made to do whatever MSDK sample transcode application does?
Replying to fulinjie:
It could be verified by gpu_copy decode which do not allocate memory internally:
This workaround option seems to require very latest FFmpeg git version, but it doesn't help, neither with the original transcode command, nor when doing just decode. Memory usage is the same, and FFmpeg outputs this warning:
[hevc_qsv @ 0x5638f71fd840] GPU-accelerated memory copy only works in MFX_IOPATTERN_OUT_SYSTEM_MEMORY. Last message repeated 1 times
(Above output is with last night Git versions of drm-tip kernel, media-driver, MSDK and FFmpeg.)
Replying to fulinjie:
However, a small initial_pool_size may lead to some unexpected errors like:
Why VA-API doesn't fail to those errors when using same sized initial pool size?
(Smaller initial pool size causing alloc errors sounds like bug, normally such things should affect only performance.)
PS. IMHO wontfix would be acceptable resolution if QSV backend would be dropped and the few extra things provided by it [1] would be added to VA-API backend (besides QSV being slower and using 2x GEM memory, it lacks support for most of the formats supported by VA-API, listed in #7691, and need to identify & specify decoding codec is annoying).
[1] low power mode & extra bit-rate control modes with HuC. Is there anything else QSV provides over VA-API with the iHD Media driver?
comment:6 by , 5 years ago
Type: | defect → enhancement |
---|
Based on the discussion with Eero:
QSV uses a pre-allocation pool for memory allocation.
One of the reasons why it needs a relatively large memory usage compared with VA-API is for look_ahead.
Pre-allocation memory allows look_ahead to Analyze LookAheadDepth frames to find per-frame costs using a sliding window of DependencyDepth frames.
If pre-allocation memory is set to similar value in VA-API, there'll be memory allocation error when look_ahead is enabled with a large look_ahead_depth.
It could also be verified with sample_multi_transcode that MSDK does use more memory with LA.
(However, it scales the memory usage according to the option)
This could be improved in FFmpeg QSV backend to scale the memory allocation
according to some options. And we need to find a proper method to handle this and
make sure it won't introduce other concerns/regressions.
Let's keep this issue open and maybe change into enhancement.
Details:
- Why QSV backend can't be made to do whatever MSDK sample transcode application does?
Actually, QSV backend is able to match the behavior of sample transcode in MSDK.
It *works well* if you set initial_pool_size to *the similar value* with VAAPI
frames_ctx->initial_pool_size = 22 + s->extra_hw_frames;
And similar GPU memory occupation could be observed.
If you set the value too small, error may happen.
- If similar value to VA-API works well, why not use that as a fix?
QSV and VA-API are not the same codec and they have different features, for example, look_ahead. (for h264_qsv specific currently).
Good point.
One of the important reasons for the pre-allocation memory could be allow look_ahead to Analyze LookAheadDepth frames to find per-frame costs using a sliding window of DependencyDepth frames.
If pre-allocation memory is set to similar value in VA-API, there'll be memory allocation error when look_ahead is enabled.
CMDLINE for testing:
ffmpeg -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v hevc_qsv -i Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265 -vf scale_qsv=format=nv12 -c:v h264_qsv -b:v 20M -look_ahead 1 -look_ahead_depth 40 -async_depth 4 output.h264
Gives:
> ---------------- > Error while filtering: Cannot allocate memory > Failed to inject frame into filter network: Cannot allocate memory > ----------------
With LA depth 20 still works, 30 or more doesn't.
Also, would you please help to verify whether sample_multi_transcode uses more GEM memory when look_ahead is enabled?
Same with MSDK:
sample_multi_transcode -i::h265
Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265 -ec::nv12
-o::h264 output.h264 -b 20000 -async 4 -hw
MSDK does use more memory with LA:
- No LA: 0.67 GB
- LA depth 10: 0.89 GB (options: -la -lad 10)
- LA depth 100: 2.12 GB
But FFmpeg QSV uses much more:
- LA depth 10: 2.54 GB (!)
This workaround option seems to require very latest FFmpeg git
version,
but it doesn't help, neither with the original transcode command, nor
when
doing just decode. Memory usage is the same, and FFmpeg outputs
this warning:
GPU copy could only be used when copy data from video memory to
system
memory, thus it won’t help in transcode pipeline.
I tried to explain the memory usage in MSDK with GpuCopy as an
example.
In case I was unclear, it fails also with decoding:
---------------------------------------------------------------- > >> fmpeg -an -y -hwaccel qsv -qsv_device /dev/dri/renderD128 -gpu_copy > on > >> -c:v hevc_qsv -i > >> Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265 > >> -vf hwdownload,format=p010 -f rawvideo /dev/null > >> ... > >> Input #0, hevc, from > >> 'input/Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265': > >> Duration: N/A, bitrate: N/A > >> Stream #0:0: Video: hevc (Main 10), yuv420p10le(tv), 4096x2160, 60 > >> fps, 60 tbr, 1200k tbn, 60 tbc > >> Stream mapping: > >> Stream #0:0 -> #0:0 (hevc (hevc_qsv) -> rawvideo (native)) > >> Press [q] to stop, [?] for help > >> [hevc_qsv @ 0x55f94862d840] GPU-accelerated memory copy only works > in > >> MFX_IOPATTERN_OUT_SYSTEM_MEMORY. > >> Last message repeated 1 times > >> Output #0, rawvideo, to '/dev/null': > >> ----------------------------------------------------------------
There is something wrong with the test cmdline:
"-hwaccel qsv" should be removed,
Thanks, I hadn't noticed. Without that it works!
and "-vf hwdownload,format=p010 " is not needed.
Try the provided cmdline in:
https://trac.ffmpeg.org/ticket/7943#comment:1
or use the following cmd:
ffmpeg -an -y -qsv_device /dev/dri/renderD128 -gpu_copy on
-c:v hevc_qsv -I
Netflix_FoodMarket_4096x2160_10bit_420_100mbs_600.h265
-f rawvideo /dev/null
=> 0.47GB
Why VA-API doesn't fail to those errors when using same sized initial
pool
size?
Both VAAPI and QSV works well.
4.IMHO, the pre-allocation doesn’t mean QSV really uses all the memory
allocated in the pool.
If it shows up as GEM allocation in sysfs, I think it's really used in a
sense that it's away from everybody else in the system, but I don't know
for sure.
(With normal memory, non-dirtied allocations are all mapped to same
zero-page, but GEM memory doesn't show up in SMAPS statistics, and I
don't know whether QSV dirties all of its allocations.)
Btw. Because DRI/GEM allocations aren't visible to rest of the system,
they can't be controlled [1] and they can easily cause OOM-kills to be
done to innocent (other) processes. I've seen this happen with 3D in X
desktop, and with Media services in Kubernetes environment (control
plane gets killed and node needs to be rebooted).
Yes, GEM memory is allocated in the pre-allocation pool. So from
this perspective, all allocated GEM memory is used by FFmpeg.
And from another perspective, FFmpeg QSV get memory from the pre-
allocation
Pool to "actually" use these memory. So the exact memory used in QSV
may be
smaller than GEM memory, but could not be observed through " GEM
object usage ".
I assume that's what GPU cgroups support is going to be looking at when
it finally gets implemented (Cgroups is mandatory to have any reasonable
GPU sharing on container loads). If it does, it would matter, a lot.
Summary:
1.Larger memory pre-allocation for QSV is reasonable for some
features(look_ahead for example).
2.Gpucopy could work if cmdline is refined.
It works, but IMHO isn't really usable, normal users aren't really going
to find out all these weird command line option combinations needed to
get perf & reasonable memory usage out of QSV (-gpu_copy option doesn't
even seem to be documented).
GpuCopy mainly helps on improve the performance,(on APL, 4K nv12 decode,
performance could be improved from 3.4 fps to 50 + fps), memory usage is kind of
a side-benefit.
And thanks for remind, a doc for this optional is reasonable, I'll think about
send a patch to add this.
I think these kind of optimal encoding settings should be handled by the
FFmpeg backend automatically (at least for cases with >= 2x
differences). It should be enough for user to tell it what to do, not how.
- GEM memory is all allocated and used by FFmpeg, but QSV may only use
part of these memories
allocated in the pre-allocation pool.
I'm not sure whether there are other potential concerns if we initialize a
smaller pre-allocation pool.
One possible solution is to leave it as an enhancement to enable large
allocation for specific feature only.
MSDK sample application seems to scale its memory usage according to
what is actually needed. Couldn't FFmpeg QSV backend do the same?
Agree, this could be improved in FFmpeg QSV backend to scale the memory allocation
according to some options. And we need to find a proper method to handle this and
make sure it won't introduce other concerns/regressions.
Let's keep this issue open and maybe change into enhancement.
follow-up: 10 comment:7 by , 5 years ago
Is there some progress in making QSV backend do allocations based on how much memory is actually needed?
(Instead of it just pre-allocating huge amount of (non-swappable) GPU memory...)
comment:8 by , 5 years ago
Previous patch in the community:
https://patchwork.ffmpeg.org/project/ffmpeg/patch/1572576045-5252-1-git-send-email-linjie.fu@intel.com/
Needs to use look_ahead_depth.
(had some attempts locally long ago, and needs to be more integrated with this option)
comment:9 by , 5 years ago
What about gpucopy being used automatically by QSV when applicable, or it at least being documented?
comment:10 by , 2 years ago
2.5 years have passed. Any progress on these?
Replying to eero-t:
Is there some progress in making QSV backend do allocations based on how much memory is actually needed?
(Instead of it just pre-allocating huge amount of (non-swappable) GPU memory...)
Replying to eero-t:
What about gpucopy being used automatically by QSV when applicable, or it at least being documented?
This seems to be similar with the performance gap issue #7690.
There is an internally memory allocation and memory copy in MSDK core library(may not be involved in sample_multi_transcode),
CommonCORE::DoFastCopyExtended
https://github.com/Intel-Media-SDK/MediaSDK/blob/master/_studio/shared/src/libmfx_core.cpp#L1522
Thus the memory usage is twice large.
https://github.com/Intel-Media-SDK/MediaSDK/issues/1550#issuecomment-515417010
It could be verified by gpu_copy decode which do not allocate memory internally:
Allocate memory internally:
ffmpeg -y -hwaccel qsv -qsv_device /dev/dri/renderD128 -c:v hevc_qsv -an -i SES.Astra.UHD.Test.1.2160p.UHDTV.AAC.HEVC.x265-LiebeIst.mkv -vf hwdownload,format=p010 -f rawvideo /dev/null
ffmpeg: 155 objects, 1680379904 bytes
1.6 GB
No memory allocation internally with gpu_copy:
ffmpeg -y -qsv_device /dev/dri/renderD128 -gpu_copy on -c:v hevc_qsv -an -i SES.Astra.UHD.Test.1.2160p.UHDTV.AAC.HEVC.x265-LiebeIst.mkv -f rawvideo /dev/null
ffmpeg: 140 objects, 393277440 bytes
400 MB
The memory gap is obvious.