Opened 4 years ago
Last modified 3 years ago
#9285 open defect
Excessive GPU memory usage with nvdec hwaccel
Reported by: | Ridley Combs | Owned by: | Timo R. |
---|---|---|---|
Priority: | normal | Component: | avcodec |
Version: | unspecified | Keywords: | nvdec nvidia |
Cc: | Blocked By: | ||
Blocking: | Reproduced by developer: | yes | |
Analyzed by developer: | yes |
Description
When decoding video using the CUDA hwaccel, ff_nvdec_decode_init()
sets both ulNumDecodeSurfaces
and ulNumOutputSurfaces
to frames_ctx->initial_pool_size
, which in turn is set by ff_nvdec_decode_init
to dpb_size + 2
, which in turn has 3 added by ff_decode_get_hw_frames_ctx()
and extra_hw_frames
+ thread_count
added by avcodec_get_hw_frames_parameters
.
This is excessive. Only ulNumDecodeSurfaces
needs additional frames based on thread count (the output surfaces are only used in nvdec_retrieve_data
, which runs on the consumer's single thread), while only ulNumOutputSurfaces
needs the 3 additional output frames from ff_decode_get_hw_frames_ctx()
or the ones from extra_hw_frames
(the decode surfaces are never exposed to the consumer).
I'm not sure what the best way to handle this is. Maybe nvdec should ignore what the generic code sets initial_pool_size
to altogether and instead calculate its buffer counts internally, duplicating the generic code's behavior only where appropriate? The initial_pool_size
value seems to be designed for systems where the decoder's internal buffered frames are returned directly to the user, but that's not the case here.
Additionally, it doesn't seem like multithreading in CUDA actually serves any purpose; I see no performance gain when using multiple threads vs 1. Is it useful with any hardware decoder? Should we be defaulting multithreading off when using a hwaccel, or forcing it off unless the hwaccel fails and software fallback occurs? This can result in some pretty hefty memory usage for no reason by default on many-core machines.
Can you comment?