Excessive GPU memory usage with nvdec hwaccel
|Reported by:||Ridley Combs||Owned by:||Timo R.|
|Blocking:||Reproduced by developer:||yes|
|Analyzed by developer:||yes|
When decoding video using the CUDA hwaccel,
ff_nvdec_decode_init() sets both
frames_ctx->initial_pool_size, which in turn is set by
dpb_size + 2, which in turn has 3 added by
thread_count added by
This is excessive. Only
ulNumDecodeSurfaces needs additional frames based on thread count (the output surfaces are only used in
nvdec_retrieve_data, which runs on the consumer's single thread), while only
ulNumOutputSurfaces needs the 3 additional output frames from
ff_decode_get_hw_frames_ctx() or the ones from
extra_hw_frames (the decode surfaces are never exposed to the consumer).
I'm not sure what the best way to handle this is. Maybe nvdec should ignore what the generic code sets
initial_pool_size to altogether and instead calculate its buffer counts internally, duplicating the generic code's behavior only where appropriate? The
initial_pool_size value seems to be designed for systems where the decoder's internal buffered frames are returned directly to the user, but that's not the case here.
Additionally, it doesn't seem like multithreading in CUDA actually serves any purpose; I see no performance gain when using multiple threads vs 1. Is it useful with any hardware decoder? Should we be defaulting multithreading off when using a hwaccel, or forcing it off unless the hwaccel fails and software fallback occurs? This can result in some pretty hefty memory usage for no reason by default on many-core machines.