Opened 6 years ago
Last modified 4 years ago
#7563 new defect
hwaccel nvdec consumes much more VRAM when compared to hwaccel cuvid
Reported by: | malakudi | Owned by: | |
---|---|---|---|
Priority: | normal | Component: | undetermined |
Version: | git-master | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Reproduced by developer: | no | |
Analyzed by developer: | no |
Description
Using hwaccel nvdec makes ffmpeg consume much more VRAM when compared to hwaccel cuvid, thus reducing the total amount of transcode sessions that can run on certain hardware. Tests have been done on Quadro P2000 that has 5120 MB of VRAM.
Here are some examples:
Following command allocates 193MB of VRAM
./ffmpeg-git -hwaccel nvdec -hwaccel_output_format cuda -f mpegts -i input_hdready_progressive_ntsc.ts -vcodec h264_nvenc -refs 4 -bf 2 -c:a copy -f mpegts -y /dev/null
while similar command with hwaccel cuvid allocates 155MB
./ffmpeg-git -hwaccel cuvid -c:v h264_cuvid -f mpegts -i input_hdready_progressive_ntsc.ts -vcodec h264_nvenc -refs 4 -bf 2 -c:a copy -f mpegts -y /dev/null
and cuvid with limiting surfaces to 10 (which are enough for this input) allocates 125 MB
/ffmpeg-git -hwaccel cuvid -c:v h264_cuvid -surfaces 10 -f mpegts -i input_hdready_progressive_ntsc.ts -vcodec h264_nvenc -refs 4 -bf 2 -c:a copy -f mpegts -y /dev/null
VRAM allocation can be seen with nvidia-smi
Differences are higher on higher input resolutions. 190MB for cuvid, 278MB for nvdec for 1920x1080i50 input (193-125=68 MB difference, 278-190=88 MB difference). And if I put scale_npp in the command line, like following commands:
./ffmpeg-git -hwaccel cuvid -c:v h264_cuvid -surfaces 10 -f mpegts -i input_1080i50.ts -vf scale_npp=w=iw/2 -vcodec h264_nvenc -refs 4 -bf 2 -c:a copy -f mpegts -y /dev/null ./ffmpeg-git -hwaccel nvdec -hwaccel_output_format cuda -f mpegts -i input_1080i50.ts -vf scale_npp=w=iw/2 -vcodec h264_nvenc -refs 4 -bf 2 -c:a copy -f mpegts -y /dev/null
then the difference is 295MB for nvdec, 167MB for cuvid. 295-167=128MB difference.
This makes using nvdec impossible if you want to utilise 100% of the hardware with multiple concurrent transcodes.
Change History (5)
follow-up: 2 comment:1 by , 6 years ago
comment:2 by , 6 years ago
Replying to oromit:
nvdec keeps more of a buffer, because it hands through the directly mapped frames from the driver. While cuvid makes a vram-internal copy, and thus can work with a minimal buffer and potentially zero copies all the way through nvenc.
There's not much that can be done about that difference, as the buffer is needed for decoder and filter chain to not run out of frames in the predicted worst case scenario.
Maybe this buffer can be made configurable, worst case scenario might be not needed at all for specific workloads so why loose so much VRAM? If it is user configurable (as with cuvid decoding surfaces for example) then it might be able to get to similar VRAM allocation levels. As it is, at least for the workloads I use, nvdec makes me loose 10-12 transcodes per card due to VRAM limit.
comment:3 by , 6 years ago
You can only increase its size via -extra_hw_frames, the minimum is calculated depending on codec parameters.
comment:4 by , 6 years ago
Then hwaccel nvdec cannot be used for workloads that run multiple realtime transcodes. Having the gpu power to do more realtime transcodes but spending vram on buffers that are not really needed in specific workloads (since cuvid works fine without them) is not a good deal.
nvdec keeps more of a buffer, because it hands through the directly mapped frames from the driver. While cuvid makes a vram-internal copy, and thus can work with a minimal buffer and potentially zero copies all the way through nvenc.
There's not much that can be done about that difference, as the buffer is needed for decoder and filter chain to not run out of frames in the predicted worst case scenario.