Opened 8 months ago

Last modified 8 months ago

#7563 new defect

hwaccel nvdec consumes much more VRAM when compared to hwaccel cuvid

Reported by: malakudi Owned by:
Priority: normal Component: undetermined
Version: git-master Keywords:
Cc: Blocked By:
Blocking: Reproduced by developer: no
Analyzed by developer: no

Description

Using hwaccel nvdec makes ffmpeg consume much more VRAM when compared to hwaccel cuvid, thus reducing the total amount of transcode sessions that can run on certain hardware. Tests have been done on Quadro P2000 that has 5120 MB of VRAM.

Here are some examples:

Following command allocates 193MB of VRAM

./ffmpeg-git -hwaccel nvdec -hwaccel_output_format cuda -f mpegts -i input_hdready_progressive_ntsc.ts -vcodec h264_nvenc -refs 4 -bf 2 -c:a copy -f mpegts -y /dev/null

while similar command with hwaccel cuvid allocates 155MB

./ffmpeg-git -hwaccel cuvid -c:v h264_cuvid -f mpegts -i input_hdready_progressive_ntsc.ts -vcodec h264_nvenc -refs 4 -bf 2 -c:a copy -f mpegts -y /dev/null

and cuvid with limiting surfaces to 10 (which are enough for this input) allocates 125 MB

/ffmpeg-git -hwaccel cuvid -c:v h264_cuvid -surfaces 10 -f mpegts -i input_hdready_progressive_ntsc.ts -vcodec h264_nvenc -refs 4 -bf 2 -c:a copy -f mpegts -y /dev/null

VRAM allocation can be seen with nvidia-smi

Differences are higher on higher input resolutions. 190MB for cuvid, 278MB for nvdec for 1920x1080i50 input (193-125=68 MB difference, 278-190=88 MB difference). And if I put scale_npp in the command line, like following commands:

./ffmpeg-git -hwaccel cuvid -c:v h264_cuvid -surfaces 10 -f mpegts -i input_1080i50.ts  -vf scale_npp=w=iw/2 -vcodec h264_nvenc -refs 4 -bf 2 -c:a copy -f mpegts -y /dev/null

./ffmpeg-git -hwaccel nvdec -hwaccel_output_format cuda -f mpegts -i input_1080i50.ts  -vf scale_npp=w=iw/2 -vcodec h264_nvenc -refs 4 -bf 2 -c:a copy -f mpegts -y /dev/null

then the difference is 295MB for nvdec, 167MB for cuvid. 295-167=128MB difference.

This makes using nvdec impossible if you want to utilise 100% of the hardware with multiple concurrent transcodes.

Change History (4)

comment:1 follow-up: Changed 8 months ago by oromit

nvdec keeps more of a buffer, because it hands through the directly mapped frames from the driver. While cuvid makes a vram-internal copy, and thus can work with a minimal buffer and potentially zero copies all the way through nvenc.
There's not much that can be done about that difference, as the buffer is needed for decoder and filter chain to not run out of frames in the predicted worst case scenario.

comment:2 in reply to: ↑ 1 Changed 8 months ago by malakudi

Replying to oromit:

nvdec keeps more of a buffer, because it hands through the directly mapped frames from the driver. While cuvid makes a vram-internal copy, and thus can work with a minimal buffer and potentially zero copies all the way through nvenc.
There's not much that can be done about that difference, as the buffer is needed for decoder and filter chain to not run out of frames in the predicted worst case scenario.

Maybe this buffer can be made configurable, worst case scenario might be not needed at all for specific workloads so why loose so much VRAM? If it is user configurable (as with cuvid decoding surfaces for example) then it might be able to get to similar VRAM allocation levels. As it is, at least for the workloads I use, nvdec makes me loose 10-12 transcodes per card due to VRAM limit.

comment:3 Changed 8 months ago by oromit

You can only increase its size via -extra_hw_frames, the minimum is calculated depending on codec parameters.

comment:4 Changed 8 months ago by malakudi

Then hwaccel nvdec cannot be used for workloads that run multiple realtime transcodes. Having the gpu power to do more realtime transcodes but spending vram on buffers that are not really needed in specific workloads (since cuvid works fine without them) is not a good deal.

Note: See TracTickets for help on using tickets.