Opened 5 years ago

Closed 5 years ago

Last modified 5 years ago

#7674 closed defect (fixed)

ffmpeg with cuvid transcoding after version 3.4.1 work unstable on heavy load CUDA card

Reported by: Maxim Owned by:
Priority: normal Component: avcodec
Version: git-master Keywords: nvenc regression
Cc: Blocked By:
Blocking: Reproduced by developer: no
Analyzed by developer: no

Description

Hi!

For the transoding of media streams I use Nvidia Quadro P5000 video cards and ffmpeg software version 3.4.1 (OS Linux Ubuntu 16.04.5, kernel 4.15).

On version 3.4.1 all work is fine on this video card it was possible to generate 79 H264 streams:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |

+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P5000 Off | 00000000:01:00.0 Off | Off |
| 49% 76C P0 79W / 180W | 12792MiB / 16278MiB | 45% Default |
+-------------------------------+----------------------+----------------------+

# nvidia-smi -i 0 | grep ffmpeg | wc -l
79

---

Utilization

Gpu : 42 %
Memory : 16 %
Encoder : 42 %
Decoder : 92 %

When I try upgrade ffmpeg to version 3.4.2 or higer (4.X ot 4.X), ffmpeg was work unstable afte 40 streams.

Configuration ffmpeg:
ffmpeg -hwaccel cuvid -c:v mpeg2_cuvid -deint 2 -drop_second_field 1 -i udp://232.10.10.1:1234?fifo_size=300000 -b:v 2800k -b:a 192k -c:v h264_nvenc -profile:v high -preset hp -c:a aac -f flv rtmp://127.0.0.1:1935/live/001

What can be a issue ?

Change History (25)

comment:1 by Carl Eugen Hoyos, 5 years ago

Component: ffmpegundetermined

Please test current FFmpeg git head and provide a simplified command line (if possible without network input or output) including the complete, uncut console output to make this a valid ticket. If you see a crash, please provide backtrace, disassembly and register dump as explained on https://ffmpeg.org/bugreports.html

comment:2 by Carl Eugen Hoyos, 5 years ago

If you believe there is a regression, run git bisect to find the change introducing the issue.

comment:3 by Timo R., 5 years ago

Also, please be _a lot_ more specific than "ffmpeg was work unstable".

in reply to:  3 comment:5 by Maxim, 5 years ago

Replying to oromit:

Also, please be _a lot_ more specific than "ffmpeg was work unstable".

Unstable work looks like this, if up to 40 ffmpeg is running at the same time, then everything is OK, if I add for example 55 streams, the nvidia-smi utility slowly shows statistics, and the ffmpeg processes randomly exit after some time.

comment:6 by Carl Eugen Hoyos, 5 years ago

Resolution: needs_more_info
Status: newclosed

Please reopen this ticket if you can confirm the issue is reproducible with current FFmpeg git head, if you can point us to the commit introducing the regression and if you can provide simplified command line including the complete, uncut console output.

comment:7 by Maxim, 5 years ago

Hi!

I was check issue on git-master version, this is issue also have.

Than I was removed patchs (revert) from version 4.1, now my issue is resolve :)

I think this is patch need analyzing in case, when used cuvid transcoding:

932037c6bb6b41a24e75b031426844a2e6472a74
48e52e4edd12adbc36eee0eebe1b97ffe0255be3
32bc4e77f61a5483c83a360b9ccbfc2840daba1e
bbe1b21022e4872bc64066d46a4567dc1b655f7a

comment:8 by Maxim, 5 years ago

Resolution: needs_more_info
Status: closedreopened

comment:9 by Carl Eugen Hoyos, 5 years ago

To make this a valid bug report please provide the command line you tested together with the complete, uncut console output and point us to the change (it is one commit, not four) that introduced the regression.

comment:10 by Maxim, 5 years ago

My step-by-step:

git clone -b release/4.1 https://git.ffmpeg.org/ffmpeg.git ffmpeg41

dependent patches, for the revert "32bc4e77f61a5483c83a360b9ccbfc2840daba1e"
git revert 932037c6bb6b41a24e75b031426844a2e6472a74
git revert 48e52e4edd12adbc36eee0eebe1b97ffe0255be3

regression patch:
git revert 32bc4e77f61a5483c83a360b9ccbfc2840daba1e

test configuration

ffmpeg -hwaccel cuvid -c:v mpeg2_cuvid -deint 2 -drop_second_field 1 -i udp://232.10.10.1:1234?fifo_size=300000 -b:v 2800k -b:a 192k -c:v h264_nvenc -profile:v high -preset hp -c:a aac -f flv rtmp://127.0.0.1:1935/live/001
.
.
.
ffmpeg -hwaccel cuvid -c:v mpeg2_cuvid -deint 2 -drop_second_field 1 -i udp://232.10.10.60:1234?fifo_size=300000 -b:v 2800k -b:a 192k -c:v h264_nvenc -profile:v high -preset hp -c:a aac -f flv rtmp://127.0.0.1:1935/live/060

With issue patch, when 59 stream, starting next stream over 20 sec and on others stream freeze video.

comment:11 by Carl Eugen Hoyos, 5 years ago

Resolution: needs_more_info
Status: reopenedclosed

comment:12 by Maxim, 5 years ago

Summary: ffmpeg with cuvid transcoding after version 3.4.1 work with crash on heavy load CUDA cardffmpeg with cuvid transcoding after version 3.4.1 work unstable on heavy load CUDA card

comment:13 by Timo R., 5 years ago

The patch you claim introduced the regression fixes a leak and potential crash. It seems very unlikely to me that it would introduce performance issues.

comment:14 by Maxim, 5 years ago

May be when using cuvid transcoding video frames are processed inside the GPU, it is possible that this was not provided when developing the patch.

comment:15 by Timo R., 5 years ago

That patch is specifically only for the case of pure on-GPU transcoding. It sits in the path that registers a CUDA frame to nvenv.

comment:16 by Maxim, 5 years ago

That's right, but when added two strings:

+ p_nvenc->nvEncUnregisterResource(ctx->nvencoder, ctx->registered_frames[tmpoutsurf->reg_idx].regptr);
+ ctx->registered_frames[tmpoutsurf->reg_idx].regptr = NULL;

GPU adapter begin to work slowly when load encoder to 80-90% :(

comment:17 by malakudi, 5 years ago

Resolution: needs_more_info
Status: closedreopened

I confirm the issue reported in this ticket with current git. Running multiple transcoding instances (above 50) starts to have a big performance degredation.

With the following patch on current git

--- ffmpeg/libavcodec/nvenc.c	2019-04-08 20:53:19.745925070 +0300
+++ ffmpeg/libavcodec/nvenc.c	2019-04-08 20:55:51.619074973 +0300
@@ -1846,13 +1846,6 @@
                 res = nvenc_print_error(avctx, nv_status, "Failed unmapping input resource");
                 goto error;
             }
-            nv_status = p_nvenc->nvEncUnregisterResource(ctx->nvencoder, ctx->registered_frames[tmpoutsurf->reg_idx].regptr);
-            if (nv_status != NV_ENC_SUCCESS) {
-                res = nvenc_print_error(avctx, nv_status, "Failed unregistering input resource");
-                goto error;
-            }
-            ctx->registered_frames[tmpoutsurf->reg_idx].ptr = NULL;
-            ctx->registered_frames[tmpoutsurf->reg_idx].regptr = NULL;
         } else if (ctx->registered_frames[tmpoutsurf->reg_idx].mapped < 0) {
             res = AVERROR_BUG;
             goto error;

the issue is resolved. I am not in a position to understand why this patch fixes the issue, but it does. And I see no other problem (mem leak or something else) when applying above patch.

The issue affects nvenc in general, it is not related to mpeg-2 input that original author reported.

comment:18 by Carl Eugen Hoyos, 5 years ago

Please either:
Send your patch - made with git format-patch - to the FFmpeg development mailing list, patches are ignored on this bug tracker.
Or provide the command line you tested together with the complete, uncut console output to make this a valid ticket.

comment:19 by malakudi, 5 years ago

I am sorry but the output is irrelevant. To test the issue you need to have at least two Quadro P5000 or two Quadro RTX 5000 on same computer, in order to be able to run that many multiple instances.
Sample bash script to test the issue:

#!/bin/bash
for j in `seq 0 $1` ;
do
for i in `seq 1 $2` ;
do 
ffmpeg-git -nostdin -loglevel error -stats \
-hwaccel cuvid -hwaccel_device $j -c:v h264_cuvid -surfaces 12 \
-i input_1080i.ts \
-vf yadif_cuda=1:-1:1,scale_npp=w=1280:h=720 \
-c:v h264_nvenc \
-preset fast \
-acodec copy -f mpegts -y /dev/null &
done
done
wait
echo done

Sample input file can be downloaded from http://207.154.237.57/files/input_1080i.ts
It is a 1080i input and we do deinterlacing. That way we can push much more frames on nvenc, because if for example we use a 50fps or 60fps input, nvdec will limit us first.

You run it as: ./testbench.sh 1 30 where 1 is the number of GPUs-1 (if you have 3, you put 2 etc) and 30 is the number of concurrent sessions per GPU.
Increasing the sessions above 25-30 per GPU will show the issue immediately. Applying above patch resolves the issue.

comment:20 by malakudi, 5 years ago

Another test case has been reported at https://devtalk.nvidia.com/default/topic/1050306/video-codec-and-optical-flow-sdk/video-transcoding-using-multiple-gpus-32-live-streaming-jobs-/ with the same behaviour. Applying above patch also fixes that test case. It has been confirmed now from at least 3 different test cases, I don't understand why you don't revert the problematic code change.

comment:21 by malakudi, 5 years ago

I have put some debug info to check what is going on when this code is executed. For every processed frame, code calls nvEncUnregisterResource. For my test example of 2984 frames, code is executed 2984 times.

/testnvidia4.sh 0 1
[h264_nvenc @ 0x562ba7c98e80] DEBUG: Unmapped, need(?) to unregister
    Last message repeated 2983 times13568kB time=00:00:47.01 bitrate=2364.1kbits/s speed=31.3x    
frame= 2984 fps=1542 q=27.0 Lsize=   18214kB time=00:01:01.56 bitrate=2423.8kbits/s speed=31.8x    

Obviously, when running multiple encodes, doing thousands calls of nvEncUnregisterResource creates the performance issue.

comment:22 by malakudi, 5 years ago

With the above patch applied, nvEncRegisterResource is called 5 times at start and nvEncUnregisterResource is called 5 times at end of process. Without it (current git code) nvEncRegisterResource is called 2984 times and nvEncUnregisterResource is also called 2984 times (for the test input file of 2984 frames). So not calling nvEncUnregisterResource at this specific code location does not leave any garbage since nvEncRegisterResource and nvEncUnregisterResource are matching, it just creates really unnecessary overhead registering and unregistering on every frame. Please fix.

in reply to:  13 comment:23 by malakudi, 5 years ago

Replying to oromit:

The patch you claim introduced the regression fixes a leak and potential crash. It seems very unlikely to me that it would introduce performance issues.

As I showed above, it does not. There is no leak. It just creates really unnecessary overhead registering and unregistering on every processed frame.

comment:24 by malakudi, 5 years ago

Commit 23ed147e8fc2b6b51a88af66b40f99049e5fa0d8 fixes the issue. Thank you very much - although I feel a bit disappointed about the wording used in the commit description. This was not a "super rare edge case", it was a typical workload with many transcodes on multiple gpus. Original commit (which is now reverted) mentioned for a "blew up" which was never experienced. Original commit only made the code keep unregistering and registering on every processes frame. So I would expect a "sorry, we screwed up the code originally, so we are reverting now" and not trying to tell that this was a "super rare edge case".

Last edited 5 years ago by Carl Eugen Hoyos (previous) (diff)

comment:25 by Carl Eugen Hoyos, 5 years ago

Component: undeterminedavcodec
Keywords: nvenc regression added
Resolution: fixed
Status: reopenedclosed
Version: unspecifiedgit-master
Note: See TracTickets for help on using tickets.