Opened 5 years ago

Closed 5 years ago

Last modified 5 years ago

#7582 closed defect (fixed)

hwaccel cuvid/nvenc performance degredation when using aq (temporal-aq or spatial-aq) with multiple concurrent encodes

Reported by: malakudi Owned by:
Priority: important Component: undetermined
Version: git-master Keywords: regresssion cuda nvenc
Cc: Blocked By:
Blocking: Reproduced by developer: no
Analyzed by developer: no

Description (last modified by Carl Eugen Hoyos)

Running multiple hwaccel cuvid/nvenc sessions that utilise temporal-aq or spatial-aq AND 3 or more reference frames results in a performance degradation since following commits:
9b82e333b7c4235a3de7ce8d8fe115c53c11f50c
93d1756af2908150f7c8c0590b9ed246951d474a
Those commits enabled the use of cuMemcpy2DAsync instead of cuMemcpy2D. With this and aq enabled and 3 or more reference frames, performance seems to be degraded at around 50% of the nvenc capacity. Maybe it could be a driver problem but still, makes ffmpeg problematic on multiple realtime encodes scenario. With -hwaccel nvdec this doesn't happen, but since -hwaccel nvdec utilises much more VRAM, I cannot run the same amount of concurrent sessions.

To reproduce, I use as input the following file: https://download.blender.org/demo/movies/BBB/bbb_sunflower_1080p_30fps_normal.mp4

Running with following bash script:

#!/bin/bash
for i in `seq 1 16` ;
do 
./ffmpeg-git -nostdin -loglevel error -hwaccel cuvid -c:v h264_cuvid -re -i bbb_sunflower_1080p_30fps_normal.mp4 -vf scale_npp=w=1280:h=720 -c:v h264_nvenc -preset medium -refs 4 -bf 3 -temporal-aq 1 -acodec copy -f mpegts -y /dev/null &
done
wait
echo done

Checking utilization with nvidia-smi you will see very low utilization, and if you run one more session interactively you will see that it cannot keep encoding at 30 fps, although the utilization of nvenc is very low. If you set temporal-aq 0 on same script, you will see much higher utilization.

Sample output of interactive encoding session while already running 16 sessions and nvidia-smi dmon output:

./ffmpeg-git -hwaccel cuvid -c:v h264_cuvid -re -i bbb_sunflower_1080p_30fps_normal.mp4 -vf scale_npp=w=1280:h=720 -c:v h264_nvenc -preset medium -refs 4 -bf 3 -temporal-aq 1 -acodec copy -f mpegts -y /dev/null
ffmpeg version N-92462-g529debc987 Copyright (c) 2000-2018 the FFmpeg developers
  built with gcc 8 (Debian 8.2.0-9)
  configuration: --enable-runtime-cpudetect --disable-decoder=amrnb --disable-decoder=libopenjpeg --disable-mips32r2 --disable-mips32r6 --disable-mips64r6 --disable-mipsdsp --disable-mipsdspr2 --disable-mipsfpu --disable-msa --disable-libopencv --disable-podpages --disable-sndio --disable-debug --enable-libaom --enable-avfilter --enable-gcrypt --enable-gnutls --enable-gpl --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libfdk-aac --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libilbc --enable-libkvazaar --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenh264 --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvo-amrwbenc --enable-libvorbis --enable-libvpx --enable-libx265 --enable-libxvid --enable-libzvbi --enable-libnpp --enable-cuda-sdk --enable-nonfree --enable-opencl --enable-opengl --enable-postproc --enable-pthreads --enable-static --disable-shared --enable-version3 --enable-libwebp --incdir=/usr/include/x86_64-linux-gnu --libdir=/usr/lib/x86_64-linux-gnu --prefix=/usr --toolchain=hardened --enable-frei0r --enable-chromaprint --enable-libx264 --enable-libiec61883 --enable-libdc1394 --enable-vaapi --enable-libmfx --disable-altivec --shlibdir=/usr/lib/x86_64-linux-gnu
  libavutil      56. 23.101 / 56. 23.101
  libavcodec     58. 39.100 / 58. 39.100
  libavformat    58. 22.100 / 58. 22.100
  libavdevice    58.  6.100 / 58.  6.100
  libavfilter     7. 44.100 /  7. 44.100
  libswscale      5.  4.100 /  5.  4.100
  libswresample   3.  4.100 /  3.  4.100
  libpostproc    55.  4.100 / 55.  4.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'bbb_sunflower_1080p_30fps_normal.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 1
    compatible_brands: isomavc1
    creation_time   : 2013-12-16T17:44:39.000000Z
    title           : Big Buck Bunny, Sunflower version
    artist          : Blender Foundation 2008, Janus Bager Kristensen 2013
    comment         : Creative Commons Attribution 3.0 - http://bbb3d.renderfarming.net
    genre           : Animation
    composer        : Sacha Goedegebure
  Duration: 00:10:34.53, start: 0.000000, bitrate: 3481 kb/s
    Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p, 1920x1080 [SAR 1:1 DAR 16:9], 2998 kb/s, 30 fps, 30 tbr, 30k tbn, 60 tbc (default)
    Metadata:
      creation_time   : 2013-12-16T17:44:39.000000Z
      handler_name    : GPAC ISO Video Handler
    Stream #0:1(und): Audio: mp3 (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 160 kb/s (default)
    Metadata:
      creation_time   : 2013-12-16T17:44:42.000000Z
      handler_name    : GPAC ISO Audio Handler
    Stream #0:2(und): Audio: ac3 (ac-3 / 0x332D6361), 48000 Hz, 5.1(side), fltp, 320 kb/s (default)
    Metadata:
      creation_time   : 2013-12-16T17:44:42.000000Z
      handler_name    : GPAC ISO Audio Handler
    Side data:
      audio service type: main
Stream mapping:
  Stream #0:0 -> #0:0 (h264 (h264_cuvid) -> h264 (h264_nvenc))
  Stream #0:2 -> #0:1 (copy)
Press [q] to stop, [?] for help
Output #0, mpegts, to '/dev/null':
  Metadata:
    major_brand     : isom
    minor_version   : 1
    compatible_brands: isomavc1
    composer        : Sacha Goedegebure
    title           : Big Buck Bunny, Sunflower version
    artist          : Blender Foundation 2008, Janus Bager Kristensen 2013
    comment         : Creative Commons Attribution 3.0 - http://bbb3d.renderfarming.net
    genre           : Animation
    encoder         : Lavf58.22.100
    Stream #0:0(und): Video: h264 (h264_nvenc) (Main), cuda, 1280x720 [SAR 1:1 DAR 16:9], q=-1--1, 2000 kb/s, 30 fps, 90k tbn, 30 tbc (default)
    Metadata:
      creation_time   : 2013-12-16T17:44:39.000000Z
      handler_name    : GPAC ISO Video Handler
      encoder         : Lavc58.39.100 h264_nvenc
    Side data:
      cpb: bitrate max/min/avg: 0/0/2000000 buffer size: 4000000 vbv_delay: -1
    Stream #0:1(und): Audio: ac3 (ac-3 / 0x332D6361), 48000 Hz, 5.1(side), fltp, 320 kb/s (default)
    Metadata:
      creation_time   : 2013-12-16T17:44:42.000000Z
      handler_name    : GPAC ISO Audio Handler
    Side data:
      audio service type: main
frame= 1372 fps= 14 q=21.0 Lsize=   14369kB time=00:00:45.73 bitrate=2573.8kbits/s speed=0.483x    
video:11382kB audio:1781kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 9.155905%

nvidia-smi dmon
# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0    56    49     -    11     4    30    37  6800  1560
    0    54    49     -    10     4    30    39  6800  1560
    0    54    49     -    10     4    32    43  6800  1590
    0    55    49     -    10     4    31    40  6800  1515
    0    57    49     -    10     4    31    40  6800  1635

getting just near 15 fps instead of 30.

If you check with ffmpeg-4.0.3 (that doesn't have the above mentioned commits) you will also see correct utilization even when using temporal-aq 1.
If you use -hwaccel nvdec or don't use -hwaccel at all (software decoding and scaling) the problem also doesn't happen.
Finally, if you use nvidia-cuda-mps to handle the encodes, the problem also doesn't show.

Finally, a sample output of running with ffmpeg-4.0.3 interactively while already running 16 sessions AND nvidia-smi dmon output

./ffmpeg-4.0.3 -hwaccel cuvid -c:v h264_cuvid -re -i bbb_sunflower_1080p_30fps_normal.mp4 -vf scale_npp=w=1280:h=720 -c:v h264_nvenc -preset medium -refs 4 -bf 3 -temporal-aq 1 -acodec copy -f mpegts -y /dev/null
ffmpeg version 4.0.3 Copyright (c) 2000-2018 the FFmpeg developers
  built with gcc 8 (Debian 8.2.0-9)
  configuration: --enable-runtime-cpudetect --disable-decoder=amrnb --disable-decoder=libopenjpeg --disable-mips32r2 --disable-mips32r6 --disable-mips64r6 --disable-mipsdsp --disable-mipsdspr2 --disable-mipsfpu --disable-msa --disable-libopencv --disable-podpages --disable-sndio --disable-stripping --enable-libaom --enable-avfilter --enable-gcrypt --enable-gnutls --enable-gpl --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libfdk-aac --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libilbc --enable-libkvazaar --enable-libmp3lame --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenh264 --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvo-amrwbenc --enable-libvorbis --enable-libvpx --enable-libx265 --enable-libxvid --enable-libzvbi --enable-libnpp --enable-cuda-sdk --enable-nonfree --enable-opencl --enable-opengl --enable-postproc --enable-pthreads --enable-static --disable-shared --enable-version3 --enable-libwebp --incdir=/usr/include/x86_64-linux-gnu --libdir=/usr/lib/x86_64-linux-gnu --prefix=/usr --toolchain=hardened --enable-frei0r --enable-chromaprint --enable-libx264 --enable-libiec61883 --enable-libdc1394 --enable-vaapi --enable-libmfx --disable-altivec --shlibdir=/usr/lib/x86_64-linux-gnu
  libavutil      56. 14.100 / 56. 14.100
  libavcodec     58. 18.100 / 58. 18.100
  libavformat    58. 12.100 / 58. 12.100
  libavdevice    58.  3.100 / 58.  3.100
  libavfilter     7. 16.100 /  7. 16.100
  libswscale      5.  1.100 /  5.  1.100
  libswresample   3.  1.100 /  3.  1.100
  libpostproc    55.  1.100 / 55.  1.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'bbb_sunflower_1080p_30fps_normal.mp4':
  Metadata:
    major_brand     : isom
    minor_version   : 1
    compatible_brands: isomavc1
    creation_time   : 2013-12-16T17:44:39.000000Z
    title           : Big Buck Bunny, Sunflower version
    artist          : Blender Foundation 2008, Janus Bager Kristensen 2013
    comment         : Creative Commons Attribution 3.0 - http://bbb3d.renderfarming.net
    genre           : Animation
    composer        : Sacha Goedegebure
  Duration: 00:10:34.53, start: 0.000000, bitrate: 3481 kb/s
    Stream #0:0(und): Video: h264 (High) (avc1 / 0x31637661), yuv420p, 1920x1080 [SAR 1:1 DAR 16:9], 2998 kb/s, 30 fps, 30 tbr, 30k tbn, 60 tbc (default)
    Metadata:
      creation_time   : 2013-12-16T17:44:39.000000Z
      handler_name    : GPAC ISO Video Handler
    Stream #0:1(und): Audio: mp3 (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 160 kb/s (default)
    Metadata:
      creation_time   : 2013-12-16T17:44:42.000000Z
      handler_name    : GPAC ISO Audio Handler
    Stream #0:2(und): Audio: ac3 (ac-3 / 0x332D6361), 48000 Hz, 5.1(side), fltp, 320 kb/s (default)
    Metadata:
      creation_time   : 2013-12-16T17:44:42.000000Z
      handler_name    : GPAC ISO Audio Handler
    Side data:
      audio service type: main
Stream mapping:
  Stream #0:0 -> #0:0 (h264 (h264_cuvid) -> h264 (h264_nvenc))
  Stream #0:2 -> #0:1 (copy)
Press [q] to stop, [?] for help
Output #0, mpegts, to '/dev/null':
  Metadata:
    major_brand     : isom
    minor_version   : 1
    compatible_brands: isomavc1
    composer        : Sacha Goedegebure
    title           : Big Buck Bunny, Sunflower version
    artist          : Blender Foundation 2008, Janus Bager Kristensen 2013
    comment         : Creative Commons Attribution 3.0 - http://bbb3d.renderfarming.net
    genre           : Animation
    encoder         : Lavf58.12.100
    Stream #0:0(und): Video: h264 (h264_nvenc) (Main), cuda, 1280x720 [SAR 1:1 DAR 16:9], q=-1--1, 2000 kb/s, 30 fps, 90k tbn, 30 tbc (default)
    Metadata:
      creation_time   : 2013-12-16T17:44:39.000000Z
      handler_name    : GPAC ISO Video Handler
      encoder         : Lavc58.18.100 h264_nvenc
    Side data:
      cpb: bitrate max/min/avg: 0/0/2000000 buffer size: 4000000 vbv_delay: -1
    Stream #0:1(und): Audio: ac3 (ac-3 / 0x332D6361), 48000 Hz, 5.1(side), fltp, 320 kb/s (default)
    Metadata:
      creation_time   : 2013-12-16T17:44:42.000000Z
      handler_name    : GPAC ISO Audio Handler
    Side data:
      audio service type: main
frame= 1106 fps= 30 q=26.0 Lsize=   11893kB time=00:00:36.92 bitrate=2638.3kbits/s speed=0.999x    
video:9454kB audio:1444kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 9.133109%

# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0    90    55     -    21     9    52    73  6800  1950
    0    90    55     -    20     8    53    73  6800  1950
    0    91    55     -    20     9    52    78  6800  1950
    0    90    55     -    20     8    52    75  6800  1950
    0    88    55     -    20     8    51    75  6800  1950
    0    87    55     -    21     8    53    71  6800  1950
    0    88    55     -    20     8    49    75  6800  1950
    0    85    55     -    21     8    53    74  6800  1950
    0    87    55     -    20     8    49    74  6800  1950
    0    85    55     -    21     9    54    74  6800  1950
    0    87    55     -    20     8    49    75  6800  1950
    0    84    55     -    21     9    54    70  6800  1950

Attachments (1)

disable_cuda_sync.diff (2.4 KB ) - added by malakudi 5 years ago.
disable cuda sync

Download all attachments as: .zip

Change History (6)

comment:1 by Carl Eugen Hoyos, 5 years ago

Description: modified (diff)
Keywords: regresssion cuda nvenc added
Priority: normalimportant

by malakudi, 5 years ago

Attachment: disable_cuda_sync.diff added

disable cuda sync

comment:2 by malakudi, 5 years ago

The problem still exists with current git, I have changed the testing process a bit.
Running script

#!/bin/bash
for i in `seq 1 $1` ;
do 
ffmpeg-git -nostdin -loglevel error -stats \
-hwaccel cuvid -c:v h264_cuvid -surfaces 12 \
-i bbb_sunflower_1080p_30fps_normal.mp4 \
-vf scale_npp=w=1280:h=720 \
-c:v h264_nvenc \
-preset medium \
-refs 4 -bf 3 \
-temporal-aq 1 \
-acodec copy -f mpegts -y /dev/null &
done
wait
echo done

You can run several instances and check the average fps achieved. With current git code and temporal-aq enabled, one instance gets 653 fps on my RTX 2080, two instances get 175 each, dropping to a total of 350 fps.

I have attached diff to disable cuda sync on current git. With sync disabled, performance is OK, there is no degredation. One instance runs at 669 fps (a bit higher compared with original git code), two instances however get 334 fps each for a total of 668 fps.

So we see an almost 50% performance degredation when using cuda sync and temporal-aq is used. If temporal-aq is not used, there is no performance degredation.

Please, some developer confirm the issue and either revent the usage of cuda sync or escalate the issue to nvidia.

comment:3 by malakudi, 5 years ago

Made also tests on Quadro P2000 with drivers 390.87 (RTX 2080 tests were on 418.56 drivers) where I can run more than two instances. Results are different but still there is a big performance hit.

1 instance => 578 fps
1 instance with temporal-aq 0 => 604 fps
1 instance with cuStreamSynchronize disabled => 600 fps
2 instances => 2*229 => 458 fps
4 instances => 4*114 => 456 fps
8 instances => 8*57 => 456 fps
16 instances => 16*27 => 432 fps
16 instances with temporal-aq 0 => 16*37 => 592 fps
16 instances with cuStreamSynchronize disabled => 16*37 => 592 fps
24 instances => 24*17 => 408 fps
24 instances with cuStreamSynchronize disabled => 24*24 => 576 fps

With nvidia-smi dmon you can see in all tests above 1 session, that utilization is never 100% when using current git. Usage of cuStreamSynchronize hurts performance very much. As concurrent encoding sessions increase, the impact is larger.

comment:4 by Philip Langdale, 5 years ago

Resolution: fixed
Status: newclosed

I've removed the unnecessary synchronisation calls, so performance should be reasonable now.

comment:5 by malakudi, 5 years ago

I confirm the fix you committed works fine. May I suggest to also backport it on 4.1 head for 4.1.3 release, because all 4.1 series is flawed on this issue. Thank you

Note: See TracTickets for help on using tickets.