Opened 17 months ago

Last modified 7 weeks ago

#6156 reopened defect

Issue with FFMPEG ALSA CPU usage

Reported by: mushm0m Owned by:
Priority: normal Component: avdevice
Version: git-master Keywords: alsa
Cc: Blocked By:
Blocking: Reproduced by developer: no
Analyzed by developer: no

Description (last modified by llogan)

I am on a Raspberry Pi Zero with a Pi Camera and an I2S MEMS mic, running the latest Raspbian Jessie. I am attempting to livestream audio and video. I've discovered that when I record ALSA audio with FFMPEG, that process takes 95% of my CPU and produces stuttering, unusable audio in the stream.

I isolated the issue to the following:

  1. ffmpeg -f alsa -acodec pcm_s32le -i mic_sv temp.wav

This single command takes 95% of my CPU.

  1. arecord -Dmic_sv -c2 -r48000 -fS32_LE -twav temp.wav

This single command takes <5% of my CPU.

I have added a custom hook to ffmpeg so we can see the PCM settings. I have ensured that settings (buffer size, period size, etc.) are identical between the two commands.

Here is the full console output of FFMPEG:

$ ALSA_CONFIG_PATH=/usr/share/alsa/alsa.conf:./dumpasoundrc  /home/pi/special/ffmpeg/ffmpeg -v 9 -loglevel 99  -f alsa -acodec pcm_s32le -i michooks -f  null -
ffmpeg version git-2017-02-05-e57fd92 Copyright (c) 2000-2017 the FFmpeg developers
  built with gcc 4.9.2 (Raspbian 4.9.2-10)
  configuration: --extra-cflags=-I/opt/vc/include/IL --enable-nonfree --enable-omx-rpi --logfile=CONFIG.TXT
  libavutil      55. 46.100 / 55. 46.100
  libavcodec     57. 75.100 / 57. 75.100
  libavformat    57. 66.101 / 57. 66.101
  libavdevice    57.  2.100 / 57.  2.100
  libavfilter     6. 73.100 /  6. 73.100
  libswscale      4.  3.101 /  4.  3.101
  libswresample   2.  4.100 /  2.  4.100
Splitting the commandline.
Reading option '-v' ... matched as option 'v' (set logging level) with argument '9'.
Reading option '-loglevel' ... matched as option 'loglevel' (set logging level) with argument '99'.
Reading option '-f' ... matched as option 'f' (force format) with argument 'alsa'.
Reading option '-acodec' ... matched as option 'acodec' (force audio codec ('copy' to copy stream)) with argument 'pcm_s32le'.
Reading option '-i' ... matched as input url with argument 'michooks'.
Reading option '-f' ... matched as option 'f' (force format) with argument 'null'.
Reading option '-' ... matched as output url.
Finished splitting the commandline.
Parsing a group of options: global .
Applying option v (set logging level) with argument 9.
Successfully parsed a group of options.
Parsing a group of options: input url michooks.
Applying option f (force format) with argument alsa.
Applying option acodec (force audio codec ('copy' to copy stream)) with argument pcm_s32le.
Successfully parsed a group of options.
Opening an input file: michooks.
[ALSA Debug Hook] Installing hooks.
Hooks PCM
Slave: Soft volume PCM
Control: Boost Capture Volume
min_dB: -3
max_dB: 50
resolution: 256
Slave: Route conversion PCM
  Transformation table:
    0 <- 0
    1 <- 0
Slave: Mmap emulation PCM
Slave: Hardware PCM card 1 'mems-mic' device 0 subdevice 0
[ALSA Debug Hook] hw_params:
Hooks PCM
Slave: Soft volume PCM
Control: Boost Capture Volume
min_dB: -3
max_dB: 50
resolution: 256
Its setup is:
  stream       : CAPTURE
  access       : RW_INTERLEAVED
  format       : S32_LE
  subformat    : STD
  channels     : 2
  rate         : 48000
  exact rate   : 48000 (48000/1)
  msbits       : 32
  buffer_size  : 65536
  period_size  : 16384
  period_time  : 341333
  tstamp_mode  : NONE
  period_step  : 1
  avail_min    : 16384
  period_event : 0
  start_threshold  : 1
  stop_threshold   : 65536
  silence_threshold: 0
  silence_size : 0
  boundary     : 1073741824
Slave: Route conversion PCM
  Transformation table:
    0 <- 0
    1 <- 0
Its setup is:
  stream       : CAPTURE
  access       : MMAP_INTERLEAVED
  format       : S32_LE
  subformat    : STD
  channels     : 2
  rate         : 48000
  exact rate   : 48000 (48000/1)
  msbits       : 32
  buffer_size  : 65536
  period_size  : 16384
  period_time  : 341333
  tstamp_mode  : NONE
  period_step  : 1
  avail_min    : 16384
  period_event : 0
  start_threshold  : 1
  stop_threshold   : 65536
  silence_threshold: 0
  silence_size : 0
  boundary     : 1073741824
Slave: Mmap emulation PCM
Its setup is:
  stream       : CAPTURE
  access       : MMAP_INTERLEAVED
  format       : S32_LE
  subformat    : STD
  channels     : 2
  rate         : 48000
  exact rate   : 48000 (48000/1)
  msbits       : 32
  buffer_size  : 65536
  period_size  : 16384
  period_time  : 341333
  tstamp_mode  : NONE
  period_step  : 1
  avail_min    : 16384
  period_event : 0
  start_threshold  : 1
  stop_threshold   : 65536
  silence_threshold: 0
  silence_size : 0
  boundary     : 1073741824
Slave: Hardware PCM card 1 'mems-mic' device 0 subdevice 0
Its setup is:
  stream       : CAPTURE
  access       : RW_INTERLEAVED
  format       : S32_LE
  subformat    : STD
  channels     : 2
  rate         : 48000
  exact rate   : 48000 (48000/1)
  msbits       : 32
  buffer_size  : 65536
  period_size  : 16384
  period_time  : 341333
  tstamp_mode  : NONE
  period_step  : 1
  avail_min    : 16384
  period_event : 0
  start_threshold  : 1073741824
  stop_threshold   : 65536
  silence_threshold: 0
  silence_size : 0
  boundary     : 1073741824
  appl_ptr     : 0
  hw_ptr       : 0
[alsa @ 0x31ea450] All info found
[alsa @ 0x31ea450] stream 0: start_time: 1487382765.290 duration: -9223372036854.775
[alsa @ 0x31ea450] format: start_time: 1487382765.290 duration: -9223372036854.775 bitrate=3072 kb/s
Guessed Channel Layout for Input Stream #0.0 : stereo
Input #0, alsa, from 'michooks':
  Duration: N/A, start: 1487382765.289814, bitrate: 3072 kb/s
    Stream #0:0, 1, 1/1000000: Audio: pcm_s32le, 48000 Hz, stereo, s32, 3072 kb/s
Successfully opened the file.
Parsing a group of options: output url -.
Applying option f (force format) with argument null.
Successfully parsed a group of options.
Opening an output file: -.
Successfully opened the file.
detected 1 logical cores
[graph_0_in_0_0 @ 0x31f9310] Setting 'time_base' to value '1/48000'
[graph_0_in_0_0 @ 0x31f9310] Setting 'sample_rate' to value '48000'
[graph_0_in_0_0 @ 0x31f9310] Setting 'sample_fmt' to value 's32'
[graph_0_in_0_0 @ 0x31f9310] Setting 'channel_layout' to value '0x3'
[graph_0_in_0_0 @ 0x31f9310] tb:1/48000 samplefmt:s32 samplerate:48000 chlayout:0x3
[format_out_0_0 @ 0x31f9f10] Setting 'sample_fmts' to value 's16'
[format_out_0_0 @ 0x31f9f10] auto-inserting filter 'auto_resampler_0' between the filter 'Parsed_anull_0' and the filter 'format_out_0_0'
[AVFilterGraph @ 0x31f9980] query_formats: 4 queried, 6 merged, 3 already done, 0 delayed
[auto_resampler_0 @ 0x31faf10] [SWR @ 0x321e130] Using fltp internally between filters
[auto_resampler_0 @ 0x31faf10] ch:2 chl:stereo fmt:s32 r:48000Hz -> ch:2 chl:stereo fmt:s16 r:48000Hz
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf57.66.101
    Stream #0:0, 0, 1/48000: Audio: pcm_s16le, 48000 Hz, stereo, s16, 1536 kb/s
    Metadata:
      encoder         : Lavc57.75.100 pcm_s16le
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s32le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
cur_dts is invalid (this is harmless if it occurs once at the start per stream)
size=N/A time=00:00:17.84 bitrate=N/A speed=   1x
video:0kB audio:3345kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown
Input file #0 (michooks):
  Input stream #0:0 (audio): 26372 packets read (6851424 bytes); 26372 frames decoded (856428 samples);
  Total: 26372 packets (6851424 bytes) demuxed
Output file #0 (pipe:):
  Output stream #0:0 (audio): 26372 frames encoded (856428 samples); 26372 packets muxed (3425712 bytes);
  Total: 26372 packets (3425712 bytes) muxed
26372 frames successfully decoded, 0 decoding errors
Exiting normally, received signal 2.

Here is the console output of Arecord:

$ arecord -v -r48000 -c2 -fS32_LE --period-size=16384 --buffer-size=65536 -Dmic_sv -twav /dev/null
Recording WAVE '/dev/null' : Signed 32 bit Little Endian, Rate 48000 Hz, Stereo
Soft volume PCM
Control: Boost Capture Volume
min_dB: -3
max_dB: 50
resolution: 256
Its setup is:
  stream       : CAPTURE
  access       : RW_INTERLEAVED
  format       : S32_LE
  subformat    : STD
  channels     : 2
  rate         : 48000
  exact rate   : 48000 (48000/1)
  msbits       : 32
  buffer_size  : 65536
  period_size  : 16384
  period_time  : 341333
  tstamp_mode  : NONE
  period_step  : 1
  avail_min    : 16384
  period_event : 0
  start_threshold  : 1
  stop_threshold   : 65536
  silence_threshold: 0
  silence_size : 0
  boundary     : 1073741824
Slave: Route conversion PCM
  Transformation table:
    0 <- 0
    1 <- 0
Its setup is:
  stream       : CAPTURE
  access       : MMAP_INTERLEAVED
  format       : S32_LE
  subformat    : STD
  channels     : 2
  rate         : 48000
  exact rate   : 48000 (48000/1)
  msbits       : 32
  buffer_size  : 65536
  period_size  : 16384
  period_time  : 341333
  tstamp_mode  : NONE
  period_step  : 1
  avail_min    : 16384
  period_event : 0
  start_threshold  : 1
  stop_threshold   : 65536
  silence_threshold: 0
  silence_size : 0
  boundary     : 1073741824
Slave: Mmap emulation PCM
Its setup is:
  stream       : CAPTURE
  access       : MMAP_INTERLEAVED
  format       : S32_LE
  subformat    : STD
  channels     : 2
  rate         : 48000
  exact rate   : 48000 (48000/1)
  msbits       : 32
  buffer_size  : 65536
  period_size  : 16384
  period_time  : 341333
  tstamp_mode  : NONE
  period_step  : 1
  avail_min    : 16384
  period_event : 0
  start_threshold  : 1
  stop_threshold   : 65536
  silence_threshold: 0
  silence_size : 0
  boundary     : 1073741824
Slave: Hardware PCM card 1 'mems-mic' device 0 subdevice 0
Its setup is:
  stream       : CAPTURE
  access       : RW_INTERLEAVED
  format       : S32_LE
  subformat    : STD
  channels     : 2
  rate         : 48000
  exact rate   : 48000 (48000/1)
  msbits       : 32
  buffer_size  : 65536
  period_size  : 16384
  period_time  : 341333
  tstamp_mode  : NONE
  period_step  : 1
  avail_min    : 16384
  period_event : 0
  start_threshold  : 1073741824
  stop_threshold   : 65536
  silence_threshold: 0
  silence_size : 0
  boundary     : 1073741824
  appl_ptr     : 0
  hw_ptr       : 0
^CAborted by signal Interrupt...

After a long chat in the FFMPEG IRC channel, we believe there may be a bug in the way FFMPEG waits for ALSA packets. Arecord uses snd_pcm_wait() to wait for packets, and this command is not used in FFMPEG sources. Inserting -re before -i into my FFMPEG command reduces CPU usage to <5%, but is not the right approach, and causes other problems (bitrate is almost 0kbps, audio file generated is <5 kilobytes).

Change History (13)

comment:1 Changed 17 months ago by mushm0m

Apologies for any incorrect formatting, it's my first time submitting a ticket. Please let me know if I can provide additional info!

comment:2 Changed 17 months ago by llogan

  • Component changed from avcodec to undetermined
  • Description modified (diff)
  • Keywords alsa added; ALSA removed
  • Version changed from unspecified to git-master

comment:3 Changed 17 months ago by mushm0m

Two things I've noticed, unsure if significant, from the FFMPEG debug output I've shared:

  1. ALSA duration is reported as negative value - why?
  2. ALSA bitrate is reported as 3072kbps, which seems high (I only have a Pi Zero). Is this configurable? How do I figure out the ALSA bitrate my Arecord command is using, to see if they are different?

comment:4 Changed 8 months ago by cehoyos

  • Resolution set to invalid
  • Status changed from new to closed

Please reopen this ticket if the issue is also reproducible with the following command:

$ ffmpeg -f alsa -i mic_sv -acodec pcm_s32le temp.wav

comment:5 follow-up: Changed 3 months ago by kainz

  • Resolution invalid deleted
  • Status changed from closed to reopened

So on a pi0w using a similar setup, but with a PCM line mic that i've got rigged to show up as a alsa PCM capture device @48000/s16le, I can absolutely reproduce the CPU usage issue when running ffmpeg -f alsa -i default -acodec pcm_s16le out.wav, which produces the following output:

`
ffmpeg -f alsa -i default -acodec pcm_s16le out.wav
ffmpeg version N-90737-gb3b3a3e3ce Copyright (c) 2000-2018 the FFmpeg developers

built with gcc 6.3.0 (Raspbian 6.3.0-18+rpi1+deb9u1) 20170516
configuration: --prefix=/usr/local --arch=armhf --target-os=linux --enable-gpl --enable-mmal --enable-omx --enable-omx-rpi --enable-nonfree --enable-libopus
libavutil 56. 15.100 / 56. 15.100
libavcodec 58. 19.100 / 58. 19.100
libavformat 58. 13.100 / 58. 13.100
libavdevice 58. 4.100 / 58. 4.100
libavfilter 7. 17.100 / 7. 17.100
libswscale 5. 2.100 / 5. 2.100
libswresample 3. 2.100 / 3. 2.100
libpostproc 55. 2.100 / 55. 2.100

Guessed Channel Layout for Input Stream #0.0 : stereo
Input #0, alsa, from 'default':

Duration: N/A, start: 1524040707.062083, bitrate: 1536 kb/s

Stream #0:0: Audio: pcm_s16le, 48000 Hz, stereo, s16, 1536 kb/s

Stream mapping:

Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))

Press [q] to stop, ? for help
Output #0, wav, to 'out.wav':

Metadata:

ISFT : Lavf58.13.100
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, stereo, s16, 1536 kb/s
Metadata:

encoder : Lavc58.19.100 pcm_s16le

size= 893kB time=00:00:04.76 bitrate=1536.5kbits/s speed=1.01x
video:0kB audio:893kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.008531%
`

Looking at the general in/filter/out process, libavdevice/alsa_dec.c (http://git.videolan.org/?p=ffmpeg.git;a=blob;f=libavdevice/alsa_dec.c;h=c50ce715064ad20f21c7d3c41d1e3003c69fef3f;hb=refs/heads/master#l116) returns AVERROR(EAGAIN) when snd_pcm_readi does not have ready data. This triggers the root cause of this issue, which is that in most places in fftools/ffmpeg.c, when a loop receives EAGAIN, it immediately loops. This means that if you are running faster than the sample rate, you are going to be pegging the CPU looping in and out of audio_read_packet.

I suppose this behavior is not unique to alsa_dec.c. In fact, if we take a quick survey of other libavdevice implementations, we can see a difference of implementation choices --

  • jack.c - here, a jack thread runs to deal with jackd's realtime callbacks to fill a FIFO that then gets 'polled' by audio_read_packet. If a packet is not available, the jack 'frontend' waits up to two seconds and then hardfails.
  • oss.c - the lower layer file read error is returned directly. I presume this includes -EAGAIN from the driver interface.
  • iec61883.c - raw1394 polls its file descriptor to wait.
  • openal-dec.c - AVERROR(EAGAIN) pattern is used here.
  • pulse_audio_dec.c - here we call into the pulseaudio main loop wait-on-the-event-thread and let it deal with things. Unless pulseaudio dies or the connection fails somehow (or the stream is terminated/deleted), this loop will not terminate.

Given a lack of consensus in the code base on how this loop is handled, I'd propose implementing use of snd_pcm_wait. That said, there's also a possible API shortcoming in general here -- should libavdevices/AVFormats implement a wait/ready/whats-my-pollfd callback and push this handling out somewhere more generic?

Barring the above, I'm going to drop a snd_pcm_wait(s->h, 16); into my alsa_dec.c and report back. That should induce a roughly 1/60th second wait if data is not yet available.

Presuming the above would work, any reccomendations/ideas on implementing a proper fix?

comment:6 Changed 3 months ago by kainz

So something about snd_pcm_wait seems to not be a solution on its own... For comparison though, using ffmpeg with the JACK input only ate about 2-7% CPU for each of jackd and ffmpeg, running jackd in 48000/s16LE mode, and using ffmpeg to write a 48000/S16LE PCM wav file. Still investigating...

Last edited 3 months ago by kainz (previous) (diff)

comment:7 in reply to: ↑ 5 ; follow-up: Changed 3 months ago by cehoyos

Replying to kainz:

So on a pi0w using a similar setup, but with a PCM line mic that i've got rigged to show up as a alsa PCM capture device @48000/s16le, I can absolutely reproduce the CPU usage issue when running ffmpeg -f alsa -i default -acodec pcm_s16le out.wav

Please confirm that the issue is not reproducible with arecord.

comment:8 in reply to: ↑ 7 Changed 3 months ago by kainz

Replying to cehoyos:

Replying to kainz:

So on a pi0w using a similar setup, but with a PCM line mic that i've got rigged to show up as a alsa PCM capture device @48000/s16le, I can absolutely reproduce the CPU usage issue when running ffmpeg -f alsa -i default -acodec pcm_s16le out.wav

Please confirm that the issue is not reproducible with arecord.

arecord -D hw:0 -c 2 -f S16_LE -r 48000 out.wav uses approximately 0.7-1% cpu on the pi0w, as does using -D default.

I get the pegged CPU on ffmpeg using either -i default or -i hw:0 on ffmpeg.

So, I just tested with pulseaudio as well now, ensuring that I have pulseaudio daemon's default sample rate set to 48k. This gives me ffmpeg -y -f pulse -i default -r 48000 out.wav results of 1.6% CPU use on ffmpeg and 0.7% cpu use on pulseaudio, as well.

comment:9 Changed 3 months ago by cehoyos

  • Component changed from undetermined to avdevice

comment:10 Changed 3 months ago by kainz

Still digging... as a further comparison, using gstreamer, the following pipeline produces about 8% cpu use for me on the pi0:

gst-launch-1.0 -v alsasrc device=hw:0 ! audio/x-raw, format=S16LE, rate=48000 ! queue ! wavenc ! filesink location=out.wav

comment:11 follow-up: Changed 3 months ago by kainz

So after having done some profiling runs on ffmpeg, I think it's just ffmpeg's processing loop causing most of the cpu use in this case, via smaller buffers (but large enough to make malloc/free expensive) and an accordingly larger amount of allocations.

I added some debugging code to print the selected ALSA pcm hw buffer size, and compared profiling runs between ffmpeg with alsa-direct and ffmpeg with alsa-jack. What happens is on the raspberrypi hardware, the i2s/pcm input has a 128KB buffer, and ffmpeg ends up allocating/deallocating this on every frame. For comparison, the alsapulse plugin ends up using a 1MB buffer, but provides larger frames, so the allocations dont happen as often.

Given all of the interactions above, I'm not really sure how best to mitigate this, and my usecase was aggravated by trying to combine the above with a concurrent v4l2 capture, which has proven challenging for other people with ffmpeg in the past.

If the above behavior is considered a design bug, I suppose I could help find and/or test a solution to it, but I think the fix would end up going deep into the ffmpeg/libavcodec/transcoding code base, probably more than is comfortable. Otherwise, I suppose this is really an unfortunate sideeffect of the hardware implementation.

An easy mitigation could be to add a commandline argument(s?) to tell the alsa demuxer to limit the buffer(or periods etc) sizes to better match the platform vs what the drivers report as the maximum permissible values.

Thoughts?

comment:12 in reply to: ↑ 11 ; follow-up: Changed 3 months ago by cehoyos

Replying to kainz:

So after having done some profiling runs on ffmpeg, I think it's just ffmpeg's processing loop causing most of the cpu use in this case, via smaller buffers (but large enough to make malloc/free expensive) and an accordingly larger amount of allocations.

Does increasing the buffer size fixes the issue for you?

comment:13 in reply to: ↑ 12 Changed 7 weeks ago by kainz

Replying to cehoyos:

Does increasing the buffer size fixes the issue for you?

Oddly enough, I was seeing the opposite. I needed to reduce the size of the buffer to fix the issue. I'm not where I can test easily right now, but after 6/3 I'll have access to what i'd need to test with various buffer sizes to show its effect on CPU.

Note: See TracTickets for help on using tickets.