Opened 3 years ago

Closed 3 years ago

#4148 closed defect (fixed)

Crash in ff_add_bytes_l2_sse2 when decoding attached APNG file

Reported by: benoit Owned by:
Priority: important Component: avcodec
Version: git-master Keywords: png crash SIGSEGV regression
Cc: Blocked By:
Blocking: Reproduced by developer: yes
Analyzed by developer: no

Description

How to reproduce:

$ gdb --args ./ffmpeg_g  -i dolske/whee.png -f null /dev/null
GNU gdb (GDB) 7.6.1-ubuntu
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/hack/ffmpeg/videolan.org/ffmpeg_g...done.
(gdb) r
Starting program: /home/hack/ffmpeg/videolan.org/./ffmpeg_g -i dolske/whee.png -f null /dev/null
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
ffmpeg version N-68130-gb50e003 Copyright (c) 2000-2014 the FFmpeg developers
  built on Dec  1 2014 14:20:21 with gcc 4.8 (Ubuntu/Linaro 4.8.1-10ubuntu9)
  configuration: --enable-gpl --enable-libmp3lame --enable-libx264 --samples=/home/hack/ffmpeg/videolan.org/fate-suite
  libavutil      54. 15.100 / 54. 15.100
  libavcodec     56. 13.100 / 56. 13.100
  libavformat    56. 15.101 / 56. 15.101
  libavdevice    56.  3.100 / 56.  3.100
  libavfilter     5.  2.103 /  5.  2.103
  libswscale      3.  1.101 /  3.  1.101
  libswresample   1.  1.100 /  1.  1.100
  libpostproc    53.  3.100 / 53.  3.100
Input #0, apng, from 'dolske/whee.png':
  Duration: N/A, start: 0.000000, bitrate: N/A
    Stream #0:0: Video: apng, rgba, 230x200, 12.08 fps, 12.05 tbr, 100k tbn, 100k tbc
[New Thread 0x7ffff2d87700 (LWP 9343)]
[New Thread 0x7ffff2586700 (LWP 9344)]
[New Thread 0x7ffff1d85700 (LWP 9345)]
[New Thread 0x7ffff1584700 (LWP 9346)]
[New Thread 0x7ffff0d83700 (LWP 9347)]
[New Thread 0x7ffff0582700 (LWP 9348)]
[New Thread 0x7fffefd81700 (LWP 9349)]
[New Thread 0x7fffef580700 (LWP 9350)]
[New Thread 0x7fffeed7f700 (LWP 9351)]
[New Thread 0x7fffee57e700 (LWP 9352)]
[New Thread 0x7fffedd7d700 (LWP 9353)]
[New Thread 0x7fffed57c700 (LWP 9354)]
[New Thread 0x7fffecd7b700 (LWP 9355)]
[New Thread 0x7fffec57a700 (LWP 9356)]
[New Thread 0x7fffebd79700 (LWP 9357)]
[New Thread 0x7fffeb578700 (LWP 9358)]
[New Thread 0x7fffead77700 (LWP 9359)]
[New Thread 0x7fffea576700 (LWP 9360)]
Output #0, null, to '/dev/null':
  Metadata:
    encoder         : Lavf56.15.101
    Stream #0:0: Video: rawvideo (RGBA / 0x41424752), rgba, 230x200, q=2-31, 200 kb/s, 12.05 fps, 12.05 tbn, 12.05 tbc
    Metadata:
      encoder         : Lavc56.13.100 rawvideo
Stream mapping:
  Stream #0:0 -> #0:0 (apng (native) -> rawvideo (native))
Press [q] to stop, [?] for help
[null @ 0x1ae2260] Encoder did not produce proper pts, making some up.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffead77700 (LWP 9359)]
ff_add_bytes_l2_sse2.loop_v () at libavcodec/x86/pngdsp.asm:90
90      ADD_BYTES_FN 2
(gdb) bt
#0  ff_add_bytes_l2_sse2.loop_v () at libavcodec/x86/pngdsp.asm:90
#1  0x0000000000918e76 in png_filter_row (dsp=0x1aa3000, dst=0x7fffc400e754 "", filter_type=2, src=0x7fffc40300b0 "\002\002\002", last=0x7fffc400e3b4 "\034_\274\377 c\300\377\036e\277\377", size=352, bpp=4) at libavcodec/pngdec.c:255
#2  0x00000000009197a9 in png_handle_row (s=0x1aa3000) at libavcodec/pngdec.c:313
#3  0x0000000000919de0 in png_decode_idat (s=0x1aa3000, length=6125) at libavcodec/pngdec.c:398
#4  0x000000000091b373 in decode_idat_chunk (avctx=0x1aa2780, s=0x1aa3000, length=6125, p=0x1aa35c0) at libavcodec/pngdec.c:679
#5  0x000000000091d4c1 in decode_frame_common (avctx=0x1aa2780, s=0x1aa3000, p=0x1aa35c0, avpkt=0x1ad0478) at libavcodec/pngdec.c:1014
#6  0x000000000091dfd5 in decode_frame_apng (avctx=0x1aa2780, data=0x1aa2c60, got_frame=0x1ad04e0, avpkt=0x1ad0478) at libavcodec/pngdec.c:1171
#7  0x000000000092955d in frame_worker_thread (arg=0x1ad0378) at libavcodec/pthread_frame.c:158
#8  0x00007ffff741ff6e in start_thread (arg=0x7fffead77700) at pthread_create.c:311
#9  0x00007ffff657a9cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) disassemble $pc-32,$pc+32
Dump of assembler code from 0xc4d3cc to 0xc4d40c:
   0x0000000000c4d3cc <ff_diff_pixels_sse2.loop+74>:    add    %al,(%rax)
   0x0000000000c4d3ce <ff_diff_pixels_sse2.loop+76>:    add    %al,(%rax)
   0x0000000000c4d3d0 <ff_add_bytes_l2_sse2+0>: movslq %ecx,%rcx
   0x0000000000c4d3d3 <ff_add_bytes_l2_sse2+3>: xor    %r9,%r9
   0x0000000000c4d3d6 <ff_add_bytes_l2_sse2+6>: mov    %rcx,%r8
   0x0000000000c4d3d9 <ff_add_bytes_l2_sse2+9>: and    $0xffffffffffffffe0,%rcx
   0x0000000000c4d3dd <ff_add_bytes_l2_sse2+13>:        jmp    0xc4d40a <ff_add_bytes_l2_sse2.end_v>
   0x0000000000c4d3df <ff_add_bytes_l2_sse2.loop_v+0>:  movdqa (%rsi,%r9,1),%xmm0
   0x0000000000c4d3e5 <ff_add_bytes_l2_sse2.loop_v+6>:  movdqa 0x10(%rsi,%r9,1),%xmm1
=> 0x0000000000c4d3ec <ff_add_bytes_l2_sse2.loop_v+13>: paddb  (%rdx,%r9,1),%xmm0
   0x0000000000c4d3f2 <ff_add_bytes_l2_sse2.loop_v+19>: paddb  0x10(%rdx,%r9,1),%xmm1
   0x0000000000c4d3f9 <ff_add_bytes_l2_sse2.loop_v+26>: movdqa %xmm0,(%rdi,%r9,1)
   0x0000000000c4d3ff <ff_add_bytes_l2_sse2.loop_v+32>: movdqa %xmm1,0x10(%rdi,%r9,1)
   0x0000000000c4d406 <ff_add_bytes_l2_sse2.loop_v+39>: add    $0x20,%r9
   0x0000000000c4d40a <ff_add_bytes_l2_sse2.end_v+0>:   cmp    %rcx,%r9
End of assembler dump.
(gdb) info registers all
rax            0xc4d3d0 12899280
rbx            0x1aa2780        27928448
rcx            0x160    352
rdx            0x7fffc400e3b4   140736481780660
rsi            0x7fffc40300b0   140736481919152
rdi            0x7fffc400e754   140736481781588
rbp            0x1ad0448        0x1ad0448
rsp            0x7fffead76b88   0x7fffead76b88
r8             0x160    352
r9             0x0      0
r10            0x5865b  362075
r11            0x1aa34f8        27931896
r12            0x13b2060        20652128
r13            0x1acf740        28112704
r14            0x1ad0378        28115832
r15            0x1ad0478        28116088
rip            0xc4d3ec 0xc4d3ec <ff_add_bytes_l2_sse2.loop_v+13>
eflags         0x10287  [ CF PF SF IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
st0            -inf     (raw 0xffff0000000000000000)
st1            -inf     (raw 0xffff0000000000000000)
st2            -inf     (raw 0xffff0000000000000000)
st3            -inf     (raw 0xffff0000000000000000)
st4            -inf     (raw 0xffff0000000000000000)
st5            -inf     (raw 0xffff0000000000000000)
st6            -inf     (raw 0xffff0000000000000000)
st7            -inf     (raw 0xffff0000000000000000)
fctrl          0x37f    895
fstat          0x0      0
ftag           0xaaaa   43690
fiseg          0x0      0
fioff          0x0      0
foseg          0x0      0
fooff          0x0      0
fop            0x0      0
mxcsr          0x1fa8   [ OE PE IM DM ZM OM UM PM ]
ymm0           {v8_float = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_double = {0x0, 0x8000000000000000, 0x0, 0x0}, v32_int8 = {0x2, 0x2, 0x2, 0x0, 0x3, 0x3, 0x3, 0x0, 0x2, 0x2, 0x2, 0x0, 0x21, 0x68, 0xc2, 0xff, 
    0x0 <repeats 16 times>}, v16_int16 = {0x202, 0x2, 0x303, 0x3, 0x202, 0x2, 0x6821, 0xffc2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int32 = {0x20202, 0x30303, 0x20202, 0xffc26821, 0x0, 0x0, 0x0, 0x0}, v4_int64 = {0x3030300020202, 
    0xffc2682100020202, 0x0, 0x0}, v2_int128 = {0xffc26821000202020003030300020202, 0x00000000000000000000000000000000}}
ymm1           {v8_float = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_double = {0x0, 0x0, 0x0, 0x0}, v32_int8 = {0x1f, 0x69, 0xc4, 0xff, 0xff, 0xff, 0xff, 0x0, 0xfe, 0xfe, 0xfe, 0x0 <repeats 21 times>}, v16_int16 = {0x691f, 0xffc4, 
    0xffff, 0xff, 0xfefe, 0xfe, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int32 = {0xffc4691f, 0xffffff, 0xfefefe, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int64 = {0xffffffffc4691f, 0xfefefe, 0x0, 0x0}, v2_int128 = {
    0x0000000000fefefe00ffffffffc4691f, 0x00000000000000000000000000000000}}
ymm2           {v8_float = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_double = {0x0, 0x0, 0x0, 0x0}, v32_int8 = {0x0 <repeats 32 times>}, v16_int16 = {0x0 <repeats 16 times>}, v8_int32 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 
  v4_int64 = {0x0, 0x0, 0x0, 0x0}, v2_int128 = {0x00000000000000000000000000000000, 0x00000000000000000000000000000000}}
ymm3           {v8_float = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_double = {0x0, 0x0, 0x0, 0x0}, v32_int8 = {0x0 <repeats 32 times>}, v16_int16 = {0x0 <repeats 16 times>}, v8_int32 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 
  v4_int64 = {0x0, 0x0, 0x0, 0x0}, v2_int128 = {0x00000000000000000000000000000000, 0x00000000000000000000000000000000}}
ymm4           {v8_float = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_double = {0x0, 0x0, 0x0, 0x0}, v32_int8 = {0x0 <repeats 32 times>}, v16_int16 = {0x0 <repeats 16 times>}, v8_int32 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 
  v4_int64 = {0x0, 0x0, 0x0, 0x0}, v2_int128 = {0x00000000000000000000000000000000, 0x00000000000000000000000000000000}}
ymm5           {v8_float = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_double = {0x0, 0x8000000000000000, 0x0, 0x0}, v32_int8 = {0x0 <repeats 14 times>, 0xfe, 0xfe, 0x0 <repeats 16 times>}, v16_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
    0x0, 0xfefe, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int32 = {0x0, 0x0, 0x0, 0xfefe0000, 0x0, 0x0, 0x0, 0x0}, v4_int64 = {0x0, 0xfefe000000000000, 0x0, 0x0}, v2_int128 = {0xfefe0000000000000000000000000000, 
    0x00000000000000000000000000000000}}
ymm6           {v8_float = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_double = {0x0, 0x0, 0x0, 0x0}, v32_int8 = {0xfe, 0x0, 0xff, 0xff, 0xff, 0x0 <repeats 27 times>}, v16_int16 = {0xfe, 0xffff, 0xff, 0x0 <repeats 13 times>}, 
  v8_int32 = {0xffff00fe, 0xff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int64 = {0xffffff00fe, 0x0, 0x0, 0x0}, v2_int128 = {0x0000000000000000000000ffffff00fe, 0x00000000000000000000000000000000}}
---Type <return> to continue, or q <return> to quit---
ymm7           {v8_float = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_double = {0x0, 0x0, 0x0, 0x0}, v32_int8 = {0x0 <repeats 32 times>}, v16_int16 = {0x0 <repeats 16 times>}, v8_int32 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 
  v4_int64 = {0x0, 0x0, 0x0, 0x0}, v2_int128 = {0x00000000000000000000000000000000, 0x00000000000000000000000000000000}}
ymm8           {v8_float = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_double = {0x0, 0x0, 0x0, 0x0}, v32_int8 = {0x0 <repeats 32 times>}, v16_int16 = {0x0 <repeats 16 times>}, v8_int32 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 
  v4_int64 = {0x0, 0x0, 0x0, 0x0}, v2_int128 = {0x00000000000000000000000000000000, 0x00000000000000000000000000000000}}
ymm9           {v8_float = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_double = {0x0, 0x0, 0x0, 0x0}, v32_int8 = {0x0 <repeats 14 times>, 0x3, 0x3, 0x0 <repeats 16 times>}, v16_int16 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x303, 0x0, 
    0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v8_int32 = {0x0, 0x0, 0x0, 0x3030000, 0x0, 0x0, 0x0, 0x0}, v4_int64 = {0x0, 0x303000000000000, 0x0, 0x0}, v2_int128 = {0x03030000000000000000000000000000, 0x00000000000000000000000000000000}}
ymm10          {v8_float = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_double = {0x0, 0x0, 0x0, 0x0}, v32_int8 = {0x0, 0x0, 0x46, 0x84, 0x24, 0x59, 0xd6, 0x3e, 0x0 <repeats 24 times>}, v16_int16 = {0x0, 0x8446, 0x5924, 0x3ed6, 
    0x0 <repeats 12 times>}, v8_int32 = {0x84460000, 0x3ed65924, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int64 = {0x3ed6592484460000, 0x0, 0x0, 0x0}, v2_int128 = {0x00000000000000003ed6592484460000, 0x00000000000000000000000000000000}}
ymm11          {v8_float = {0x9689a800, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_double = {0x0, 0x0, 0x0, 0x0}, v32_int8 = {0x6a, 0xa2, 0x65, 0x50, 0xf2, 0xea, 0x8f, 0xbd, 0x0 <repeats 24 times>}, v16_int16 = {0xa26a, 0x5065, 0xeaf2, 
    0xbd8f, 0x0 <repeats 12 times>}, v8_int32 = {0x5065a26a, 0xbd8feaf2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int64 = {0xbd8feaf25065a26a, 0x0, 0x0, 0x0}, v2_int128 = {0x0000000000000000bd8feaf25065a26a, 
    0x00000000000000000000000000000000}}
ymm12          {v8_float = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_double = {0x0, 0x0, 0x0, 0x0}, v32_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc4, 0x3c, 0x0 <repeats 24 times>}, v16_int16 = {0x0, 0x0, 0x0, 0x3cc4, 
    0x0 <repeats 12 times>}, v8_int32 = {0x0, 0x3cc40000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int64 = {0x3cc4000000000000, 0x0, 0x0, 0x0}, v2_int128 = {0x00000000000000003cc4000000000000, 0x00000000000000000000000000000000}}
ymm13          {v8_float = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_double = {0x0, 0x0, 0x0, 0x0}, v32_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x80, 0x59, 0xbc, 0x0 <repeats 24 times>}, v16_int16 = {0x0, 0x0, 0x8000, 0xbc59, 
    0x0 <repeats 12 times>}, v8_int32 = {0x0, 0xbc598000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int64 = {0xbc59800000000000, 0x0, 0x0, 0x0}, v2_int128 = {0x0000000000000000bc59800000000000, 0x00000000000000000000000000000000}}
ymm14          {v8_float = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_double = {0x0, 0x0, 0x0, 0x0}, v32_int8 = {0x8e, 0x85, 0x83, 0xe8, 0xf0, 0x24, 0x53, 0x3c, 0x0 <repeats 24 times>}, v16_int16 = {0x858e, 0xe883, 0x24f0, 0x3c53, 
    0x0 <repeats 12 times>}, v8_int32 = {0xe883858e, 0x3c5324f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int64 = {0x3c5324f0e883858e, 0x0, 0x0, 0x0}, v2_int128 = {0x00000000000000003c5324f0e883858e, 0x00000000000000000000000000000000}}
ymm15          {v8_float = {0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_double = {0x2d, 0x0, 0x0, 0x0}, v32_int8 = {0xc0, 0x9, 0xf2, 0x16, 0xb5, 0xdf, 0x46, 0x40, 0x0 <repeats 24 times>}, v16_int16 = {0x9c0, 0x16f2, 0xdfb5, 0x4046, 
    0x0 <repeats 12 times>}, v8_int32 = {0x16f209c0, 0x4046dfb5, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int64 = {0x4046dfb516f209c0, 0x0, 0x0, 0x0}, v2_int128 = {0x00000000000000004046dfb516f209c0, 0x00000000000000000000000000000000}}

Note that I tried to disable the first loop in add_bytes_l2, and the output seems correct:

diff --git a/libavcodec/x86/pngdsp.asm b/libavcodec/x86/pngdsp.asm
index 8e23ccf..d4be8ad 100644
--- a/libavcodec/x86/pngdsp.asm
+++ b/libavcodec/x86/pngdsp.asm
@@ -50,8 +50,8 @@ cglobal add_bytes_l2, 4, 6, %1, dst, src1, src2, wa, w, i
     mova  [dstq+iq+mmsize], m1
     add                 iq, mmsize*2
 .end_v:
-    cmp                 iq, waq
-    jl .loop_v
+;    cmp                 iq, waq
+;    jl .loop_v
 
 %if mmsize == 16
     ; vector loop

Attachments (1)

whee_trnc.png (2.5 MB) - added by benoit 3 years ago.

Change History (10)

Changed 3 years ago by benoit

comment:1 Changed 3 years ago by cehoyos

  • Keywords png crash SIGSEGV added; pngdsp removed
  • Priority changed from normal to important
  • Status changed from new to open

comment:2 Changed 3 years ago by cehoyos

  • Keywords regression added

Could be considered a regression.

comment:3 Changed 3 years ago by benoit

Given this is triggered by the APNG demuxer/decoder combo which has just been added, I don't think this really qualify as a regression.

comment:4 Changed 3 years ago by kurosu

void (*add_bytes_l2)(uint8_t *dst /* align 16 */,

uint8_t *src1 /* align 16 */,
uint8_t *src2 /* align 16 */, int w);

rdx 0x7fffc400e3b4
rdi 0x7fffc400e754

dst/src2 are no longer abiding per the alignment constrain. Either the DSP must be modified (most probably that), a specific unaligned version be written, or the caller modified to use aligned addresses.

comment:5 Changed 3 years ago by benoit

Well, there is already the scalar loop in the DSP. What would you think of just adding a test (on dst or src2, as they have the same alignment), like this (note that I don't speak yasm ;-):

diff --git a/libavcodec/x86/pngdsp.asm b/libavcodec/x86/pngdsp.asm
index 8e23ccf..b43fb18 100644
--- a/libavcodec/x86/pngdsp.asm
+++ b/libavcodec/x86/pngdsp.asm
@@ -36,9 +36,16 @@ cglobal add_bytes_l2, 4, 6, %1, dst, src1, src2, wa, w, i
     movsxd             waq, wad
 %endif
     xor                 iq, iq
+    mov                 wq, waq
+
+    ; if dst buffer is unaligned, use scalar loop
+    mov                 waq, dstq
+    and                 waq, 15
+    test                waq, waq
+    jnz .end_s
 
     ; vector loop
-    mov                 wq, waq
+    mov                waq, wq
     and                waq, ~(mmsize*2-1)
     jmp .end_v
 .loop_v:

comment:6 Changed 3 years ago by kurosu

Crap, my comment was not posted, so I'll make it briefer:

0) Are you sure src1 is always aligned? I think not
1) Can the caller be fixed to use aligned addresses without copying around data? If yes, that should be done to have as few unaligned loads as possible
2) Speed generally matters, so deciding in case of unaligned addresses to use the slow, scalar version depends on how often and how large the areas it applies to are
3) If speed matters, a SIMD handling unaligned addresses should be used
4) Compare its speed to the aligned version for aligned addresses. Larger areas, older PCs should show larger differences.
5) If the speed difference is large enough, add a branch for unaligned addresses, otherwise just replace the aligned code
6) If you're right about src1:

test dstq, 15
jnz .handle_unaligned

If you're not:

mov iq, dstq
or   iq, src1q
test iq, 15
jnz .handle_unaligned

then xor iq on each path.

comment:7 Changed 3 years ago by kurosu

Note: that code should not present for mmx.
15 should be replaced by mmsize-1 (although the probability of an AVX2 version is low).

comment:8 Changed 3 years ago by benoit

0) yes, I am sure, at least it is advertised as being so (see libavcodec/pngdec.c:674)
1) no, it cannot, it's working on (arbitrary) subsets of a buffer, and the alignment cannot be guaranteed
2) the only thing I can tell is that it is only affecting certain APNG files, but their number is something that can vary
3) I thought of something like a "prologue" for the function, just like the end of the buffer is handled. My yasm skills are void, though, so I think a first approach would be to branch as I proposed (using your first version in 6) above)

I'll send a patch to do that shortly.
Thank you.

comment:9 Changed 3 years ago by cehoyos

  • Resolution set to fixed
  • Status changed from open to closed

Fixed by Christophe Gisquet in 9fa056ba

Note: See TracTickets for help on using tickets.