Opened 5 years ago

Last modified 22 months ago

#3118 open enhancement

SAMI: multiple languages not detected

Reported by: eelco Owned by:
Priority: normal Component: avformat
Version: git-master Keywords:
Cc: nfxjfg@googlemail.com Blocked By:
Blocking: Reproduced by developer: no
Analyzed by developer: no

Description

Summary of the bug:

SAMI files can contain multiple languages, but handles the file as containing a single stream with no way to filter only one language.

How to reproduce:

./ffmpeg -i multiple_languages.smi out.srt
ffmpeg version N-57932-g89a3be8 Copyright (c) 2000-2013 the FFmpeg developers
  built on Nov  5 2013 16:30:18 with Apple LLVM version 5.0 (clang-500.2.78) (based on LLVM 3.3svn)
  configuration: --prefix=/Users/eelco/Projects/Beamer/FFmpeg/build --disable-shared
  libavutil      52. 52.100 / 52. 52.100
  libavcodec     55. 41.100 / 55. 41.100
  libavformat    55. 21.100 / 55. 21.100
  libavdevice    55.  5.100 / 55.  5.100
  libavfilter     3. 90.102 /  3. 90.102
  libswscale      2.  5.101 /  2.  5.101
  libswresample   0. 17.104 /  0. 17.104
Input #0, sami, from 'multiple_languages.smi':
  Duration: N/A, bitrate: N/A
    Stream #0:0: Subtitle: sami
Output #0, srt, to 'out.srt':
  Metadata:
    encoder         : Lavf55.21.100
    Stream #0:0: Subtitle: subrip
Stream mapping:
  Stream #0:0 -> #0:0 (sami -> subrip)
Press [q] to stop, [?] for help
size=      38kB time=00:11:43.56 bitrate=   0.4kbits/s    
video:0kB audio:0kB subtitle:23 global headers:0kB muxing overhead 63.508757%

The input file (multiple_languages.smi) defines the different language in the ‘style sheet’:

...
            <STYLE TYPE="text/css">
            <!--
            P { margin-left:2pt; margin-right:2pt; margin-bottom:1pt;
                font-size:20pt; text-align:center; font-weight:bold;
                color:white; }
            .ENCC { Name:English; lang:en-US; SAMIType:CC; }
            .KRCC { Name:한국어; lang:ko-KR; SAMIType:CC; }
            -->
            </STYLE>
...

And uses the classes to mark the language:

...
<SYNC Start=10109><P Class=KRCC>
<br>사랑과 배신<br>탐욕과 살육의 이야기죠
<SYNC Start=13977><P Class=KRCC>&nbsp;
<SYNC Start=17667><P Class=KRCC>
<br>선악의 정의에 대해서<br>대립하는 가치관을 가진
...

The output however, mixes both languages:

...
4
00:00:10,109 --> 00:00:13,979

There is love and betrayal,
greed and murder. 

5
00:00:17,667 --> 00:00:17,667

선악의 정의에 대해서
대립하는 가치관을 가진 

6
00:00:17,667 --> 00:00:21,717

It's set in this interesting
world of contrasting ideology, 
...

Attachments (3)

multiple_languages.smi (56.4 KB) - added by eelco 5 years ago.
out.srt (37.7 KB) - added by eelco 5 years ago.
0001-avformat-basic-language-support-in-SAMI-subtitles.patch (12.2 KB) - added by Jehan 22 months ago.
Basic language support in SAMI files.

Download all attachments as: .zip

Change History (10)

Changed 5 years ago by eelco

Changed 5 years ago by eelco

comment:1 Changed 5 years ago by Cigaes

  • Component changed from undetermined to avformat
  • Status changed from new to open
  • Type changed from defect to enhancement
  • Version changed from unspecified to git-master

As said above, the lang is set using a CSS property applied using classes.

Currently, the SAMI demuxer just copies the CSS stylesheet into the extradata, and the decoder ignores it.

Handling styled text is still an open issue.

comment:2 follow-up: Changed 5 years ago by gjdfgh

The different languages should really handled as separate subtitle tracks.

This is also how they're meant to be used AFAIK.

Note that SAMI has other issues, and the ffmpeg SAMI decoder is apparently completely unusable for Korean users.

comment:3 Changed 5 years ago by gjdfgh

  • Cc nfxjfg@googlemail.com added

comment:4 Changed 22 months ago by Jehan

I very often get smi files with several languages in it, and it is very annoying because our only solution right now to view them well with media players based on ffmpeg is to edit and delete the unwanted language. It would be good to have multi-lang support.

Here a link to the spec which explains how this works: https://msdn.microsoft.com/en-us/library/ms971327.aspx

comment:5 in reply to: ↑ 2 Changed 22 months ago by Jehan

Replying to gjdfgh:

Note that SAMI has other issues, and the ffmpeg SAMI decoder is apparently completely unusable for Korean users.

By the way about this specific sentence. Maybe it was true 3 years ago when this was written (though not sure about this either, because for a dozen years, all my media players of choice were based on ffmpeg and I have read Korean subtitles for many years). But currently, apart from the fact that I have to edit out the English part of SAMI files before playing, SAMI subtitles work ok. I play videos on mpv (hence with ffmpeg) with SAMI subtitles in Korean nearly every day (with a native Korean person reading the said subtitles).

Changed 22 months ago by Jehan

Basic language support in SAMI files.

comment:6 Changed 22 months ago by Jehan

So I have been having a look at the code for SAMI support and I propose this first basic version. It is not yet a full support which extracts several subtitle tracks from a single SAMI. Instead it simply extracts only the subtitles of the default language (the first language in the list, cf. section "Class and Localization" in the spec).

Not perfect yet still better than the current situation.

comment:7 Changed 22 months ago by cehoyos

Please send your patch to the development mailing list, it will be ignored here.

Note: See TracTickets for help on using tickets.