Generating captions
06 Feb 2021(Assumed audience: folks familiar with Combine and optionally, Point-Free’s swift-composable-architecture
and -parsing
packages.)
While RIFF wasn’t quite ready for App Store prime time while I worked on it, it was our responsibility as a primarily audio-based app to make sure the eventual launch was accessible for the Deaf and Hard of Hearing community1.
And in hopes of making captions generation more widely practiced, I wanted to write through an approach — but first, some context on the status quo of captions files.
SRT files
A decent chuck of RIFFs are cross-posted to Twitter and the service’s captions file format of choice is SubRip Text (.srt
, for short).
Thankfully, SubRip Text files are relatively straightforward to generate, and in turn, parse.
The plaintext format is as follows:
- Subtitle groups are sequentially numbered from 1.
- On the next line, the timecode that the subtitle should appear, followed by a
-->
separator, and the timecode for disappearance.- Timecodes are formatted in hours:minutes:seconds,milliseconds with two zero-padded digits for the hours, minutes, and seconds and three for the milliseconds component (i.e.
hh:mm:ss,SSS
).
- Timecodes are formatted in hours:minutes:seconds,milliseconds with two zero-padded digits for the hours, minutes, and seconds and three for the milliseconds component (i.e.
- The subtitle text on one or more lines.
- A blank line, indicating the end of the subtitle group.
- (Repeating the above for as many groups as needed.)
Here’s an example (from the end of an old Mama Singh voicemail):
(…tail of file.)
97
00:00:32,189 --> 00:00:32,490
OK
98
00:00:33,060 --> 00:00:33,329
love
99
00:00:33,329 --> 00:00:33,509
you
100
00:00:33,509 --> 00:00:33,740
take
101
00:00:33,740 --> 00:00:33,910
care
102
00:00:33,910 --> 00:00:34,289
bye
Apple’s Speech
framework reports back transcription segments as an array of SFTranscriptionSegment
s — we’ll get to requesting those in a bit, but let’s start with the assumption that we have a collection of them in hand and want to write a function from ([SFTranscriptionSegment]) -> String
, where the retuned string is SRT-formatted. Here’s some scaffolding we’ll work under:
In prose, we’re,
- enumerating
segments
— and shifting the offset down one — to construct the subtitle group sequence numbers, - assuming we have a function
subRipTimingLine
in hand that’ll generate the-->
-separated timecodes, - tacking on
SFTranscriptionSegment.substring
(the transcriptionSpeech
relays back to us), - and finally joining each group’s lines together, then joining all groups with blank lines.
Let’s fill in that subRipTimingLine
gap.
That was…a lot. Thankfully subRipTimeIntervalFormatter
can do the zero-padding for us in the hours, minutes, and seconds positions (over at (1)
) and we can hand roll it for the milliseconds value over at (2)
. time.truncatingRemainder(dividingBy: 1)
returns time
’s fractional component, which can then be shifted up three spots by multiplying by 1_000
, rounded, converted to a string, and then take up to the first three characters of that final Int
-turned-String
. From there, we pad millisecondsDigits - milliseconds.count
many zeros and then append it onto the formatted string from subRipTimeIntervalFormatter
.
Which brings us to the question I punted earlier: how do we get an array SFTranscriptionSegment
s from Speech
in the first place?
Speech recognition requests
The framework packs a SFSpeechRecognitionRequest
base class for recognition requests and two subclasses: SFSpeechAudioBufferRecognitionRequest
and SFSpeechURLRecognitionRequest
. The former transcribes live audio and the latter, existing audio files — since RIFF recording is backed by AVAudioRecorder
, which requires an on-disk location for the final audio file, we’ll step through URL-backed recognition requests.
(No sweat if you’re more familiar with Combine, proper, and not Point-Free’s Effect
wrapper type over it. You can read Effect<Output, Failure>
as AnyPublisher<Output, Failure>
and Effect.future
as a Deferred
Future
, in the usual sense.)
We start off by guarding against either a nil
SFSpeechRecognizer
(the initializer can nil
out if “if the user’s default language is not supported for speech recognition”) and check if the recognizer is available, which is flipped to true
after pinging SFSpeechRecognizer.requestAuthorization
and permissions are granted (we’ll get to this in a bit).
Lastly, we construct the request with the audioURL
argument and kick off the recognition task. For RIFF’s case, we only need to final transcription, but if you’d like intermediate transcription results, flipping SFSpeechURLRecognitionRequest.shouldReportPartialResults
to true
will pipe them through the completion handler.
We’ll give the effect a spin with Point-Free’s Composable Architecture (abbreviated TCA). However, the logic above is UI-agnostic (sans Effect
being included in TCA’s framework), so it can be dropped into a vanilla SwiftUI-and-Combine or UIKit-backed app with minimal changes.
Here’s a recording of a sample project with transcriptionEffect
in action.
There’s two bits of the project to focus in on in ContentView.swift
: the reducer calling SFSpeechRecognizer.requestAuthorization
when Action.onAppear
is dispatched and how to call transcriptionEffect
with a bundled audio file while making sure it’s subscribed to off the main thread and that results are returned back on that same thread for the reducer to handle.
⬦
…aaaand there we have it — this sketch of captions generation and the above sample project should hopefully help folks building audio-based apps with SRT file generation. It’s a baseline level of accessibility I’d love to see met more often.
For learning’s sake, let’s wrap up by writing an SRT parser to check that transcriptionEffect
’s output is formatted to spec.
SRT parsing
In compositional parsing fashion — which definitely isn’t high fashion — , let’s start with the most involved piece, the time code line, and then glue it together with the remaining bits.
The timecode format string is hh:mm:ss,SSS --> hh:mm:ss,SSS
. Or, in plain English, two zero-padded hours, minutes, and seconds digits followed by a comma, a three digit, zero-padded milliseconds component, -->
, and then repeating the timecode part once more.
Let’s start with the zero-padded two- and three-digit numbers (flipping isSigned
to false
disallows leading -
or +
signs).
And now splitting them across literal parsers for the :
and ,
separators and repeating twice around the -->
arrow.
Ah! Almost forgot to map
on timecodeParser
to squash (Int, Int, Int, Int)
down to a TimeInterval
value we can roll into a SubtitleGroup
type that we’ll introduce soon.
Much better.
Now we can zoom back out and start parsing subtitle groups at a time.
Oof. NewLine
is constrained to UTF8 code units; so, we’ll need to tee up our parsers to work with that constraint (note the added .utf8
s in the snippet below).
Onto the last two parts of each subtitle group: the caption itself and the double-newline separator. To package each group, let’s introduce a SubtitleGroup
struct.
Woot woot. We can finally repeat srtGroupParser
with Many
’s help and check that we’ve consumed all input with a trailing .skip(End())
.
We could stop here after checking it parses out some sample SRT-formatted strings. But, we let’s add a few validation checks to make sure sequence numbers are in (1...)
form and that startTimecode
s are strictly less than endTimecode
s within each group (i.e. a caption group must have a positive presentation time) and adjacent groups have nondecreasing timecodes (a subsequent group shouldn’t overlap with — or come before — another).
⬦
I’ll leave writing tests as an exercise for the reader — but, if you’ve made it through these 1.2k+ words, please take a break before then! You’ve more than earned it.
Until next time.
■
-
Maya Gold’s apology for the lack of accessibility in audio tweets was an incredibly honest and powerful example of how to admit responsibility that more of our industry should learn from. ↩