Generating captions

(Assumed audience: folks familiar with Combine and optionally, Point-Free’s swift-composable-architecture and -parsing packages.)


While RIFF wasn’t quite ready for App Store prime time while I worked on it, it was our responsibility as a primarily audio-based app to make sure the eventual launch was accessible for the Deaf and Hard of Hearing community1.

And in hopes of making captions generation more widely practiced, I wanted to write through an approach — but first, some context on the status quo of captions files.

SRT files

A decent chuck of RIFFs are cross-posted to Twitter and the service’s captions file format of choice is SubRip Text (.srt, for short).

Subtitles, as we’re defining them, are transcripts of a video’s dialogue or audio contained in .srt files that are attached to videos through either Media Studio, ads.twitter.com, or via the API.

Thankfully, SubRip Text files are relatively straightforward to generate, and in turn, parse.

The plaintext format is as follows:

  • Subtitle groups are sequentially numbered from 1.
  • On the next line, the timecode that the subtitle should appear, followed by a --> separator, and the timecode for disappearance.
    • Timecodes are formatted in hours:minutes:seconds,milliseconds with two zero-padded digits for the hours, minutes, and seconds and three for the milliseconds component (i.e. hh:mm:ss,SSS).
  • The subtitle text on one or more lines.
  • A blank line, indicating the end of the subtitle group.
  • (Repeating the above for as many groups as needed.)

Here’s an example (from the end of an old Mama Singh voicemail):

(…tail of file.)

97
00:00:32,189 --> 00:00:32,490
OK

98
00:00:33,060 --> 00:00:33,329
love

99
00:00:33,329 --> 00:00:33,509
you

100
00:00:33,509 --> 00:00:33,740
take

101
00:00:33,740 --> 00:00:33,910
care

102
00:00:33,910 --> 00:00:34,289
bye

Apple’s Speech framework reports back transcription segments as an array of SFTranscriptionSegments — we’ll get to requesting those in a bit, but let’s start with the assumption that we have a collection of them in hand and want to write a function from ([SFTranscriptionSegment]) -> String, where the retuned string is SRT-formatted. Here’s some scaffolding we’ll work under:

(Gist permalink.)

In prose, we’re,

  • enumerating segments — and shifting the offset down one — to construct the subtitle group sequence numbers,
  • assuming we have a function ‌subRipTimingLine in hand that’ll generate the -->-separated timecodes,
  • tacking on SFTranscriptionSegment.substring (the transcription Speech relays back to us),
  • and finally joining each group’s lines together, then joining all groups with blank lines.

Let’s fill in that subRipTimingLine gap.

(Gist permalink.)

That was…a lot. Thankfully subRipTimeIntervalFormatter can do the zero-padding for us in the hours, minutes, and seconds positions (over at (1)) and we can hand roll it for the milliseconds value over at (2). time.truncatingRemainder(dividingBy: 1) returns time’s fractional component, which can then be shifted up three spots by multiplying by 1_000, rounded, converted to a string, and then take up to the first three characters of that final Int-turned-String. From there, we pad millisecondsDigits - milliseconds.count many zeros and then append it onto the formatted string from subRipTimeIntervalFormatter.

Which brings us to the question I punted earlier: how do we get an array SFTranscriptionSegments from Speech in the first place?

Speech recognition requests

The framework packs a SFSpeechRecognitionRequest base class for recognition requests and two subclasses: SFSpeechAudioBufferRecognitionRequest and SFSpeechURLRecognitionRequest. The former transcribes live audio and the latter, existing audio files — since RIFF recording is backed by AVAudioRecorder, which requires an on-disk location for the final audio file, we’ll step through URL-backed recognition requests.

(Gist permalink.)

(No sweat if you’re more familiar with Combine, proper, and not Point-Free’s Effect wrapper type over it. You can read Effect<Output, Failure> as AnyPublisher<Output, Failure> and Effect.future as a Deferred Future, in the usual sense.)

We start off by guarding against either a nil SFSpeechRecognizer (the initializer can nil out if “if the user’s default language is not supported for speech recognition”) and check if the recognizer is available, which is flipped to true after pinging SFSpeechRecognizer.requestAuthorization and permissions are granted (we’ll get to this in a bit).

Lastly, we construct the request with the audioURL argument and kick off the recognition task. For RIFF’s case, we only need to final transcription, but if you’d like intermediate transcription results, flipping SFSpeechURLRecognitionRequest.shouldReportPartialResults to true will pipe them through the completion handler.

We’ll give the effect a spin with Point-Free’s Composable Architecture (abbreviated TCA). However, the logic above is UI-agnostic (sans Effect being included in TCA’s framework), so it can be dropped into a vanilla SwiftUI-and-Combine or UIKit-backed app with minimal changes.

Here’s a recording of a sample project with transcriptionEffect in action.

There’s two bits of the project to focus in on in ContentView.swift: the reducer calling SFSpeechRecognizer.requestAuthorization when Action.onAppear is dispatched and how to call transcriptionEffect with a bundled audio file while making sure it’s subscribed to off the main thread and that results are returned back on that same thread for the reducer to handle.

(Gist permalink.)

…aaaand there we have it — this sketch of captions generation and the above sample project should hopefully help folks building audio-based apps with SRT file generation. It’s a baseline level of accessibility I’d love to see met more often.

For learning’s sake, let’s wrap up by writing an SRT parser to check that transcriptionEffect’s output is formatted to spec.

SRT parsing

In compositional parsing fashion — which definitely isn’t high fashion — , let’s start with the most involved piece, the time code line, and then glue it together with the remaining bits.

The timecode format string is hh:mm:ss,SSS --> hh:mm:ss,SSS. Or, in plain English, two zero-padded hours, minutes, and seconds digits followed by a comma, a three digit, zero-padded milliseconds component, -->, and then repeating the timecode part once more.

Let’s start with the zero-padded two- and three-digit numbers (flipping isSigned to false disallows leading - or + signs).

(Gist permalink.)

And now splitting them across literal parsers for the : and , separators and repeating twice around the --> arrow.

(Gist permalink.)

Ah! Almost forgot to map on timecodeParser to squash (Int, Int, Int, Int) down to a TimeInterval value we can roll into a SubtitleGroup type that we’ll introduce soon.

(Gist permalink.)

Much better.

Now we can zoom back out and start parsing subtitle groups at a time.

(Gist permalink.)

Oof. NewLine is constrained to UTF8 code units; so, we’ll need to tee up our parsers to work with that constraint (note the added .utf8s in the snippet below).

(Gist permalink.)

Onto the last two parts of each subtitle group: the caption itself and the double-newline separator. To package each group, let’s introduce a SubtitleGroup struct.

(Gist permalink.)

Woot woot. We can finally repeat srtGroupParser with Many’s help and check that we’ve consumed all input with a trailing .skip(End()).

(Gist permalink.)

We could stop here after checking it parses out some sample SRT-formatted strings. But, we let’s add a few validation checks to make sure sequence numbers are in (1...) form and that startTimecodes are strictly less than endTimecodes within each group (i.e. a caption group must have a positive presentation time) and adjacent groups have nondecreasing timecodes (a subsequent group shouldn’t overlap with — or come before — another).

(Gist permalink.)

I’ll leave writing tests as an exercise for the reader — but, if you’ve made it through these 1.2k+ words, please take a break before then! You’ve more than earned it.

Until next time.


  1. Maya Gold’s apology for the lack of accessibility in audio tweets was an incredibly honest and powerful example of how to admit responsibility that more of our industry should learn from.