Hi Jeff, are there any plans to support dual-channel audio recordings (e.g., Twi...

jeffharris · 2025-03-20T23:26:29 1742513189

this has been coming up often recently. nothing to announce yet, but when enough developers ask for it, we'll build it into the model's training

diarization is also a feature we plan to add

a-r-t · 2025-03-21T02:55:25 1742525725

Glad to hear it's on your radar. I'd imagine phone call transcription is a significant use case.

ekzy · 2025-03-20T21:30:24 1742506224

I’m not entirely sure what you mean but twilio recordings supports dual channels already

a-r-t · 2025-03-20T22:01:01 1742508061

Transcribing Twilio's dual-channel recordings using OpenAI's speech-to-text while preserving channel identification.

ekzy · 2025-03-20T22:08:02 1742508482

Oh I see what you mean that would be a neat feature. Assuming you can get timestamps though it should be trivial to work around the issue?

a-r-t · 2025-03-21T02:50:17 1742525417

There are two options that I know of:

1. Merge both channels into one (this is what Whisper does with dual-channel recordings), then map transcription timestamps back to the original channels. This works only when speakers don't talk over each other, which is often not the case.

2. Transcribe each channel separately, then merge the transcripts. This preserves perfect channel identification but removes valuable conversational context (e.g., Speaker A asks a question, Speaker B answers incomprehensively) that helps model's accuracy.

So yes, there are two technically trivial solutions, but you either get somewhat inaccurate channel identification or degraded transcription quality. A better solution would be a model trained to accept an additional token indicating the channel ID, preserving it in the output while benefiting from the context of both channels.

claiir · 2025-03-22T10:12:09 1742638329

(2) is also significantly harder with these new models as they don’t support word timestamps like WHISPR.

see > Other parameters, such as timestamp_granularities, require verbose_json output and are therefore only available when using whisper-1.