Hi Jeff, are there any plans to support dual-channel audio recordings (e.g., Twilio phone call audio) for speech-to-text models? Currently, we have to either process each channel separately and lose conversational context, or merge channels and lose speaker identification.
1. Merge both channels into one (this is what Whisper does with dual-channel recordings), then map transcription timestamps back to the original channels. This works only when speakers don't talk over each other, which is often not the case.
2. Transcribe each channel separately, then merge the transcripts. This preserves perfect channel identification but removes valuable conversational context (e.g., Speaker A asks a question, Speaker B answers incomprehensively) that helps model's accuracy.
So yes, there are two technically trivial solutions, but you either get somewhat inaccurate channel identification or degraded transcription quality. A better solution would be a model trained to accept an additional token indicating the channel ID, preserving it in the output while benefiting from the context of both channels.