Having read the docs - used chat gpt to summarize them - there is no mention of speaker diarization for these models.
This is a _very_ low hanging fruit anyone with a couple of dgx h100 servers can solve in a month and is a real world problem that needs solving.
Right now _no_ tools on the market - paid or otherwise - can solve this with better than 60% accuracy. One killer feature for decision makers is the ability to chat with meetings to figure out who promised what, when and why. Without speaker diarization this only reliably works for remote meetings where you assume each audio stream is a separate person.
In short: please give us a diarization model. It's not that hard - I've done it one for a board of 5, with a 4090 over a weekend.
> This is a _very_ low hanging fruit anyone with a couple of dgx h100 servers can solve in a month and is a real world problem that needs solving.
I am not convinced it is a low hanging fruit, it's something that is super easy for humans but not trivial for machines, but you are right that it is being neglected by many. I work for speechmatics.com and we spent a significant amoutn of effort over the years on it. We now believe we have the world's best real-time speaker diarization system, you should give it a try.
After throwing the average meeting as an mp3 to your system, yes, you have diarization solved much better than everyone else I've tried by far. I'd say you're 95% of the way to being good enough for becoming the backbone of monolingual corporate meeting transcription, and I'll be buying API tokens the next time I need to do this instead of training a custom model. Your transcription however isn't that great - but good enough for LLMs to figure out a minutes of the meeting.
That said, the trick to extracting voices is to work in frequency space. Not sure what your model does but my home made version first ran all the audio through a fft, then essentially became a vision problem for finding speech patterns that matched in pitch and finally output extremely fined grained time stamps for where they were found and some python glue threw that into an open source whisper tts model.
This is a _very_ low hanging fruit anyone with a couple of dgx h100 servers can solve in a month and is a real world problem that needs solving.
Right now _no_ tools on the market - paid or otherwise - can solve this with better than 60% accuracy. One killer feature for decision makers is the ability to chat with meetings to figure out who promised what, when and why. Without speaker diarization this only reliably works for remote meetings where you assume each audio stream is a separate person.
In short: please give us a diarization model. It's not that hard - I've done it one for a board of 5, with a 4090 over a weekend.