Hacker Newsnew | past | comments | ask | show | jobs | submit | freshthought's commentslogin

What is SWIFT, behind the scenes?


probably a TCP socket and an old, well established, EDI format.


Does anyone use this? How does AssemblyAI compare to Google’s? We are considering adding speech recognition to a small part of our product.


Dylan from Assembly here. Most of our customers have actually switched over to us from Google - this Launch HN from a YC startup that uses our API goes into a bit more detail if you're interested:

https://news.ycombinator.com/item?id=26251322

My email is in my profile if you want to reach out to chat more!


We run 10's of thousands of hours of audio through Assembly AI each day. We did a boatload of benchmarking on manually transcribed audio when we decided to use them and they were by far the best across the usual suspects (Amazon, etc.) and against smaller startups. They've only gotten better in the 2-3 years we've been using them


I believe most people already moved to offline engines. No need to send the data to some random guys like this Assembly. Nemo Conformer from Nvidia, Robust Wav2Vec from Facebook, Vosk. There are dozen options. And the cost is $0.01 per hour, not $0.89 per hour like here.

Another advantage is that you can do more custom things - add words to vocabulary, detect speakers with biometric features, detect emotions.


without talking about accuracy any comparison is meaningless.


You don't even need to compare accuracy, you can just check the technology. Facebook model is trained on 256 GPU cards and you can fine-tune it to your domain in a day or two. The release was 2 month ago. There is no way any cloud startup can have something better in production given they have access to just 4 Titan cards.


Also curious, are there any 'independent' performance benchmarks in this space?


This is tricky. The de facto metric to evaluate an ASR model is Word Error Rate (WER). But results can vary widely depending on the pre-processing that's done (or not done) to transcription text before calculating a WER.

For example if you take the WER of "I live in New York" and "i live in new york" the WER would be 60% because you're comparing a capitalized version vs an uncapitalized version.

This is why public WER results vary so widely.

We publish our own WER results and normalize the human and automatic transcription text as much as possible to get as close to "true" numbers as possible. But in reality, we see a lot of people comparing ASR services simply by doing diffs of transcripts.


I used both Google's speech-to-text APIs and Assembly's APIs as well as some other ones to build Twilio Voice phone calling applications. The out of the box accuracy was way better with Assembly and its far easier to quickly customize the language model for higher accuracy in specific domains (for example programming language keywords). Generally I avoid using Google APIs whenever possible since they always seem overly complicated to get started with and have incomplete documentation even when I'm working in Python which should be one of the better supported languages.


I would strongly advise against using Google's ML apis.

First, at my company Milk Video, we are huge fans of Assembly AI. The quality, speed and cost of their transcription is galaxies beyond the competition.

Having worked in machine learning focused companies for a few years, I have been researching this exact question. I'm curious how I can better forecast the amount of ML talent I should expect to build into our team (we are a seed stage company), and how much I can confidently outsource to best-in-class.

A lot of the ML services we use now are utilities that we don't want to manage (speech-to-text, video content processing, etc), and also want to see improve. We took a lot of time to decide who we outsource these things to, like working with AssemblyAI, because we were very conscious of the pace of improvement in speech-to-text quality.

When we were comparing products, the most important questions were:

1. How accurate is the speech-to-text API

1.a Word error rate

1.b Time attributed to start/end word

2. How fast does it process our content

3. How much does it cost

AssemblyAI was the only tool that used modern web patterns (ie. not Googles horrible API or other non-tech based companies trying to provide transcript services) that made it easy to integrate with in a short Sunday morning. The API is also surprisingly better than other speech-to-text services, because its trained for the kind of audio/video content being produced today (instead of old call center data, or perfect audio from studio-grade media).

Google's api forced you to manage your asset hosting in GCP, handle tons of unnecessary configuration around auth/file access/identity, and its insanely slow/inaccurate. Some other transcription services we used were embarrassingly horrible from a developer experience perspective, in that they also required you to actually talk to a person before giving you access.

The reason Assembly is so great is that you can literally make an API request with a media file url (video or audio), and boom, you get a nice intuitive JSON formatted transcript response. You can also add params to get speakers, get topic analysis, personal information detection, and it's just a matter of changing the payload in the first API request.

I'm very passionate about this because I spent so much time fighting previously implemented transcript services, and want to help anyone avoid the pain because Assembly really does it correctly.


How good is their speaker labeling? We've been using the Google API but their diarization has been basically unusable for our application (transcripts of group conversations).


Dylan from Assembly here. If you want to send me one of your audio files (my email is in my profile) I'd be happy to send you back the diarized results from our API.

You can also signup for a free account and test from the dashboard without having to write any code if that's easier.

Other than lots of crosstalk in your group conversations - is there anything else challenging about your audio (eg, distance from microphones, background noise, etc?)


We use assemblyai at our YC startup https://pickleai.com for our transcripts and deploy our own sentiment and summary models to help users take more efficient notes on Zoom calls! Super happy with them!


Maybe relevant in context: you can now use Siri offline transcription inside your apps. (for free)


This doesn’t answer the question at all, but huggingface also has some decent ASR models available.


Huggingface ASR models are not really recommended. The simple fact they don't use beam decoder with LM makes them much less accurate for practical applications. If you compare them to setups like Nemo + pyctcdecode, they will be 30% less accurate.

Also, most of the models there are undertrained.


Perhaps the most notable fact is how upvoted this article is. Shows you how much people want to believe, maybe?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: