Otter CEO and Silicon Valley-based serial entrepreneur Sam Liang created a cloud-based artificial intelligence engine to power its automated speech to text transcription service.
In my 25-plus-year career—and counting!—as a journalist, I’ve done thousands of interviews and attended even more meetings, either face to face or over the phone which, for lots of them, I had to manually transcribe the conversations to make sure that I got my interlocutors’ comments right—a chore that I dreaded.
So when my friend Marie Domingo introduced me to Otter.ai at the TechCrunch Disrupt conference in San Francisco last fall—she was handling their public relations at that time—I was, of course, curious to know more about how this mobile app works but highly skeptical that it would actually help me.
How wrong I was! After more than six months using the free version of the service that includes 600 minutes of free transcriptions per month, Otter.ai changed my life and in a later post, I will publish a more in-depth technical review of Otter’s paid “Premium” version that costs just $10 a month for 6,000 minutes of transcription.
Not only does the app (for iOS and Android) do an excellent job in transcribing my live interviews and meetings from speech to text with great accuracy—and letting me focus on the actual conversations rather than taking notes verbatim—but it made the notes totally searchable which is an amazing time saver when I’m looking for specific keywords.
I recently sat down with Otter’s CEO and co-founder Sam Liang in Palo Alto to talk about his entrepreneurial journey, from his humble beginnings as a computer student at Peking University to a software engineer (at Cisco and Google to name a few), to finally starting the 30-person startup, which raised $13 million since it was founded over three years ago.
Although I’ve obviously used Otter for the live transcription of our conversation, this exclusive and nearly two-hour long interview has been edited for clarity and length.
Otter can transcribe in real-time meeting conversations and share the notes within a team.
Jean Baptiste Su: How did you get the idea of starting Otter?
Sam Liang: There are billions of people here in the world. Everybody talk several hours every day. So voice is just a such a pervasive way for people to communicate. However, I just keep forgetting things. I have so many meetings back to back every day. By the end of my third meeting, I already forgot maybe 70% of the things that happened in the first meeting. So myself, I have the need to have an easy way to take notes. And not only just take notes actually search, because like a month later, if I want to recall you told me about Notre Dame Cathedral, I vaguely remember something but I don’t remember the details. At that moment. I could have been a Starbucks coffee place. So I may not have my notebook with me, but I really need that information instantly. So with an app like this, I can just get it and search easily. That’s one thing. And also, for us, we have a small team of a few dozen people, but we have tons of meetings going on actually: the product meeting, the Android meeting, the marketing meeting, business development, sales, customer service, etc. So we have all that information but how do you share this information with each other? We have this meeting with you today but somebody in our organization may be interested in some of this conversation. Of course, I can tell him/her verbally, but whatever I say will be somewhat processed. But if they want to know the original conversation is really hard. But with Otter, we can share it easily which is really important for enterprises. And you probably have seen this type of study that shows that in enterprises, people spend at least 30%, or even more 40% or 50% of their time in all kinds of meetings, either in person meeting, phone calls or video conferences. If you think about it, if you pay an employee $100,000 a year, they actually spend 30,000, or $40,000, to just attend meetings. This is how expensive meetings are, but then, how much information is actually being discussed or generated? Or how much return are you getting out of that investment? Right? You pay them $40,000 to attend meetings? Okay, what is the return out of that? So, we’re not building a transcription company, Otter is a collaboration company. This is in the same domain, like Zoom, Slack or the storage companies (Box or Dropbox), because the information is also stored in the cloud and is very easy to access, both for yourself and your team members.
JBS: Talk about your entrepreneurial journey
SL: I studied computer science at Peking University and it was my dream to study at Stanford. I applied and was rejected several times but the third time, they finally accepted me in the Ph.D. program with professor David Cheriton and I did my thesis on large scale distributed systems. Then I worked at Cisco a little bit and then joined Google in 2006 to work on a metro Wi-Fi system, putting a Wi-Fi router on light poles on the street. But while working on that, I realized that when I connect my laptop with the Wi-Fi router on the light pole, we can actually estimate their location. So that’s the beginning of my location work. So I actually started the Google location service platform. And then later, they realize actually, this is so critical for mobile map. So we integrate that with mobile map. That’s why in 2007 when Steve Jobs was launching the first iPhone, one critical app he wanted to provide was a map service. So we actually help them to put the map on the first iPhone, at that time Google and Apple were still good friends because Android hasn’t been released yet. But, I always wanted to do something crazy. So I actually quit Google in 2010 to start my first startup: A mobile startup related to location that automatically tracks location and automatically detects all the places a user visit without manual check-in. That company went pretty well, and later it was acquired by Alibaba. The whole team was based here in Palo Alto and I work there a couple more years until I find something even crazier to do. Then I realized that I have this big pain point remembering meetings, and also in how do I share meeting information with each other. So that’s the origin of Otter actually. So because I have an engineering background, and I like to just build stuff from the ground up, I build a team that created the speech recognition engine. We are not using Google API or Microsoft API. This is entirely our own technology. So it turned out that this is actually working better than the Google API, we actually beat Microsoft, by a big margin in terms of accuracy, especially in a very noisy environment. And actually, when you have a meeting like this, the speech recognition task is actually more difficult than Alexa, Cortana or Siri. The reason is this: When you talk to Siri or Alexa, you say, hey, Alexa, what’s the weather tomorrow? The whole sentence is very short. It’s only a few seconds. And it’s a very short question or short command. So we call that a chatbot system. Our system is actually much more challenging because it is trying to listen to the conversation between multiple people, it could be three or could be five or 10 in an enterprise meeting. So it’s a lot more complicated because whenever you have multiple people speaking, they interrupt each other, they have different accents, they have different paces, their distance to the microphone is different. So somebody’s voice is louder and others lower. So we have to build the system from the ground up to handle all this complexity. It is a lot more complicated than handling an Alexa question.
JBS: Did you use some open source technologies to build your engine from scratch?
SL: We use some academic systems but it’s a lot more work to do because any project from academia is not usable in a real product. We actually use millions of hours of recordings to train our deep learning algorithm for example.
JBS: How is artificial intelligence (AI) used in Otter?
SL: The speech recognition itself has a lot of AI and machine learning to separate the noise—we actually inject in our training system a lot of the noise into the training data to make it harder, so that the model can learn how to handle the noise—and accents is a very complicated issue to solve as well, even in the U.S., people from Texas, Boston and California talk differently. So how do you know they’re saying the same words. Humans have been trained to understand those after 20 years, 30 years of training, right? That’s a long time to train your brain to understand it. But still, if it’s the first time you hear somebody from Ireland or Scotland, you’ll still have trouble understand them, right? Maybe after a few hours, you get better. But how do you teach machine to do that? It’s a huge amount of AI there. Also, many words have similar pronunciation. So when you first hear the pronunciation, how do you know which word it is? That’s where the context comes into play. The words before it, the words after it, and even the words that happened 10 minutes ago can help you. So the, there’s a lot of AI there to incorporate that semantics knowledge into the system and that’s part of the natural language processing (NLP), to understand partially the meaning of the words: So which words actually usually go together, or which words rarely go together. In the past, before the current generation of AI, systems used a rules-based system: If this happens, do this and a you may have 100 or a few thousand rules but the problem with that approach is that it doesn’t scale, because the number of variation is just too big for you to write a rule for every single scenario and that’s where machine learning comes into play. By using a huge amount of data to train the model, in a very large neural network with many nodes, you create a huge matrix from thousands of different variations. It’s a totally different. Machine learning is the same thing than how human learn things: When you grow up, you learn a language, nobody tell you rules, but you just sort of emulate how other people speak. And when you say something other people respond. So you get reinforcement, you know, okay, you’re saying the same thing. Machine is trying to emulate that type of learning. And even for humans, we may not be able to explain very clearly how we do it and that’s the one issue with machine learning that sometimes there’s no clear explanation why it happens. But somehow it learns from the data in the past.
JBS: Do you think that machines will surpass humans?
SL: I think it’s just a matter of time that machine will be smarter than human. The number of brain cells a human have is huge, but is limited. But machine, once you have the internet, you can actually take advantage of all the machines on the internet, and take advantage all the the knowledge on the internet, then make a decision based on that knowledge. Any single human being one be able to do that. Right now the hard thing for the machine, is to better understand the context. That’s still hard. Like when you say the same words, you use different tones that could may mean different things, and that’s the level of subtlety that machines can not fully understand yet, but I think is a matter of time an I’m really optimistic about this.
JBS: What has been Otter’s traction so far?
SL: So far, we’ve transcribed almost 10 million meetings—uploaded to our platform or live— for a total of over 250 million minutes. Our technology is also integrated into other services such as the Zoom’s cloud-based video conferencing platform. Actually, two years ago, Zoom was actually searching for a solution to transcribe their meetings and they looked at all the usual suspect you can think of to provide a voice API, and found that we worked better than the other guys and decided to license our technology.
JBS: What’s the history behind the name of your company, Otter?
SL: When we were looking for a name for our company, we looked at a lot of different choices: We thought about trees, like sequoia or redwood tree, we also looked at fruits, apple is the most famous fruit in the world, and we though about animals that are friendly, adorable and smart, and finally settled on otter. Many people don’t know but otters are one of the smartest animals in the world. Dolphins and elephants are also smart, but otters actually have a very high IQ: They can be taught to play games—if you look on YouTube, you can find videos of otters playing basketball in the water! Otters can learn to use rocks and tools to open clams. They also have really good memory. If you teach them something they will still remember it many years later. And there are also very friendly: There are lots of picture of otters holding hands to float in the water. Which also fits our narrative as we see Otter.ai as a collaboration system.
JBS: How do you see the future of Otter?
SL: We don’t see Otter as a boring transcription service but as a collaboration system. So transcription is a first step for us to provide better collaboration and next step, there will be more and more natural language processing and natural language understanding (NLU). So we’re working on new algorithms to generate summarization as well so after a long meeting it can intelligently detect certain important sentences including action items that can be extracted and put automatically into your calendar, create a reminder for you, integrate that with other project management systems like Jira, or other system. And if you look at Otter, it’s actually similar to Slack except that we’re focused on voice communication rather than text: in Slack, you have a concept called channels. Similarly, in Otter, you can create groups (a product group, engineering, marketing…), and all the meetings and all the conversations happening in that group are automatically shared with everybody, both the voice and the transcript. So you can make sure that everybody’s on the same page, even if they missed the meetings.