- Accurately interpreting speech has posed challenges due to the variation in meaning and context that occurs during ordinary conversation.
- Entity formatting or number interpretation is notoriously difficult to achieve. People might say ’oh’ instead of ’zero,’ or use ‘triple three’ instead of ‘3-3-3.’
Speechmatics was founded by Dr. Tony Robinson in 2006 and is based in Cambridge, England (UK). Robinson was a pioneer in applying recurrent neural and deep neural networks to automatic speech recognition (ASR). Speechmatics employs approximately 250 workers and is privately held. In 2012, the company began offering its ASR software to enterprise businesses on a usage-based revenue model.
ASR helps businesses reduce friction in the workflow and eliminate time-consuming manual processes. ASR systems are currently deployed in various roles, including voice bots, contact centers, financial institutions, and subtitles in film and television. In order to transform speech into readable text, ASR systems rely upon three levels of processing. The first level, known as the signal level, extracts critical features from the audible portion of the speech stream, as well as removing extraneous noise from the file. The acoustic level identifies the different states and helps to enhance the clarity of speech. Finally, at the language level, the ASR system looks to construct meaningful sentences from the speech data.
The use of AI and network architectures, such as deep neural networks and recurrent neural networks, have proven helpful in advancing the performance and accuracy of ASR systems. Deep learning is especially effective in building models for ASR due to the large amounts of data and tagging required to teach a machine the subtleties of human speech. Unsupervised learning algorithms will reduce the time to train ASR systems and eliminate the need to acquire or create tagged data. Similar to the techniques used in natural language processing, unsupervised learning will reduce cost and allow ASR systems to increase the number of languages supported.
The addition of entity formatting relieves companies of a persistent problem with the interpretation of numbers in ASR transcripts. To solve this problem, Speechmatics utilizes inverse text normalization, a process that converts the spoken form of output to the corresponding textual record. This includes simple numbers, currencies, addresses, email IDs, dates, times, and Uniform Resource Identifier. The use of recurrent neural networks has been helpful in achieving better accuracy; however, the error rate was not low enough for adoption in production systems. Some companies, such as Apple, have successfully employed a memory augmented neural network architecture. Unfortunately, this approach relies on manually curated rules and is not scalable or economically feasible since native language speakers are required to curate transformation rules. A solution to this problem has stymied software engineers at Apple, who note that label assignments can produce correct numbering formats. However, an Apple post commented that “a few phenomena require additional post-processing,” including time expressions, financial symbols, and amounts; this additional post-processing creates disruptions in the workflow and can delay the delivery of the final work transcript to the customers.
Exactly how Speechmatics ASR has achieved its breakthrough in number recognition where others have failed is unclear. Based on pre-selected standardizations chosen by the customer, numbers can either be represented in written format or as spoken in a transcript. However, some clues are likely to surface once Speechmatics files for patent protection for its functionality. Any improvement that decreases transcript errors and human intervention can improve production times and increase customer satisfaction. Even if Speechmatics is only 95% effective, the reduction in time spent manually correcting errors will increase the use of ASRs and help companies become more productive.