Scientists believe that language that fails to spread in the electronic medium is bound to soon become obsolete. Tilde IT, a language technology company, was the first in Lithuania to undertake software localization and to develop advanced speech recognition and synthesis technology. Its leader, Renata Špukienė, emphasizes that we are Lithuanians, so technology must also speak to us in Lithuanian.
– Among other projects, Tilde IT develops Lithuanian language recognition and synthesis. What kind of technology is it and what principle is it based on?
– Tilde IT has been working with language technology for a long time and we started developing this language recognition and synthesis technology as early as 2019. We participated in the projects organized by the Lithuanian Business Support Agency that finances intellectual projects and as a result developed this technology. The project itself lasted 24 months, and various studies and experiments were carried out because it was a completely new technology at the time.
Its accuracy is more than 80%, which matches the accuracy and quality of recognition of all other major languages, so Lithuanian is keeping up with the recognition and synthesis technologies of other world languages.
The technology is based on the principle of deep neuron networks; it recognizes natural spoken language and is written in text, which requires the collection of a number of samples of spoken text. Men’s and women’s voices of all ages are collected, then those speech corpora are processed, prepared for the neural network model, and there the networks themselves work in a similar way to the human brain to recognize, put together sounds and turn them into text.
Users are now free to connect and use this service. They can upload an audio file and get written text, and they can dictate and the text will be transcribed on the screen. Meanwhile, during voice synthesis, the user can offer to make the written text sound in a male or female voice.
– What are the links between your technology and the project “LIEPA” (Lithuanian Speech-Based Services) implemented by Vilnius University?
– “LIEPA” and “LIEPA-2” projects create their own recognition and we create ours. At the time, in 2016-17, when the technology was being developed, the difference was that we were developing technology that recognized and distinguished human language in any environment. Whether you were in a noisy environment or music was playing in the background, a car passed by or someone was talking next to you, the technology was designed to distinguish your speech by eliminating the sounds around you.
“LIEPA” was doing laboratory sound quality at the time, then they, too, switched to a noisy environment, and now their technology works successfully. At the time, however, we were the first whose technology was able to recognize a conversation and turn it into text in real time. “LIEPA” created a speech corpus within the framework of its project and collected 100 hours of speech, we used that corpus for our speech recognition technology—of course we also had our own, but we also used theirs, because the more resources and the more diverse they are, the better the result we get.
– What challenges do you face in developing such technology?
– First of all, resources. Such a thing requires huge volumes of speech corpora, audio recordings, which must be made up of voices of different ages, genders, dialects, because the technology must be able to recognize the Lithuanian-speaking Russian or a Samogitian speaking common Lithuanian equally well.
The next challenge is the noise. The technology must recognize the sound not only in sterile and quiet environments, but also in noisy environments, so we need to teach it to eliminate noise. For example, if someone speaks during a meeting and a pen is clicking in the background, the technology doesn’t care if its a person speaking or a pen clicking—it is a sound, and the technology captures all of it. It must recognize that the clicking of a pen is not the sound that forms a word, a phrase or a syllable, and she it eliminate it from the overall soundtrack.
Another challenge is the small market, because we are a small nation, a small language and a limited number of consumers. We would love to go global with this technology, but the world doesn’t need the Lithuanian language much—only as much as there are Lithuanian speakers.
– Tilde IT currently offers Lithuanian language recognition only for desktop computer systems. Do you plan to apply such a service to smart/mobile devices as well?
– Our goal was not to create an app for mobile devices, we see our services in a slightly wider context. We aim to adapt them to other electronic services, integrate them into customer systems that can work with them, such as a chatbot. As it stands right now, you need to type your question to the automated assistant, while we work on making sure that the chatbots can communicate in voice rather than in text, meaning they hear what's being said to them, recognize the speech, and then give a spoken answer, or in other words, synthesize it.
It's true that we’ve developed the “Tildės Balsas” app, but it works more like a demonstration tool. With it, we aim to show what our speech recognition technology is capable of and how it works. The app allows you to dictate texts and use a variety of commands—for example, you can dictate text messages, notes, written letters, arrange your schedule, and you can tell addresses to Google Maps or Waze; if you drive a car and have a headset on, you can search for contacts using your voice and ask to read texts in voice. The app works well, you can try it and maybe it will become part of your normal routine, such as putting together a list of tasks or groceries on your way to work.
– How often do people use the Lithuanian language recognition service online?
– Perhaps less people use the mobile app, but the speech recognition service itself is used widely by both businesses and individuals.
I would advise it to journalists who can upload a voice recording of an interview and get text. The technology is also used by media monitoring companies, which provide customers with a certain analysis of what has been said about one company or another in the news, the press. Companies that produce subtitles also use it. Another area of use of speech recognition is the recording of meetings – for example, when there's a meeting and there's no time to do the so-called meeting minutes, you can turn the recording into text, transcribe it, and make a protocol—it saves a lot of manual work.
– In terms of language recognition, does the Lithuanian language offer any advantages or is it rather a language that poses additional challenges?
– There are no advantages or disadvantages, each language is unique. The speech corpus is collected, the engines are taught, which is the standard, as is the case for other languages.
Why have we achieved such a high level of quality in language recognition? Because all language recognition technologies are based on the same principle—the more resources you have, the more accurate your language recognition will be.
– Will the development of the Lithuanian language in cyberspace be in demand in the future, given the younger generation’s tendency to use English more often?
– I would say that the Lithuanian language will be in demand in the future as long as we speak Lithuanian. When we wake up, we think in Lithuanian first, we dream in Lithuanian, too, so it is natural for the Lithuanian language to exist in technology. Our goal is for Lithuanian language, its speech recognition to appear on every device.
– Could the Lithuanian language attract the attention of such giants as Apple or Microsoft when it comes to speech recognition technology?
– To put it simply and briefly, the Lithuanian language will appear in Apple, Google and other manufacturers when they install the Lithuanian language support feature. Speech recognition technology has enormous potential in the future and all the big players understand that.
They focus on large markets, with assistants from Siri, Amazon Alexa or Google Assistant speaking in the major languages (English, Russian, German, French, Italian, etc.). Amazon Alexa currently speaks 8 languages and supports 10 other dialects. A dialect is similar to how we have Northern (aukštaičių), Western (žemaičių) and Southern (dzūkų) dialects. They also have Australian English, British English, American English, which are called dialects because there are subtle differences in usage, pronunciation. Google Assistant currently has 12 languages and 13 dialects, while Siri has 21 languages and countless dialects.
So, it is not a matter of having our own language, it is a matter of when. All of this comes down to human and financial resources.
In order to develop technologies in smaller languages, resources must be available that can speak, or at least understand the language. Adding new languages always rests on resources—how much resources can you collect to develop the technology, and how much research can you do. Of course, large producers can always look for solutions that already exist on the market. In our case, they could work with us, but perhaps that will happen someday. We’re ready.
Book a demo today to see what Tilde’s AI chatbot can do for your business tomorrow.