Speech Services

Speech is the next step in language technology. Though speech technology already exists for the world’s larger languages – such as English, Spanish, and German – smaller languages are underrepresented. Tilde is currently working on building speech technology services for Europe’s smaller languages like Latvian, Lithuanian, and Estonian.

Latvian Automated Speech Recognition (ASR)
Tilde was the world’s first company to create ASR for Latvian. The ARS service is based on huge database of spoken Latvian data. Since completion, the service has been integrated into a mobile app that recognizes spoken numerals.
Latvian text-to-speech service
Speech Synthesis (Text-to-Speech, TTS) technology transform the wording of an utterance into sounds that are outputted to the user.
Tilde has worked on Speech Synthesis technology development since the late 1990s. Technology for pronouncing Latvian words and texts is also included in our product Tildes Birojs, the market-leading proofreading software for Latvian.
Endpoint URL
V1 API main endpoint URL is: https://runa.tilde.lv/v1
Available systems

The following speech recognition systems are available:

  • Offline recognition systems
    • LVASR – speech recognition for Latvian (default)
    • LTASR – speech recognition for Lithuanian
    • EEASR – speech recognition for Estonian
  • Online recognition systems
    • LVASR-ONLINE – speech recognition for Latvian (default)
    • LTASR-ONLINE – speech recognition for Lithuanian
    • EEASR-ONLINE – speech recognition for Estonian
    • ADDRESS-ONLINE – street address recognition for Latvian
    • ADDRESS-LT-ONLINE – street address recognition for Lithuanian

The following speech synthesis systems are available:

  • lt-regina – Lithuanian, Regina voice
  • lt-edvardas – Lithuanian, Edvardas voice
  • flite – Latvian, female voice (default)
  • flite-bern – Latvian, child voice

 

Audio Format Support
 
The following audio formats (containers) are be supported: 
 
  • AIFF (Audio Interchange File Format), file extensions: .aiff
  • WAV (Waveform Audio File Format), file extensions: .wav, .wave
  • 3GA (3GPP file format), file extensions: .3ga, .3gpp
  • ASF (Advanced Systems Format), file extensions: .asf, .wma
  • Matroska, file extensions: .mkv, .mka
  • MP4 (MPEG-4 Part 14), file extensions: .mp4, .m4a
  • Ogg, file extensions: .ogg, .oga

The following audio codecs are be supported:

  • Uncompressed WAV
  • AAC (Advanced Audio Coding)
  • AMR (Adaptive Multi-Rate audio codec)
  • MP3 (MPEG Layer III Audio)
  • OGG Speex, Vorbis, Opus
  • FLAC (Free Lossless Audio Codec)
  • WMA (Windows Media Audio)

The following audio signal discretization parameters are supported:

  • 8kHz, 16kHz, 22kHz, 44.1kHz and 48kHz
  • 8/16-bit signed/unsigned integer, mu-law or a-law (for uncompressed formats)

Audio can be mono, stereo or multi-channel. However, only first channel will be used for recognition.

 

Authentification 

 

In order to use API one should first authentificate with Speech service using OpenID Connect 1.0 protocol[1] and obtain JWT token.

Authorization and token endpoints can be obtained here:

https://testspeechcloud.b2clogin.com/testspeechcloud.onmicrosoft.com/v2.0/.well-known/openid-configuration?p=b2c_1_signupsignin1

When performing any API request the obtained JWT token should be provided using HTTP Authorization header, like: „Authorization: Bearer <jwt>“.

 

Offline recognition API

Offline recognition API is intended for transcription of long audio recordings, where response time is not critical. By default, each file submitted to offline recognition API goes through full pipeline: audio format conversion, segmentation, diarization, decoding and post-processing.

 

Sending requests

Use the HTTP POST[1] method to send a “multipart/form-data[2] request for speech recognition.

The request should be sent to address:

<url>/recognize/file?system=<system>

 <url>: speech service main endpoint address.

“<system>”: optional ASR system string identifier, e.g. “LTASR”. If omitted default ASR system is used.

Request form fields are described in Table 1. 

Table 1. POST request parameters

Field

Description

Required

audio

Audio file

Yes

mode

List of enabled features. See Table 2.

No

email

Email address for notifications

No

email_url

Result page URL to be included in email.

e.g. https://asr_results?token=?id?

No

email_template

E-mail template name (e.g. EEASR, LVASR etc)

No

callback

Callback endpoint URI to be called after job is finished

No

callback_type

File to include in the callback (ctm, srt, etc, see 5.3)

No

 

NB: The “audio” MUST be the last parameter in request body!

Parameter “mode“ is comma-separated list of enabled features. See Table 2. 

Table 2. Offline API decoding features.

Name

Description

skip_diariz

Skip segmentation and diarization

speakers

Include speaker ID’s in transcription

ctm

Generate CTM[1] file

full

Default mode. Perform diarization and CTM generation

skip_postprocess

Skip all postprocessing

Replay from server

In case of successful request response header “Location” will contain an URI of the created job and  the following JSON message will be returned by speech service:

{ "status" : "0", "request_id" : "<unique identifier>"  }

List of possible status values can be seen in Table 3. In case of the error the “status” field will contain value other than 0, 3 or 4.

Table 3. Status codes

Code

Description

0

Success.

1

No speech found in file.

2

Unknown error during decoding. Recognition aborted.

3

File is waiting in queue.

4

File is being processed.

5

File type is not recognized.

6

Timeout during processing. Recognition aborted.

9

The request cannot be processed at the moment. Try again later.

10

Authorization error

 

The status of the recognition task can be checked by sending a HTTP GET request to the created job URI:

<url>/recognize/file/<job>/status

where “<url>” is endpoint main address and “<job>” is a unique identifier of the request.

The reply will look like this:

{ "status" : "0", "request_id" : "<unique identifier>"  }

 

Getting results

When speech recognition is completed, transcription and other files can be received by using a HTTP GET:

<url>/recognize/file/<job>/<file>

where

<url>: endpoint address  

 “<job>”: is a unique identifier of the request

“<file>”: is the name of the file we wish to receive. Possible values are:

  • ctm – CTM file with word timing information
  • srt – Transcript in subtitle format
  • txt – Transcript in plain text format
  • summary – Recognition statistics 
CTM file format
 

Each line in CTM file represents a word in a transcript, but columns are defined like this:

CTM :== <U> <C> <BT> <DUR> <word_id> <CONF>

where

<U>: the id of the utterance

 “<C>”: the waveform channel, always “1”

“<BT>”: The begin time (seconds) of the word, measured from the start time of the utterance.

“<DUR>”: The duration (seconds) of the word.

“<word_id>”: The id of the word.

“<CONF>”: Confidence score for word.

 

Summary file format

The “summary” file contains a summary of recognition statistics in JSON format.

Callbacks

If „callback“ parameter is supplied when submitting a file, then after job is finished a notification will be sent, by doing a POST “multipart/form-data” request to the provided callback URI. On the receiving end, it will look like as a usual multiple file upload request.

The request will contain JSON status file (multipart section “status”) with following content:

{ "status" : "<code>", "request_id" : "<job>"  }

Status codes are explained in table 3.

If “callback-type” parameter is provided, and job is completed successfully, then callback request will also include an additional file which is specified by “callback-type”. For example, if “callback-type” is “txt”, then request will include multipart section “txt”, which contains transcript in plain text.

 

Online recognition API

Previously described offline recognition API is intended for cases, where speed of answer is not essential. This means that this solution is not suitable for real time solutions, where as fast as possible answer is necessary. Also, usage of traditional HTTP GET and POST methods is not very suitable for full-duplex scenario, where one side continuously sends audio and the other simultaneously responds with ASR recognition results.

To provide speech recognition for situations, where fast answer is necessary, online recognition API is necessary. This API is based on a WebSocket protocol.

Opening a session

To open a session, connect to the specified server websocket address:

wss://<url>/recognize/stream/<system>

where

<url>: endpoint address

“<system>”: optional ASR system string identifier, if omitted default ASR system is used.

The server assumes by default that incoming audio is sent using 16 kHz, mono, 16bit little-endian format. This can be overridden using the 'content-type' request parameter. The content type has to be specified using GStreamer 1.0 caps format, e.g. to send 44100 Hz mono 16-bit data, use: "audio/x-raw, layout=(string)interleaved, rate=(int)44100, format=(string)S16LE, channels=(int)1". This needs to be url-encoded of course, so the actual request is something like:

wss://<url>/recognize/stream/LVASR-ONLINE?content-type=audio/x-raw,+layout=(string)interleaved,+rate=(int)44100,+format=(string)S16LE,+channels=(int)1

Normally server should recognize the container and codec automatically from the stream, you don't have to specify the content type. E.g., to send audio encoded using the Speex codec in an Ogg container, use the following URL to open the session (server should automatically recognize the codec):

wss://<url>/recognize/stream/<system>

Because WebSocket protocol does not allow to use “Authorization” header for authentication, after session is opened the client application shall immediately send an authentication and initiation message:

{

"access_token":<JWT access token>,

}

This message can also include instructions for recognition results postprocessing:

{

"access_token":<JWT access token>,

"enable-partial-postprocess":<postprocessors>,

"enable-postprocess":<postprocessors>,

}

where <postprocessors> can be:

  • A list of postprocessors to enable, e.g. [“numbers_all”,”exampe2”…]
  • A dictionary, representing postprocessors and processing options.
    E.g. { "numbers_all": "escape", "commands2": { "commands": ["_$_ESCAPE_$_", "_$_NEW_LINE_$_", "_$_DELETE_LEFT_$_"] } }

Next step is optional. At the end of recognition session, a speaker adaptation data is collected and sent to client application. At the start of new session, a client application can choose to send previously collected data in order to improve recognition quality. Adaptation data is sent as JSON message:

{ “adaptation_state” : {

    “type”: “<data encoding, e.g. string+gzip+base64>”,

    “value”: <adaptation data, e.g. iVector>

    }

}

 

Sending audio

Speech should be sent to the server in raw blocks of binary data, using the encoding specified when session was opened. It is recommended that a new block is sent at least 4 times per second (less frequent blocks would increase the recognition lag). Blocks do not have to be of equal size.

After the last block of speech data, a special 3-byte ANSI-encoded string "EOS" ("end-of-stream") needs to be sent to the server. This tells the server that no more speech is coming and the recognition can be finalized.

After sending "EOS", client has to keep the websocket open to receive recognition results from the server. Server closes the connection itself when all recognition results have been sent to the client. No more audio can be sent via the same websocket after an "EOS" has been sent. In order to process a new audio stream, a new websocket connection has to be created by the client.

Before closing the connection, the server sends JSON message with speaker adaptation data:

{ “adaptation_state” : {

    “type”: “<data encoding, e.g. string+gzip+base64>”,

    “value”: <adaptation data, e.g. iVector>

    }

}

This data can be stored and used in next recognition session to improve recognition quality.

Reading results

Server sends recognition results and other information to the client using the JSON format. The response can contain the following fields:

{ "status" : response status (integer), see codes below,
  "result": {
    "hypotheses" : 

      [{ "utterance" : <transcription>,

       “confidence” : <(optional) confidence of the hypothesis (float, 0..1)> }]

    “final” : <true when the hypothesis is final, i.e., doesn't change any more>

  }

}

The following status codes are currently in use:

0 – Success. Usually used when recognition results are sent

2 – Aborted. Recognition was aborted for some reason.

1 – No speech. Sent when the incoming audio contains a large portion of silence or non-speech.

9 – Not available. Max load limit reached.

10 – Authentication failed.
11 – All recognition workers are currently in use and real-time recognition is not possible.

Websocket is always closed by the server after sending a non-zero status update (except for status code 11).

Examples of server responses:

{"status": 9}

 

{"status": 0, "result": {"hypotheses": [{"transcript": "see on"}], "final": false}}

 

{"status": 0, "result": {"hypotheses": [{"transcript": "see on teine lause."}], "final": true}}

Server segments incoming audio on the fly. For each segment, many non-final hypotheses, followed by one final hypothesis are sent. Non-final hypotheses are used to present partial recognition hypotheses to the client. A sequence of non-final hypotheses is always followed by a final hypothesis for that segment. After sending a final hypothesis for a segment, server starts decoding the next segment, or closes the connection, if all audio sent by the client has been processed.

Client is responsible for presenting the results to the user in a way suitable for the application.

 

Simple recognition API

Simple recognition API is intended for fast recognition of very small files (usually less than 10 sec.). Simple recognition API uses the same online ASR systems like online recognition API, but provides a much simpler HTTP interface.

Use the HTTP POST method to send a “multipart/form-data” request for speech recognition.

The request should be sent to address:

<url>/recognize/utterance?system=<system>

 “<url>: main endpoint address.

“<system>”: optional ASR system string identifier, e.g. “LTASR-ONLINE”. If omitted default ASR system is used.

Request form fields are described in Table 4.

Table 4. Simple recognition API POST request parameters

Field

Description

Required

audio

Audio file

Yes

 

NB: The “audio” MUST be the last parameter in request body!

If recognition is successful, the response will look like this:

{ "status" : "0",

  "request_id" : "<unique identifier>"

  “hypotheses” :  [{ “utterance” : <recognized text> }]

}

If some error occurred recognition the “hypotheses” field will be not available and “status” field will contain one of following values:

2 – Aborted. Recognition was aborted for some reason.

1 – No speech. Sent when the incoming audio contains a large portion of silence or non-speech.

9 – Not available. Max load limit reached.

10 – Authentication failed.
11 – All recognition workers are currently in use and real-time recognition is not possible.

Speech synthesis API
 
Synchronous API
 

Send HTTP GET request to:

<url>/say?text=<text>&voice=<system>&pitch=<pitch>&tempo=<tempo>

where:

“<text>”: UTF-8 text that needs to be synthesized.

“ <system>” : optional synthesis system id (for example, “lt-regina”), if omitted the default TTS system is used to process the request

“<pitch>”: optional parameter, a value between 0.1 and 10 (1.0 by default) which allows to change the pitch of the synthesized speech

“<tempo>”: optional parameter, a value between 0.1 and 10 (1.0 by default) which allows to change the tempo of the synthesized speech. Greater values mean slower tempo.

Response

mp3 audio file with synthesized text.

 

Asynchronous API

Submitting speech synthesis job

Send HTTP POST request to:

<url>/say/?text=<text>&voice=<system>&pitch=<pitch>&tempo=<tempo>

where:

“<text>”: UTF-8 text that needs to be synthesized.

“ <system>” : optional synthesis system id (for example, “lt-regina”), if omitted the default TTS system is used to process the request

“<pitch>”: optional parameter, a value between 0.1 and 10 (1.0 by default) which allows to change the pitch of the synthesized speech

“<tempo>”: optional parameter, a value between 0.1 and 10 (1.0 by default) which allows to change the tempo of the synthesized speech. Greater values mean slower tempo.

 

Response and monitoring job status

In case of successful request speech service returns following JSON message:

{ "status" : "0", "request_id" : "<unique identifier>"  }

The “Location” header of the response will contain a URI of created job.

In case of the error the “status” field will contain value greater than 0. For asynchronous requests the status of the synthesis job can be checked by sending a HTTP GET request to URI of the created job:

<url>/say/<job>/status

where “<url>” is speech service main address and “<job>” is a unique identifier of the request.

If job is finished the reply will look like this:

{ "status" : "0", "request_id" : "<unique identifier>"  }

List of possible status values can be seen in Table 3. In case of the error the “status” field will contain value other than 0, 3 or 4.

 

Retrieving the result

When speech synthesis is completed, audio and other files can be received by using a HTTP GET request:

<url>/say/<job>/<file>

where

<url>: endpoint address 

 “<job>”: is a unique identifier of the request

“<file>”: is the name of the file we wish to receive. Possible values are:

  • ctm – CTM file with word timing information
  • audio – synthesized speech in MP3 format

Each line in CTM file represents a word in a transcript, but columns are defined like this:

CTM :== <U> <C> <BT> <DUR> <word>

where

<U>: the id of the utterance

 “<C>”: the waveform channel, always “1”

“<BT>”: The begin time (seconds) of the word, measured from the start time of the utterance.

“<DUR>”: The duration (seconds) of the word.

“<word>”: Synthesized word.