As a research company, Tilde actively takes part in EU funded projects, thus fostering innovation and creating a research background for our business development. We closely cooperate with leading European universities and language technology companies on cutting-edge research. We have led multiple large-scale EU research activities and served as the coordinator of several EU Seventh Framework Programme (FP7) projects.
Research projects are related to our main research competences:
The main objective of the TaaS project is to establish a cloud-based services for acquiring, processing, and reusing multilingual terminological data. The main result of the project is the innovative platform TaaS “Terminology as a Service” for acquiring raw terminological data, cleaning up these data, and then, sharing and reusing terminological data, based on cloud computing.
TaaS offers the following cloud-based terminology services:
- Search of terms in the TaaS Shared Term Repository and other online terminology resources, such as IATE and EuroTermBank;
- Import of files in different formats widely exploited by users, e.g., DOC(X), PDF, XML-based formats like XLIFF and others;
- Automated extraction of monolingual term candidates (from documents uploaded by users) using state-of-the-art linguistically and statistically motivated terminology extraction techniques;
- Automatic lookup of translation equivalent candidates (for monolingual term candidates automatically extracted from documents uploaded by users) from the largest publicly available terminology databases, such as IATE and EuroTermBank, as well as statistical terminological data acquired from publicly available parallel and comparable Web data by use of state-of-the-art linguistically and statistically motivated terminology extraction and bilingual terminology alignment techniques;
- Creation of monolingual and bilingual terminology collections in user-defined languages within the 25 project languages;
- Collaborative terminology clean-up, e.g., deletion of irrelevant or unreliable term candidates and “incorrect” extraction; definition of termhood and unithood; term variant identification; deduplication; bilingual checking of translation equivalents and deletion of irrelevant or unreliable translation equivalents; validation term candidates in context etc.;
- Sharing of resulting terminological data with major terminology databases and banks;
- Reuse of terminology collections in various applications within different human and machine usage scenarios via the TaaS application user interface (API) and export of files in different formats widely exploited by users, e.g., TSV, CSV, and TBX.
TaaS demonstrates the efficacy of its terminology services within the following usage scenarios:
- For language workers, to simplify the processing, storage, sharing, and reuse of task-specific multilingual terminology.
- For computer-assisted translation (CAT) tools, to provide instant access to term candidates and translation equivalent candidates via the TaaS API.
- For statistical machine translation (SMT) systems, to facilitate the domain adaptation by a dynamic integration with TaaS-provided terminological data via the TaaS API.
The research within the TaaS project leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013), Grant Agreement no 296312.
To fully exploit the huge potential of existing open SMT technologies we propose to build an innovative online collaborative platform for data sharing and MT building. This platform will support upload of public as well as proprietary MT training data and building of multiple MT systems, public or proprietary, by combining and prioritizing this data. The project will extend the use of existing state-of-the-art SMT methods that will be applied to data supplied by users in order to increase quality, scope and language coverage of machine translation.
LetsMT! services will be focused on two application scenarios — the free online translation of business and financial news and the application in the localization and translation industry. At the same time, it will be of interest for a variety of users: Web users in general, speakers of less-covered languages, academia, etc.
For the localization and translation industry, LetsMT! will provide facilities for training of SMT systems on their data and generating custom SMT solutions to be used by localization service providers, as well as enterprises and organizations with multilingual translation needs. Integration of SMT solutions in professional productivity environments will be provided.
For readers of business and financial news, LetsMT! will provide free and instant MT services with an emphasis on less covered languages. Their quality will be ensured by application of a large pool of domain-specific resources and subsequent evaluation cycles.
LetsMT! services will be accessible through the Web portal for free translation of texts, through a translation widget provided for inclusion in a Web page, through browser plug-ins for quick access to translation, and through integration in professional translation tools. Project Website
Lack of sufficient linguistic resources for many languages and domains currently is one of the major obstacles in further advancement of automated translation. The main goal of the ACCURAT research is to find, analyze and evaluate novel methods how comparable corpora can compensate for this shortage of linguistic resources to improve MT quality significantly for under-resourced languages and narrow domains.
The ACCURAT project will provide researchers and developers with novel methodology and fully functional model for exploiting comparable corpora to increase translation quality of existing and emerging MT systems. We will determine criteria to measure the comparability of texts in comparable corpora. Methods for automatic acquisition of a comparable corpus from the Web will be analyzed and evaluated. Advanced techniques will be elaborated to extract lexical, terminological and other linguistic data from comparable corpora to provide training and customization data for MT. Improvements from applying acquired data will be measured against baseline results from MT systems and validated in practical applications.
ACCURAT will provide novel approaches to achieve high quality MT translation for a number of under-resourced EU languages (e.g. Estonian, Croatian) and to adapt existing MT technologies to narrow domains (e.g. automotive engineering), significantly increasing the language and domain coverage of MT. ACCURAT methods will be universal and adaptable to new languages and domains.
The project consortium has an optimum balance of world-class researchers in all key research areas and industry SME participants ensuring maximum orientation to exploitation needs. The ACCURAT will provide contribution for expected impacts of the Call by providing methods for automatic acquisition and annotation of language resources, removing gaps in language coverage and increasing quality of translation and providing methods for automated translation to make it more adaptive. Project Website
Baltic and Nordic Branch of the European Open Linguistic Infrastructure
The META-NORD project aims to establish an open linguistic infrastructure in the Baltic and Nordic countries (Denmark, Estonia, Finland, Iceland, Latvia, Lithuania, Norway, and Sweden). The focus of the project is to assemble, link across languages, and make widely available language resources of different types (the core set of META-NORD are treebanks, wordnets and multilingual terminologies) used by different categories of target user communities in academia and industry to specific products and applications. The project will tightly integrate with META-NET and other related activities to create a pan-European open resource exchange platform. Project Website
Terminology Extraction, Translation Tools and Comparable Corpora
The TTC project aims at leveraging machine translation tools (MT tools), computer-assisted translation tools (CAT tools) and multilingual content management tools by automatically generating bilingual terminologies from comparable corpora in five European languages (English, French, German, Spanish and one under-resourced language, Latvian), as well as in Chinese and Russian. Project Website
Crosslingual and multimodal Search in a Portal for Support of Assisted Living
The project will support the social participation of disabled and elderly people, by providing crosslingual and multimodal support for accessing information bases on assistive tools and technology. Recent efforts have linked national assistive technology information bases into a European portal called EASTIN (www.eastin.info). This portal will be enhanced and made more accessible using language technology.
Multilingual technology will allow users to search the data in their native language. Multimodal technology will allow them to access the portal not just in written but also in spoken communication.
The result of the project will be a dedicated language server inside the EASTIN portal. Its task is to act as an interpreter, transforming voice into text and text into voice, and translating queries and documents form and into the users‘ native language. A by-product will be a multilingual terminological glossary of the assistive domain, covering all relevant search terms and their translations. This glossary will be made available for online access. Project Website
The primary objective of Tripod was to revolutionize access to the enormous body of visual media. Applying an innovative multidisciplinary approach Tripod utilized largely untapped but vast, accurate and regularly updated sources of semantic information to create ground breaking intuitive search services, enabling users to effortlessly and accurately gain access to the image they seek from this ever expanding resource. Project Website
Multi modal Interaction Analysis and exploration of Users within a Controlled Environment
The MIAUCE project investigated and developed techniques to analyze the multi-modal behavior of users within the context of real applications. The multi-modal behavior takes the form of eye gaze/fixation, eye blink and body move. We studied and developed techniques that capture and analyse multi-modal behavior in controlled environments. As a result of such analysis, information can be adapted to the user needs and situation. The objective is to develop techniques for human controlled environment interaction, rather than human computer interaction or human human interaction. Project Website
Cross Language Information Retrieval and Organization of Text and Audio Documents
CLARITY project developed CLIR techniques which work with minimal translation resources such as language models to reduce ambiguity introduced during translation, methods to translate words without standard translation information, and a means of translating from one language to another via an intermediate language. It researched retrieval methods that handle mixed collections of spoken and different language documents. Number of techniques was investigated to enable users to better interact with the system and that better present and organize cross-language and spoken documents. Online document organization tool was created based on concept hierarchies, a method for identifying document style and a document gisting method, to translate small summaries. Problem of CLIR for the Baltic languages was researched and solutions were provided.
The goal of EuroTermBank project, administered under the European Commission eContent programme from 2004-2007, was to facilitate terminology data accessibility and exchange, by collecting, consolidating and disseminating existing dispersed terminology resources through an online terminology data bank. The initial focus of EuroTermBank was to contribute to improvement of the terminology infrastructure in the selected new European Union member countries (Latvia, Lithuania, Estonia, Poland, Hungary) but the project has expanded its activities to other EU states and beyond.
The objective of EuroTermBank is to integrate available terminology resources (not only from project partner countries) into the central EuroTermBank database or interlink them via EuroTermBank as a central gateway and a single point of service. The data bank works on a two-tier principle — as a central database and as an interlink node or gateway to other national and international terminology banks. Data exchange mechanisms have been developed to establish term import, export and exchange with other terminology databases.
A large number of terminology resources have been acquired and processed for inclusion into the EuroTermBank database. The methodology developed in EuroTermBank project serves as the basis for content processing. The content passes several stages before integration into the database, including selection, prioritization, modification, and digitalization (for non-digital format). The outcome is a reliable multilingual terminology resource, networked with other existing national and international resources available for users over the global network. EuroTermBank portal enables searching within approximately 600,000 terminology entries containing over 1.5 million terms in various languages, coming from 100 terminology collections. As a single point of service, the EuroTermBank portal provides a consolidated search interface to its central database as well as other national and international terminology banks. It can be easily expanded by importing or interlinking new terminology resources. Project Website
The objective of the SOLIM project is to improve context-aware information analysis by expansion of state of the art ontology languages and their support for automated reasoning by adding a spatial dimension. This will enable semantic systems to venture beyond a static world and add the concepts of space and change.
Current technological tools for describing semantic knowledge are incapable of adequately supporting automated reasoning on the inherent spatial properties of concepts. Information with a spatial component can be described by using an ontology that treats locations as ordinary concepts. However, in doing so the temporal-spatial consequences of the described events (locations and movement) are lost in the formalization. This means that knowledge about the spatial aspects (such as orientation, dimension, scale, location and movement of a concept) cannot be efficiently described inside the ontology, even though it comprises valid and persistent knowledge about the domain. Spatial properties can only be dealt with in an ad-hoc manner while these are among the basic properties of physical concepts expressed in many ontologies.
The SOLIM project extends the ontology web language OWL so that it can support effective storage and reasoning on spatial information, and will demonstrate the power of such an extension with automatic processing of textual and graphical information. Project Website
Semantic Analysis-Based Multilanguage Document Management System
The goal of the SEMO project is to develop a new intelligent technology that retrieves metadata from documents both in paper and electronic format regardless of their type, structure and language. Paper documents first get digitalized, then they are subject to the OCR procedure and then the documents are entered into the system for metadata retrieval. The metadata retrieval tool analyzes the entered structured file, recognizes the metadata type, classifies and then retrieves the document content. A special screening procedure estimates the probability of any recognition and retrieval errors and assesses the quality of each document processed.