Linguistic Analysis

Methods for automated linguistic analysis are the foundation of robust multilingual solutions that allows to cross language barriers. Tilde’s researchers has deep knowledge in developing written language technology for complex, highly inflected languages.

Tilde’s long-term research activities on written language processing and linguistic analysis for highly inflected languages has resulted in the exceptional proofing tools and natural language processing technologies (e.g. morphological analyzers and taggers, syntactic parsers, named entity recognizers, etc.) for Baltic languages.

Linguistic analysis research

 

Linguistic analysis for better language technology

The quality of basic linguistic analysis tools for written text processing plays a crucial role in the development of high level, cutting-edge language technology solutions. Therefore, Tilde's team of researchers is constantly looking for novel methods to improve linguistic analysis tools. Methods used for linguistic analysis include knowledge based, data driven and hybrid. Recently our researchers started to investigate neural network models for three types of written text analysis tasks – syntactic analysis, assessment of grammaticality, and grammar correction.

Tilde’s research in syntactic analysis and grammar checking has been internationally acknowledged and has received best paper, 3rd place in 2014.

PROJECTS

Ongoing projects

Quality Translation 21

Quality Translation 21

Project aims to develop substantially improved statistical and machine-learning based translation models for challenging languages and resource scenarios.

Read more
Odine Project

Open Data Incubator for Europe (ODINE)

As part of its ODINE incubator project, Tilde will gather, create, and contribute new Multilingual Open Data sets for EU languages, which enable the language technology community to develop key services such as machine translation systems.

Read more
European Language Resource Coordination

European Language Resource Coordination

The objective of the project to identify and gather language and translation data relevant to public administration across all 30 European countries.

Read more

Completed Projects

project clarity logo

CLARITY (FP5 project) – Cross-Language Information Retrieval and Organisation of Text and Audio Documents

The aim of the CLARITY project was to develop cross-lingual information retrieval (CLIR) techniques for English -> Finnish, Swedish, Latvian & Lithuanian i.e low density languages with minimal translation resources and to investigate techniques of document organisation and presentation in concept hierarchies and by document genres and filters. Clarity was a fully-fledged retrieval system that supported the user during the whole process of query formulation, text retrieval and document browsing.

 

project ttc logo

TTC (FP7 project) – Terminology Extraction, Translation Tools and Comparable Corpora

The TTC project aimed at leveraging machine translation tools (MT tools), computer-assisted translation tools (CAT tools) and multilingual content management tools by automatically generating bilingual terminologies from comparable corpora in several European languages (i.e. English, French, German and Latvian) as well as in Chinese and Russian. Terms in different languages are aligned based on the similarity of words next to them in the corpora (immediate vicinity), the approach is known as lexical context analysis. The system generates candidate translations for single- or multi- word terms. The approach relies on the one-to-one relation between terms and concepts.

Publications

2020

Georg Rehm, Maria Berger, Ela Elsholz, Stefanie Hegele, Florian Kintzel, Katrin Marheinecke, Stelios Piperidis, Miltos Deligiannis, Dimitris Galanis, Katerina Gkirtzou, Penny Labropoulou, Kalina Bontcheva, David Jones, Ian Roberts, Jan Hajic, Jana Hamrlová, Lukáš Kačena, Khalid Choukri, Victoria Arranz, Andrejs Vasiļjevs (Tilde), Orians Anvari (Tilde), Andis Lagzdiņš (Tilde), Jūlija Meļņika(Tilde), Gerhard Backfried, Erinç Dikici, Miroslav Janosik, Katja Prinz, Christoph Prinz, Severin Stampler, Dorothea Thomas-Aniola, Jose Manuel Gomez-Perez, Andres Garcia Silva, Christian Berrío, Ulrich Germann, Steve Renals and Ondrej Klejch. 2020. European Language Grid: An Overview . Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020), 3359‑3373.

 

Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis, Kalina Bontcheva, Jan Hajic, Khalid Choukri, Andrejs Vasiļjevs (Tilde), Gerhard Backfried, Christoph Prinz, Jose Manuel Gomez-Perez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek, Morten Irgens, Patrick Gatellier, Joachim Köhler, Laure Le Bars, Dimitra Anastasiou, Albina Auksoriūtė, Núria Bel, António Branco, Gerhard Budin, Walter Daelemans, Koenraad De Smedt, Radovan Garabík, Maria Gavriilidou, Dagmar Gromann, Svetla Koeva, Simon Krek, Cvetana Krstev, Krister Lindén, Bernardo Magnini, Jan Odijk, Maciej Ogrodniczuk, Eiríkur Rögnvaldsson, Mike Rosner, Bolette Pedersen, Inguna Skadina, Marko Tadić, Dan Tufiș, Tamás Váradi, Kadri Vider, Andy Way and François Yvon. 2020. The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe. Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020), 3315‑3325.

2019