Tilde unveils massive new corpus of multilingual open data

Machine Translation
23.05.2017
new corpus multilingual open data

Tilde unveiled a brand new corpus of multilingual open data for EU languages this week at the NoDaLiDa conference in Gothenburg, Sweden. Known as the TILDE MODEL (Multilingual Open Data for EU Languages) corpus, the resource contains over 16 million segments of parallel data collected in several key domains. 

The TILDE MODEL corpus will now be available to the global language technology community for developing high quality language technology services for EU languages. The data was collected from sites allowing free use and reuse of its content, as well as from public sector web sites.The resources in the TILDE MODEL corpus have been cleaned, aligned, and formatted into the standard TMX format, useable for developing new language technology products and services.

The corpus can also be accessed in the META-SHARE respository, maintained by the Multilingual Europe Technology Alliance.

At the NoDaLiDa conference, representatives from Tilde will describe the chosen approaches to select data sources for the new corpus, how the source data were handled, what tools were used, and what data was obtained as the result of the project.

Tilde collected and processed the new corpus as part of the ODINE Open Data Incubator for Europe, which aims to support the next generation of digital businesses and fast-track the development of new products and services.

The NoDaLiDa conference (21st Nordic Conference on Computational Linguistics) is one of the oldest and largest language technology conferences in Nordic and Baltic countries. The conference addresses all aspects of natural language processing and computational linguistics — including work on applications, linguistic resources and in closely related neighbouring disciplines (such as linguistics or psychology) that is sufficiently formalized or applied to bear relevance to speech and language technologies.