Text Collection

Text collection for the development of a machine translation engine.

Contribute: Translations

By contributing texts for the training corpus you will significatly aid the development of the slovene machine translation engine for the ENG-SLO language pair.

Contribute: Gigafida

Do you have a large quantity of texts in Slovene? By contributing them to the Gigafida corpus you will greatly aid the development of Slovene language technologies.

In WP 4, we will develop a machine translation engine for the English-Slovenian language pair, which will be available in many different forms: on the DSDE portal as a web application, as part of a speech translation pipeline, and as open-source code, available for commercial use.

The development of language technologies, such as machine translation, is crucial for the survival of language in the digital age, as it will not be otherwise possible to integrate future technology into the new ways of communication and methods of work and leisure that will be available to us. This phenomenon can already be observed in the regard to virtual assistants – we can communicate with them today - but only in a few languages (Amazon Alexa, 8 languages, Google Assistant, 13 languages, and Apple Siri, 21 languages). The DSDE project strives to make Slovene one of these languages.

In order to develop a high-quality machine translation tool, it is necessary to collect as many translated and aligned texts as possible. Compared to larger languages, Slovene is in a worse position, as fewer speakers also mean less translations. In addition, in a broader market, smaller languages are also less interesting for the development of commercial solutions. A rough estimate of the number of aligned segments for the ENG-SLO language pair is 36 million, with more than half of these segments coming from subtitles on the OpenSubtitles web portal. These texts are less useful for the development of a machine translation engine that can successfully translate different text types. Languages with a larger number of speakers also have 3 to 4 times larger corpora of translations.

For this reason, one of the key goals of WP 4 is to collect aligned translations and original texts from different fields – we call on all those who produce translations to contribute their texts to the ENG-SLO translation corpus and thus help develop Slovene in the digital environment.

Potential contributors are e.g., public institutions, private companies, translation agencies and individual translators. We have compiled a list of frequently asked questions and answers regarding text collection.

Questions and answers for businesses and text owners

What texts can I contribute?

To develop a machine translation engine, we need source texts and their translations that are segmented. Typically, this means that they have been translated with a computer-aided translation (CAT) program and are available in the bilingual XML format (translation memories, XLIFF files). Most professional translators and agencies use CAT tools for their work. They enable them to store translations in a database, from which the texts can be quickly retrieved at a later time. These files are the easiest to include in our corpus.

The translation agency or translator did not provide me with bilingual files. What can I do?

It is common practice in the translation industry for translations to remain the property of the client unless a contract explicitly states the contrary. Therefore, it is possible to ask whomever provides you the texts in question to give you with bilingual files, which you can then contribute to the DSDE project.

What if I still cannot obtain bilingual files?

Larger quantities of original texts and their translations can also be aligned later. The latest automatic solutions achieve high accuracy – as part of the project, we will develop applications that will allow machine alignment of texts in different languages, so you can also contribute translations and source texts that are not aligned.

I want to contribute texts, but they contain personal information. What can I do?

Personal data and other sensitive information can be anonymized through semi-automatic pseudonymization procedures. The results are very good –you will also receive the texts for review before they are included in the corpus. The texts will be added to the corpus only after they are approved by the contributor.

I want to contribute texts, but I do not want third parties to be able to access my documents.

The sentences in the corpus will be randomized – mixed with each other in random order. We will also remove all metadata, which could be used to possibly retroactively compile whole documents: additionally, we can anonymize company names, product names, addresses and the like.

I want to contribute texts, but I am not satisfied with the quality of the translations.

This is a common concern of many potential contributors and indicates a high level of concern for the mother tongue, but in most cases this concern is redundant. Modern algorithms, which operate on the principle of deep neural networks, can very effectively identify low-quality translations or non-standard translations, and eliminate them in the training process. Given the principle of corpus construction (randomized sentences, no metadata), there is also no fear that it would be easy to determine the origin of any poor translations in the corpus.

Is there a compensation for contributing translations to the corpus?

Within the DSDE project, a certain monetary compensation is envisaged for the contribution of translations to the ENG-SLO translation corpus. The amount of the compensation depends on the amount of text, the filed and the relevance of said text. We suggest that you contact us, and we will prepare a customized offer for you.

Will the corpus be publicly available?

Yes, in accordance with the requirements of the project, the ENG-SLO translation corpus will be available on the CLARIN portal under the CC-BY-SA 4.0 license.

Questions and answers for translators and translation agencies

Why, as a translator, would I help with an activity that takes away my work and earnings?

Machine translation engines have certainly changed the translation market in the last 10 to 15 years. If a decade ago machine translations from English into Slovene were practically useless, today they present a more prominent part of the usual translation process, mainly in the form of post-editing for foreign clients. Although it is undeniable that rates have been reduced, on the other hand, it is also true that the global translation industry is growing from year to year. Just as computer-assisted translation tools came on the market less than 30 years ago and completely changed the way we work in the translation industry, a similar leap is now happening with machine translation. With the DSDE project, the Republic of Slovenia ensures that the development of the Slovene language will not only take place in closed circles within large (mostly) American multinationals. We would also like to share knowledge on machine translations and its reasonable applications with the Slovene public and translation clients.

Can I contribute texts as a translator or translation agency?

To contribute texts to the corpus, you must be the owner of the translations and original texts. Translators are usually not the owners of the original texts. Please check the contents of the contract between the provider of translation services and the client.

Is there any other way I can help collect texts for the corpus of translations?

As a translator, you can inform your clients that the Republic of Slovenia is purchasing translations within the DSDE project and thus provide them with an additional source of income. Your clients will certainly appreciate such a proposal, which will further deepen the business relationship, and which may also be reflected in new translation orders in the future.

Machine translation

© 2020. All rights reserved

Concept and implementation: ENKI, d.o.o. Legal notice Cookies