Developing Veʹrdd for Easy Editing of Apertium Machine Translation Dictionaries - a Google Summer of Code project

This summer I had the pleasure of participating in the Google Summer of Code (GoSC) program the purpose of which is to fund university students to work on an open source project during a summer. I spent my summer working for Apertium, which is an open-source machine translation tool. Apertium embraces a rule-based tradition which makes its use possible even in the severely low-resourced scenarios, as it relies on dictionaries and rule-based descriptions of language such as finite-state morphology (FST).

Veʹrdd and its Background

The development of Veʹrdd started long before the Google Summer of Code project. It was initially developed for editing the upcoming Finnish-Skolt Sami dictionary together with Oulu University. The main requirement for Veʹrdd was the ability to bring data in from different sources such as Giella formatted XMLs developed by Jack Rueter and unstructured Excel files.

In addition, Skolt Sami FST was integrated to Veʹrdd so that it can be used to automatically fill missing information such as inflectional paradigms and to link words to each other based on compounding and derivation. This was done by using the open-source tool called UralicNLP developed by Mika Hämäläinen.

As Veʹrdd had already been developed with multiple import formats (XML and Excel), export formats (XML, LEXC and LaTeX) and multilinguality in mind, it was quite evident that it could be expanded to cover Apertium needs.

From Skolt Sami to Multilingual Apertium

The first step was to make sure that Apertum formatted XML could be imported to Veʹrdd and their data represented in such a format that information from Giella XMLs and LEXCs could be integrated with Apertium bidix and monodix files. This problem that sounds relatively easy and straightforward turned out to be quite a challenge as Giella and Apertium use partially incompatible representations. Also, Apertum format is not quite as machine readable as that of Giella, which resulted in a need of writing a great list of conversion rules.

Another source of problems was fitting morphological information coming from LEXC files together with the information represented in Apertium monodix. The continuation lexica in the both file types were full of inconsistencies which resulted in duplicate entries in Veʹrdd. Fixing this issue was a time consuming task given that we were dealing with dozens of languages at the same time.

What was achieved with Veʹrdd

Veʹrdd has been under development on GitHub ever since it was moved there from BitBucket. I have been the main developer of the repository from the very beginning and the developments done during the Google Summer of Code can be seen from this commit onwards.

In the current state, Veʹrdd provides a graphical web interface for non-technical people to conduct dictionary work. This can be done for the first time ever in such a way that both Giella and Apertium based tools can use the same system. In other words, there is no longer a need to develop resources for endangered languages separately in Giella and Apertium. Instead, all edits can be exported in Giella and apertium formats whenever needed. With the scripts developed by Jack Rueter it is even possible to use the exported Giella XMLs to build FST based morphological analysers and generators.

Veʹrdd also makes community involvement possible when speakers of small languages can directly edit the resources without the need of dealing with XML markup and somewhat under documented representation of machine readable dictionaries.

A future direction for improving Veʹrdd further is to introduce an additional layer of abstraction for linking words that mean the same thing to each other, namely concepts. Several multi-lingual online dictionaries have been created in the past, most notably Wiktionary, where the number of languages involved gets higher, the complexity increases as more languages are introduced to the system. This means that such dictionaries tend to have a lot of words in different languages that mean the same thing and would be translations of each other, but they are not linked to each other in any way. The feature of concepts in Veʹrdd would mean that words link to a concept and via a concept they link indirectly to each other. This would ensure that as more languages are introduced they would automatically get translations to completely new language pairs. This would help bootstrap the development of different bilingual translation systems in Apertium.

Another potential feature is semi-automatically acquiring standardizations and neologisms that are approved by official committee representatives of low-resourced languages, and released online. This feature would continuously extend Veʹrdd to include the latest terminologies and translations.

Veʹrdd can be accessed on https://akusanat.com/verdd/ and it has extensive documentation on its dedicated Wiki pages.