processing languages
of the global south

Processing languages of the Global South

What's this project about?

PLoGS is a project designed to support languages of the Global South through the development of computational tools. Most such languages are disadvantaged because of years of colonialism and linguistic imperialism. The so-called Linguistic Digital Divide leaves them further marginalized in comparison to languages like English, Spanish, and Chinese, with few materials accessible on the internet and few resources available for creating new materials. The Linguistic Digital Divide is not simply a technological problem, but technology may be able to play a role in supporting these under-resourced languages.

Existing software

All the software is free and available under a GNU General Public License, according to which you can use it for any purpose, change it to suit your needs, and share it with others.

Our work has focused on two types of tools: those for processing the morphology (structure of words) of particular languages and those for assisting in the translation of documents for particular pairs of languages.

Morphological processing (L3Morpho)

For languages whose words have complex structure, morphological processing is an essential component in many applications. "Morphological processing" refers to two distinct processes, analysis, which extracts the root and grammatical properties of a given word, and generation, which realizes the reverse process. For example, given the Spanish word cambies, a morphological analyzer would recognize that it's a verb with the infinitive cambiar, in the present subjunctive present, and with a second person singular subject. And given the infinitive cambiar and the properties suj=2p and tmp=subj_pres, a morphological generator would produce the word cambies.

The applications that we have created for morphological processing are not designed for naive users but instead for computer scientists who are familiar with the programming language Python and want to incorporate morphological processing in the computational systems they are developing.

We currently have available systems for the following languages.

  • HornMorpho, for three languages spoken in the Horn of Africa: Amharic, Oromo, Tigrinya
  • AntiMorfo, for Southern Quechua and Spanish, spoken in the Andes ("Anti" in Quechua) of Peru and Bolivia
  • KaxilMorfo, for K'iche' and Spanish, spoken in Guatemala ("Paxil Kayala'" in K'iche')

Computer-assisted translation

One important way to increase the available materials in under-resourced languages is through the translation of materials from other languages. Although machine translation is not yet capable of producing publication-quality materials, computer-assisted systems can speed up the work of human translators. Most computer-assisted translation software relies on translation memories, large databases of translation examples. For under-resourced languages, these databases don't exist yet, so the framework we are developing relies on grammatical knowledge to suggest translations for the user and saves the user's translations in a growing translation memory.

We are currently developing a web application, called Mainumby ("hummingbird" in Guarani), that implements such a system for the language pair Spanish-Guarani, representing the two official languages of Paraguay.