Natural language processing (NLP) has generated a number of applications for languages with many speakers, such as automatic translators, spell checkers, text classification, automatic summarization, automatic generation of questions and/or answers (question answering), among others.
These tools and benefits have not materialized for languages with fewer resources, such as Mexican languages, and in general for languages with few speakers, mainly in developing countries.
One of the main reasons is precisely the lack of sufficient texts, but another equally big one is the lack of governmental and social support to organize existing resources and generate new ones.
NLP tools for low-resource languages
Py-Elotl
Py-Elotl is a python package (python>=3.x) developed by Paul Aguilar and Robert Pugh from the Elotl community. It is focused on low-resource languages in Mexico. The current version has two main functions: orthographic normalization and parallel corpora.
How to install Py-Elotl?
Assuming that we have a GNU/Linux working environment with python>3.x installed, we can create a python environment, which we will call tlahtolli (“language” in Nahuatl language), to better organize the work:
python3 -m venv tlahtolli
We enter the created folder:
cd tlahtolli
And we activate the environment:
source ./bin/activate
To install the latest version of Py-Elotl:
git clone https://github.com/ElotlMX/py-elotl.git
pip install -e py-elotl/
What are they and how to use parallel corpora?
Parallel corpora are a linguistic resource consisting of texts with the same meaning in two different languages, usually at sentence level. It can be understood as a database with original text plus a translation usually into a more spoken language.
The corpora are widely used to create machine translation tools or directly as searchable texts for learners of one of the languages involved.
We start by importing the package:
import elotl.corpus
And we print the list of available corpora:
list_of_corpus = elotl.corpus.list_of_corpus()
for row in list_of_corpus:
print(row)
The output in the current version is:
['axolotl', 'Is a nahuatl corpus']
['tsunkua', 'Is an otomí corpus']
Axolotl is a Nahuatl-Spanish corpus and Tsunkua is an Otomi-Spanish corpus. In the py-Elotl case each entry contains 4 fields:
- Non_original_language (non_original_language), for the current corpora would be Spanish.
- Original language (original_language). Currently Nahuatl and Otomi.
- Dialectal variant (variant). Mexican languages are very diverse and many have several dialectal variants.
- Name of the document from which the sentence originates (document_name).
To load a corpus just use the load()
method with the corpus name, for example, to load the corpus “axolotl”:
axolotl = elotl.corpus.load('axolotl')
Once loaded, we can make queries, for example, we see the first entry:
print(axolotl[0])
And the output is:
['Vino a iluminar el sol y allí fue a ver a su', 'tlaminako tonati uan noponi kiitato', '', 'Lo que relatan de antes (cuentos tének y nahuas de la huasteca)']
For example, to search for happiness, that is, the word “happiness” (paquiliztli) in the corpus, we can do the following:
keyword = 'paquiliztli'
for entry in axolotl:
if any(keyword in s for s in entry):
print(entry)
And we will have many examples of the use of this word, for example the last entry is:
['Mucha gente busca la felicidad en todas partes y nunca la encuentra.', 'Miac tlacah canohuian pahpaquiliztli quitemoah ihuan ayic quinextiah.', '', 'Método auto-didáctico náhuatl-español']
What is spelling normalization and how to use it?
py-Elotl includes a spelling normalizer for the Nahuatl language.
Nahuatl, like many other languages, has more than one spelling, more than one orthographic rule (in Spanish there is an interesting story with the Chilean orthography).
A disadvantage of orthographic diversity is that it makes it difficult to process (compare, count, etc.) the same word in different standards. This is where a normalizer comes in handy.
py-Elotl can convert between three standards: SEP, INALI and Ack.
To use it we first import the nomalizer:
import elotl.nahuatl.orthography
And we load the standard we want to work with:
n = elotl.nahuatl.orthography.Normalizer("sep")
And we can try normalizing the word “tlahtoa”, you can also use long texts without any problem, which in the Sep standard would be “tlajtoa” (word):
word = 'tlahtoa'
print(n.normalize(word))
And the output is the expected “tlajtoa”. It is even possible to see the conversion to phonemes:
n.to_phones(word)
And the output is:
ƛaʔtoa
The example from start to finish is shown below:
import elotl.corpus
import elotl.nahuatl.orthography
print('List of corpus')
list_of_corpus = elotl.corpus.list_of_corpus()
for row in list_of_corpus:
print(row)
axolotl = elotl.corpus.load('axolotl')
print('Print first entry')
print(axolotl[0])
keyword = 'paquiliztli'
print('Looking for the word: '+keyword)
for entry in axolotl:
if any(keyword in s for s in entry):
print(entry)
word = 'tlahtoa'
print('Normalizing the word: '+word)
n = elotl.nahuatl.orthography.Normalizer('sep')
print(n.normalize(word))
print(n.to_phones(word))