Knowledge-based word sense disambiguation for Setswana-English machine translation
Date
2024
Authors
Moape, Tebatso Gorgina
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
There are several challenges that hinder the development of Setswana-to-English machine translation systems. A key obstacle is the absence of machine-readable knowledge resources. This has
prompted the use of the only accessible data, which originates from the government domain. While
training machine-translation systems using government-domain data can offer specialized language
knowledge, such training introduces obstacles such as limited vocabulary, style variation, bias, and
domain specificity. Furthermore, it is noted in the literature that the ongoing problem of polysemy in a
machine-translation system reduces the overall accuracy. Polysemy is a linguistic phenomenon in which
a single word or phrase has multiple senses, resulting in ambiguity. The task of resolving ambiguity in
natural language processing (NLP) is known as word sense disambiguation (WSD). The concept of
WSD serves as an intermediate task for enhancing text understanding in NLP applications, including
machine translation, information retrieval, and text summarization. Its cardinal role is to enhance the
effectiveness and efficiency of these applications by ensuring the accurate selection of the appropriate
sense for polysemous words in diverse contexts. This study addresses these challenges by proposing
three essential components: a diversity-aware machine-readable knowledge resource for SetswanaEnglish, or the Setswana universal knowledge core (SUKC), a WSD approach to resolving lexical
ambiguity; and a corresponding machine-translation model embedded with a WSD capability.
Setswana-English data was collected from the existing paper-based bilingual dictionaries to achieve
this purpose. Secondly, the study employed professional translators to translate space domain concepts
from English to Setswana. The collected lexicon was integrated into the universal knowledge core
(UKC). The Lesk algorithm which has seen various adaptations by researchers for different languages
over the years was employed to address the inherent polysemy challenges. This study used a simplified,
Lesk-based algorithm to resolve polysemy for Setswana; and used the bidirectional encoder
representations from transformers (BERT) model for Setswana, and cosine similarity measure to embed
Setswana glosses and measure semantic similarity, thus determining the accurate sense. The study
employed a rule-based method embedded with the WSD algorithm for machine translation. The
translation accuracy of the machine-readable dictionary was assessed by employing the developed
machine-translation model; and evaluated using the BLEU score. The proposed model was tested on a
combination of sentences containing both ambiguous words and those without ambiguity; and a higher
BLEU score of 34.89 was achieved.
Description
Submitted in Fulfilment of the requirements of the Degree of Doctor of Philosophy in Information Technology, Durban University of Technology, Durban, South Africa, 2024.
Keywords
Setswana-to-English, Machine-readable knowledge resources, Machine-translation system
Citation
DOI
https://doi.org/10.51415/10321/5570