Repository logo
 

Knowledge-based word sense disambiguation for Setswana-English machine translation

Thumbnail Image

Date

2024

Authors

Moape, Tebatso Gorgina

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

There are several challenges that hinder the development of Setswana-to-English machine translation systems. A key obstacle is the absence of machine-readable knowledge resources. This has prompted the use of the only accessible data, which originates from the government domain. While training machine-translation systems using government-domain data can offer specialized language knowledge, such training introduces obstacles such as limited vocabulary, style variation, bias, and domain specificity. Furthermore, it is noted in the literature that the ongoing problem of polysemy in a machine-translation system reduces the overall accuracy. Polysemy is a linguistic phenomenon in which a single word or phrase has multiple senses, resulting in ambiguity. The task of resolving ambiguity in natural language processing (NLP) is known as word sense disambiguation (WSD). The concept of WSD serves as an intermediate task for enhancing text understanding in NLP applications, including machine translation, information retrieval, and text summarization. Its cardinal role is to enhance the effectiveness and efficiency of these applications by ensuring the accurate selection of the appropriate sense for polysemous words in diverse contexts. This study addresses these challenges by proposing three essential components: a diversity-aware machine-readable knowledge resource for SetswanaEnglish, or the Setswana universal knowledge core (SUKC), a WSD approach to resolving lexical ambiguity; and a corresponding machine-translation model embedded with a WSD capability. Setswana-English data was collected from the existing paper-based bilingual dictionaries to achieve this purpose. Secondly, the study employed professional translators to translate space domain concepts from English to Setswana. The collected lexicon was integrated into the universal knowledge core (UKC). The Lesk algorithm which has seen various adaptations by researchers for different languages over the years was employed to address the inherent polysemy challenges. This study used a simplified, Lesk-based algorithm to resolve polysemy for Setswana; and used the bidirectional encoder representations from transformers (BERT) model for Setswana, and cosine similarity measure to embed Setswana glosses and measure semantic similarity, thus determining the accurate sense. The study employed a rule-based method embedded with the WSD algorithm for machine translation. The translation accuracy of the machine-readable dictionary was assessed by employing the developed machine-translation model; and evaluated using the BLEU score. The proposed model was tested on a combination of sentences containing both ambiguous words and those without ambiguity; and a higher BLEU score of 34.89 was achieved.

Description

Submitted in Fulfilment of the requirements of the Degree of Doctor of Philosophy in Information Technology, Durban University of Technology, Durban, South Africa, 2024.

Keywords

Setswana-to-English, Machine-readable knowledge resources, Machine-translation system

Citation

DOI

https://doi.org/10.51415/10321/5570

Endorsement

Review

Supplemented By

Referenced By