Repository logo
 

Theses and dissertations (Accounting and Informatics)

Permanent URI for this collectionhttp://ir-dev.dut.ac.za/handle/10321/4

Browse

Search Results

Now showing 1 - 4 of 4
  • Thumbnail Image
    Item
    Data mining and machine learning : a study of the CO2 emission trends in South Africa
    (2024) Mohamed, Ghulam Masudh; Patel, Sulaiman Saleem; Naicker, Nalindren
    This study addresses the pressing global issue of elevated carbon dioxide emissions (CO2E), with a particular focus on South Africa (SA), which ranks amongst the world's top emitters and largest in Africa. By introducing a novel integration of Change-point Analysis (CPA) and Machine Learning (ML) techniques, this research addresses significant gaps in CO2E trend analysis. Unlike previous studies, this research applies CPA methodologies within the distinct context of SA, employing algorithms like cumulative sum (CUSUM) and Bootstrap analysis to pinpoint crucial change-points in CO2E data specific to the country. The Bootstrap analysis determines the confidence levels associated with each detected change. Additionally, this study sought to validate historical trends and predict future patterns using ML models, with a specific focus on employing the AdaBoost ensemble learning technique. Drawing on insights from a Preferred Reporting Items for Systematic Reviews and MetaAnalyses (PRISMA)-based systematic review, the research selects input variables based on the factors identified as significant contributors to CO2E, ensuring the models capture the relevant variables effectively. The results of the systematic review highlight energy production and economic growth as key drivers of CO2E, thus validating their selection as input data for constructing the CPA and ML models. To conduct this study, secondary data was obtained from the World Bank's Open Data initiative data repository, a common source for environmental research. This selection was justified by a literature review, which highlighted the reliability and applicability of this data source. The CPA results reveal significant change-points in electricity generation, economic growth, and CO2E, with an average confidence level of 94%, indicating the accuracy of this analytical approach. Moreover, the CPA results emphasise the relationship between economic growth, electricity production, and CO2E in SA. Before forecasting future CO2E trends, the effectiveness of the AdaBoost regressor in enhancing model performance was benchmarked against traditional ML algorithms, including Linear regression, Polynomial regression, Bayesian Linear regression and K-Nearest Neighbors (KNN) regression, to determine the most effective technique for forecasting CO2E. The researcher evaluated model performance using key regression ML performance metrics, including Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), coefficient of determination (R2) score, and an additional accuracy score introduced by the researcher. Notably, the AdaBoost models demonstrated superior performance, with an average RMSE score of 10,143.17 kilotons (kt), MAE score of 9,642.64 kt, R2 of 0.90, and accuracy of 96.74%. The study also revealed that, on average, models that were trained using the AdaBoost algorithm surpassed traditional ML models, in terms of performance. They achieved a reduction in RMSE score by 6,417.29 kt, a decrease in MAE score by 4,358.09 kt, an increase in R2 score by 0.07 and enhanced accuracy by 0.60%. Additionally, a comparative analysis of the repeated holdout methods and cross-validation techniques was conducted, with results revealing that repeated holdout had a more significant impact on model performance. After excluding outliers, the average improvement in crossvalidation results, due to the repeated holdout method, was a decrease of 783.32 kt for RMSE, a reduction of 1,289.39 kt for MAE, and an increase of 0.88% for accuracy. The extent to which the repeated holdout method improved the performance of ML models that were integrated with cross-validation techniques, was correlated with the initial model performance. For ML models with RMSE and MAE scores equal to or exceeding 15,000 kt, the findings indicate that the repeated holdout methods studied should enhance performance by at least 2,000 kt. Similarly, an improvement of nearly 3% or higher in accuracy was noted, when the crossvalidation value for this metric was 94% or lower. The AdaBoost model, integrated with repeated holdout, was selected as the optimal model, as evidenced by the results, for forecasting CO2E in SA from 2021 to 2027. The forecasted CO2E trends validate that energy production and economic growth are indeed the primary drivers of CO2E in SA, as previously highlighted by the CPA model. This underscores the importance of addressing these factors to effectively mitigate carbon emissions in the country. Moreover, the forecasted results indicate that SA is unlikely to meet the global temperature limit of 1.5 degrees Celsius by 2030, given the trajectory showing a shortfall in achieving the target level of 334 million tonnes (Mt) of CO2E, agreed upon in the Paris Agreement. However, the country did meet its CO2E commitments outlined in the 2030 National Development Plan, showing some progress towards environmental sustainability. Nonetheless, the failure to meet these targets at their lower ranges suggests the need for further efforts to reduce carbon emissions, which is crucial for aligning with the Paris Agreement objectives and achieving a zero net emission rate by 2050. This highlights the importance of ongoing initiatives to enhance environmental policies and practices in SA. Future research should focus on integrating load-shedding dynamics into the analysis to examine and confirm its effects on energy production, economic growth, and CO2E in SA. Additionally, future research should focus on forecasting future change-points for the socio-economic indicators or variables utilised in this study. This can help policymakers anticipate fluctuations and devise proactive strategies, to address environmental and economic challenges effectively. It is also recommended that future research consider the output of renewable energy production, when analysing CO2E trends.
  • Thumbnail Image
    Item
    Predicting serious crime trends in South Africa using data analytic techniques
    (2024) Falope, Olayemi Success; Thakur, Surendra Colin
    This dissertation aims to investigate the application of data analytics in forecasting serious crime trends in South Africa. The escalating rates of serious crimes, including homicide, robbery, and sexual assault, present significant challenges to the country's economic growth and the safety of its citizens. Recent South African crime statistics indicate a notable increase of over 9.6% in serious crimes, rising from 444,452 incidents in December 2021 to 486,960 in December 2022. This upward trajectory underscores the urgency to predict future serious crimes preemptively, facilitating the development of proactive strategies by law enforcement agencies, policymakers, and community organizations to prevent and mitigate criminal activities. To achieve this objective, this study employs a comprehensive dataset comprising historical crime records and spatial data to analyse serious crime trends across South Africa's nine provinces from 2005 to 2020. Data pre-processing techniques are applied to clean and normalize the data, ensuring its suitability for subsequent analysis. Exploratory data analysis is conducted using Python (Anaconda) and the Flourish studio environment to identify patterns, relationships, and potentially influential factors associated with serious crimes in South Africa. Various data analytics techniques, including machine learning algorithms, time series analysis, and spatial analysis, are utilized to construct models for predicting serious crime trends. These predictive models are trained using historical crime data and relevant contextual features, facilitating the identification of patterns and correlations that could inform future crime trends. The evaluation of these predictive models involves rigorous performance metrics and validation techniques to assess their predictive power, stability, and generalizability. The results reveal an increase in serious crime across South Africa, with certain provinces emerging as hotspots for specific serious crimes, such as Gauteng with a 21% increase in sexual crimes, KwaZulu-Natal with a 23.1% increase in murders, and the Western Cape with a 38% increase in drug-related crimes. This dissertation contributes to the field of crime analysis by presenting a comprehensive approach to predicting serious crime trends in South Africa. The insights gained from this research can inform the development of proactive strategies and resource allocation by law enforcement agencies, policymakers, and community organizations to address serious crimes effectively. Furthermore, this study lays the groundwork for future research in crime prediction and prevention, highlighting the potential of data analytics techniques in tackling complex societal issues. Future research may explore advanced techniques such as ensemble learning and deep learning to enhance the accuracy and robustness of predictive models.
  • Thumbnail Image
    Item
    Predicting at-risk students in a higher educational institution in Ghana for early intervention using machine learning
    (2023) Tahiru, Fati; Parbanath, Steven
    Learning analytics (LA) uses data and evidence to suggest a better learning approach that suits a particular student. This data and evidence are gathered from students’ online engagement with systems such as Blackboard, Moodle, Sakai, eLibrary platforms, and other e-learning platforms. LA continues to gain much attention as digitization of the learning environment is advancing. It allows educators to analyze and interpret data correctly, setting in motion strategies that offer points of leverage and performance for and among students. The use of predictive systems and Early Warning Systems (EWS) in education addressed the issue of student dropouts and suggested interventions for improving students’ performance. High dropout rates in education continue to be a global challenge; however, EWS provide a solution to curb the menace in education in various developed nations, such as the United States, Australia, and the United Kingdom. Developing countries face similar problems of dropouts in the educational sector, but not much research has been undertaken in LA to address the intervention needed to leverage the situation. Some studies have designed models predicting student failure and success, student attrition, student performance and final grades. Most of these studies have focused on only virtual learning environments (VLE) datasets. Nonetheless, this study uses student “activity logs”, “student courses”, “demographics”, and “student assessments” to design a predictive model to identify at-risk students (ARS) from not graduating. The purpose of this study is to use LA and Machine Learning (ML) to analyse the characteristics and behaviours of students in order to identify those who may need support to improve their academic performance. The study adopted the systematic literature review (SLR) approach to determine which emerging ML tools/techniques have been applied successfully in designing predictive systems in education. The SLR enabled the study to identify ML methods and the features that have been used in the domain of predictive systems in education. The study used an integrated 5-step LA process and ML workflow to predict which students are likely to dropout. Using the OULAD dataset, the findings indicated that non-graduated students had habits of not revising the learning materials early before the final exams. Although it was noted that both graduated and non-graduated students access the learning materials simultaneously, variations were recorded in the habits of assignment submission and revision patterns. Graduated students recorded higher clicks for accessing VLE activities than non-graduated students, which signifies that the graduated students interacted more with course activities than non-graduated students. The study also compared different ML algorithms and determined the method that achieved the best predictive accuracy that could be adapted in higher educational institutions. The evaluation of the models concluded that the ensemble machine-learning methods outperformed the traditional methods. The Random Forest ensemble learning algorithms outperformed the GB, Catboost, KNN, LG and NB on the accuracy, precision, recall and f-1 score. The study identified important features such as “date of-assignment-submission”, “sum_clicks-of-activities”, “score on the assessment”,”date-of registration”, “date-of-assignment-submission”, “studied-credits”, and “date-the-student unregistered” for predicting students dropout in higher educational institution (HEI). The model was trained with the important features to predict ARS and achieved an accuracy of 92% in less time than using all the features. The research indicated that implementing LA and ML techniques can effectively identify students at risk of withdrawing from higher education. In view of this, the study concluded that targeted interventions can be developed to mitigate the risk of students dropping out of school through improved learning outcomes
  • Thumbnail Image
    Item
    Software reliability prediction of mobile applications using machine learning techniques
    (2021-04-30) Hoosen, Sumaya; Singh, Alveen
    Software reliability is an important aspect for evaluating the quality of a software product. In a growing global software industry of increasingly complex systems, reliability becomes crucial urging software engineers to strive toward the development of failure free software and to ensure high reliability before delivery. This positions software reliability as one of the key attributes required to achieve high quality software products. In response to this stature, software companies invest considerable resources boosting apps development into a multi-billion Rand global industry. In recent times smart devices are established as one of the most used electronic device with apps being the more popular medium for bringing a multitude of functionalities to a wide user base. However, current literature portrays a far from ideal reliability rate for apps. Despite the availability of a wide range of approaches focused on improved reliability these mostly remain cumbersome and costly to implement from a software management perspective. Hence, there is a need to investigate approaches beyond current dominant thinking that underpins reliability measurements in the mobile app development space. At the same time, Machine Learning (ML) is a recent recipient of much attention from researchers and practitioners that offers a bouquet of tools and techniques that when applied correctly could potentially improve reliability prediction. In line with the above, the overall aim of this study is to provide a ML modelling approach to assist with the reliability prediction of mobile apps. It is hoped that the findings of this study may provide a useful ML modelling approach to help developers increase the reliability rates of apps. For this study ML techniques were applied to 3 feature sets of data extracted from the Eclipse JDT core dataset. These feature sets based on software systems and their histories, include the source code metrics set, process metric sets, and a combination of both metric sets. All metric sets went through stages of data cleaning and pre-processing before they were modelled using five machine learning algorithms, namely, Random Forest, Support Vector Machine, Naïve Bayes, Decision Trees and Neural Networks. During the modelling process, all the results were evaluated using ML evaluation scores to determine which ML modelling approach is most useful for reliability prediction. The results indicate that Random Forest generated better results in all cases and can be used for predicting app reliability since it predicted reliability more accurately and precisely compared to the other ML algorithms. Random Forest also achieved the highest evaluation score when it was applied to the combined metric set of data. This means that the modelling approach of applying Random Forest to a combination of source code and process metrics generated the highest prediction performance. This further implies that developers should consider these selected features within the combined metric set, as they could serve as useful indicators for predicting reliability of apps.