Early prediction of students at risk in a virtual learning environment using ensemble machine learning techniques

Singh, AlveenSoobramoney, Ranjin2022-06-152022-06-152021-12-13https://hdl.handle.net/10321/4072Submitted in fulfillment of the requirements for the Degree of Masters of Information and Communication Technology, Durban University of Technology, Durban, South Africa, 2021.Students at risk (SAR) are those students who are considered to have a higher probability of failing academically or dropping out of an academic programme. The literature reveals that SAR is a global problem at Higher Education Institutions (HEIs). A high failure rate can not only harm the reputation of the HEIs, but if left unchecked, can be detrimental to these HEIs. The problem of identifying SAR is a pervasive and persistent one. However, early identification of SAR will allow for timely and focused interventions, thereby reducing the problem. Various techniques have been used by HEIs to identify SAR. The traditional statistical approach is one such technique. One of the key challenges with this technique however, is that it often requires a large amount of manual analysis of the data to predict SAR, which in turn also makes early predictions of SAR more computationally challenging. To overcome some of the challenges of the traditional statistical approach, machine learning-based techniques have been proffered to predict SAR. Since machine learning (ML) models are based on the input data rather than the underlying problem, they are expected to have better predictive capabilities than traditional statistical models. Several ML-based techniques have been applied to predict SAR with varying degrees of success. This study proposes the use of ensemble ML techniques for early and accurate prediction of SAR using students’ demographic and weekly online Virtual Learning Environment (VLE) data. Aggregating the predictions of a group of ML classifiers is expected to provide a better generalization performance than each of the individual classifiers on their own. The use of ensemble ML techniques for this study will provide an improved solution to the problem of predicting SAR. To this end, this study focused on training forty different ML predictive models, one for each week of the semester, using twenty-five different ML classifiers. Each model was trained using students’ demographic data combined with data from their weekly interactions with a VLE. Based on the training results, four classifiers, namely AdaBoostClassifier, LGBMClassifier, RandomForestClassifier, and XGBClassifier were selected as base learners for the ensemble classifier. Hyperparameter optimization was performed using Random Search on each of the four classifiers. These classifiers were then used to create a voting classifier ensemble for each of the forty weeks, with 10-fold cross validation being used to evaluate the predictive models. The results show that the voting classifier ensemble method outperformed the individual classifiers overall over forty weeks and can thus provide an improved solution to the problem of predicting SAR.126 penStudents at RiskEnsemble learningLazypredictMachine Learning AlgorithmsVirtual Learning EnvironmentComputer-assisted instruction--South AfricaAcademic achievementUnderprepared college students--South AfricaWeb-based instructionEarly prediction of students at risk in a virtual learning environment using ensemble machine learning techniquesThesishttps://doi.org/10.51415/10321/4072