DefinePK

DefinePK hosts the largest index of Pakistani journals, research articles, news headlines, and videos. It also offers chapter-level book search.

IMPROVING THE ACCURACY OF IMBALANCED DATASET USING K-MEANS CLUSTERING


Article Information

Title: IMPROVING THE ACCURACY OF IMBALANCED DATASET USING K-MEANS CLUSTERING

Authors: Adnan Saeed, Dr. Anwar Ali Sanjrani, Syed Khalid Shah Bukhari, Shabeer Ahmad

Journal: Spectrum of Engineering Sciences

HEC Recognition History
Category From To
Y 2024-10-01 2025-12-31

Publisher: Sociology Educational Nexus Research Institute

Country: Pakistan

Year: 2025

Volume: 3

Issue: 9

Language: en

Keywords: IMPROVING THE ACCURACYOF IMBALANCED DATASETUSING K-MEANS CLUSTERING

Categories

Abstract

Class imbalance represents a significant obstacle in predictive modelling, frequently producing biased models that demonstrate poor performance on minority classes. Conventional classification methods often exhibit a tendency to favour the majority class, leading to suboptimal recall and precision for the critical minority outcomes. To address this issue, the current paper suggests the state-of-the-art prediction approaches that combine the results of the unsupervised K-Means clustering with the supervised classification algorithms. The key assumption is to capture some underlying group-level behavioral patterns by clustering and then provide the resulting cluster labels as auxiliary features in the classification pipeline. The aim of this hybrid approach is to augment the feature space, enhance model sensitivity to the minority class and finally improve overall model predictive power. Experimental testing conducted on two actual churn datasets of customers showed that models trained with cluster labels continually performed better on all important performance metrics. Most interestingly, there was a significant increase in performance when the K-Means clustering algorithm was used together with the K-Nearest Neighbours (KNN) classifier than when either of the two were used separately as the base-line models. The given framework is an effective and feasible plan to eliminate the challenge of data imbalance on customer churn prediction and other similar classification systems.


Paper summary is not available for this article yet.

Loading PDF...

Loading Statistics...