DefinePK

DefinePK hosts the largest index of Pakistani journals, research articles, news headlines, and videos. It also offers chapter-level book search.

ADVANCED MULTIVARIATE STATISTICAL METHODS FOR HIGH-DIMENSIONAL DATA MODELING, PREDICTION, AND INTERPRETATION


Article Information

Title: ADVANCED MULTIVARIATE STATISTICAL METHODS FOR HIGH-DIMENSIONAL DATA MODELING, PREDICTION, AND INTERPRETATION

Authors: Asjad Ali, Kashifa Basheer, Muhammad Nadeem, Umme Habiba, Waqas Arif, Hadia Tabassum, Muhammad Anas Waqar, Muhammad Ibrar Ali, Shahzaib Khan

Journal: Spectrum of Engineering Sciences

HEC Recognition History
Category From To
Y 2024-10-01 2025-12-31

Publisher: Sociology Educational Nexus Research Institute

Country: Pakistan

Year: 2025

Volume: 3

Issue: 9

Language: en

Keywords: Tax evasionClassificationData MiningTransparencyFinancial Crimes

Categories

Abstract

Tax evasion and financial crimes are two issues that are here and will always stay a thorn in the flesh of any economic stability and the distrust of a population in the fiscal systems, but current detection techniques are often not analytical enough to find the hidden, convoluted trends in mass financial data. In spite of the global progress in forensic accounting and regulatory practices there has always remained a glaring gap in terms of integrating sophisticated data mining methods to effectively identify anomalies in a variety of datasets. This paper aimed to overcome this weakness by creating and implementing a multi-layered analytical model combining survival analysis, penalized regression, machine learning classification and multivariate diagnostics to identify tax evasion and other financial crimes in the United States. Three heterogeneous data were analyzed, genomic-style high-dimensional financial records (n=200, p=5000), institutional transaction data (n=5200, p=300) and survey-based socioeconomic indicators (n=2500, p=220). The problem of missing data were addressed with multiple imputation, multicollinearity were handled with the variance inflation factor thresholds and principal component reduction, and robust statistical analyses were performed, including Cox regression, Elastic Net regression, Random Forest classification, and MANOVA. Findings showed that the genomic-style dataset produced 12 significant predictors of fraudulent patterns (HR=1.51, 95% CI: 1.22 -1.88, p=0.011), whereas the financial dataset produced high predictive power with Elastic Net (RMSE=2.78%) relative to the baseline OLS (RMSE=3.45%). Random Forest AUC 0.83 was obtained with survey-based modeling, which is better than other classifiers. Clinical-style covariates were integrated to verify the independent contributions of variables related to frauds (C-index=0.72). These results underscore the ability of state-of-the-art data mining to increase promptness of financial crime detection, decrease false positives, and serve regulatory policy. The research adds a repeatable framework that enhances rigor of the methodology and enriches the literature of evidence-based detection of financial crimes.


Paper summary is not available for this article yet.

Loading PDF...

Loading Statistics...