Exploring Character-Based Stylometry Features Using Machine Learning for Intrinsic Plagiarism Detection in Urdu

Article Information

Title: Exploring Character-Based Stylometry Features Using Machine Learning for Intrinsic Plagiarism Detection in Urdu

Authors: Muhammad Faraz Manzoor, Muhammad Shoaib Farooq, Muntazir Mehdi, Adnan Abid

Journal: International Journal of Innovations in Science & Technology

HEC Recognition History

Category	From	To
Y	2024-10-01	2025-12-31
Y	2023-07-01	2024-09-30
Y	2021-07-01	2022-06-30

Publisher: 50SEA JOURNALS (SMC-PRIVATE) LIMITED

Country: Pakistan

Year: 2024

Volume: 6

Issue: 7

Language: English

Keywords: IntrinsicPlagiarismUrduStylometry.

Abstract

Plagiarism detection in natural language processing (NLP) plays a crucial role in maintaining textual integrity across various domains, particularly for low-resource languages like Urdu. This study addresses the emerging challenge of intrinsic plagiarism detection in Urdu, an area with limited research due to the scarcity of datasets and model resources. To bridge this gap, our research investigates the use of character-based stylometric features in combination with machine learning (ML) and deep learning (DL) models specifically designed for Urdu text analysis. We conducted a series of experiments to evaluate the performance of several classifiers, including Random Forest, AdaBoost, K-Nearest Neighbor (KNN), Decision Tree, Gaussian Naive Bayes, and Long Short-Term Memory (LSTM) networks. Our results show that KNN and LSTM achieved the highest accuracy at 74%, with KNN outperforming the others in terms of F1-score (64.3%), highlighting its balanced performance across accuracy, precision, and recall. AdaBoost followed closely with an accuracy of 73% and a precision of 77.5%, although its F1-score was slightly lower at 63.6%. These findings emphasize the need for specialized approaches in NLP for Urdu, demonstrating that tailored ML and DL techniques can significantly improve intrinsic plagiarism detection in low-resource languages.

Disclaimer: The following sections are produced through an intensive multi-stage analytical pipeline developed by DefinePK, utilizing advanced natural-language processing, domain-specific knowledge models, and automated cross-referencing systems. Despite the rigor and computational effort involved, the generated summary may still contain inaccuracies or incomplete interpretations. Users are strongly advised to verify all critical information against the original paper and its authoritative sources.

Research Objective

To investigate the use of character-based stylometric features in combination with machine learning (ML) and deep learning (DL) models for intrinsic plagiarism detection in Urdu text, and to compare the performance of various classifiers for this task.

Methodology

The study employed a dataset specifically designed for sentence-level intrinsic plagiarism detection in Urdu, collected from various Urdu-language sources. Character-based stylometry features were extracted. Six classifiers were used: Random Forest, Decision Tree, K-Nearest Neighbors (KNN), Naive Bayes, AdaBoost, and Long Short-Term Memory (LSTM). Principal Component Analysis (PCA) was used for dimensionality reduction. Performance was evaluated using accuracy, precision, recall, and F1-score.

Methodology Flowchart

                        graph TD;
    A[Data Collection & Dataset Creation] --> B[Stylometry Feature Extraction];
    B --> C[Data Pre-processing];
    C --> D[Model Training: RF, DT, KNN, NB, AdaBoost, LSTM];
    D --> E[Performance Evaluation: Accuracy, Precision, Recall, F1-score];
    E --> F[Comparison and Conclusion];

Discussion

The research highlights the effectiveness of character-based stylometry features and machine learning models for intrinsic plagiarism detection in Urdu, a low-resource language. The superior performance of KNN is attributed to its non-parametric nature and flexibility in handling complex patterns in high-dimensional data. The findings suggest that specialized approaches are necessary for Urdu NLP tasks, and the developed methods show promise compared to existing techniques in other languages.

Key Findings

KNN and LSTM achieved the highest accuracy at 74%. KNN outperformed other classifiers in F1-score (64.3%), indicating balanced performance. AdaBoost showed high precision (77.5%) but a slightly lower F1-score (63.6%). The study demonstrated that tailored ML and DL techniques can significantly improve intrinsic plagiarism detection in low-resource languages like Urdu.

Conclusion

This study successfully developed and evaluated methods for intrinsic plagiarism detection in Urdu using character-based stylometry features and various machine learning classifiers. KNN emerged as the most effective classifier. The research contributes to addressing the gap in plagiarism detection tools for Urdu and lays the foundation for future work, including the exploration of transfer learning and dataset expansion.

Fact Check

1. Accuracy of KNN and LSTM: The study reports that KNN and LSTM achieved the highest accuracy at 74%.
2. KNN's F1-score: KNN achieved the highest F1-score of 64.3%.
3. Dataset Size: The dataset consists of 2,520 documents, evenly split into plagiarized and non-plagiarized categories.

Mind Map

Loading PDF...

Loading Statistics...

DefinePK

Select Collection