DefinePK hosts the largest index of Pakistani journals, research articles, news headlines, and videos. It also offers chapter-level book search.
Title: Exploring Character-Based Stylometry Features Using Machine Learning for Intrinsic Plagiarism Detection in Urdu
Authors: Muhammad Faraz Manzoor, Muhammad Shoaib Farooq, Muntazir Mehdi, Adnan Abid
Journal: International Journal of Innovations in Science & Technology
Publisher: 50SEA JOURNALS (SMC-PRIVATE) LIMITED
Country: Pakistan
Year: 2024
Volume: 6
Issue: 7
Language: English
Keywords: IntrinsicPlagiarismUrduStylometry.
Plagiarism detection in natural language processing (NLP) plays a crucial role in maintaining textual integrity across various domains, particularly for low-resource languages like Urdu. This study addresses the emerging challenge of intrinsic plagiarism detection in Urdu, an area with limited research due to the scarcity of datasets and model resources. To bridge this gap, our research investigates the use of character-based stylometric features in combination with machine learning (ML) and deep learning (DL) models specifically designed for Urdu text analysis. We conducted a series of experiments to evaluate the performance of several classifiers, including Random Forest, AdaBoost, K-Nearest Neighbor (KNN), Decision Tree, Gaussian Naive Bayes, and Long Short-Term Memory (LSTM) networks. Our results show that KNN and LSTM achieved the highest accuracy at 74%, with KNN outperforming the others in terms of F1-score (64.3%), highlighting its balanced performance across accuracy, precision, and recall. AdaBoost followed closely with an accuracy of 73% and a precision of 77.5%, although its F1-score was slightly lower at 63.6%. These findings emphasize the need for specialized approaches in NLP for Urdu, demonstrating that tailored ML and DL techniques can significantly improve intrinsic plagiarism detection in low-resource languages.
To investigate the use of character-based stylometric features in combination with machine learning (ML) and deep learning (DL) models for intrinsic plagiarism detection in Urdu text, and to compare the performance of various classifiers for this task.
The study employed a dataset specifically designed for sentence-level intrinsic plagiarism detection in Urdu, collected from various Urdu-language sources. Character-based stylometry features were extracted. Six classifiers were used: Random Forest, Decision Tree, K-Nearest Neighbors (KNN), Naive Bayes, AdaBoost, and Long Short-Term Memory (LSTM). Principal Component Analysis (PCA) was used for dimensionality reduction. Performance was evaluated using accuracy, precision, recall, and F1-score.
graph TD;
A[Data Collection & Dataset Creation] --> B[Stylometry Feature Extraction];
B --> C[Data Pre-processing];
C --> D[Model Training: RF, DT, KNN, NB, AdaBoost, LSTM];
D --> E[Performance Evaluation: Accuracy, Precision, Recall, F1-score];
E --> F[Comparison and Conclusion];
The research highlights the effectiveness of character-based stylometry features and machine learning models for intrinsic plagiarism detection in Urdu, a low-resource language. The superior performance of KNN is attributed to its non-parametric nature and flexibility in handling complex patterns in high-dimensional data. The findings suggest that specialized approaches are necessary for Urdu NLP tasks, and the developed methods show promise compared to existing techniques in other languages.
KNN and LSTM achieved the highest accuracy at 74%. KNN outperformed other classifiers in F1-score (64.3%), indicating balanced performance. AdaBoost showed high precision (77.5%) but a slightly lower F1-score (63.6%). The study demonstrated that tailored ML and DL techniques can significantly improve intrinsic plagiarism detection in low-resource languages like Urdu.
This study successfully developed and evaluated methods for intrinsic plagiarism detection in Urdu using character-based stylometry features and various machine learning classifiers. KNN emerged as the most effective classifier. The research contributes to addressing the gap in plagiarism detection tools for Urdu and lays the foundation for future work, including the exploration of transfer learning and dataset expansion.
1. Accuracy of KNN and LSTM: The study reports that KNN and LSTM achieved the highest accuracy at 74%.
2. KNN's F1-score: KNN achieved the highest F1-score of 64.3%.
3. Dataset Size: The dataset consists of 2,520 documents, evenly split into plagiarized and non-plagiarized categories.
Loading PDF...
Loading Statistics...