DefinePK hosts the largest index of Pakistani journals, research articles, news headlines, and videos. It also offers chapter-level book search.
Title: Enhancing OCR: A Novel Segmentation Approach for Pashto Text Images into Characters
Journal: Journal of Engineering and Applied Sciences
Publisher: University of Engineering & Technology, Peshawar
Country: Pakistan
Year: 2024
Volume: 43
Issue: 1
Language: en
ABSTRACT This paper presents a novel approach to segmenting typed Pashto text images into individual characters, addressing a critical challenge in Optical Character Recognition (OCR) for this language. Pashto, a right-to-left, highly cursive language similar to Arabic and Urdu, poses unique segmentation difficulties due to the variable shapes and forms of its characters depending on their position in a word. The segmentation of Pashto characters remains an underdeveloped area in language processing, significantly hindering OCR performance. To tackle this, an image database of isolated Pashto characters was created. Pashto text samples were generated in Microsoft Word, with images saved in Bitmap (BMP) format for processing. These images were preprocessed, converting them to binary form and removing noise. These preprocessed images were then segmented into their constituent characters by the proposed algorithm. The proposed algorithm measure pixels strength to segment words into characters. The algorithm achieved a segmentation accuracy of 84.6%, verified through manual analysis, although some new and unwanted characters (garbage) were also generated. This work contributes a significant step toward improving OCR for the Pashto language, offering a reliable method for character segmentation, which is fundamental to the development of an accurate Pashto OCR system.
Loading PDF...
Loading Statistics...