DefinePK hosts the largest index of Pakistani journals, research articles, news headlines, and videos. It also offers chapter-level book search.
Title: Towards Sindhi Corpus Construction
Authors: Mutee U Rahman
Journal: Linguistics and Literature Review
Publisher: University of Management & Technology
Country: Pakistan
Year: 2015
Volume: 1
Issue: 1
Language: English
DOI: 10.32350/llr/11/04
Keywords: scriptcorpus constructionunigrambigramtrigram frequencies orthography
The paper discusses the current state of Sindhi corpus construction in detail. Sindhi corpus development issues including corpus acquisition, preprocessing, and tokenization are discussed in detail. Preliminary results and observations which include letter unigram, bigram and trigram frequencies; word frequencies and word bigram frequencies are presented. Current state of Sindhi corpus with its limitations and future work is also discussed. The paper also explores the orthography and script of Sindhi language with reference to corpus development.
Loading PDF...
Loading Statistics...