DefinePK hosts the largest index of Pakistani journals, research articles, news headlines, and videos. It also offers chapter-level book search.
Title: Focused crawling of online business Web pages using latent semantic indexing approach
Authors: Thamer Salah, Sabrina Tiun
Journal: ARPN Journal of Engineering and Applied Sciences
Publisher: Khyber Medical College, Peshawar
Country: Pakistan
Year: 2016
Volume: 11
Issue: 15
Language: English
With the exponential growth of textual information available from the Internet, there has been an emergent need to find relevant, in-time and in-depth knowledge about business topic. The huge size of such data makes the process of retrieving and analyzing and use of the valuable information in such texts manually a very difficult task. In this paper, we attempt to address a challenging task i.e. a crawling business-specific knowledge on the Web. To do that, the main goal of this paper is to describe a new method of focused crawling with latent semantic indexing for online business web pages. We describe a new model for online business text crawling which seeks, acquires, maintains and filter business pages. This model consists mainly from two main modules: a crawling system and a text filtering system. The crawler is used to collect as many web pages as possible from the news websites. This focused crawler is guided by a latent semantic index and information from Word Net (business filter) which learns to recognize the relevance of a web page with respect to the business topic and it is also utilized a set of domain specific keywords. The obtained results also on online real word data show that the focused crawler is very effective for building high-quality collections of business Web documents.
Loading PDF...
Loading Statistics...