DefinePK hosts the largest index of Pakistani journals, research articles, news headlines, and videos. It also offers chapter-level book search.
Title: Finding Related Web Pages in Parallel by Using Grouped Link Structures
Authors: Shen Xiaoyan, Chen Junliang, Meng Xiangwu, Zhang Yujie
Journal: Information Technology Journal
Publisher: Asian Network for Scientific Information (ANSInet)
Country: Pakistan
Year: 2009
Volume: 8
Issue: 4
Language: English
Keywords: ParallelScalableRelated pagesco-citation algorithmHTML segmentation
In this study, a block co-citation algorithm is proposed to find related pages for a given web page in two steps. First, all hyperlinks in a web page are segmented into several blocks according to the HTML structure and text style information. Second, for each page, the similarity between every two hyperlinks in the same block is computed. Then the total similarity from one page to the other is obtained after all web pages are processed. For a given page u, the pages which have the highest total similarity to u are selected as the related pages of u. The block co-citation algorithm was implemented in parallel to analyze a corpus of 37, 482, 913 pages sampled from a commercial search engine and demonstrate its feasibility and efficiency. Experimental results for 28 pages pertaining to 7 topics indicated that the performance of the block co-citation algorithm is superior to traditional co-citation algorithm. This method is very suitable for application in commercial search engines.
Loading PDF...
Loading Statistics...