DefinePK

DefinePK hosts the largest index of Pakistani journals, research articles, news headlines, and videos. It also offers chapter-level book search.

Finding Related Web Pages in Parallel by Using Grouped Link Structures


Article Information

Title: Finding Related Web Pages in Parallel by Using Grouped Link Structures

Authors: Shen Xiaoyan, Chen Junliang, Meng Xiangwu, Zhang Yujie

Journal: Information Technology Journal

HEC Recognition History
No recognition records found.

Publisher: Asian Network for Scientific Information (ANSInet)

Country: Pakistan

Year: 2009

Volume: 8

Issue: 4

Language: English

DOI: 10.3923/itj.2009.427.440

Keywords: ParallelScalableRelated pagesco-citation algorithmHTML segmentation

Categories

Abstract

In this study, a block co-citation algorithm is proposed to find related pages for a given web page in two steps. First, all hyperlinks in a web page are segmented into several blocks according to the HTML structure and text style information. Second, for each page, the similarity between every two hyperlinks in the same block is computed. Then the total similarity from one page to the other is obtained after all web pages are processed. For a given page u, the pages which have the highest total similarity to u are selected as the related pages of u. The block co-citation algorithm was implemented in parallel to analyze a corpus of 37, 482, 913 pages sampled from a commercial search engine and demonstrate its feasibility and efficiency. Experimental results for 28 pages pertaining to 7 topics indicated that the performance of the block co-citation algorithm is superior to traditional co-citation algorithm. This method is very suitable for application in commercial search engines.


Paper summary is not available for this article yet.

Loading PDF...

Loading Statistics...