Using XPaths of Inbound Links to Cluster Template-Generated Web Pages
- Department of Information Systems, Vilnius Gediminas Technical University
Sauletekio av. 11, LT–10223 Vilnius, Lithuania
{tomas.grigalis, antanas.cenys}@vgtu.lt
Abstract
Template-generated Web pages contain most of structured data on the Web. Clustering these pages according to their template structure is an important problem in wrapper-based structured data extraction systems. These systems extract structured data using wrappers that must be matched to only particular template pages. Selecting single type of template from all crawled Web pages is a time consuming task. Although there are methods to cluster Web pages according to their structural similarity, however, in most cases they are too computationally expensive to be applicable at Web-Scale. We propose a novel highly scalable approach to structurally cluster Web pages by employing XPath addresses of inbound inner-site links. We demonstrate the effectiveness of our method by clustering more than one million Web pages from many real world Websites in a few minutes and achieving >90% accuracy.
Key words
Web data extraction, structural clustering, template-generated pages, wrapper induction
Digital Object Identifier (DOI)
https://doi.org/10.2298/CSIS130416020G
Publication information
Volume 11, Issue 1 (January 2014)
Year of Publication: 2014
ISSN: 2406-1018 (Online)
Publisher: ComSIS Consortium
Full text
Available in PDF
Portable Document Format
How to cite
Grigalis, T., Čenys, A.: Using XPaths of Inbound Links to Cluster Template-Generated Web Pages. Computer Science and Information Systems, Vol. 11, No. 1, 111-132. (2014), https://doi.org/10.2298/CSIS130416020G