Towards Clustering of Web-based Document Structures

Matthias Dehmer, Frank Emmert-Streib, Juergen Kilian, Andreas Zulauf

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

Methods for organizing web data into groups in order to analyze web-based hypertext data and facilitate data availability are very important in terms of the number of documents available online. Thereby, the task of clustering web-based document structures has many applications, e.g, improving information retrieval on the web, better understanding of user navigation behavior, improving web users requests servicing, and increasing web information accessibility. In this paper we investigate a new approach for clustering web-based hypertexts on the basis of their graph structures. The hypertexts will be represented as so called generalized trees which are more general than usual directed rooted trees, e.g., DOM-Trees. As a important preprocessing step we measure the structural similarity between the generalized trees on the basis of a similarity measure d. Then, we apply agglomerative clustering to the obtained similarity matrix in order to create clusters of hypertext graph patterns representing navigation structures. In the present paper we will run our approach on a data set of hypertext structures and obtain good results in Web Structure Mining. Furthermore we outline the application of our approach in Web Usage Mining as future work.

Original languageEnglish
Title of host publicationProceedings Of World Academy Of Science, Engineering And Technology, Vol 10
EditorsC Ardil
PublisherWORLD ACAD SCI, ENG & TECH-WASET
Pages289-294
Number of pages6
Publication statusPublished - 2005
Externally publishedYes
Publication typeA4 Article in conference proceedings
EventConference of the World-Academy-of-Science-Engineering-and-Technology - Cracow, Poland
Duration: 16 Dec 200518 Dec 2005

Publication series

NameProceedings of World Academy of Science Engineering and Technology
PublisherWORLD ACAD SCI, ENG & TECH-WASET
Volume10
ISSN (Print)1307-6884

Conference

ConferenceConference of the World-Academy-of-Science-Engineering-and-Technology
Country/TerritoryPoland
Period16/12/0518/12/05

Keywords

  • Clustering methods
  • graph-based patterns
  • graph similarity
  • hypertext structures
  • web structure mining

Fingerprint

Dive into the research topics of 'Towards Clustering of Web-based Document Structures'. Together they form a unique fingerprint.

Cite this