XML CLUSTERING FRAMEWORK BASED ON DOCUMENT CONTENT AND STRUCTURE IN A HETEROGENEOUS DIGITAL LIBRARY

Nafisse Samadi; Sri Devi  Ravana

doi:10.22452/mjcs.vol36no2.2

Authors

Nafisse Samadi Department of Information Systems, Faculty of Computer Science and Information Technology, Universiti Malaya, 50603 Kuala Lumpur, Malaysia
Sri Devi Ravana Department of Information Systems, Faculty of Computer Science and Information Technology, Universiti Malaya, 50603 Kuala Lumpur, Malaysia Corresponding Author

DOI:

https://doi.org/10.22452/mjcs.vol36no2.2

Keywords:

Information retrieval, Document clustering, Focused retrieval, XML Document clustering, Digital library

Abstract

As textually published information is increasing in digital libraries, efficient retrieval methods are required. Textual documents in a digital library are available in various structures and contents. It is possible to represent these documents with hierarchical levels of granularity when these are organized in XML structure to improve precision by focused retrieval. By this means, contextual elements of each document can be retrieved from a known structure. One solution for retrieving these elements is clustering from a combination of Content and Structural similarities. To achieve this, a novel two-level clustering framework based on Content and Structure is proposed. The framework decomposes a document into meaningful structural units and analyzes all its rich text in its own structure. The quality of the proposed framework was experimented on a heterogeneous XML document collection, having varieties of data sources, structures, and content, be represented as a sample of a real digital library. This collection was made with capabilities to test all of our objectives. The clustering results were evaluated by the Entropy criterion. Finally, the Content and Structure clustering was compared with the usual clustering based on the Content Only to prove the efficacy of considering structural features against the existing Content Only methods in the retrieval process. The total Entropy results of the two-level Content and Structural clustering are almost twice better than the Content Only clustering approach. Consequently, the proposed framework has the ability to improve Information Retrieval systems from two points of view: i) considering the structural aspect of text-rich documents in the retrieval process, and ii) replacing the document-level retrieval with the element-level retrieval.

XML CLUSTERING FRAMEWORK BASED ON DOCUMENT CONTENT AND STRUCTURE IN A HETEROGENEOUS DIGITAL LIBRARY

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

Most read articles by the same author(s)

Editorial Information

Scope

Submission Guidelines

Indexing

Article Publication Charge

Journal Template

Special Issue

In Press Publication

Awards

Information

Conference

Articles

Top Cited Articles

Most View Articles

Publishing Timeline