TF -IDF formula for hierarchical categorisation. MDS (Multimedia Database Systems)
Lokasi : Perpustakaan Fakultas Ilmu Komputer
Tidak ada review pada koleksi ini: 3712
Online collections, such as world-wide web databases, newswire services, or digital libraries, are growing so rapidly that good management is necessary to ensure that they remain useful resources. Storing, querying, retrieving, routing, filtering, and categorising documents are some of the management tasks for online collections. Categorisation of online documents, the focus of this research work, is an assignment of category labels to the documents. The category labels are useful for managing the storage and retrieval of these documents. Some collections have their documents labeled using unstructured independent category labels, while others are labeled hierarchically. If category labels are organised hierarchically, this structure has the potential to improve the identification of correct category assignment by allowing identification into more general categories, and then continue to more specific ones. This category hierarchy is also useful in an automatic categorisation system by allowing analysis of categorisation into different levels of the hierarchy and using the information to achieve a better category assignment. This thesis addresses the hierarchical categorisation problem; that is the problem of assigning one or more identifiers from a list of hierarchically-ordered class labels to a document. We focus on the analysis of categorisation behaviour in a hierarchical class environment, elaborate on approaches to improving the performance of general categorisation problems using heirarchical models, and solve category assignment problems into hierarchical categories. We show that in a hierarchical environment, higher and lower level hierarchies exhibit different behaviour. The categorisation into a higher level hierarchy tends to produce a better result than into a lower level one. This is especially true for a hierarchical system where the upper level categories subsume all the properties of the lower level children. Choosing upper level categories results in a higher performance, while choosing lower level categories results in sharper, more specific categories. In this thesis, we propose novel techniques to categorise into hierarchical categories. Our techniques can improve categorisation accuracy by shifting assignment to higher level categories if assignment into lower level categories tend to produce errors. We also propose an extension of the well-known TF.IDF formula for hierarchical categorisation. We show that by employing this new formula, we can improve significantly the overall effectiveness of a categorisation compared to employing a flat categorisation. The improvement is obtained by categorising into mostly the correct and most detailed class labels while at some extent the class labels are the ones from the higher level hierarchy. In this thesis, we also propose a novel feature selection strategy that will improve categorisation performance while reducing the size of the feature space. Our feature selection technique is suitable for hierarchical categorisation since it works equally well when applied to categories at different hierarchical levels.