A novel framework for termset selection and weighting in binary text classification

Loading...
Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Pergamon-Elsevier Science Ltd

Access Rights

info:eu-repo/semantics/closedAccess

Abstract

This study presents a new framework for termset selection and weighting. The proposed framework is based on employing the joint occurrence statistics of pairs of terms for termset selection and weighting. More specifically, each termset is evaluated by taking into account the simultaneous or individual occurrences of the terms within the termset. Based on the idea that the occurrence of one term but not the other may also convey valuable information for discrimination, the conventionally used term selection schemes are adapted to be employed for termset selection. Similarly, the weight of a selected termset is computed as a function of the terms that occur in the document under concern where a termset is assigned a nonzero weight if either or both of the terms appear in the document. This weight estimation scheme allows evaluation of the individual occurrences of the terms and their co-occurrences separately so as to compute the document-specific weight of each termset. The proposed termset-based representation is concatenated with the bag-of-words approach to construct the document vectors. Experiments conducted on three widely used datasets have verified the effectiveness of the proposed framework. (C) 2014 Elsevier Ltd. All rights reserved.

Description

Keywords

Co-occurrence features, Termset selection, Termset weighting, Document representation, Text categorization

Journal or Series

Engineering Applications of Artificial Intelligence

WoS Q Value

Scopus Q Value

Volume

35

Issue

Citation

Endorsement

Review

Supplemented By

Referenced By