A Comparative Study of Statistical Models for Feature Selection Methods in Text Categorization

DSpace Home
→
02 Faculty of Engineering
→
Department of Computer Engineering
→
Theses (Master's and Ph.D) – Computer Engineering
→
View Item

dc.contributor.advisor	Ergün, Cem
dc.contributor.author	Sarıhan, Tansel
dc.date.accessioned	2021-08-16T11:18:19Z
dc.date.available	2021-08-16T11:18:19Z
dc.date.issued	2019
dc.date.submitted	2019
dc.identifier.citation	Sarıhan, Tansel. (2019). A Comparative Study of Statistical Models for Feature Selection Methods in Text Categorization. Thesis (M.S.), Eastern Mediterranean University, Institute of Graduate Studies and Research, Dept. of Computer Engineering, Famagusta: North Cyprus.	en_US
dc.identifier.uri	http://hdl.handle.net/11129/5026
dc.description	Master of Science in Computer Engineering. Thesis (M.S.)--Eastern Mediterranean University, Faculty of Engineering, Dept. of Computer Engineering, 2019. Supervisor: Assist. Prof. Dr. Cem Ergün.	en_US
dc.description.abstract	In consequence of change and developments in the world of technology, the data have been started to transfer to the digital environment rapidly and categorization task of digital documents has become difficult and complicated. Therefore, researchers have focused on doing more research in field of machine learning to provide a more effective solution in terms of resources and time. A major problem of text categorization is the high dimension of the feature space. Feature selection methods are widely used for choosing the subset of features in last decades. In order to maximize the text classification efficiency, some machine learning algorithms and feature selection methods are studied in a comparative way. The experiments are conducted with Reuters-21578 "ApteMod" version, The 4-Universities and 20- Newsgroups "bydate" version datasets. Many topics are discussed from gathering data to organizing data with diffent preprocessing and term weighting approaches to perform test by using the feature selection methods and many classification algorithms. The idea behind of feature selection is that determining of the importance of words that are discriminative for categorization task and removing the non-informative terms. In this regard, CHI-Square, Mutual Information, Galavotti-Sebastiani-Simi Coefficient and Document Frequency metrics are studied for feature selection process. The TF-IDF and probability-based term weighting approaches are used to prepare the texts for classification process. Then to get the best achievement for the classifiers and feature selection methods, the effectiveness of system is evaluated with performance evaluation metrics such as accuracy score, precision, recall and f-measure.	en_US
dc.description.abstract	ÖZ: Teknoloji dünyasındaki değişim ve gelişmelerin sonucunda, veriler hızla dijital ortama aktarılmaya başlanmış ve böylece, dijital belgelerin sınıflandırılması zor ve karmaşık hale gelmiştir. Bu sebepten dolayı araştırmacılar bu probleme zaman ve kaynak kullanımı açısından daha verimli bir çözüm sağlamak için makine öğrenmesi alanında daha fazla araştırma yapmaya odaklanmıştır. Metin sınıflandırmanın ana sorunu, özellik alanının yüksek boyutudur. Özellik seçim yöntemleri, son yıllarda özelliklerin alt kümesini seçmek için yaygın olarak kullanılır. Metin sınıflandırma verimliliğini en üst düzeye çıkarmak için, bazı makine öğrenme algoritmaları ve özellik seçimi yöntemleri karşılaştırmalı olarak incelenmiştir. Deneyler Reuters-21578 "ApteMod", The 4-Universities ve 20-Newsgroups "bydate" verisetleri ile gerçekleştirilmiştir. Özellik seçim yöntemlerini ve birçok sınıflandırma algoritmasını kullanarak farklı metin ön işleme ve terim ağırlıklandırma yaklaşımlarına kadar birçok konu tartışılmaktadır. Özellik seçiminin arkasındaki fikir, metinlerin kategorilerini ayırabilecek nitelikte olan kelimelerin öneminin belirlenmesi ve bilgilendirici olmayan terimlerin kaldırılmasıdır. Bu bağlamda, özellik seçimi için Ki-kare, Karşılıklı bilgi, Galavotti-Sebastiani-Simi Katsayısı ve Döküman frekansı ölçümleri incelenmiştir. TF-IDF ve olasılık temelli terim ağırlıklandırma yaklaşımları, metinleri sınıflandırma sürecine hazırlamak için kullanılmıştır. Daha sonra en iyi sınıflandırıcıları ve özellik seçim metriklerini elde etmek için, sistemin etkinliği accuracy, precision, recall ve fölçüsü gibi performans değerlendirme ölçütleri ile değerlendirilmiştir.	en_US
dc.language.iso	eng	en_US
dc.publisher	Eastern Mediterranean University (EMU) - Doğu Akdeniz Üniversitesi (DAÜ)	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Computer Engineering	en_US
dc.subject	Text processing (Computer science)	en_US
dc.subject	Computational linguistics--Statistical methods	en_US
dc.subject	Natural language processing (Computer science)	en_US
dc.subject	Feature Selection Methods	en_US
dc.subject	Text Categorization	en_US
dc.subject	Term Weighting Performance Evaluation	en_US
dc.title	A Comparative Study of Statistical Models for Feature Selection Methods in Text Categorization	en_US
dc.type	masterThesis	en_US
dc.contributor.department	Eastern Mediterranean University, Faculty of Engineering, Dept. of Computer Engineering	en_US