Feature Selection in High Dimensional Spaces

DSpace Home
→
02 Faculty of Engineering
→
Department of Computer Engineering
→
Theses (Master's and Ph.D) – Computer Engineering
→
View Item

dc.contributor.advisor	Altınçay, Hakan
dc.contributor.author	Sheikhi, Ghazaal
dc.date.accessioned	2022-03-11T06:13:02Z
dc.date.available	2022-03-11T06:13:02Z
dc.date.issued	2020
dc.date.submitted	2020-09
dc.identifier.citation	Sheikhi, Ghazaal. (2020). Feature Selection in High Dimensional Spaces. Thesis (Ph.D.), Eastern Mediterranean University, Institute of Graduate Studies and Research, Dept. of Computer Engineering, Famagusta: North Cyprus.	en_US
dc.identifier.uri	http://hdl.handle.net/11129/5307
dc.description	Doctor of Philosophy in Computer Engineering. Institute of Graduate Studies and Research. Thesis (Ph.D.) - Eastern Mediterranean University, Faculty of Engineering, Dept. of Computer Engineering, 2020. Supervisor: Prof. Dr. Hakan Altınçay.	en_US
dc.description.abstract	In this study, two novel filter feature selection approaches are proposed as alternatives to state-of-the-art. The first proposed approach is a greedy-based feature selection method where redundancy is replaced by diversity to quantify the complementarity of a candidate feature with respect to the already selected subset. Both relevance and diversity are computed in terms of the ranks of positive instances, which is analogous to the computation of the area under the receiver operating characteristic curve (AUC). In the second approach, a novel dissimilarity metric based on Feature-to-Feature (F2F) scatter frequencies is proposed for clustering-based filter feature selection. The proposed metric is computed by obtaining feature-dependent ranks of samples and identifying the features which assign close ranks to each sample. Samples are represented as a set of affinity sets containing features having rank differences within a predefined proximity window size. The F2F dissimilarity of a pair of features is computed using the frequency of their appearance in different affinity sets. Features are then clustered into distinct groups using F2F dissimilarity metric. From each cluster, the feature having the highest relevance score is selected. The experiments conducted on 10 UCI and microarray gene expression data sets have confirmed that the proposed feature selection approaches provide better performance scores when compared to other competing methods. The proposed method outperforms the widely-used mutual information-based schemes in terms of classification accuracy, AUC and stability. Keywords: feature selection, ranks of instances, relevance, diversity, dissimilarity, scatter frequency, representative feature.	en_US
dc.description.abstract	ÖZ: Bu çalışmada, en son teknolojiye alternatif olarak iki yeni öznitelik yaklaşımı seçme önerilmiştir. Önerilen ilk yaklaşım, seçilmiş olan alt kümeye göre bir aday özniteliğin tamamlayıcılığını ölçmek için artıklığı çeşitleme ile değiştiren özyineli bir öznitelik seçim yöntemidir. Hem ilgililik hem de çeşitlilik, alıcı çalışma karakteristik eğrisi (AUC) altındaki alanın hesaplanmasına benzer olan pozitif örneklerin sıralarına göre hesaplanır. İkinci yaklaşımda, kümeleme tabanlı filtre öznitelik seçimi için özniteliklar arası (F2F) dağılım frekanslarına dayanan yeni bir benzemezlik metriği önerilmektedir. Önerilen metrik, özniteliğe bağlı örnek grupları elde edilerek ve her bir örneğe yakın düzeyler atanan özniteliklerin tanımlanmasıyla hesaplanır. Örnekler, önceden tanımlanmış bir yakınlık penceresi boyutu içinde sıra farklılıklarına sahip öznitelikler içeren bir yakınlık kümesi olarak temsil edilir. Bir çift özniteliğın F2F benzemezliği, farklı benzesim kümelerinde görünümlerinin sıklığı kullanılarak hesaplanır. Özniteliklar daha sonra F2F benzemezlik metriği kullanılarak farklı gruplara kümelenir. Her kümeden, ilgililik düzeyi en yüksek olan öznitelik seçilir. 10 UCI ve mikrodizi gen ekspresyon veri setleri üzerinde yapılan deneyler, önerilen öznitelik seçim yaklaşımlarının diğer rakip yöntemlere kıyasla daha iyi performans skorları sağladığını göstermiştir. Önerilen yöntem, sınıflandırma doğruluğu, AUC ve kararlılık açısından yaygın olarak kullanılan karşılıklı bilgi tabanlı tekniklerdan daha iyi performans göstermektedir. Anahtar Kelimeler: öznitelik seçimi, örnek sıraları, ilgililik, çeşitlilik, farklılık, benzemezlik dağılım frekansı, temsilcisi öznitelik.	en_US
dc.language.iso	eng	en_US
dc.publisher	Eastern Mediterranean University (EMU) - Doğu Akdeniz Üniversitesi (DAÜ)	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Computer Engineering	en_US
dc.subject	Data mining--Data Acquisition and Storage	en_US
dc.subject	Information storage and retrieval systems	en_US
dc.subject	Text Categorization--Text Classification	en_US
dc.subject	Text processing (Computer science)	en_US
dc.subject	Data in computer systems	en_US
dc.subject	Feature selection	en_US
dc.subject	ranks of instances	en_US
dc.subject	relevance	en_US
dc.subject	diversity	en_US
dc.subject	dissimilarity	en_US
dc.subject	scatter frequency	en_US
dc.subject	representative feature	en_US
dc.title	Feature Selection in High Dimensional Spaces	en_US
dc.type	doctoralThesis	en_US
dc.contributor.department	Eastern Mediterranean University, Faculty of Engineering, Dept. of Computer Engineering	en_US