İkili Veriler İçin Benzerlik Katsayılarının Değerlendirilmesi: Bir Benzetim Çalışması

İsmet DOĞAN^a, Nurhan DOĞAN^a, Taylan DOĞAN^b
^aAfyonkarahisar Sağlık Bilimleri Üniversitesi Tıp Fakültesi, Biyoistatistik ve Tıbbi Bilişim ABD, Afyonkarahisar, TÜRKİYE
^bKPN B.V. Mobile Telecommunications Company, Amsterdam, HOLLANDA

Turkiye Klinikleri J Biostat. 2021;13(3):282-92

doi: 10.5336/biostatic.2021-84763

Article Language: TR

Full Text

ÖZET
Amaç: Bu çalışmanın amacı, türetilmiş veri setleri kullanarak farklı n,a,b,c ve d değerleri için belirlenen 72 farklı ikili benzerlik katsayısını tanıtmak, özelliklerini ortaya koyarak değerlendirmektir. Gereç ve Yöntemler: Bu çalışmada, ikili veriler için ileri sürülen benzerlik katsayıları ele alınmıştır. Çalışmada Phytonrandom kütüphanesi kullanılarak 10≤n≤1000 aralığında yer alan 35 farklı n değeri için veri türetilmiştir. Verilerin türetilmesinde önce a,b,c ve d ile gösterilen gözelerden hangisine değer atanacağı sonra da ilgili gözeye atanacak değer belirlenmiştir. n=10 için 288, n=15 için 817 ve n≥20 için 1000'er farklı veri seti çalışmada kullanılmıştır. Bulgular: İkili veriler için tüm benzerlik katsayılarının değer aralığının 0 (benzerlik yok) ile 1 (tam benzerlik) olması beklenmesine rağmen tüm katsayılar için bu aralık geçerli değildir. Dikkate alınan 72 farklı katsayı içerisinden 29 tanesi bu aralıkta değer almaktadır. Hiyerarşik Kümeleme Analizi'ne göre benzerlik katsayılarının çoğu birbirine benzemektedir. Sonuç: Genel olarak hemen tüm katsayılara ait değerler, örnekler daha benzer hâle geldikçe sabit bir minimumdan sabit bir maksimuma doğru artmaktadır. Ancak Hamann ve Sokal-Michener tarafından önerilen katsayılar, tüm n değerleri için benzerlik ile doğrusal olarak sorunsuz bir şekilde artmaktadır. Değer aralığının 0-1 olması ve benzerlik artışı ile paralellik göstermesinden dolayı Sokal-Michener tarafından önerilen katsayı tüm katsayılar içerisinde öne çıkmaktadır. Eyraud, Fager-McGowan, Fossum, Gower, Harris-Lahey, Pearson I ve Sokal-Sneath IV benzerlik katsayıları n sayısından etkilenmekte diğer katsayılar etkilenmemektedir. Dolayısıyla benzerlik katsayılarının önemli bir kısmının örnek büyüklüğünden bağımsız oldukları belirlenmiştir.

Anahtar Kelimeler: Benzerlik katsayısı; hiyerarşik kümeleme; ikili veri

ABSTRACT
Objective: The aim of this study is to introduce 72 different binary similarity coefficients determined for different n, a, b, c and d values by using derived data sets and to evaluate them by revealing their properties. Material and Methods: In this study, the similarity coefficients put forward for binary data are considered. In the study, data were derived for 35 different n values in the range of 10≤n≤1000 using the Phyton-random library. In the derivation of the data, firstly, which cell shown with a, b, c and d will be assigned value, then the value to be assigned to the relevant cell was determined. 288 for n=10 , 817 n=15 for and n≥20 1000 different data sets for were used in the study. Results: Although the value range of all similarity coefficients for binary data is expected to be 0 (no similarity) to 1 (exact similarity), this range is not valid for all coefficients. Out of 72 different coefficients, 29 take values in this range. According to the Hierarchical Clustering Analysis, most of the similarity coefficients are similar. Conclusion: In general, the values of almost all coefficients increase from a fixed minimum to a fixed maximum as the samples become more similar. However, the coefficients proposed by Hamann and Sokal-Michener increase smoothly linearly with similarity for all n values. The coefficient suggested by Sokal-Michener stands out among all coefficients because the value range is 0-1 and shows parallelism with the increase in similarity. Eyraud, Fager-McGowan, Fossum, Gower, Harris-Lahey, Pearson I and Sokal-Sneath IV similarity coefficients are affected by the number of n and other coefficients are not.

Keywords: Similarity coefficient; hierarchical clustering; binary data

REFERENCES:

Wolda H. Similarity indices, sample size and diversity. Oecologia. 1981;50(3):296-302. [Crossref] [PubMed]
Wong KS, Kim MH. Privacy-preserving similarity coefficients for binary data. Comput Math Appl. 2013;65(9):1280-90. [Crossref]
Willett P. Similarity-based approaches to virtual screening. Biochem Soc Trans. 2003;31(Pt 3):603-6. [Crossref] [PubMed]
Cheetham AH, Hazel JE. Binary (presence-absence) similarity coefficients. J Paleontol. 1969;43(5):1130-6. [Link]
Hubalek Z. Coefficients of association and similarity, based on binary (presence-absence) data: an evaluation. Biol Rev. 1982;57:669-89. [Crossref]
Holliday JD, Hu CY, Willett P. Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings. Comb Chem High Throughput Screen. 2002;5(2):155-66. [Crossref] [PubMed]
Choi SS, Cha SH, Tappert CC. A survey of binary similarity and distance measures. J Syst Cybern Inf. 2010;8(1):43-8. [Link]
Todeschini R, Consonni V, Xiang H, Holliday J, Buscema M, Willett P. Similarity coefficients for binary chemoinformatics data: overview and extended comparison using simulated and real data sets. J Chem Inf Model. 2012;52(11):2884-901. [Crossref] [PubMed]
Wijaya SH, Afendi FM, Batubara I, Darusman LK, Altaf-Ul-Amin M, Kanaya S. Finding an appropriate equation to measure similarity between binary vectors: case studies on Indonesian and Japanese herbal medicines. BMC Bioinformatics. 2016;17(1):520. [Crossref] [PubMed] [PMC]
Brusco M, Cradit JD, Steinley D. A comparison of 71 binary similarity coefficients: The effect of base rates. PLoS One. 2021;16(4):e0247751. [Crossref] [PubMed] [PMC]
Peters JA. A computer program for calculating degree of biogeographical resemblance between areas. Syst Zool. 1968;17(1):64-9. [Crossref]
Warrens MJ. On association coefficients for 2x2 tables and properties that do not depend on the marginal distributions. Psychometrika. 2008;73(4):777-89. [Crossref] [PubMed] [PMC]
Consonni V, Todeschini R. New similarity coefficients for binary data. MATCH Commun Math Comput Chem. 2012;68:581-92. [Link]

.: Up To Date

.: Process List

Turkish English

About us Contact Us Comments

Ortadoğu Reklam Tanıtım Yayıncılık Turizm Eğitim İnşaat Sanayi ve Ticaret A.Ş.

.: Address

Turkocagi Caddesi No:30 06520 Balgat / ANKARA
Phone: +90 312 286 56 56
E-mail: info@turkiyeklinikleri.com

.: Manuscript Editing Department

Phone: +90 312 286 56 56/ 154 - 153
E-mail: yaziisleri@turkiyeklinikleri.com

.: English Language Redaction

Phone: +90 312 286 56 56/ 145
E-mail: tkyayindestek@turkiyeklinikleri.com

.: Marketing Sales-Project Department

Phone: +90 312 286 56 56/ 142
E-mail: reklam@turkiyeklinikleri.com

.: Subscription and Public Relations Department

Phone: +90 312 286 56 56/ 197
E-mail: abone@turkiyeklinikleri.com

.: Customer Services

Phone: +90 312 286 56 56/ 197
E-mail: satisdestek@turkiyeklinikleri.com