Monday 7 July 2014

Similarity Measures on Categorical Data Part 1

Last Saturday, I have started my research work on Cluster Analysis. In the book, I was reading, it was talking about a common problem in Cluster Analysis which is handling categorical data. The method that I usually adopt would be to create a string of dummy variables to represent each level of the categorical variables. For example, if the categorical variable have 4 values, 4 dummy variables would be created for the dummy variables.

Apparently, using this method, there is a possibility of introducing more 'similarity' into the model rather. '1-1' pairing is similar to a '0-0' pairing. Thus there are other measures to calculate the 'similarity' of categorical data. Measures such as Goodall, Gambaryan and so on that takes into account the frequency and also occurrence of attributes.

Let's see if I can digest the papers that I have found.

http://www-users.cs.umn.edu/~sboriah/PDFs/BoriahBCK2008.pdf