-
oa Frequency tree for clustering categorical data with similarity measure based on items' weights
- Publisher: Hamad bin Khalifa University Press (HBKU Press)
- Source: Qatar Foundation Annual Research Forum Proceedings, Qatar Foundation Annual Research Forum Volume 2012 Issue 1, Oct 2012, Volume 2012, AESNP4
Abstract
Clustering is an important data mining technique that groups similar data records. In this research, the problem of clustering categorical (or transactional) data is studied. An algorithm for clustering transactional data, called F-Tree (frequent tree), is proposed, and a new similarity measure function, called PWO (probability of the weights of overlapped items), is introduced. Then, a framework to detect the best similarity value for different datasets is developed. Based on the items' frequencies in transactions, the F-Tree algorithm first generates small but highly pure clusters. It then merges similar clusters using the PWO similarity measure function, which depends on the probability of weights of overlapped items. One advantage of PWO is that it outputs the percentage of similarity between items; this is a clearer and more accurate measure of similarity than other similarity measures. Our experimental evaluation on real categorical datasets such as (Mushrooms, KrVskp, Congressional Voting, Soybean-Large, Soybean-Small, Hepatitis, Zoo, Lenses, and Adult-Stretch) shows that: Firstly, F-Tree is effective in finding rare or interesting clusters, and produces clusters with higher purity than LargeItem, WCD and CLOPE. Secondly, PWO is more effective in measuring the similarity between categorical data than LargeItem and CLOPE. Thirdly, clustering using the similarity measure PWO with pre-defined number of clusters results in separate classes with a good purity of average 80% coverage of real classes. Fourthly, the overlap estimator perfectly estimates the value of the overlap threshold using a small sample of dataset of around 5%. Finally, it is seen that the process of merging pure and small clusters increases the purity of resulted clusters as well as reduces time of clustering better than the process of generating clusters directly from the dataset then refining clusters. Based on the items' frequencies in transactions, The F-Tree algorithm first generates small but highly pure clusters. It then merges similar clusters using the PWO similarity measure function, which depends on the probability of weights of overlapped items. One advantage of PWO is that it outputs the percentage of similarity between items; this is a clearer and more accurate measure of similarity than other similarity measures. Our experimental evaluation on real categorical datasets such as "Mushrooms, KrVskp, Congressional Voting, Soybean-Large, Soybean-Small, Hepatitis, Zoo, Lenses, and Adult-Stretch" shows that: Firstly, F-Tree is effective in finding rare or interesting clusters, and produces clusters with higher purity than LargeItem, WCD and CLOPE. Secondly, PWO is more effective in measuring the similarity between categorical data than LargeItem and CLOPE. Thirdly, Clustering using the similarity measure PWO with pre-defined number of clusters results in separate classes with a good purity of average 80% coverage of real classes. Fourthly, the overlap estimator perfectly estimates the value of the overlap threshold using a small sample of dataset of around 5 percent. Finally, It is seen that the process of merging pure and small clusters increases the purity of resulted clusters as well as reduces time of clustering better than the process of generating clusters directly from the dataset then refining clusters.