1887

Abstract

Clustering is an important data mining technique that groups similar data records. In this research, the problem of clustering categorical (or transactional) data is studied. An algorithm for clustering transactional data, called F-Tree (frequent tree), is proposed, and a new similarity measure function, called PWO (probability of the weights of overlapped items), is introduced. Then, a framework to detect the best similarity value for different datasets is developed. Based on the items' frequencies in transactions, the F-Tree algorithm first generates small but highly pure clusters. It then merges similar clusters using the PWO similarity measure function, which depends on the probability of weights of overlapped items. One advantage of PWO is that it outputs the percentage of similarity between items; this is a clearer and more accurate measure of similarity than other similarity measures. Our experimental evaluation on real categorical datasets such as (Mushrooms, KrVskp, Congressional Voting, Soybean-Large, Soybean-Small, Hepatitis, Zoo, Lenses, and Adult-Stretch) shows that: Firstly, F-Tree is effective in finding rare or interesting clusters, and produces clusters with higher purity than LargeItem, WCD and CLOPE. Secondly, PWO is more effective in measuring the similarity between categorical data than LargeItem and CLOPE. Thirdly, clustering using the similarity measure PWO with pre-defined number of clusters results in separate classes with a good purity of average 80% coverage of real classes. Fourthly, the overlap estimator perfectly estimates the value of the overlap threshold using a small sample of dataset of around 5%. Finally, it is seen that the process of merging pure and small clusters increases the purity of resulted clusters as well as reduces time of clustering better than the process of generating clusters directly from the dataset then refining clusters. Based on the items' frequencies in transactions, The F-Tree algorithm first generates small but highly pure clusters. It then merges similar clusters using the PWO similarity measure function, which depends on the probability of weights of overlapped items. One advantage of PWO is that it outputs the percentage of similarity between items; this is a clearer and more accurate measure of similarity than other similarity measures. Our experimental evaluation on real categorical datasets such as "Mushrooms, KrVskp, Congressional Voting, Soybean-Large, Soybean-Small, Hepatitis, Zoo, Lenses, and Adult-Stretch" shows that: Firstly, F-Tree is effective in finding rare or interesting clusters, and produces clusters with higher purity than LargeItem, WCD and CLOPE. Secondly, PWO is more effective in measuring the similarity between categorical data than LargeItem and CLOPE. Thirdly, Clustering using the similarity measure PWO with pre-defined number of clusters results in separate classes with a good purity of average 80% coverage of real classes. Fourthly, the overlap estimator perfectly estimates the value of the overlap threshold using a small sample of dataset of around 5 percent. Finally, It is seen that the process of merging pure and small clusters increases the purity of resulted clusters as well as reduces time of clustering better than the process of generating clusters directly from the dataset then refining clusters.

Loading

Article metrics loading...

/content/papers/10.5339/qfarf.2012.AESNP4
2012-10-01
2024-12-22
Loading full text...

Full text loading...

/content/papers/10.5339/qfarf.2012.AESNP4
Loading
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error