-
oa Detecting Chronic Kidney Disease Using Machine Learning
- الناشر: Hamad bin Khalifa University Press (HBKU Press)
- المصدر: Qatar Foundation Annual Research Conference Proceedings, Qatar Foundation Annual Research Conference Proceedings Volume 2016 Issue 1, مارس ٢٠١٦, المجلد 2016, ICTSP1534
ملخص
Motivation Chronic kidney disease (CKD) refers to the loss of kidney functions over time which is primarily to filter blood. Based on its severity it can be classified into various stages with the later ones requiring regular dialysis or kidney transplant. Chronic kidney disease mostly affects patients suffering from the complications of diabetes or high blood pressure and hinders their ability to carry out day-to-day activities. In Qatar, due to the rapidly changing lifestyle there has been an increase in the number of patients suffering from CKD. According to Hamad Medical Corporation [2], about 13% of Qatar's population suffers from CKD, whereas the global prevalence is estimated to be around 8–16% [3]. CKD can be detected at an early stage and can help at-risk patients from a complete kidney failure by simple tests that involve measuring blood pressure, serum creatinine and urine albumin [1]. Our goal is to use machine learning techniques and build a classification model that can predict if an individual has CKD based on various parameters that measure health related metrics such as age, blood pressure, specific gravity etc. By doing so, we shall be able to understand the different signals that identify if a patient at risk of CKD and help them by referring to preventive measures.
Dataset Our dataset was obtained from the UCI Machine Learning repository, which contains about 400 individuals of which 250 had CKD and 150 did not. The dataset was obtained from a hospital in southern India over a period of two months. In total there are 24 fields, of which 11 are numeric and 13 are nominal i.e. can take on only one of many categorical values. Some of the numerical fields include: blood pressure, random blood glucose level, serum creatinine level, sodium and potassium in mEq/L. Similarly, examples of nominal fields are answers to yes/no type questions such as whether the patient suffers from hypertension, diabetes mellitus, coronary artery disease. There was missing data values in a few rows which was addressed by imputing them with the mean value of the respective column feature. This ensures that the information in the entire dataset is leveraged to generate a model that best explains the data.
Approach We use two different machine learning tasks to approach this problem, namely: classification and clustering. In classification we built a model that can accurately classify if a patient has CKD based on their health parameters. And in order to understand if people can be grouped together based on the presence of CKD we have performed clustering on this dataset. Both these approaches provide good insights into the patterns present in the underlying data. Classification This problem can be modeled as a classification task in machine learning where the two classes are: CKD and not CKD which represents if a person is suffering from chronic kidney disease or not respectively. Each person is represented as a set of features provided in the dataset described earlier. We also have ground truth as to if a patient has CKD or not, which can be used to train a model that learns how to distinguish between the two classes. Our training set consists of 75% of the data and the remaining 25% is used for testing. The ratio of CKD to non-CKD persons in the test dataset was maintained to be approximately the similar to the entire dataset to avoid the problems of skewness. Various classification algorithms were employed such as logistic regression, Support Vector Machine (SVM) with various kernels, decision trees and Ada boost so as to compare their performance. While training the model, a stratified K-fold cross validation was adopted which ensures that each fold has the same proportion of labeled classes. Each classifier has a different methodology for learning. Some classifiers assign weights to each input feature along with a threshold that determines the output and updates them accordingly based on the training data. In the case of SVM, kernels map input features into a different dimension which might be linearly separable. Decision tree classifiers have the advantage that it can be easily visualized since it is analogous to a set of rules that need to be applied to an input feature vector. Each classifier has a different generalization capability and the efficiency depends on the underlying training and test data. Our aim is to discover the performance of each classifier on this type of medical information.
Clustering Clustering involves organizing a set of items into groups based on a pre-defined similarity measure. This is an unsupervised learning method that doesn't use the labeled information. There are various popular clustering algorithms and we use k-means and hierarchical clustering to analyze our data. K-means involves specifying the number of classes and the initial class means which are set to random points in the data. We vary the number of groups from 2 to 5 to figure out which maximizes the quality of clustering. Clustering with more than 2 groups also might allow to quantify the severity of Chronic Kidney Disease (CKD) for each patient instead of the binary notion of just having CKD or not. In each iteration of k-means, each person is assigned to a nearest group mean based on the distance metric and then the mean of each group is calculated based on the updated assignment. After a few iterations, once the means converge the k-means is stopped. Hierarchical clustering follows another approach whereby initially each datapoint is an individual cluster by itself and then at every step the closest two clusters are combined together to form a bigger cluster. The distance metric used in both the methods of clustering is Euclidean distance. Hierarchical clustering doesn't require any assumption about the number of clusters since the resulting output is a tree-like structure that contains the clusters that were merged at every time-step. The clusters for a certain number of groups can be obtained by slicing the tree at the desired level. We evaluate the quality of the clustering based on a well known criteria known as purity. Purity measures the number of data points that were classified correctly based on the ground truth which is available to us [5].
Principal Component Analysis Principal Component Analysis (PCA) is a popular tool for dimensionality reduction. It reduces the number of dimensions of a vector by maximizing the eigenvectors of the covariance matrix. We carry out PCA before using K-Means and hierarchical clustering so as to reduce it's complexity as well as make it easier to visualize the cluster differences using a 2D plot.
Results Classification In total, 6 different classification algorithms were used to compare their results. They are: logistic regression, decision tree, SVM with a linear kernel, SVM with a RBF kernel, Random Forest Classifier and Adaboost. The last two classifiers fall under the category of ensemble methods. The benefit of using ensemble methods is that it aggregates multiple learning algorithms to produce one that performs in a more robust manner. The two types of ensemble learning methods used are: Averaging methods and Boosting methods [6].
The averaging method typically outputs the average of several learning algorithms and one such type we used is random forest classifier. On the other hand, a boosting method “combines several weak models to produce a powerful ensemble” [6]. Ada boost is an example of boosting method that we have used.
We found that the SVM with linear kernel performed the best with 98% accuracy in the prediction of labels in the test data. The next best performance was by the two ensemble methods: Random Forest Classifier with 96% and Adaboost 95% accuracy. The next two classifiers were: Logistic regression with 91% and Decision tree with 90%. The classifier with the least accuracy was SVM with a RBF kernel which has about 60% accuracy. We believe that RBF gave lower performance because the input features are already high dimensional and don't need to be mapped into a higher dimensional space by RBF or other non-linear kernels. A Receiver Operating Characteristic (ROC) curve can also be plotted to compare the true positive rate and false positive rate. We also plan to compute other evaluation metrics such as precision, recall and F-score. The results are promising as majority of the classifiers have a classification accuracy of above 90%.
After classifying the test dataset, feature analysis was performed to compare the importance of each feature. The most important features across the classifiers were: albumin level and serum creatinine. Logistic regression classifier also included the ‘pedal edema’ feature along with the previous two features mentioned. Red blood cell feature was included as an important feature by Decision tree and Adaboost classifier.
Clustering After performing clustering on the entire dataset using K-Means we were able to plot it on a 2D graph since we used PCA to reduce it to two dimensions. The purity score of our clustering is 0.62. A higher purity score (max value is 1.0) represents a better quality of clustering. The hierarchical clustering plot provides the flexibility to view more than 2 clusters since there might be gradients in the severity of CKD among patients rather than the simple binary representation of having CKD or not. Multiple clusters can be obtained by intersecting the hierarchical tree at the desired level.
Conclusions We currently live in the big data era. There is an enormous amount of data being generated from various sources across all domains. Some of them include DNA sequence data, ubiquitous sensors, MRI/CAT scans, astronomical images etc. The challenge now is being able to extract useful information and create knowledge using innovative techniques to efficiently process the data. Due to this data deluge phenomenon, machine learning and data mining have gained strong interest among the research community. Statistical analysis on healthcare data has been gaining momentum since it has the potential to provide insights that are not obvious and can foster breakthroughs in this area.
This work aims to combine work in the field of computer science and health by applying techniques from statistical machine learning to health care data. Chronic kidney disease (CKD) affects a sizable percentage of the world's population. If detected early, its adverse effects can be avoided, hence saving precious lives and reducing cost. We have been able to build a model based on labeled data that accurately predicts if a patient suffers from chronic kidney disease based on their personal characteristics.
Our future work would be to include a larger dataset consisting of of thousands of patients and a richer set of features that shall improve the richness of the model by capturing a higher variation. We also aim to use topic models such as Latent Dirichlet Allocation to group various medical features into topics so as to understand the interaction between them. There needs to be a greater encouragement for such inter-disciplinary work in order to tackle grand challenges and in this case realize the vision of evidence based healthcare and personalized medicine.
References
[1] https://www.kidney.org/kidneydisease/aboutckd
[3] http://www.ncbi.nlm.nih.gov/pubmed/23727169
[4] https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease
[5] http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html