🧬

🧑🏻‍⚕️ 003. Chronic Kidney Disease Classification with Decision Models

TL;DR

  • Analysis and cleaning of a CKD ARFF dataset in four versions using imputation, feature selection, PCA, and SMOTE.
  • Training and hyperparameter tuning for eight classifiers (k-NN, decision trees, Random Forest, Gradient Boosting, logistic regression, SVM, MLP, and XGBoost).
  • Comprehensive comparison of metrics (accuracy, precision, recall, F1-score, AUC) with automatic confusion matrix and ROC/Precision-Recall curve generation.

Ever wondered how to turn a raw medical dataset into a useful machine learning project? In this post, we’ll work with the well-known Chronic Kidney Disease (CKD) dataset, which includes clinical records of Indian patients with and without chronic kidney disease. I’ll guide you step by step so you understand not only the “what” but also the “why” behind every preprocessing and modeling decision.


We’ll start by fixing the ARFF format: removing extra commas, replacing question marks with NaN, and loading the data into a pandas DataFrame. Then we’ll explore missing value counts and variable types (numeric vs. categorical) to plan our cleaning strategies.


Next, we’ll create four different versions of the same dataset. Version one drops all rows with any missing value. Version two imputes numeric features with the median and categorical ones with the mode. Version three removes columns with over 30% missing data before imputing and encoding. Version four applies PCA for dimensionality reduction and SMOTE for class balancing.


With each dataset prepared, we’ll train eight classifiers: k-NN, Decision Tree, Random Forest, Gradient Boosting, logistic regression, SVM, Multilayer Perceptron, and XGBoost. We’ll use GridSearchCV to fine-tune hyperparameters and validate using cross-validation to ensure each model performs at its best.


During training, we’ll extract key metrics (accuracy, precision, recall, F1-score, AUC) and generate confusion matrices, ROC curves, and precision-recall curves. All reports and plots will be saved automatically in an organized folder structure for easy comparison.


Finally, we’ll compare model and dataset version performance to identify which preprocessing and algorithm combination offers the best balance between accuracy and generalization. By the end, you’ll have a clear view of best practices for handling real clinical data and selecting the most robust model for classification.

📄 Full Document

Download detailed PDF
<- Back to blog