Explainable Clinical Risk Prediction from EHR Tabular Data using Monotonic Constraints and Calibrated Probabilities
DOI:
https://doi.org/10.55606/jurrsendem.v2i1.9197Keywords:
Clinical Risk Prediction, Monotonic Constraints, Probability Calibration, SHAP, XGBoostAbstract
Tabular-based clinical risk prediction models are extensively applied in medical decision support systems; however, two major challenges often reduce their reliability: predictions that contradict basic clinical logic and poorly calibrated probability outputs that weaken threshold-based decision making. This study investigates explainable binary risk prediction using the processed Cleveland subset of the UCI Heart Disease dataset as a public clinical benchmark. A lightweight and CPU-efficient pipeline is proposed by employing an XGBoost classifier integrated with monotonic constraints on clinically relevant features, followed by probability calibration through post-hoc methods, including Platt scaling, temperature scaling, and isotonic regression on a separate validation set. Model performance is assessed in terms of discrimination capability using AUROC, AUPRC, F1-score, sensitivity, and specificity, while probability reliability is evaluated using ECE and Brier score metrics. A monotonicity audit is also conducted through counterfactual feature sweeps to measure violation rates. In addition, the model is applied for risk stratification into low-, medium-, and high-risk categories with corresponding event-rate reporting. The findings demonstrate that isotonic regression improves probability reliability without degrading discrimination performance. Furthermore, the monotonicity audit reveals no observed violations for constrained features. Overall, the integration of monotonic constraints and probability calibration produces more decision-ready risk estimates for threshold-based clinical decision support while maintaining transparency through SHAP-based analysis.
Downloads
References
Arrieta, A. B., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R., et al. (2020). Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 58, 82–115. https://doi.org/10.1016/j.inffus.2019.12.012
Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3. https://doi.org/10.1175/1520-0493(1950)078
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). https://doi.org/10.1145/2939672.2939785
Dua, D., & Graff, C. (2019). UCI machine learning repository.
Ghassemi, M., Oakden-Rayner, L., & Beam, A. L. (2021). The false hope of current approaches to explainable artificial intelligence in health care. The Lancet Digital Health, 3(11), e745–e750. https://doi.org/10.1016/S2589-7500(21)00208-9
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (Vol. 70, pp. 1321–1330). https://proceedings.mlr.press/v70/guo17a.html
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems. https://arxiv.org/abs/1712.01034
Kull, M., Perello-Nieto, M., Kängsepp, M., Filho, T. S., Song, H., & Flach, P. (2019). Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with Dirichlet calibration. In Advances in Neural Information Processing Systems. https://arxiv.org/abs/1910.12656
Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems. https://arxiv.org/abs/1705.07874
Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., & Lee, S.-I. (2020). From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2(1), 56–67. https://doi.org/10.1038/s42256-019-0138-9
Moons, K. G. M., Wolff, R. F., Riley, R. D., Whiting, P. F., Westwood, M., Collins, G. S., Reitsma, J. B., Kleijnen, J., & Mallett, S. (2019). PROBAST: A tool to assess risk of bias and applicability of prediction model studies: Explanation and elaboration. Annals of Internal Medicine, 170(1), W1–W33. https://doi.org/10.7326/M18-1377
Niculescu-Mizil, A., & Caruana, R. (2006). Knowledge discovery in the Cleveland heart disease data. In Proceedings of the AAAI Workshop on Evaluation Methods for Machine Learning.
Nixon, J., Dusenberry, M. W., Zhang, L., Jerfel, G., & Tran, D. (2019). Measuring calibration in deep learning. arXiv Preprint. https://arxiv.org/abs/1904.01685
Pei, S., Xue, B., Liu, H., & Wang, X. (2016). Multivariate decision trees with monotonicity constraints. Information Sciences, 369, 178–198. https://doi.org/10.1016/j.ins.2016.06.019
Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers (pp. 61–74). https://doi.org/10.7551/mitpress/1113.003.0008
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: Unbiased boosting with categorical features. In Advances in Neural Information Processing Systems. https://arxiv.org/abs/1706.09516
Rajkomar, A., Oren, E., Chen, K., Dai, A. M., Hajaj, N., Hardt, M., Liu, P. J., Liu, X., Marcus, J., Sun, M., et al. (2018). Scalable and accurate deep learning with electronic health records. npj Digital Medicine, 1, 18. https://doi.org/10.1038/s41746-018-0029-1
Rudin, C. (2019). Stop explaining black box machine learning models for high-stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1, 206–215. https://doi.org/10.1038/s42256-019-0048-x
Silva Filho, T. M., Song, H., Kull, M., & Flach, P. (2023). Classifier calibration: A review of probabilistic outputs in classification models. Machine Learning. https://doi.org/10.1007/s10994-023-06336-7
Stevenson, M. D., et al. (2021). EHR-based clinical prediction models: A systematic review of risks of bias and reporting. Journal of the American Medical Informatics Association, 28(8), 1759–1771.
Van Calster, B., McLernon, D. J., van Smeden, M., Wynants, L., & Steyerberg, E. W. (2019). Calibration: The Achilles heel of predictive analytics. BMC Medicine, 17(1), 230. https://doi.org/10.1186/s12916-019-1466-7
Vickers, A. J., & Elkin, E. B. (2006). Decision curve analysis: A novel method for evaluating prediction models. Medical Decision Making, 26(6), 565–574. https://doi.org/10.1177/0272989X06295361
Vickers, A. J., Van Calster, B., & Steyerberg, E. W. (2019). Decision curve analysis for evaluating prediction models: A tutorial. Medical Decision Making, 39(5), 583–594. https://doi.org/10.1177/0272989X19855449
Wang, Y., et al. (2022). Monotonic gradient boosting for risk prediction with domain constraints. IEEE Journal of Biomedical and Health Informatics, 26(8), 3890–3901.
Yang, W., et al. (2022). Modified Brier score for evaluating prediction accuracy in binary outcomes. Statistics in Medicine. https://pmc.ncbi.nlm.nih.gov/articles/PMC9691523/
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Danang Danang, Toni Wijanarko Adi Putra

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.






