Results to date

Publications

Journal Publications

Lung cancer is the most common cause of cancer deaths in the UK emphasizing the critical need for early diagnosis. Survival rates vary significantly according to the stage of diagnosis. This study aims to develop machine learning models to classify between lung cancer and non-lung cancer cases using data from the Clinical Practice Research Datalink (CPRD) which includes UK primary care records. Both interpretable and post hoc explainable approaches are explored including RuleFit, a rule-based method; decision tree, an inherently interpretable model; and random forest and eXtreme Gradient Boosting, tree-based ensemble models. The model performance is assessed using metrics such as accuracy, Area Under the Receiver Operating Characteristic Curve, sensitivity, and specificity. The models performed similarly across all measures. Additionally, SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME) are employed to enhance model interpretability. These insights contribute to better understanding the leading risk factors for lung cancer. Using SHAP, it is found that age and smoking status play a crucial role in lung cancer prediction for all tree-based models. Then, LIME is used to evaluate individual-level explanations and identify any discrepancies in their explanations between different models. Our study combines robust evaluation with prominent interpretability techniques to gain valuable insights into lung cancer prediction.

https://doi.org/10.1007/978-3-031-91379-2_6

Introduction: The need for eXplainable Artificial Intelligence (XAI) in healthcare is more critical than ever, especially as regulatory frameworks such as the European Union Artificial Intelligence (EU AI) Act mandate transparency in clinical decision support systems. Post hoc XAI techniques such as Local Interpretable Model-Agnostic Explanations (LIME), SHapley Additive exPlanations (SHAP) and Partial Dependence Plots (PDPs) are widely used to interpret Machine Learning (ML) models for disease risk prediction, particularly in tabular Electronic Health Record (EHR) data. However, their reliability under real-world scenarios is not fully understood. Class imbalance is a common challenge in many real-world datasets, but it is rarely accounted for when evaluating the reliability and consistency of XAI techniques.

Methods: In this study, we design a comparative evaluation framework to assess the impact of class imbalance on the consistency of model explanations generated by LIME, SHAP, and PDPs. Using UK primary care data from the Clinical Practice Research Datalink (CPRD), we train three ML models: XGBoost (XGB), Random Forest (RF), and Multi-layer Perceptron (MLP), to predict lung cancer risk and evaluate how interpretability is affected under class imbalance when compared against a balanced dataset. To our knowledge, this is the first study to evaluate explanation consistency under class imbalance across multiple models and interpretation methods using real-world clinical data.

Results: Our main finding is that class imbalance in the training data can significantly affect the reliability and consistency of LIME and SHAP explanations when evaluated against models trained on balanced data. To explain these empirical findings, we also present a theoretical analysis of LIME and SHAP to understand why explanations change under different class distributions. It is also found that PDPs exhibit noticeable variation between models trained on imbalanced and balanced datasets with respect to clinically relevant features for predicting lung cancer risk.

Discussion: These findings highlight a critical vulnerability in current XAI techniques, i.e., their interpretability are significantly affected under skewed class distributions, which is common in medical data and emphasises the importance of consistent model explanations for trustworthy ML deployment in healthcare.

https://doi.org/10.3389/frai.2025.1682919

Background: Federated learning (FL) is a rapidly advancing technique that enables collaborative model training while preserving data privacy. This approach is particularly relevant in healthcare, where privacy concerns and regulatory restrictions often prevent centralized data sharing. FL has shown promise in tasks such as disease detection, achieving performance levels comparable to centralized systems. However, its practical usability in real-world applications remains underexplored.
Methods: We evaluate the practical effectiveness of FL in predicting whether patients suspected of prostate cancer require invasive biopsy procedures. The study uses 14 publicly available prostate cancer datasets from 10 countries. We propose and benchmark a novel FL evaluation strategy, Leave-Silo-Out (LSO), which quantifies the performance gap between federated training and free-riding (utilizing the federated model without contributing data). Additionally, we investigate whether locally trained models can outperform multi-hospital FL models. The results are assessed with a focus on improving the diagnosis of local patients.
Results: Our findings reveal that the benefits of FL vary with the amount of locally available annotated data. Hospitals with very small datasets see negligible improvements from FL compared to free-riding. Institutions with moderate datasets may achieve some gains through FL training. However, hospitals with extensive datasets often experience little to no advantage from FL and, in some cases, observe reduced performance compared to local training.
Conclusion: Federated learning shows potential in scenarios with limited data availability. However, its practical applicability is highly context-dependent, influenced by factors such as data availability and specific task requirements.

https://doi.org/10.1016/j.ijmedinf.2025.106046

We thank you for the opportunity to respond to the commentary letter by Dehaene et al on our recent article, “Does differentially private synthetic data lead to synthetic discoveries?”[1] published in Methods of Information in Medicine. We appreciate the commentators’ interest in our work and their contribution to an important and ongoing discussion on the utility of synthetic data and its implications for statistical inference.

The letter from Dehaene et al raises a concern about two possible interpretations of the results in our article, namely that the risk of unacceptably high false-positive findings from synthetic data can be simply countered by increasing the amount of original data enough, or by stepping away from differentially private (DP) synthetization methods. Referring to simulation results in Decruyenaere et al,[2] they note that even for non-DP methods and large original sample sizes, this risk can remain high, especially when using deep learning-based generation methods. We find that Dehaene et al raise an important point and their observations are compatible also with our results. While reducing the amount of DP noise and increasing the original sample size are positively correlated with the utility of generated synthetic data, these alone are not enough if the generator is a misspecified parametric model or suffers from what Decruyenaere et al[2] refers to as the regularization bias.

As the authors note, citing Chen et al: “synthetic data are artificial data that (attempt to) mimic the original data in terms of statistical properties, without revealing individual records.”[3] Obviously, if privacy would not be of concern and reliable prior information on the true distribution of data absent, this would be achieved simply by using the original data. Indeed, some DP data release methods reconstruct the original data in the limit of epsilon approaching infinity. In our experiments, the DP perturbed and DP smoothed histograms have such properties. Accordingly, these methods demonstrate a clear trade-off between similarity to original data, privacy level, and the amount of original data, with the inferential utility of the synthetic data typically increasing both with respect to original sample size and inversely with respect to privacy level. On the other hand, the synthetic data generated by Multiplicative Weights Exponential Mechanism (MWEM) and Private-PGM (Private-Probabilistic Graphical Model) may diverge from the distribution of original data in the limit due to approximating higher-dimensional data with low-dimensional marginals. Hence, the trade-off may be less clear, if the statistical property of interest changes not only due to privacy level but also due to approximation. In some of our results, this is reflected by the utility increasing as a function of decreasing privacy level only up to a certain limit but not achieving the utility of the original data. A similar effect can take place if the synthetization methods make incorrect parametric assumptions. At the other extreme of this continuum of methods, there are synthesizers having regularization bias aimed for purposes other than reproducing the original data. For example, in our experiments, the DP GAN method had very different behavior compared with the other methods, and the risk of false discoveries even increased as a function of decreasing privacy level.

Accordingly, we agree with the main message of Dehane et al that the inferential utility level of the original data is not necessarily achieved simply by decreasing the privacy level or with larger amounts of original data, but is very method-dependent. Hence caution is certainly always warranted when performing statistical inference on synthetic data, with different methods having different trade-offs and some demonstrating systematic biases that are not easy to counter.

https://doi.org/10.1055/a-2540-8284

Explainable artificial intelligence (XAI) has gained much interest in recent years for its ability to explain the complex decision-making process of machine learning (ML) and deep learning (DL) models. The Local Interpretable Model-agnostic Explanations (LIME) and Shaply Additive exPlanation (SHAP) frameworks have grown as popular interpretive tools for ML and DL models. This article provides a systematic review of the application of LIME and SHAP in interpreting the detection of Alzheimer’s disease (AD). Adhering to PRISMA and Kitchenham’s guidelines, we identified 23 relevant articles and investigated these frameworks’ prospective capabilities, benefits, and challenges in depth. The results emphasise XAI’s crucial role in strengthening the trustworthiness of AI-based AD predictions. This review aims to provide fundamental capabilities of LIME and SHAP XAI frameworks in enhancing fidelity within clinical decision support systems for AD prognosis.

https://doi.org/10.1186/s40708-024-00222-1

Differentially private (DP) synthetic data has emerged as a potential solution for sharing sensitive individual-level biomedical data. DP generative models offer a promising approach for generating realistic synthetic data that aims to maintain the original data’s central statistical properties while ensuring privacy by limiting the risk of disclosing sensitive information about individuals. However, the issue regarding how to assess the expected real-world prediction performance of machine learning models trained on synthetic data remains an open question. In this study, we experimentally evaluate two different model evaluation protocols for classifiers trained on synthetic data. The first protocol employs solely synthetic data for downstream model evaluation, whereas the second protocol assumes limited DP access to a private test set consisting of real data managed by a data curator. We also propose a metric for assessing how well the evaluation results of the proposed protocols match the real-world prediction performance of the models. The assessment measures both the systematic error component indicating how optimistic or pessimistic the protocol is on average and the random error component indicating the variability of the protocol’s error. The results of our study suggest that employing the second protocol is advantageous, particularly in biomedical health studies where the precision of the research is of utmost importance. Our comprehensive empirical study offers new insights into the practical feasibility and usefulness of different evaluation protocols for classifiers trained on DP-synthetic data.

https://doi.org/10.1109/ACCESS.2024.3446913

Background Synthetic data have been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential Privacy (DP) is currently considered the gold standard approach for balancing this trade-off.

Objectives The aim of this study is to investigate how trustworthy are group differences discovered by independent sample tests from DP-synthetic data. The evaluation is carried out in terms of the tests’ Type I and Type II errors. With the former, we can quantify the tests’ validity, i.e., whether the probability of false discoveries is indeed below the significance level, and the latter indicates the tests’ power in making real discoveries.

Methods We evaluate the Mann–Whitney U test, Student’s t-test, chi-squared test, and median test on DP-synthetic data. The private synthetic datasets are generated from real-world data, including a prostate cancer dataset (n = 500) and a cardiovascular dataset (n = 70,000), as well as on bivariate and multivariate simulated data. Five different DP-synthetic data generation methods are evaluated, including two basic DP histogram release methods and MWEM, Private-PGM, and DP GAN algorithms.

Conclusion A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at levels of ϵ ≤ 1. This result calls for caution when releasing and analyzing DP-synthetic data: low p-values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP Smoothed Histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget (ϵ ≥ 5) in order to have reasonable Type II error levels.

https://doi.org/10.1055/a-2385-1355

Methylation is considered one of the proteins’ most important post-translational modifications (PTM). Plasticity and cellular dynamics are among the many traits that are regulated by methylation. Currently, methylation sites are identified using experimental approaches. However, these methods are time-consuming and expensive. With the use of computer modelling, methylation sites can be identified quickly and accurately, providing valuable information for further trial and investigation. In this study, we propose a new machine-learning model called MeSEP to predict methylation sites that incorporates both evolutionary and structural-based information. To build this model, we first extract evolutionary and structural features from the PSSM and SPD2 profiles, respectively. We then employ Extreme Gradient Boosting (XGBoost) as the classification model to predict methylation sites. To address the issue of imbalanced data and bias towards negative samples, we use the SMOTETomek-based hybrid sampling method. The MeSEP was validated on an independent test set (ITS) and 10-fold cross-validation (TCV) using lysine methylation sites. The method achieved: an accuracy of 82.9% in ITS and 84.6% in TCV; precision of 0.92 in ITS and 0.94 in TCV; area under the curve values of 0.90 in ITS and 0.92 in TCV; F1 score of 0.81 in ITS and 0.83 in TCV; and MCC of 0.67 in ITS and 0.70 in TCV. MeSEP significantly outperformed previous studies found in the literature. MeSEP as a standalone toolkit and all its source codes are publicly available at https://github.com/arafatro/MeSEP.

https://doi.org/10.1007/s12559-024-10268-2

Transformers have dominated the landscape of Natural Language Processing (NLP) and revolutionalized generative AI applications. Vision Transformers (VT) have recently become a new state-of-the-art for computer vision applications. Motivated by the success of VTs in capturing short and long-range dependencies and their ability to handle class imbalance, this paper proposes an ensemble framework of VTs for the efficient classification of Alzheimer’s Disease (AD). The framework consists of four vanilla VTs, and ensembles formed using hard and soft-voting approaches. The proposed model was tested using two popular AD datasets: OASIS and ADNI. The ADNI dataset was employed to assess the models’ efficacy under imbalanced and data-scarce conditions. The ensemble of VT saw an improvement of around 2% compared to individual models. Furthermore, the results are compared with state-of-the-art and custom-built Convolutional Neural Network (CNN) architectures and Machine Learning (ML) models under varying data conditions. The experimental results demonstrated an overall performance gain of 4.14% and 4.72% accuracy over the ML and CNN algorithms, respectively. The study has also identified specific limitations and proposes avenues for future research. The codes used in the study are made publicly available.

https://doi.org/10.1186/s40708-024-00238-7

In many countries around the world, the healthcare sector is facing difficult problems: the aging population needs more care at the same time as the workforce is not growing, the cost of treatments is going up, and the more and more technical medical products are placing serious challenges to the expertise of the healthcare professionals. At the same time, the field of artificial intelligence (AI) is making big leaps, and naturally, AI is also suggested as a remedy to these problems. In this article, we discuss some of the ethical and legal problems facing AI in the healthcare field, with case study of European Union (EU) regulations and the local laws in one EU member state, Finland. We also look at some of the directions that the AI research in medicine will develop in the next 3–10 years. Especially, Large Language Models (LLMs) and image analysis are used as examples. The potential of AI is huge and the potential has already become a reality in many fields, but in medicine, there remain obstacles. We discuss both technical and regulatory questions related to the expansion of AI techniques used in the clinical environment.

https://doi.org/10.5772/intechopen.1007443

Conference Publications

Chest CT scans are essential in diagnosing lung abnormalities, including lung cancer, but their utility in training deep learning models is often pushed back by limited data availability, high labeling costs, and privacy concerns. To address these challenges, this study explores the use of score-based diffusion models for the conditional generation of lung CT scans slices. Two generation scenarios are explored: one limited to lung segmentation masks and another incorporating both lung and nodule segmentation mappings to guide the synthesis process. The proposed methods are custom U-Net architecture models trained to predict the scores in Variance Preserving (VP) and Variance Exploding (VE) Stochastic Differential Equations (SDEs), composing the primary ground for comparison in conditional sample generation. The results demonstrate the VP SDEs model’s superiority in generating high-fidelity images, as evidenced by high SSIM (0.894) and PSNR (28.6) values, as well as low domain-specific FID (173.4), MMD (0.0133) and ECS (0.78) scores. The generated images consistently followed the conditional mapping guidance during the generation process, effectively producing realistic lung and nodule structures, highlighting their potential for data augmentation in medical imaging tasks. While the models achieved notable success in generating accurate 2D lung CT scan slices given simple conditional image region mappings, future work surrounds the extension of these methods to 3D conditional generation and the use of richer conditional mappings to account for broader anatomical variations. Nevertheless, this study holds promise for improvement in computer-aided systems through the support in deep learning model training for lung disease diagnosis and classification.

https://doi.org/10.1109/EMBC58623.2025.11254813

Prostate cancer (PCa) diagnosis often relies on biopsies, which can lead to unnecessary procedures and complications. Federated learning (FL) offers a privacy-preserving approach for training predictive models across hospitals without sharing sensitive patient data. In this study, we evaluate the feasibility of FL for PCa risk prediction by benchmarking different training strategies, including local, federated models, as well as free-riding (FR) on federated models. Using real-world heterogeneous datasets from 19 hospitals, we analyze the impact of data diversity and consortium size on predictive performance. Our results show that while FL improves model generalizability, local models often perform comparably, making direct participation in FL less beneficial for large hospitals. However, a small consortium of high-data-quality institutions could collaboratively develop robust models for broader clinical use. We discuss the practical implications of FL in healthcare and propose strategies for sustainable deployment in real-world hospital networks.

https://doi.org/10.1109/EMBC58623.2025.11252903

We propose a new framework for Bayesian estimation of differential privacy, incorporating evidence from multiple membership inference attacks (MIA). Bayesian estimation is carried out via a Markov Chain Monte Carlo (MCMC) algorithm, named MCMC-DP-Est, which provides an estimate of the full posterior distribution of the privacy parameter (e.g., instead of just credible intervals). Critically, the proposed method does not assume that privacy auditing is performed with the most powerful attack on the worst-case (dataset, challenge point) pair, which is typically unrealistic. Instead, MCMC-DP-Est jointly estimates the strengths of MIAs used and the privacy of the training algorithm, yielding a more cautious privacy analysis. We also present an economical way to generate measurements for the performance of an MIA that is to be used by the MCMC method to estimate privacy. We present the use of the methods with numerical examples with both artificial and real data.

https://doi.org/10.1007/978-3-032-06096-9_23

Clinical predictive models have played an important role in healthcare. An important task in lung cancer healthcare is to identify those participants involved in a screening program with higher lung cancer risk from a selected population. More interestingly, Electronic Healthcare Records (EHRs) data can be acquired from primary care and have been used to emulate a screening program. An example of such EHR dataset is Clinical Practice Research Datalink (CPRD) that covers 4.5% UK population. In this paper, we provide a worked example for such task while employing Explainable Boosting Machine (EBM) as the predictive model and using CPRD dataset as the EHRs.

EBM is a prominent example of inherently interpretable models (i.e., IIM). IIMs can predict target variables and model explanation simultaneously. More importantly, EBMs represent a family of non-linear IIMs. This kind of generalisation presents a significant extension of logistic regression. EBMs have been developed as an end-to-end system at Microsoft Research. It provide powerful visualisation tools for evaluating both model prediction and explanation. On the other hand, EBM users like to know more technical details about EBM itself. Thus, we provide a brief introduction to Generalised Additive Model, Gradient Boosting, Boosted Trees, and Bagging Ensemble. Finally, we further provide two EBM-based Use Cases in healthcare domain as well as an illustrative example of lung cancer prediction and explanation.

https://doi.org/10.1007/978-3-032-04657-4_13

Federated learning is a machine learning technique that allows multiple distributed clients to collaboratively train an ML model without sharing their private data with any of the parties involved. However, ensuring the privacy of client data during the FL process remains an ongoing concern. In this study, we propose a homomorphic-encryption-based privacy-preserving FL protocol for multilayer perceptrons, which is shown to be secure under the presence of colluding honest-but-curious clients. The possibility of client collusion attacks is eliminated by utilizing the inherent permutability of neural networks. Our results indicate that our protocol does not incur any considerable loss in accuracy during the training process. Furthermore, it offers minimal computation costs by utilizing the batching technique of homomorphic operation and employing only the inexpensive homomorphic addition operation for the aggregation process.

https://doi.org/10.1007/978-3-032-04657-4_34

Machine learning models have increasingly played an important role in medicine and healthcare. They can be readily adapted for clinical prognostic tasks. A prominent task in lung cancer healthcare is to select people with higher lung cancer risk from some population. The task can be undertaken using clinical predictive models along with real-world Electronic Healthcare Records. In this paper, we provide a worked example for such task using Logistic Regression as the model and using CPRD Dataset as the EHRs which cover 4.5% UK population [9].

Further, the use of clinical predictive models in cancer care has gone beyond cancer screening programme. That is, such models can also be employed to perform a variety of cancer healthcare management tasks. In this paper, we provide six “lung cancer”-related use cases to illustrate task diversity. It is also demonstrated that each of 6 use cases has chosen their appropriate set of prognostic predictors to optimally perform their task. Last, their task performance is also critically evaluated.

Domains such as medicine and healthcare require trustworthiness and accountability. To meet this challenge, Explainable Artificial Intelligence (XAI) techniques have been timely developed. In this paper, we introduced impurity-, permutation-, LIME-, and SHAP-based importance measures. These XAI techniques were applied to 6 use cases for variable importance analysis. Last, we used domain-specific knowledge to critically interpret their XAI results. We also briefly reviewed a model-specific XAI application. It relies on knowledge-based constraints.

https://doi.org/10.1007/978-981-96-6588-4_11

Machine learning (ML) models in healthcare are increasing but the lack of interpretability of these models results in them not being suitable for use in clinical practice. In the medical field, it is vital to clarify to clinicians and patients the rationale behind a model’s high probability prediction for a specific disease in an individual patient. This transparency fosters trust, facilitates informed decision-making, and empowers both clinicians and patients to understand the underlying factors driving the model’s output. This paper aims to incorporate explainability to ML models such as Random Forest (RF), eXtreme Gradient Boosting (XGBoost) and Multilyer Perceptron (MLP) for using with Clinical Practice Research Datalink (CPRD) data and interpret them in terms of feature importance to identify the top most features when distinguishing between lung cancer and non-lung cancer cases. The SHapley Additive exPlanations (SHAP) method has been used in this work to interpret the models. We use SHAP to gain insights into explaining individual predictions as well as interpreting them globally. The feature importance from SHAP is compared with the default feature importance of the models to identify any discrepancies between the results. Based on experimental findings, it has been found that the default feature importance from the tree-based models and SHAP is consistent with features ‘age’ and ‘smoking status’ which serve as the top features for predicting lung cancer among patients. Additionally, this work pinpoints that feature importance for a single patient may vary leading to a varied prediction depending on the employed model. Finally, the work concludes that individual-level explanation of feature importance is crucial in mission-critical applications like healthcare to better understand personal health and lifestyle factors in the early prediction of diseases that may lead to terminal illness.

https://doi.org/10.1109/IJCNN60899.2024.10650819

In recent times, the Visual Transformer (VT) has emerged as a powerful alternative to the conventional Convolutional Neural Networks (CNNs) for their superior attention mechanism and pattern recognition abilities. Within a short time, the VT paradigm has given rise to many variants, each showcasing enhanced accuracy and optimized performance for various computer vision applications. Our study introduces a multitransformer pipeline for optimal VT architecture exploration in AD detection and classification. Through a comparative evaluation among the VT variants, this study also aims to contribute valuable insights into the applicability of VTs in Alzheimer’s Disease (AD) classification using OASIS and ADNI datasets. Furthermore, VT performances are systematically compared with CNNs to determine the basic capabilities of the models and their limitations in capturing intricate patterns indicative of early AD stages under both data-rich and data-scarce situations. The results resonate with the fact that the attention mechanism of VTs is of pivotal importance for achieving superior performance in AD diagnosis. The codes used in the study are made publicly available.

https://doi.org/10.1109/IJCNN60899.2024.10650975

This study delves into the characterization of synthetic lung nodules using latent diffusion models applied to chest CT scans. Our experiments involve guiding the diffusion process by means of a binary mask for localization and various nodule attributes. In particular, the mask indicates the approximate position of the nodule in the shape of a bounding box, while the other scalar attributes are encoded in an embedding vector. The diffusion model operates in 2D, producing a single synthetic CT slice during inference. The architecture comprises a VQ-VAE encoder to convert between the image and latent spaces, and a U-Net responsible for the denoising process. Our primary objective is to assess the quality of synthesized images as a function of the conditional attributes. We discuss possible biases and whether the model adequately positions and characterizes synthetic nodules. Our findings on the capabilities and limitations of the proposed approach may be of interest for downstream tasks involving limited datasets with non-uniform observations, as it is often the case for medical imaging.

https://ebooks.iospress.nl/pdf/doi/10.3233/FAIA240408