ID: 2427030381

Predict ovarian cancer using clinical biomarkers + medical images via AI-ML for early detection

"Early detection of ovarian cancer can be significantly improved by integrating clinical biomarker data (such as CA-125 levels, age, family history, and hormonal factors) with medical imaging features (ultrasound/CT/MRI) using AI-ML models. A multimodal machine learning approach that learns patterns jointly from structured clinical parameters and image-based tumor characteristics will achieve higher prediction accuracy, sensitivity, and earlier risk identification compared to models relying on a single data source."

Problem Statement

India holds the second position worldwide in annual new ovarian cancer cases, contributing nearly 15% of the global burden, it is a major global health challenge, with over 324,000 new cases and more than 200,000 deaths reported each year. It is often diagnosed at a late stage because early symptoms are vague and difficult to detect making it the 3rd most lethal cancer in women. While clinical biomarkers and medical imaging are used for diagnosis, each method alone has limitations in accuracy and early-stage identification.

Although significant research already exists on ovarian cancer detection using AI and medical data, many studies focus on a single type of input, such as only biomarkers or only imaging. This project aims to build a multimodal AI-ML model that combines both clinical biomarker data and medical images to improve prediction accuracy and support earlier, more reliable detection.

Literature Review / Market Research

Liu et al., 2019: Highlighted the importance of key biomarkers like CA-125, HE4, and other clinical factors in improving ML-based prediction models.
Zhang et al., 2020: Compared machine learning models such as Random Forest and XGBoost using biomarker data for ovarian cancer prediction and found improved accuracy over traditional statistical methods.
Nunes et al., 2021: Integrated clinical and imaging data using deep learning techniques to support diagnosis, showing that combining multiple data types improves performance.
Boehm et al., 2022: Developed a multimodal ML approach for risk stratification and used feature importance and image heatmaps to improve interpretability of predictions.

Research Gap / Innovation

Research Gap

Most existing approaches rely on highly complex datasets and expensive imaging techniques
Research-level deep learning systems are difficult to implement in real clinical settings
Many models are trained on limited or specialized populations
Current solutions are not designed for simple, scalable deployment

Innovation

Practical and accessible multimodal AI-ML model using commonly available biomarkers
Emphasizes simplicity, interpretability, and real-world usability
Compares multiple ML techniques for comprehensive evaluation
Highlights important features for transparent clinical decision-making

System Methodology

Dataset / Input

This project uses publicly available datasets from Kaggle:

Ovarian Cancer Biomarker Dataset (clinical parameters)
Ovarian Cancer Histopathology Image Dataset

Data Files Used:

Supplementary Data 1: Original raw dataset

Supplementary Data 2: List of biomarkers with abbreviations and descriptions

Supplementary Data 3: Imputed training data (without CA72-4 biomarker)

Supplementary Data 4: Raw training data

Supplementary Data 5: Raw test data

Model / Architecture

A multimodal AI-ML workflow is used to combine clinical data and image-based inputs.

For Biomarker Data

Algorithms: Random Forest, XGBoost, and Logistic Regression

Purpose: Predict cancer risk based on clinical parameters

System Workflow:

Input clinical biomarker values and histopathology images
Preprocess data (cleaning, normalization, resizing images)
Train ML models on biomarker data
Train CNN on image data
Combine outputs for final prediction
Classify result as OC (0) or BOT (1)

Live Execution

VIEW CODE / DEMO

Results & Analysis

Accuracy / Performance 83.09%

Quantifiable outcomes and evaluation metrics compared to baselines.

Academic Credits

Project Guide

Dr. Jay Prakash Singh

Team Member

Ananya Kawatra

2427030381