Student Capstone Projects

Explore

The capstone is the culminating project for each student in a SACS Master of Science program. The comprehensive, real-life industry-type projects are oriented toward the student’s domain of interest.

Each project includes: formulation of a question to be answered by the data; collection, cleaning and processing of data; choosing and applying a suitable model and/or analytic method to the problem; and communicating the results to a non-technical audience.

Kristen Oguno

M.S. Data Science

Building a Machine Learning Model to Evaluate Risk Factors Associated with Poly-cystic Ovarian Syndrome

Poly-cystic Ovarian Syndrome (PCOS) is a common, yet often undiagnosed, health condition affecting 8-13% of women globally. Its effects are primarily centered around hormonal imbalances and metabolism causing problems with the ovaries. The exact cause remains unknown, but PCOS is associated with an increased risk of diabetes, heart disease, and other complications. Early diagnosis is crucial for effective management and prevention of these issues. Leveraging machine learning (ML) and data science, our study focuses on developing a robust diagnostic model for PCOS, excluding the need for ultrasonography. Statistical analysis models such as Recursive Feature Elimination (RFE), Logistic Regression, and Random Forest were used to identify key predictors of PCOS diagnosis. Notably, results revealed women with less than 5 cycle days per month were more likely to develop PCOS, contradicting the assumption that PCOS causes excessive bleeding. Cystic acne, skin discoloration, and excess hair growth were identified as notable precursors to PCOS. Anti-Müllerian hormone was a significant biomarker for PCOS development. To address disparities in access to diagnostic tools, we propose integrating Anti-Müllerian hormone testing into routine blood work for all women to enable earlier PCOS detection. Implementing these recommendations could revolutionize PCOS management by facilitating early intervention and mitigating downstream health complications. Further research is needed to fully understand the mechanisms underlying PCOS development.

Jessica Owens

M.S. Data Science

Design and Evaluation of a Multi-Source Movie Recommendation System

Bio:

Jessica Owens is a data analyst and graduate student at Meharry Medical College, where she is pursuing a Master of Science in Data Science. She holds an undergraduate degree in Data Analysis from Lipscomb University, where she developed skills in data visualization, analytical problem solving, and working with data tools to interpret complex information. Her work focuses on analyzing large datasets to identify insights that support decision making, and she is actively exploring machine learning and neural networks through her graduate studies. As a United States Air Force veteran, Jessica brings discipline, adaptability, and a results-driven mindset to her academic and professional pursuits, and she is currently completing her capstone project on a multi-source recommendation system.

Abstract:

This capstone project focuses on improving traditional collaborative filtering models, which often rely only on user-item interactions and miss important contextual information. The project develops a hybrid recommendation system that combines collaborative filtering with content-based features from MovieLens, IMDb, and Movie Plot Synopses (MPST). User ratings are integrated with additional data such as genre, personnel, and text features to improve prediction accuracy. The study compares a baseline collaborative filtering model with hybrid models that incorporate additional data sources, using Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) to evaluate improvements. This approach is expected to show that incorporating multiple data sources can improve recommendation performance and support more informed, data-driven decision making.

Advisor: Graham West, Ph.D.

Mian Pan

M.S. Data Science

Predicting Unmet Healthcare Needs Among Children with Autism Using Machine Learning

Bio

Mian Pan is a graduate student in Data Science at Meharry Medical College’s School of Applied Computational Sciences. She holds advanced degrees in Communication and Information Systems and Computer Science and is transitioning into a career in data science with a focus on healthcare applications. Her academic interests include machine learning, statistical modeling, and health data analytics, particularly in addressing disparities in underserved populations. Mian has conducted research in protein function prediction and has been involved in the AIM-AHEAD health data science training program, where she applies AI and machine learning techniques to real-world clinical and population health data. Her current work includes projects involving electronic health records and autism-related healthcare access. She is passionate about using data-driven approaches to improve clinical decision-making and public health outcomes.

Abstract

This capstone project investigates unmet healthcare needs among children with autism using data from the National Survey of Children’s Health (NSCH). The study applies machine learning models, including logistic regression, Random Forest, Gradient Boosting (XGBoost), and Support Vector Machines (SVM), to model and identify key predictors of disparities in healthcare access. Model performance is evaluated through cross-validation and comparative analysis. To enhance interpretability, SHAP (SHapley Additive exPlanations) is employed to quantify feature contributions and identify key socioeconomic and demographic determinants. The project also incorporates state-level choropleth maps to describe geographic disparities. By combining predictive modeling with interpretable analytics, this work aims to provide evidence on targeted healthcare needs for children with autism.

Advisor: Aize Cao, Ph.D.

Bradford Patton

M.S. Data Science

Applying Link Prediction on Knowledge Graph for Biomedical Knowledge Discovery

Understanding complex biological processes, diseases, and drug discovery necessitates deciphering the intricate interactions among diverse biological entities such as proteins, drugs, metabolites, and enzymes. Knowledge graphs (KGs), representing interconnected multi-relational entities through nodes and edges, offer a promising approach to model heterogeneous biological data comprehensibly. However, many crucial links within KGs remain hidden, presenting a challenge for comprehensive understanding and analysis. In this capstone project, we address the task of link prediction within a biomedical knowledge graph to facilitate biomedical knowledge discovery. Our objectives encompass curating a comprehensive biomedical knowledge graph, constructing a general link prediction pipeline employing various knowledge graph embedding (KGE) models, and applying link prediction techniques to tackle two challenging biomedical problems: drug repurposing and protein function prediction. This project not only contributes a reusable pipeline for biomedical knowledge discovery but also lays the groundwork for future advancements in link discovery using knowledge graphs, applicable to a wide range of biomedical research tasks.

Dyani Peterson

MS. Biomedical Data Science

Machine Learning for Predicting Autism: Utilizing EHR and Clinical Data to Enhance Early Diagnosis

Autism Spectrum Disorder (ASD) is a common neurodevelopmental condition affecting approximately 1 in 36 children in the United States. Despite the importance of early intervention, diagnosis often occurs after critical developmental windows due to reliance on specialist-driven behavioral assessments, which are not scalable, particularly in underserved populations. This study aims to develop a fair machine learning (ML) framework for early ASD risk prediction using electronic health record (EHR) and clinical data. This study will demonstrate the potential of machine learning approaches to support earlier and more scalable ASD detection. By leveraging EHR and contextual data, the proposed framework can aid clinicians in identifying high-risk children and may help reduce disparities in access to timely diagnosis and intervention.

Destiny Pounds

M.S. Biomedical Data Science

Forecasting Minute-by-Minute Stress, Anxiety, and Affective States Using Time-Series Analysis of Wearable Sensor Data

This capstone project explores the potential of using wearable sensor data to predict psychological states in real time. The study focuses on features extracted from electrodermal activity (EDA) and heart rate variability (HRV) signals to predict self-reported stress, anxiety, and affect using the WESAD dataset. Two modeling approaches were developed: static models that predict self-reports from single-minute data snapshots and time-series models that use ten-minute sliding windows of physiological data and labels to forecast the next minute’s self-reported data. Multiple machine learning algorithms, including Random Forest, XGBoost, Light GBM, and Support Vector Machine, were trained and evaluated using subject-aware cross-validation to ensure generalizability across unseen participants. Results show that time-series models consistently outperform static models in predicting stress, anxiety, and affective states, highlighting the value of incorporating temporal physiological context. Feature importance analysis further identified key physiological markers associated with psychological states. These findings support the potential of wearable technology and machine learning for real-time, personalized mental health monitoring, offering a foundation for future research and clinical applications in remote or continuous mental health assessment.

Cameron Pykiet

M.S. Data Science

Developing a Novel Protein Function Prediction Pipeline using Structurally Aware Tokenization (SAT)

Proteins are the biomolecules that form the building blocks of life. The genetic information encoded in Deoxyribonucleic Acid (DNA) is translated into proteins from Ribonucleic Acid (RNA) transcription. Proteins perform multitudes of functions that include but are not limited to catalyzing reactions as enzymes, participating in the body’s defense mechanism as antibodies, forming structures and transporting important chemicals. Therefore, functional characterization of proteins is crucial to understanding life, diseases, and developing novel therapeutics. In this research, we propose a protein function prediction (PFP) pipeline that utilizes a new way of tokenization called SAT (structurally aware Tokenization). SAT works by decomposing a protein sequence into its constituent domains which are the evolutionarily conserved structural parts of a protein sequence. Typically, given a protein sequence as input, the objective of protein function prediction or annotation is to assign appropriate Gene Ontology (GO) terms to describe the functional characteristics of the query protein. Tokenization – the split of the input sequence into smaller subunits (tokens), is crucial in language models. Typically, these models take protein sequences of amino acids and use common tokenization techniques such as individual amino acids (AA), K-Mers, or Byte Pair Encoding (BPE) to build protein language models. However, we hypothesize that including structural constructs of protein such as ordered list of motifs and domains as tokens will provide additional valuable structural information into the model as they directly link to functions. In this work, we have explored SAT for protein function prediction in comparison with AA, k-Mers, and BPE under various machine learning and natural language processing techniques including ‘Word2Vec’ and large language models. Additionally, we have implemented a GPT2 style transformer architecture to train a PFP model using proposed SAT tokenization. The experimental outcome presents promising performance although it demands further exploration to confirm its efficacy.

Gina Robinson

M.S. Data Science

DEFIN’D: Examining the Efficacy of Data-Driven Digital Recruitment Strategies for Clinical Trials in Attracting Candidates from Diverse Backgrounds

Clinical trials play a vital role in advancing medical research and enhancing patient outcomes. Nevertheless, the recruitment of diverse participants for these trials remains a significant challenge. The objective of this study is to assess the efficacy of digital recruitment strategies in attracting candidates from diverse backgrounds for clinical trials. To achieve this, the study will conduct a comprehensive review of existing literature on digital recruitment efforts in clinical trials, with a particular emphasis on diversity considerations. Additionally, a pre-collected dataset consisting of diverse digital recruitment campaigns and their outcomes will be utilized, supplemented with data from the US census. The analysis will primarily focus on key metrics such as the number of recruited participants, demographic information, recruitment channels and outreach, participant engagement, and participant retention rates. By examining these data points, the study aims to identify trends and patterns pertaining to the effectiveness of digital recruitment strategies in enrolling participants from diverse backgrounds. Preliminary findings indicate that digital recruitment strategies have the potential to reach a broader audience and attract participants from diverse backgrounds compared to traditional recruitment methods. However, several factors were found to influence the effectiveness of these strategies, including the selection of appropriate digital platforms, targeted messaging, and cultural sensitivities. By identifying the strengths and limitations of digital recruitment strategies, this study aims to provide valuable insights and recommendations for optimizing future clinical trial recruitment efforts. The findings will inform researchers, pharmaceutical companies, and clinical trial coordinators on the best practices for designing inclusive digital recruitment campaigns that effectively engage candidates from diverse backgrounds.