Student Capstone Projects

The capstone is the culminating project for each student in the M.S. Data Science and M.S. Biomedical Data Science programs. The comprehensive, real-life industry-type projects are oriented toward the student’s domain of interest.

Each project includes: formulation of a question to be answered by the data; collection, cleaning and processing of data; choosing and applying a suitable model and/or analytic method to the problem; and communicating the results to a non-technical audience.

Brown, Chris

M.S. Data Science

Antiphospholipid Syndrome: Unraveling Adverse Outcomes in Pregnancy

Antiphospholipid syndrome (APS) refers to the clinical association between antiphospholipid antibodies and a hypercoagulable state, which increases the risk of blood clot formation within blood vessels. APS is more prevalent in women than in men. Research shows that women with APS face an elevated risk of adverse pregnancy outcomes, particularly during the fetal period (ten or more weeks of gestation). These outcomes include preeclampsia, characterized by high blood pressure and proteinuria (excess protein in urine), recurrent early pregnancy loss, fetal demise, and intrauterine growth restriction. APS-related pregnancy losses tend to occur later in pregnancy compared to sporadic or recurrent miscarriages, which typically happen earlier in the pre-embryonic or embryonic period. Factors such as placental insufficiency, hypertensive disorders of pregnancy, thrombophilia, and underlying autoimmune conditions play a role. This research aims to study the complex interplay of these factors to improve outcomes for affected women. Notably, APS is more prevalent among underserved communities.

Oguno, Kristen

M.S. Data Science

Building a Machine Learning Model to Evaluate Risk Factors Associated with Poly-cystic Ovarian Syndrome

Poly-cystic Ovarian Syndrome (PCOS) is a common, yet often undiagnosed, health condition affecting 8-13% of women globally. Its effects are primarily centered around hormonal imbalances and metabolism causing problems with the ovaries. The exact cause remains unknown, but PCOS is associated with an increased risk of diabetes, heart disease, and other complications. Early diagnosis is crucial for effective management and prevention of these issues. Leveraging machine learning (ML) and data science, our study focuses on developing a robust diagnostic model for PCOS, excluding the need for ultrasonography. Statistical analysis models such as Recursive Feature Elimination (RFE), Logistic Regression, and Random Forest were used to identify key predictors of PCOS diagnosis. Notably, results revealed women with less than 5 cycle days per month were more likely to develop PCOS, contradicting the assumption that PCOS causes excessive bleeding. Cystic acne, skin discoloration, and excess hair growth were identified as notable precursors to PCOS. Anti-Müllerian hormone was a significant biomarker for PCOS development. To address disparities in access to diagnostic tools, we propose integrating Anti-Müllerian hormone testing into routine blood work for all women to enable earlier PCOS detection. Implementing these recommendations could revolutionize PCOS management by facilitating early intervention and mitigating downstream health complications. Further research is needed to fully understand the mechanisms underlying PCOS development.

Xin, Fuxue

M.S. Biomedical Data Science

The Impact of Housing Condition on AMA among Pregnant and Postpartum Women with SUDs

Leaving treatment against medical advice (AMA) among pregnant and postpartum women with substance use disorder (SUD) is influenced by various factors, including housing and other social determinants of health (age, insurance, SUDs and mental state). Housing instability can be a significant barrier for pregnant and postpartum women with SUD seeking treatment. Lack of stable housing can lead to difficulties in accessing care, completing treatment programs, and maintaining recovery. This study aims to explore the feasibility and effectiveness of utilizing natural language processing to extract housing information from clinical notes, validate model performance. The dataset is coming from a clinical chart for patients in the Rainbow/Mending Rainbow program at Elam Mental Health Center (EMHC) at Meharry.

Starling, Jacquese

M.S. Data Science

Unraveling the Complexities of Employee Retention: A Comparative Analysis of Logistic Regression and Random Forest Models

Employee attrition is a critical challenge facing organizations, with significant impacts on productivity, morale, and the bottom line. This study employed a rigorous, data-driven approach to uncover the key drivers of employee turnover within a specific organizational context. Using a combination of Logistic Regression and Random Forest models, we analyzed a comprehensive dataset of employee records and demographic information. The findings revealed a multifaceted set of factors contributing to attrition, including work-life balance, compensation, career development opportunities, and manager-employee relationships. By leveraging these data-driven insights, the study provides a roadmap for targeted interventions and talent management strategies. The results underscore the power of integrating people analytics and business acumen to foster work environments that prioritize employee engagement, well-being, and long-term retention. This research offers valuable guidance for organizational leaders seeking to transform their approach to talent management, ultimately driving sustainable success and positively impacting the lives of their workforce. The study’s methodology and findings contribute to the growing body of knowledge on evidence-based human capital management practices.

Patton, Bradford

M.S. Data Science

Applying Link Prediction on Knowledge Graph for Biomedical Knowledge Discovery

Understanding complex biological processes, diseases, and drug discovery necessitates deciphering the intricate interactions among diverse biological entities such as proteins, drugs, metabolites, and enzymes. Knowledge graphs (KGs), representing interconnected multi-relational entities through nodes and edges, offer a promising approach to model heterogeneous biological data comprehensibly. However, many crucial links within KGs remain hidden, presenting a challenge for comprehensive understanding and analysis. In this capstone project, we address the task of link prediction within a biomedical knowledge graph to facilitate biomedical knowledge discovery. Our objectives encompass curating a comprehensive biomedical knowledge graph, constructing a general link prediction pipeline employing various knowledge graph embedding (KGE) models, and applying link prediction techniques to tackle two challenging biomedical problems: drug repurposing and protein function prediction. This project not only contributes a reusable pipeline for biomedical knowledge discovery but also lays the groundwork for future advancements in link discovery using knowledge graphs, applicable to a wide range of biomedical research tasks.

Lynch, Lexius

M.S. Data Science

Exploratory Analysis of Alzheimer’s Disease: Unraveling the complexities of single cell RNA sequence Data

Single-cell biology is a field that focuses on understanding human health and diseases at the cellular level, with a particular emphasis on precision medicine. Identifying specific cell types in major brain disorders is a critical area of research. However, the complex cellular architecture of the brain, which consists of a diverse set of cell types, makes it challenging to determine the primary pathological cell type for a particular disease. Recent studies have used single-cell RNA and expression-weighted cell type enrichment to identify specific neuronal cell types associated with brain disorders, such as Alzheimer’s disease. Sc-RNA is a powerful technology that allows the analysis of a large number of individual cells. These studies have revealed statistically significant enrichment of certain neuronal cell types in the context of these disorders, providing valuable insights into the differentially expressed genes as well as cell signaling pathways critical to the understanding of variants associated with brain diseases.

Lee, Charleston

M.S. Data Science

Investigating Criminal References in Rap Lyrics: A Data Science Approach

Rap music has long served as a platform for artists to express their realities, often exploring themes of crime and street life. This study employs data science methodologies to identify and analyze potential references to criminal activities within rap lyrics. Leveraging natural language processing techniques, this study analyzes a curated dataset of rap lyrics spanning the years 2000-2023, sourced from a diverse range of music labels, certifications, and artists, encompassing modern slang associated with criminal behavior while ensuring ethical data collection practices. Through sentiment analysis, topic modeling, and named entity recognition, the aim of this study is to quantify and contextualize the prevalence of criminal references in rap songs during this time period. Additionally, this study investigates the relationship between lyrical content and commercial success by examining the impact of identified themes on the performance of songs, measured through RIAA certifications from 2000-2023. Furthermore, based on these linguistic insights, a chatbot is developed that is equipped with a comprehensive understanding of contemporary slang terminologies related to crime. This chatbot enables interactive engagement and discourse on pertinent subjects within the rap genre, facilitating broader discussions about cultural representations in music and industry influences. This interdisciplinary approach not only advances data science methodologies but also provides valuable insights into the portrayal of societal realities within artistic expression and its reception in the music industry across different labels and time periods.

Tsurgeon, Cyruss

M.S. Biomedical Data Science

Cell Census: Unlocking the Power of Artificial Intelligence for Accurate Cell Quantification

Cell counting is a fundamental task in various biological and medical research fields, providing crucial information about cellular populations in a variety of contexts. The addition of fluorescence microscopy has revolutionized cell imaging by enabling visualization of specific cell types and components with high precision and sensitivity. Thus, providing advanced techniques for distinguishing individual cells or cellular features, segregating clusters of cells by type, and even labeling distinct cells in culture or tissue section. However, the manual counting of cells in-situ is a time-consuming and subjective process prone to human error and cannot be performed from microscope images themselves. To overcome these limitations, researchers have turned to deep learning techniques, leveraging their ability to learn intricate patterns and relationships in large datasets. In this paper, we present a comprehensive approach for automated cell counting using deep learning algorithms applied to fluorescent microscopy images. We propose a novel framework that combines convolutional neural networks (CNNs) with advanced image processing techniques and statistical methods, enabling accurate and efficient cell quantification. Our method utilizes annotated training data to train the network, and subsequently employs it for automated cell counting in unseen microscopy images. We demonstrate the effectiveness and robustness of our approach through extensive experiments on diverse datasets, showcasing improved performance compared to existing methods. The proposed deep learning-based automated cell counting technique holds immense potential for accelerating research and advancing our understanding of various biological processes, while also serving as a valuable tool for diagnostic and therapeutic applications in clinical settings. In addition, we demonstrate the application of our model in various contexts including medical diagnosis, drug discovery, biological research, and environmental monitoring. With this research, we provide a foundation for future investigations in biomedical image analysis, offering new insights into the applications of deep learning in computer vision for medicine and healthcare.

Invest in Knowledge

With Your Support We Can Change the World.