Skip to main content

Visualization Challenge

10:10 to 11:05 a.m. - Evans Room

Interactive Visualizations

Temporal Coauthorship Network of Scialog Conference Attendees

Anastasiya Salova, Postdoc, McCormick School of Engineering and Applied Science

Abstract:  Scientific conferences are considered essential to establishing new research directions and promoting collaboration. However, explicitly demonstrating the causal connection between conference attendance and scientific outcomes—e.g., journal publications—requires longitudinal data and careful analysis. The data we use come from multi-year Scialog (Science Dialog) initiatives that aim to ignite new collaborations between early-career professors and researchers. Specifically, we focus on the Advanced Energy Storage conferences (held in 2017, 2018, and 2019), each attended by around 60 participants. Attendees coauthored publications before, during, and after the conference. We aim to learn how conference coattendance causally affected the attendee collaboration network. For instance, we would like to understand whether the conference resulted in new collaborations or strengthened the existing ones and how conference coattendance changed the role of different researchers within the attendee network.

We built an interactive visualization tool for our data using Shiny for Python. The visualization showcases publication data as a network, allowing access to complex coauthorship patterns beyond the aggregate numbers. The interactivity of our visualization allows the user to explore temporal aspects of the data (e.g., the structure of the collaboration network before and after the conference) and collaboration patterns between attendees with different coattendance patterns (only 35 attendees attended all three conference years.) In addition to being used as a supplement for our manuscript titled "The effect of prescribed interactions on patterns of scientific collaboration," the visualization will be useful for presenting our data at conferences and inspiring testable hypotheses within our collaboration.

2022 Storm Events Throughout the U.S.

Corbin Diaz, Undergraduate Student, Weinberg College of Arts and Sciences

Abstract:  As the effects of climate change intensify, analyzing climate and weather data has grown increasingly important. Understanding weather distributions can provide valuable insights into dangerous weather events throughout the year, helping to raise awareness and enhance public safety and knowledge. While storm events are highly documented, no widely accessible tool exists to make them easily digestible—especially for the general public. This interactive visualization enables users to explore the distribution of significant weather events in the United States. The data, sourced from the NOAA, includes major weather reports across U.S. states, territories, and marine regions for 2022. However, this visualization can be adapted to view data from any year or multiple years.

This tool is a proof-of-concept of an easily digestible, yet highly detailed interactive visual for NOAA data. It allows the public to see any level of detail desired, from high-level summary statistics to more detailed views of selected parts. They can also see damage statistics and can filter across any metric and weather type to better understand the impact these storms can have. Being user-friendly allows users to explore the data on their own, getting a better understanding of local and national weather distributions over the seasons. One may even use it to find new or significant weather patterns that would have otherwise been hard to notice.

This visualization tool is built using Shiny and R and incorporates many packages such as ggplot, leaflet, and interactive tables.

 

Clinical Courses of ICU Patients with Severe Pneumonia

Nikolay Markov, PhD Student, Feinberg School of Medicine

Abstract: This is a demo data browser accompanying “Machine learning links unresolving secondary pneumonia to mortality in patients with severe pneumonia, including COVID-19“ paper (doi: 10.1172/JCI170682).

This project visualizes ICU stays of patients with severe pneumonia. Clinical courses of patients with severe pneumonia are very heterogeneous: some patients stay just a few days in the ICU, while others stay several weeks. We used clustering and visualization to explore the data and try identifying the patterns underlying the dynamics of these complex clinical courses. We represented each patient stay by a sequence of patient-days, and aggregated all electronic health record data for that day in a single datapoint. Clustering these datapoints and assessing transitions between the clusters allows to find granular patterns inside the ICU stays of the patients. We needed a visualization tool to explore both global structure of the data and the individual patient clinical course.

National Park and Air Quality Visualizer

Sarah Abara, Undergraduate Student, Weinberg College of Arts and Sciences
Tosin Okoh, Undergraduate Student, Weinberg College of Arts and Sciences

Abstract: Our data visualization, created using Tableau, explores the relationship between air quality and visit counts in U.S. National Parks. We hypothesized that as air quality worsens (indicated by rising PM2.5 levels), the number of park visitors would increase.

While we observed slight trends supporting this idea, the data alone is inconclusive. Our analysis does not account for economic factors, which may play a significant role in visitation trends. In general, U.S. National Park visitation patterns are susceptible to a range of socio-economic influences. The fluctuating impact of social media promotion, economic downturns such as the 2008 recession, and public health crises like the COVID-19 pandemic demonstrate this complexity.

Consequently, further research is needed to fully understand the extent to which air quality impacts visitation, especially in comparison to these external factors. Our visualization is designed to be clear, intuitive, and accessible to a broad audience, including individuals unfamiliar with our field.

We included a purpose statement to ensure that viewers understand the goals of our analysis. Interactive elements, such as “Read Me” pop-ups, were carefully designed to be user-friendly and to guide viewers through the dataset. The intended audience includes policymakers, environmental researchers, and the general public—anyone interested in understanding how environmental factors influence National Park visitation. Air quality is a crucial issue as climate change leads to worsening conditions worldwide, affecting ecosystems, wildlife, and human experiences in National Parks.

By visualizing this data dynamically rather than through static charts, we allow users to explore variations across different states and localities. Our use of Tableau enables engagement with the data, giving viewers the ability to filter and interact with specific insights. We also prioritized accessibility in our design.

To accommodate individuals with color vision deficiency, we used color schemes with strong contrast. Additionally, we enabled zoom features for those with visual impairments. By incorporating these considerations, we ensured that our visualization is both informative and inclusive.

While our approach is not entirely novel, our use of multiple tabs enhances exploration and engagement, allowing users to analyze various aspects of the data. Overall, our visualization serves as a foundation for further investigation into the complex relationship between environmental conditions and National Park visitation trends.

Heart Isoform Database

Timothy Pan, PhD Student, Feinberg School of Medicine

Abstract:  Through alternative splicing, a single gene can produce several isoforms, each with unique structural and functional properties. Maintaining the delicate balance of these isoforms is essential for normal biological processes, as disruptions are known to contribute to the development of various diseases. In my research, I applied a novel RNA-sequencing platform that utilizes long-read sequencing technology to capture the isoform landscape of the normal and diseased human adult left ventricle for the first time. My visualization tool facilitates the easy navigation of our data in the Heart Isoform Database, a Shiny web-based platform. Users can easily explore the isoform composition of any gene without the burden of downloading large data archives and manually processing results. The webserver provides customizable figures that illustrate isoform compositions according to study parameters, including cardiac cell types, donor IDs, and disease conditions. Furthermore, the raw count numbers are made visible for additional transparency in genomics studies. This webserver will offer easy accessibility and navigation of the results of our study and will serve as a valuable resource in advancing cardiovascular research.

Static Visualizations

Access to National Parks is Racially Divided

Alice Kang, PhD Student, Weinberg College of Arts and Sciences

Abstract: My visualization highlights racial disparities in National Park access. Utilizing 2018 data from the National Park Service and U.S. Census Bureau, I map park locations against Black and White population distributions at the census tract level, controlling for population density. Park visitation is represented by marker size and racial composition by monochromatic choropleth maps, avoiding bias and ensuring accessibility for viewers with color-vision deficiency. Created using R, Tableau, and PowerPoint, this tool informs park visitors, researchers, and policymakers about the relationship between park locations and surrounding demographics.

Despite their mission to preserve and share extraordinary natural resources with the public, National Park visitation is considerably skewed. White individuals with high socioeconomic status are far more likely to visit National Parks than other demographic groups. Prior studies cite institutional discrimination, cultural norms, and socioeconomic marginalization to explain the underrepresentation of minority visitors at National Parks. But a more recent line of research argues that the geographic distribution of National Parks plays a critical role in determining access. My visualization builds on this argument, namely that park location matters.

Most National Parks are remotely located, far from public transit. This makes them challenging to access—especially if you live farther away. Who is more likely to live farther away? My visualization reveals a clear pattern. National Parks are disproportionately located closer to areas with high White populations and farther from areas with high Black populations. This pattern mirrors broader racial disparities in access to public green spaces.

Clustering Challenges in Mental Health NLP: Evaluating Domain-Specific vs. General-Purpose Embeddings

Juwon Park, Undergraduate Student, Weinberg College of Arts and Sciences

Background: Increasing use of large language models (LLMs) in medicine necessitates responsible implementation, particularly when leveraging specialized clinical text embedders. This study evaluates embeddings generated by a domain-specific model (MentalBERT) and a general-purpose model (all-MiniLM-L6-v2) on mental health-related text from Reddit. Using dimensionality reduction and clustering, we assess which model better captures nuances indicated by subreddit labels, while exploring the implications for clinical application integration.
Methods:  A Hugging Face dataset containing 54,412 Reddit posts labeled with mental health-related subreddits (e.g., r/SuicideWatch, r/Depression, r/Anxiety) was utilized in this study. Fixed-length embeddings were generated using MentalBERT (a domain-specific model) and all-MiniLM-L6-v2 (a general-purpose model) and visualized via t-SNE for clustering. K-Means clustering (k=5) was applied to the t-SNE embeddings, and performance was evaluated using metrics such as Silhouette Score, Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), and Purity Score.
Results:  The dataset exhibits significant variability in post lengths, with an average of 178 words (SD = 237) and a median of 108. Clustering performance metrics and visualizations show that all-MiniLM-L6-v2 outperformed MentalBERT across all measures, though both models struggled to align clusters with true subreddit labels as evidenced by low evaluation metrics such as Silhouette Score, NMI, ARI, and Purity Score. While all-MiniLM-L6-v2 demonstrated better generalization, both models faced challenges in representing the nuances of informal, mental health-related text for unsupervised clustering tasks.
Conclusion:  Label imbalance, variability in post lengths, and the models' focus on short-sentence tasks likely contributed to the underperformance, raising concerns about the use of specialized embedders in fine-tuning LLMs for clinical applications.