2025-12-10 –, Abigail Adams
Rivers have long been storytellers of human history. From the Nile to the Yangtze, they have shaped trade, migration, settlement, and the rise of civilizations. They reveal the traces of human ambition... and the costs of it. Today, from the Charles to the Golden Gate, US rivers continue to tell stories, especially through data.
Over the past decades, extensive water quality monitoring efforts have generated vast public datasets: millions of measurements of pH, dissolved oxygen, temperature, and conductivity collected across the country. These records are more than environmental snapshots; they are archives of political priorities, regulatory choices, and ecological disruptions. Ultimately, they are evidence of how societies interact with their environments, often unevenly.
In this talk, I’ll explore how Python and modern data workflows can help us "listen" to these stories at scale. Using the United States Geological Survey (USGS) Water Data APIs and Remote SSH in Positron, I’ll process terabytes of sensor data spanning several years and regions. I’ll demonstrate that, while Parquet and DuckDB enable scalable exploration of historical records, using Remote SSH is paramount in order to enable large-scale data analysis. By doing so, I hope to answer some analytical questions that can surface patterns linked to industrial growth, regulatory shifts, and climate change.
By treating rivers as both ecological systems and social mirrors, we can begin to see how environmental data encodes histories of inequality, resilience, and transformation.
Whether your interest lies in data engineering, environmental analytics, or the human dimensions of climate and infrastructure, this talk will explore topics at the intersection of environmental science, will offer both technical methods and sociological lenses to understand the stories rivers continue to tell.
Context
Rivers are not just ecological systems... they are historical witnesses. Over decades, millions of water quality measurements have been recorded at several monitoring sites across the United States. This talk shows how data scientists can tap into this living archive using the USGS Water Quality API, which provides open access to high-resolution, continuous monitoring data.
Working with such massive datasets requires scalable approaches. This talk will demonstrate how Remote SSH can be used to process terabytes of time series data without overwhelming a local machine. Attendees will see how to download and structure water quality data, convert it to efficient columnar formats like Parquet, and query it interactively with DuckDB; using familiar Python workflows and free, open-source tools.
More than a technical walkthrough, this talk frames water data as a sociological artifact and as a record of how communities, industries, and ecosystems have interacted over time. By analyzing trends in key parameters like pH, temperature, and conductivity, we can surface narratives of environmental change, industrialization, and regulation. This blend of large-scale data processing and social interpretation makes the talk relevant to anyone interested in data engineering, environmental science, or applied analytics.
Prior Knowledge Expected
Basic familiarity with Python, data analysis workflows, and working with APIs. Some knowledge of tabular data formats (e.g., Parquet, CSV) and SQL-style querying (e.g., DuckDB or pandas) will be helpful but not required.
Outline with Time Estimates
- 0–5 min: Why rivers matter and data as historical memory
- Rivers as sociological and environmental archives
- Scope and scale of USGS continuous monitoring
- 5–15 min: Getting the data
- Overview of the USGS Water Quality API
- Continuous vs. discrete measurements
- Estimating data scale across states
- 15–25 min: Processing at scale
- Using Remote SSH to handle large datasets
- Partitioning data with Parquet and querying with DuckDB
- Building reproducible Python pipelines
- 25–35 min: Listening to the data
- Identifying long-term trends: pH, conductivity, dissolved oxygen
- Connecting data to environmental and regulatory history
- Mapping change over space and time
- 35–40 min: Takeaways and discussion
- Lessons from working with massive public datasets
- How sociological framing enriches technical analysis
- Q&A
Audience Takeaways
- How to estimate and retrieve massive environmental datasets using the USGS API
- How to use Remote SSH workflows in Positron to handle large-scale data processing
- How to structure and query historical time series efficiently with Parquet and DuckDB
- How sociological framing can shape richer and more meaningful data analysis
- Practical patterns for working with other large open datasets beyond USGS
Audience Level
Beginner to Intermediate
Talk Length
40 minutes (including Q&A)
References / Links
Rodrigo Silva Ferreira is a QA Engineer at Posit, where he contributes to the quality and usability of open-source tools that empower data scientists working in R and Python. He focuses on both manual and automated testing strategies to ensure reliability, performance, and an excellent user experience.
Rodrigo holds a BSc. in Chemistry with minors in Applied Math and Arabic from NYU Abu Dhabi and a MSc. in Analytical Chemistry from the University of Pittsburgh. Multilingual and globally minded, he enjoys working at the intersection of data, science, and technology — especially when it means building tools that help people better understand and navigate the world through its increasingly complex data.