Skip to content

yvnnhong/ebird-sql-duckdb-analytics

Repository files navigation

eBird SQL DuckDB Analytics

Focus: Species Ecology, Conflict, and Conservation in California

Project Overview

Comprehensive analysis of bird population dynamics, breeding patterns, and conservation outcomes in California using large-scale eBird citizen science data (2005-2025). I requested a custom download of 47GB of bird observation data in California from Cornell's eBird api from 2005-2025 to examine three distinct ecological niches across multiple species, providing insights into urban adaptation, conservation effectiveness, and climate resilience in California's diverse avian communities.

Research Scope

Dataset: 47GB eBird observation data from Cornell Lab eBird API
Geographic Focus: California statewide
Temporal Scope: ~21 years (2005-2025)
Analysis Platform: DuckDB SQL queries Species Coverage: Urban adapters, endangered species, and endemic specialists

Study Components

1. Urban Adaptation Comparative Study

Folder: house_finch_vs_house_sparrow/

Research Question: How do two historically successful urban-adapted species respond differently to California's changing urban landscapes?

Species Analyzed:

  • House Finch (Haemorhous mexicanus) - Native California species
  • House Sparrow (Passer domesticus) - Introduced European species

Key Findings:

  • House Finch dominance: 87.1% of observations (134.4M birds vs 19.9M sparrows)
  • Urban adaptation reversal: House Sparrows declining (-31.7%) while House Finches increase (+133.6%)
  • Geographic displacement: Sparrows show range contraction in major metropolitan areas
  • Breeding success differential: Finches maintain 44.6% success vs Sparrows' 41.0% in urban environments

Conservation Implications: Native species may outcompete introduced species in mature urban environments, suggesting urban habitat management can favor native biodiversity.

2. Endangered Species Conservation Assessment

Folder: ridgways_rail/

Research Question: How effective have large-scale habitat restoration programs been for endangered species recovery?

Species Analyzed:

  • Ridgway's Rail (Rallus obsoletus) - Endangered California marsh endemic

Key Findings:

  • Dramatic population recovery: 4,456% increase (390 to 17,758 birds observed annually)
  • Geographic expansion: 3,880% increase in occupied localities (58 to 2,261 sites)
  • Conservation success by region: San Francisco Bay Area shows coordinated improvement across multiple counties
  • Habitat restoration effectiveness: Counties with active salt pond restoration programs show highest recovery rates

Conservation Implications: Large-scale, coordinated habitat restoration programs can achieve remarkable endangered species recovery when sustained over multi-decade timeframes.

3. Endemic Species Climate Resilience Study

Folder: yellow_billed_magpie/

Research Question: How do endemic species adapt to climate variability, and what factors drive range expansion?

Species Analyzed:

  • Yellow-billed Magpie (Pica nuttalli) - California Central Valley endemic

Key Findings:

  • Active range expansion: 53.8% of breeding counties represent recent colonizations (2020+)
  • Climate stability: Breeding success remains stable (19-22%) across all climate conditions
  • Year-round breeding: Unique extended breeding season unlike most temperate species
  • Observer behavior insights: Climate affects citizen science data collection more than actual bird biology

Conservation Implications: Endemic species can demonstrate remarkable climate resilience and adaptive capacity when suitable habitat corridors exist for range expansion.

Methodological Innovations

Large-Scale Data Processing

  • DuckDB optimization: Custom SQL queries handling 47GB datasets efficiently
  • Quality control protocols: Systematic filtering for data accuracy and biological relevance
  • Multi-temporal analysis: 20+ year trend detection with seasonal and annual resolution

Geographic Analysis Techniques

  • Custom grid systems: 1km² coordinate-based cells for fine-scale range analysis
  • Locality tracking: Colonization event detection using unique site identifiers
  • Administrative boundary analysis: County-level conservation impact assessment

Citizen Science Data Validation

  • Observer effort standardization: Accounting for participation bias in trend analysis
  • Detection probability modeling: Handling variable observer skill and effort
  • Climate bias correction: Identifying weather effects on data collection patterns

Breeding Success Metrics

  • Standardized breeding codes: Consistent classification across species and studies
  • Success rate calculations: Proportion-based metrics with confidence intervals
  • Seasonal pattern analysis: Monthly and seasonal breeding activity quantification

Key Technological Components

Database Architecture

  • DuckDB analytical platform: High-performance analytical queries on large datasets
  • CSV processing optimization: Efficient handling of 47GB flat files
  • Memory management: Optimized for large-scale ecological data analysis

SQL Query Development

  • Modular analysis framework: Reusable query components across species
  • Complex temporal analysis: Multi-year trend detection with statistical rigor
  • Geographic computation: Coordinate-based range and expansion calculations

Data Quality Assurance

  • Systematic validation protocols: Multi-stage filtering for observation quality
  • Outlier detection: Statistical methods for identifying data entry errors
  • Biological relevance filtering: Species-specific behavior and count validation

Conservation Science Contributions

Population Monitoring Methodologies

  • Long-term trend analysis: 20+ year population trajectories using citizen science data
  • Conservation program evaluation: Quantitative assessment of restoration effectiveness
  • Range dynamics modeling: Understanding factors driving geographic distribution changes

Urban Ecology Insights

  • Native vs introduced species competition: Competitive dynamics in urban environments
  • Habitat quality assessment: Urban environment suitability for different species groups
  • Conservation prioritization: Data-driven urban habitat management recommendations

Climate Adaptation Research

  • Species resilience assessment: Climate vulnerability evaluation using breeding success metrics
  • Behavioral adaptation analysis: Extended breeding seasons and climate response
  • Observer bias quantification: Citizen science data quality under variable conditions

Research Applications

Academic Research

  • Avian ecology: Population dynamics, breeding ecology, and geographic distribution
  • Conservation biology: Restoration effectiveness and endangered species recovery
  • Urban ecology: Species adaptation to anthropogenic environments
  • Climate science: Species responses to climate variability and change

Conservation Management

  • Resource allocation: Data-driven prioritization of conservation investments
  • Program evaluation: Quantitative metrics for restoration and management assessment
  • Adaptive management: Geographic and temporal patterns informing future strategies
  • Policy development: Evidence-based recommendations for species and habitat protection

Citizen Science Research

  • Data quality assessment: Validation methods for large-scale citizen science datasets
  • Observer engagement: Understanding factors driving participation and data quality
  • Bias correction: Methods for accounting for observer and temporal sampling bias
  • Integration protocols: Combining citizen science with professional monitoring data

Technical Specifications

Data Sources

  • Primary dataset: Cornell Lab eBird API (47GB California observations)
  • Temporal coverage: 2005-2025 (20+ years)
  • Geographic scope: California statewide (58 counties)
  • Species coverage: 3 focal species with contrasting ecological strategies

Analytical Platform

  • Database system: DuckDB with SQL query optimization
  • Processing capacity: 47GB dataset analysis with complex multi-table joins
  • Output formats: Structured text results with statistical summaries
  • Documentation: Comprehensive SQL query libraries with embedded methodology

Quality Control Standards

  • Observation validation: Multi-stage filtering for biological accuracy
  • Temporal consistency: Standardized date and seasonal classifications
  • Geographic precision: Coordinate validation and administrative boundary matching
  • Breeding behavior classification: Standardized codes across species and studies

Dataset Statistics

Study Component Observations Species Years Counties Key Metric
Urban Adaptation 3.2M records 2 species 2005-2025 58 counties 154.3M birds analyzed
Conservation Assessment 44,248 records 1 species 2005-2025 16 counties 4,456% population increase
Climate Resilience 1,792 records 1 species 2013-2025 26 counties 53.8% range expansion

Summary and Impact

This project demonstrates the power of large-scale citizen science data for addressing fundamental questions in avian ecology and conservation biology. By analyzing three ecologically distinct species across multiple decades, the research provides insights into urban adaptation, conservation effectiveness, and climate resilience that inform both scientific understanding and practical conservation management.

The methodological framework developed for this project, including advanced SQL query optimization for ecological data and systematic approaches to citizen science data validation, contributes to the growing field of computational ecology and big data applications in conservation science.

About

Analyzing 47GB of eBird data (2005-2025) with DuckDB SQL to study urban adaptation in House Finches vs. House Sparrows, endangered Ridgway’s Rail recovery, and Yellow-billed Magpie climate resilience in California. Provides data-driven insights for conservation and urban ecology.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors