Data Collection Methodology

Overview

We estimate Python package usage by continuously sampling and analyzing public GitHub repositories. Our approach provides insights into real-world package adoption patterns across the Python ecosystem.

1. Repository Sampling

We use a temporal randomization strategy to ensure diverse repository coverage:

  • Random Time Selection: We randomly select hours across the past 10 years
  • GitHub API Query: For each random hour, we query repositories updated during that time
  • Quality Filters:
    • Minimum 5 stars (configurable)
    • Minimum 100KB repository size
    • Primary language: Python
  • Sample Size: Currently processing 50 repositories every 6 hours via GitHub Actions

2. Import Analysis

For each sampled repository, we perform comprehensive import analysis:

  • Repository Cloning: We clone repositories with minimal depth for efficiency
  • File Discovery: Identify Python files (.py extension) throughout the repository
  • AST Parsing: Use Python's Abstract Syntax Tree (AST) parser for accurate import extraction
  • Fallback Parsing: Regex-based extraction for files with syntax errors
  • Resource Limits:
    • Maximum 10 Python files per repository
    • Skip files larger than 1MB
    • 5-minute timeout per repository

3. Data Processing

We filter and aggregate the collected import data:

  • Standard Library Filtering: Exclude Python standard library imports using sys.stdlib_module_names
  • Package Name Extraction: Extract top-level package names from import statements
  • Deduplication: Track processed repositories to avoid counting the same repository multiple times
  • Aggregation: Count occurrences of each package across all analyzed files

4. Statistical Considerations

Strengths:

  • Random sampling reduces selection bias
  • Continuous updates provide fresh data
  • Quality filters ensure meaningful repositories
  • AST parsing ensures accurate import detection

Limitations:

  • Sample Size: Current sample may not be statistically representative of all ~18M Python repositories
  • Temporal Bias: Recently updated repositories may be overrepresented
  • Private Repositories: Analysis limited to public repositories only
  • Internal Modules: Some counted "packages" may be repository-specific modules (e.g., "utils")
  • Import vs Usage: Import presence doesn't indicate usage intensity

5. Automation & Updates

Our data collection is fully automated using GitHub Actions:

  • Schedule: Runs every 6 hours
  • Workflow:
    1. Find new random repositories
    2. Analyze imports in unprocessed repositories
    3. Aggregate and count package usage
    4. Update statistics and dashboard
    5. Commit results back to repository
  • Incremental Processing: Only analyzes new repositories to avoid duplication
  • Data Persistence: All raw data stored in JSONL format for transparency

6. Data Files

All data is publicly available in our repository:

File Description Format
repos.jsonl Sampled repository information JSON Lines
imports.jsonl Raw import statements with context JSON Lines
library_counts.csv Aggregated package usage statistics CSV
processed_repos.txt List of analyzed repositories Plain text

7. Future Improvements

We're continuously working to improve our methodology:

  • Increase sample size for better statistical significance
  • Add PyPI validation to filter out internal modules
  • Track package versions and Python version requirements
  • Implement stratified sampling by repository characteristics
  • Add confidence intervals and error margins
  • Distinguish between production and test code imports
  • Track temporal trends in package adoption

8. Contributing

This project is open source and welcomes contributions!

  • Source Code: GitHub Repository
  • Report Issues: Issue Tracker
  • Suggest Improvements: Open a pull request or issue
  • Use the Data: All data is freely available under open license