Methodology - Python Package Usage Statistics

Overview

We estimate Python package usage by continuously sampling and analyzing public GitHub repositories. Our approach provides insights into real-world package adoption patterns across the Python ecosystem.

1. Repository Sampling

We use a temporal randomization strategy to ensure diverse repository coverage:

Random Time Selection: We randomly select hours across the past 10 years
GitHub API Query: For each random hour, we query repositories updated during that time
Quality Filters:
- Minimum 5 stars (configurable)
- Minimum 100KB repository size
- Primary language: Python
Sample Size: Currently processing 50 repositories every 6 hours via GitHub Actions

2. Import Analysis

For each sampled repository, we perform comprehensive import analysis:

Repository Cloning: We clone repositories with minimal depth for efficiency
File Discovery: Identify Python files (.py extension) throughout the repository
AST Parsing: Use Python's Abstract Syntax Tree (AST) parser for accurate import extraction
Fallback Parsing: Regex-based extraction for files with syntax errors
Resource Limits:
- Maximum 10 Python files per repository
- Skip files larger than 1MB
- 5-minute timeout per repository

3. Data Processing

We filter and aggregate the collected import data:

Standard Library Filtering: Exclude Python standard library imports using sys.stdlib_module_names
Package Name Extraction: Extract top-level package names from import statements
Deduplication: Track processed repositories to avoid counting the same repository multiple times
Aggregation: Count occurrences of each package across all analyzed files

4. Statistical Considerations

Strengths:

Random sampling reduces selection bias
Continuous updates provide fresh data
Quality filters ensure meaningful repositories
AST parsing ensures accurate import detection

Limitations:

Sample Size: Current sample may not be statistically representative of all ~18M Python repositories
Temporal Bias: Recently updated repositories may be overrepresented
Private Repositories: Analysis limited to public repositories only
Internal Modules: Some counted "packages" may be repository-specific modules (e.g., "utils")
Import vs Usage: Import presence doesn't indicate usage intensity

5. Automation & Updates

Our data collection is fully automated using GitHub Actions:

Schedule: Runs every 6 hours
Workflow:
1. Find new random repositories
2. Analyze imports in unprocessed repositories
3. Aggregate and count package usage
4. Update statistics and dashboard
5. Commit results back to repository
Incremental Processing: Only analyzes new repositories to avoid duplication
Data Persistence: All raw data stored in JSONL format for transparency

6. Data Files

All data is publicly available in our repository:

File	Description	Format
`repos.jsonl`	Sampled repository information	JSON Lines
`imports.jsonl`	Raw import statements with context	JSON Lines
`library_counts.csv`	Aggregated package usage statistics	CSV
`processed_repos.txt`	List of analyzed repositories	Plain text

7. Future Improvements

We're continuously working to improve our methodology:

Increase sample size for better statistical significance
Add PyPI validation to filter out internal modules
Track package versions and Python version requirements
Implement stratified sampling by repository characteristics
Add confidence intervals and error margins
Distinguish between production and test code imports
Track temporal trends in package adoption

8. Contributing

This project is open source and welcomes contributions!

Source Code: GitHub Repository
Report Issues: Issue Tracker
Suggest Improvements: Open a pull request or issue
Use the Data: All data is freely available under open license

Data Collection Methodology