Data Collection Methodology
Overview
We estimate Python package usage by continuously sampling and analyzing public GitHub repositories. Our approach provides insights into real-world package adoption patterns across the Python ecosystem.
1. Repository Sampling
We use a temporal randomization strategy to ensure diverse repository coverage:
- Random Time Selection: We randomly select hours across the past 10 years
- GitHub API Query: For each random hour, we query repositories updated during that time
- Quality Filters:
- Minimum 5 stars (configurable)
- Minimum 100KB repository size
- Primary language: Python
- Sample Size: Currently processing 50 repositories every 6 hours via GitHub Actions
2. Import Analysis
For each sampled repository, we perform comprehensive import analysis:
- Repository Cloning: We clone repositories with minimal depth for efficiency
- File Discovery: Identify Python files (.py extension) throughout the repository
- AST Parsing: Use Python's Abstract Syntax Tree (AST) parser for accurate import extraction
- Fallback Parsing: Regex-based extraction for files with syntax errors
- Resource Limits:
- Maximum 10 Python files per repository
- Skip files larger than 1MB
- 5-minute timeout per repository
3. Data Processing
We filter and aggregate the collected import data:
- Standard Library Filtering: Exclude Python standard library imports using
sys.stdlib_module_names - Package Name Extraction: Extract top-level package names from import statements
- Deduplication: Track processed repositories to avoid counting the same repository multiple times
- Aggregation: Count occurrences of each package across all analyzed files
4. Statistical Considerations
Strengths:
- Random sampling reduces selection bias
- Continuous updates provide fresh data
- Quality filters ensure meaningful repositories
- AST parsing ensures accurate import detection
Limitations:
- Sample Size: Current sample may not be statistically representative of all ~18M Python repositories
- Temporal Bias: Recently updated repositories may be overrepresented
- Private Repositories: Analysis limited to public repositories only
- Internal Modules: Some counted "packages" may be repository-specific modules (e.g., "utils")
- Import vs Usage: Import presence doesn't indicate usage intensity
5. Automation & Updates
Our data collection is fully automated using GitHub Actions:
- Schedule: Runs every 6 hours
- Workflow:
- Find new random repositories
- Analyze imports in unprocessed repositories
- Aggregate and count package usage
- Update statistics and dashboard
- Commit results back to repository
- Incremental Processing: Only analyzes new repositories to avoid duplication
- Data Persistence: All raw data stored in JSONL format for transparency
6. Data Files
All data is publicly available in our repository:
| File | Description | Format |
|---|---|---|
repos.jsonl |
Sampled repository information | JSON Lines |
imports.jsonl |
Raw import statements with context | JSON Lines |
library_counts.csv |
Aggregated package usage statistics | CSV |
processed_repos.txt |
List of analyzed repositories | Plain text |
7. Future Improvements
We're continuously working to improve our methodology:
- Increase sample size for better statistical significance
- Add PyPI validation to filter out internal modules
- Track package versions and Python version requirements
- Implement stratified sampling by repository characteristics
- Add confidence intervals and error margins
- Distinguish between production and test code imports
- Track temporal trends in package adoption
8. Contributing
This project is open source and welcomes contributions!
- Source Code: GitHub Repository
- Report Issues: Issue Tracker
- Suggest Improvements: Open a pull request or issue
- Use the Data: All data is freely available under open license