Quick Start • Features • Architecture • Monitoring • Docs
Multi-stage intelligent web crawler that discovers, analyzes, and summarizes web content at scale.
graph LR
A[🌐 URLs] --> B[🕵️ Scout Spider]
B --> C[📊 Analysis]
C --> D[🤖 Summarization]
D --> E[💾 Delta Lake]
style A fill:#e1f5ff
style B fill:#fff9e6
style C fill:#ffe6f0
style D fill:#e6f7ff
style E fill:#f0ffe6
|
|
|
|
python start.pyThat's it! 🎉 The pipeline starts with:
- ✅ All services running
- ✅ Monitoring enabled
- ✅ Sample URLs loaded
Open http://localhost:3000 (login: admin / admin)
| Service | URL | Purpose |
|---|---|---|
| 📊 Grafana | localhost:3000 |
Visual dashboards |
| 🔥 Prometheus | localhost:9091 |
Metrics database |
| 🕷️ Spider Metrics | localhost:9410 |
Spider stats |
| 📮 Redis Metrics | localhost:9090 |
Queue depth |
graph TB
subgraph "🎛️ Configuration"
CM[ConfigManager]
end
subgraph "💾 Storage Layer"
SM[StorageManager]
DL[Delta Lake]
PG[PostgreSQL]
RD[Redis]
SM --> DL
SM --> PG
SM --> RD
end
subgraph "🔗 URL Processing"
UP[URLProcessor]
EX[Extractor]
AS[Assessor]
UP --> EX
UP --> AS
end
subgraph "🕷️ Crawling Pipeline"
S1[Scout Spider]
S2[Deep Dive Spider]
S3[JS Spider]
end
CM --> SM
CM --> UP
SM --> S1
SM --> S2
SM --> S3
UP --> S1
UP --> S2
style CM fill:#667eea
style SM fill:#f093fb
style UP fill:#4facfe
style S1 fill:#43e97b
style S2 fill:#fa709a
style S3 fill:#fee140
📦 Scraping Pipeline
├── 🎛️ src/common/ # Core managers (Config, Storage, URL)
├── 🕷️ src/stage1/ # Discovery spiders (Scout, DeepDive, JS)
├── 📊 src/stage2/ # Page analysis workers
├── 🤖 src/stage3/ # Summarization workers
├── 📈 monitoring/ # Prometheus + Grafana configs
├── 🦀 kafka-delta-ingest/ # Rust ingestion service
├── ☸️ k8s/ # Kubernetes deployments
└── 🧪 tests/ # Comprehensive test suite
|
Fast discovery
|
Hidden URLs
|
Dynamic content
|
from src.common.config_manager import ConfigManager
from src.common.storage_manager import StorageManager
from src.common.url_processor import URLProcessor
# Single source of truth
config = ConfigManager.get_instance()
# Unified storage
storage = StorageManager.get_instance()
storage.delta.write_batch('table', records)
# Smart URL processing
processor = URLProcessor(base_url, domains)
urls = processor.discover_and_assess(response, min_value_score=40)| Dashboard | Metrics | Update Frequency |
|---|---|---|
| Spider Overview | URLs/min, Success rate, Queue depth | Real-time |
| Storage Health | Write throughput, Table sizes, Errors | 10s |
| System Resources | CPU, Memory, Disk I/O | 5s |
| Quality Metrics | Content scores, Dedup rate, JS confidence | Real-time |
# View all services
docker-compose ps
# Follow spider logs
docker-compose logs -f scrapy-app
# Check system health
./scripts/diagnose_issues.sh
# Reset everything
python start.py --reset-deltagraph TD
A[🌍 Environment Variables] --> B[📝 YAML Config]
B --> C[⚙️ Code Defaults]
style A fill:#48bb78,color:#fff
style B fill:#4299e1,color:#fff
style C fill:#9f7aea,color:#fff
Highest Priority ➜ Lowest Priority
# config.yml
redis:
host: localhost
port: 6379
stage1:
batch_size: 50
js_confidence_threshold: 0.7
stage2:
max_workers: 100
min_word_count: 50Override with environment variables:
export REDIS_HOST=production-redis
export DB_PASSWORD=secret123Access in code:
config = ConfigManager.get_instance()
redis_host = config.redis.host # Type-safe!
batch_size = config.stage1.batch_size # IDE autocompletestorage = StorageManager.get_instance()
# Delta Lake - Raw data
storage.delta.write('stage1_discovery', records)
data = storage.delta.read('stage1_discovery')
# PostgreSQL - Metrics
storage.postgres.log_error('spider_name', error)
metrics = storage.postgres.get_performance_metrics()
# Redis - Queues
storage.redis.mark_url_seen('https://example.com')
storage.redis.enqueue('queue_name', item)
# Health checks
health = storage.health_check()
# {'delta': True, 'postgres': True, 'redis': True}# Context manager automatically closes connections
with StorageManager() as storage:
storage.delta.write_batch('table', data)
# Connections closed on exitprocessor = URLProcessor('https://example.com', ['example.com'])
# Discover + assess in one call
urls = processor.discover_and_assess(
response,
min_value_score=40 # Filter low-value URLs
)
# Each URL includes:
# - value_score (0-100)
# - recommended_spider ('scout'/'depth'/'js')
# - reasons (why this score)|
Normalization # Removes tracking, lowercases
url = processor.normalize_url(
'https://Example.com?utm_source=test'
)
# → 'https://example.com' |
Validation # Filters unwanted URLs
should_follow = processor.should_follow_url(
'https://example.com/login'
)
# → False |
|
Deduplication # Removes duplicates
unique = processor.deduplicate_urls([
'url1', 'url2', 'url1'
])
# → ['url1', 'url2'] |
Prioritization # Calculates crawl priority
priority = processor.calculate_priority(
url, value_score=85, depth=2
)
# → 75 (0-100) |
# Deploy full pipeline
python start.py --env k8s --stage pipeline
# Deploy individual stages
python start.py --env k8s --stage stage1
# Scaled deployment
python start.py --env k8s --stage all-stages \
--release-prefix prod \
--namespace-prefix scraping- Horizontal pod autoscaling enabled
- Resource limits enforced
- Rolling updates supported
- Health checks configured
| Component | Coverage | Tests |
|---|---|---|
| ConfigManager | 95%+ | 20+ |
| StorageManager | 90%+ | 30+ |
| URLProcessor | 95%+ | 40+ |
| Spiders | 85%+ | 50+ |
# All tests
pytest
# Specific component
pytest tests/unit/common/test_config_manager.py -v
# With coverage
pytest --cov=src --cov-report=html
# Fast tests only
pytest -m "not slow"
|
|
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # or `.venv\Scripts\activate` on Windows
# Install dependencies
pip install -r requirements.txt
pip install -r dev-requirements.txt
# Run tests
pytest
# Code quality
ruff check .
mypy src/# Install hooks
pre-commit install
# Run manually
pre-commit run --all-files# Reseed data
python reseed.py
# Load new URLs
python cli.py load_seeds data/urls.csv
# Reset Delta tables
python start.py --reset-delta
# View logs
docker-compose logs -f scrapy-app
# Enter container
docker-compose exec scrapy-app bashThis repository uses a single root-level .gitignore file for all ignore rules. All nested .gitignore files have been consolidated into /.gitignore for easier maintenance and consistency.
The root .gitignore covers:
- Python artifacts: bytecode, wheels, eggs, build outputs
- Virtual environments:
.venv/,venv/,ENV/,env/ - IDE files:
.idea/,.vscode/,*.iml, swap files - Data & logs:
data/**,logs/**,*.log,*.db - Test artifacts:
.pytest_cache/,.coverage,htmlcov/ - Secrets:
.env*,*.pem,*.key,credentials.json - Database files:
*.sqlite,*.db-shm,*.db-wal - Delta Lake:
data/delta_lake/,_delta_log/, checkpoints - Kafka/Streaming:
kafka-logs/,zookeeper/ - Docker overrides:
docker-compose.override.yml - Monitoring data:
prometheus-data/,grafana-data/ - Temp files:
tmp/,temp/,*.tmp,*.bak - macOS artifacts:
.DS_Store,._* - Rust/Cargo:
target/,.cargo/,*.rs.bk
Important project files are explicitly whitelisted:
package.json,package-lock.json(Node dependencies)tsconfig.json(TypeScript config)Cargo.toml,Cargo.lock(Rust dependencies)
View the complete ignore rules in .gitignore.
# System health check
./scripts/diagnose_issues.sh
# View all services
docker-compose ps
# Check specific service
docker-compose logs kafka-delta-ingestor
# Verify storage health
python -c "from src.common.storage_manager import StorageManager; \
print(StorageManager.get_instance().health_check())"🔴 Spiders not starting
Check seed URLs are loaded:
docker-compose exec scrapy-app python cli.py list_seedsReload if needed:
python start.py --reset-delta🔴 Grafana dashboard empty
Reset Grafana:
./scripts/reset_grafana_complete.shWait 30s, then reload dashboard.
🔴 High memory usage
Adjust batch sizes in config.yml:
stage1:
batch_size: 25 # Reduce from 50Restart services:
python shutdown.py && python start.py| Metric | Scout Spider | Deep Dive | JS Spider |
|---|---|---|---|
| Throughput | 1000+ URLs/min | 100+ URLs/min | 20+ URLs/min |
| Concurrent Requests | 1024 | 32 | 20 |
| Memory Usage | ~2GB | ~1GB | ~4GB |
| Discovery Rate | 95%+ | 85%+ | 100% |
- 🎯 Use
min_value_scoreto filter low-value URLs early - 🔄 Enable Redis queue for distributed crawling
- 📊 Monitor queue depth to prevent backpressure
- ⚡ Adjust
batch_sizebased on available memory
We welcome contributions! Here's how:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing) - Add tests for new functionality
- Ensure all tests pass (
pytest) - Commit with clear messages
- Push to your fork
- Open a Pull Request
- ✅ Type hints required
- ✅ Tests required (90%+ coverage)
- ✅ Documentation required
- ✅ Ruff linting passes
- ✅ Pre-commit hooks pass
MIT License - see LICENSE for details.
Built with these amazing tools:
| Tool | Purpose |
|---|---|
| 🕷️ Scrapy | Web crawling framework |
| 🦀 Delta Lake | Data lake storage |
| 📊 Grafana | Visualization |
| 🔥 Prometheus | Metrics collection |
| 🎭 Playwright | Browser automation |
| 🐘 PostgreSQL | Relational database |
| 🔴 Redis | In-memory store |
python start.pyQuestions? Open an issue • Star ⭐ if you find this useful!
Made with ❤️ and lots of ☕