🕷️ Web Scraping Pipeline

Intelligent, scalable web crawling with real-time monitoring

Quick Start • Features • Architecture • Monitoring • Docs

🎯 What It Does

Multi-stage intelligent web crawler that discovers, analyzes, and summarizes web content at scale.

graph LR
    A[🌐 URLs] --> B[🕵️ Scout Spider]
    B --> C[📊 Analysis]
    C --> D[🤖 Summarization]
    D --> E[💾 Delta Lake]

    style A fill:#e1f5ff
    style B fill:#fff9e6
    style C fill:#ffe6f0
    style D fill:#e6f7ff
    style E fill:#f0ffe6

✨ Features

🚀 Intelligent Crawling Smart URL prioritization JS-heavy page detection Adaptive rate limiting Domain-aware routing	📊 Real-Time Monitoring Live Grafana dashboards Prometheus metrics Circuit breaker patterns Health checks
💾 Robust Storage Delta Lake for raw data PostgreSQL for metrics Redis for queues Unified interface	🧪 Production Ready 90%+ test coverage Type-safe configuration Docker + Kubernetes Auto-scaling support

🚀 Quick Start

One Command Setup

python start.py

That's it! 🎉 The pipeline starts with:

✅ All services running
✅ Monitoring enabled
✅ Sample URLs loaded

View Your Dashboard

Open http://localhost:3000 (login: admin / admin)

Service	URL	Purpose
📊 Grafana	`localhost:3000`	Visual dashboards
🔥 Prometheus	`localhost:9091`	Metrics database
🕷️ Spider Metrics	`localhost:9410`	Spider stats
📮 Redis Metrics	`localhost:9090`	Queue depth

🏗️ Architecture

Three-Tier Manager System

graph TB
    subgraph "🎛️ Configuration"
        CM[ConfigManager]
    end

    subgraph "💾 Storage Layer"
        SM[StorageManager]
        DL[Delta Lake]
        PG[PostgreSQL]
        RD[Redis]
        SM --> DL
        SM --> PG
        SM --> RD
    end

    subgraph "🔗 URL Processing"
        UP[URLProcessor]
        EX[Extractor]
        AS[Assessor]
        UP --> EX
        UP --> AS
    end

    subgraph "🕷️ Crawling Pipeline"
        S1[Scout Spider]
        S2[Deep Dive Spider]
        S3[JS Spider]
    end

    CM --> SM
    CM --> UP
    SM --> S1
    SM --> S2
    SM --> S3
    UP --> S1
    UP --> S2

    style CM fill:#667eea
    style SM fill:#f093fb
    style UP fill:#4facfe
    style S1 fill:#43e97b
    style S2 fill:#fa709a
    style S3 fill:#fee140

📁 Project Structure

📦 Scraping Pipeline
├── 🎛️  src/common/          # Core managers (Config, Storage, URL)
├── 🕷️  src/stage1/          # Discovery spiders (Scout, DeepDive, JS)
├── 📊 src/stage2/          # Page analysis workers
├── 🤖 src/stage3/          # Summarization workers
├── 📈 monitoring/          # Prometheus + Grafana configs
├── 🦀 kafka-delta-ingest/  # Rust ingestion service
├── ☸️  k8s/                # Kubernetes deployments
└── 🧪 tests/              # Comprehensive test suite

🕷️ Stage 1: Discovery

🔍 Scout Spider

Fast discovery

Aggressive crawling
Broad URL discovery
Queue population
1000+ req/min

🎯 Deep Dive

Hidden URLs

Data attributes
JSON-LD extraction
API endpoints
Value assessment

⚡ JS Spider

Dynamic content

Playwright rendering
SPA handling
Lazy loading
Network intercept

Usage

from src.common.config_manager import ConfigManager
from src.common.storage_manager import StorageManager
from src.common.url_processor import URLProcessor

# Single source of truth
config = ConfigManager.get_instance()

# Unified storage
storage = StorageManager.get_instance()
storage.delta.write_batch('table', records)

# Smart URL processing
processor = URLProcessor(base_url, domains)
urls = processor.discover_and_assess(response, min_value_score=40)

📊 Monitoring

Live Dashboards

Dashboard	Metrics	Update Frequency
Spider Overview	URLs/min, Success rate, Queue depth	Real-time
Storage Health	Write throughput, Table sizes, Errors	10s
System Resources	CPU, Memory, Disk I/O	5s
Quality Metrics	Content scores, Dedup rate, JS confidence	Real-time

Quick Commands

# View all services
docker-compose ps

# Follow spider logs
docker-compose logs -f scrapy-app

# Check system health
./scripts/diagnose_issues.sh

# Reset everything
python start.py --reset-delta

🛠️ Configuration

Three-Level Hierarchy

graph TD
    A[🌍 Environment Variables] --> B[📝 YAML Config]
    B --> C[⚙️ Code Defaults]

    style A fill:#48bb78,color:#fff
    style B fill:#4299e1,color:#fff
    style C fill:#9f7aea,color:#fff

Highest Priority ➜ Lowest Priority

Example Configuration

# config.yml
redis:
  host: localhost
  port: 6379

stage1:
  batch_size: 50
  js_confidence_threshold: 0.7

stage2:
  max_workers: 100
  min_word_count: 50

Override with environment variables:

export REDIS_HOST=production-redis
export DB_PASSWORD=secret123

Access in code:

config = ConfigManager.get_instance()
redis_host = config.redis.host          # Type-safe!
batch_size = config.stage1.batch_size   # IDE autocomplete

📚 Full Configuration Guide →

💾 Storage

Unified Interface

storage = StorageManager.get_instance()

# Delta Lake - Raw data
storage.delta.write('stage1_discovery', records)
data = storage.delta.read('stage1_discovery')

# PostgreSQL - Metrics
storage.postgres.log_error('spider_name', error)
metrics = storage.postgres.get_performance_metrics()

# Redis - Queues
storage.redis.mark_url_seen('https://example.com')
storage.redis.enqueue('queue_name', item)

# Health checks
health = storage.health_check()
# {'delta': True, 'postgres': True, 'redis': True}

Auto-cleanup

# Context manager automatically closes connections
with StorageManager() as storage:
    storage.delta.write_batch('table', data)
    # Connections closed on exit

🔗 URL Processing

All-in-One

processor = URLProcessor('https://example.com', ['example.com'])

# Discover + assess in one call
urls = processor.discover_and_assess(
    response,
    min_value_score=40  # Filter low-value URLs
)

# Each URL includes:
# - value_score (0-100)
# - recommended_spider ('scout'/'depth'/'js')
# - reasons (why this score)

Smart Operations

Normalization

# Removes tracking, lowercases
url = processor.normalize_url(
    'https://Example.com?utm_source=test'
)
# → 'https://example.com'

Validation

# Filters unwanted URLs
should_follow = processor.should_follow_url(
    'https://example.com/login'
)
# → False

Deduplication

# Removes duplicates
unique = processor.deduplicate_urls([
    'url1', 'url2', 'url1'
])
# → ['url1', 'url2']

Prioritization

# Calculates crawl priority
priority = processor.calculate_priority(
    url, value_score=85, depth=2
)
# → 75 (0-100)

☸️ Kubernetes Deployment

Production Ready

# Deploy full pipeline
python start.py --env k8s --stage pipeline

# Deploy individual stages
python start.py --env k8s --stage stage1

# Scaled deployment
python start.py --env k8s --stage all-stages \
  --release-prefix prod \
  --namespace-prefix scraping

Auto-Scaling

Horizontal pod autoscaling enabled
Resource limits enforced
Rolling updates supported
Health checks configured

📚 Kubernetes Guide →

🧪 Testing

Comprehensive Coverage

Component	Coverage	Tests
ConfigManager	95%+	20+
StorageManager	90%+	30+
URLProcessor	95%+	40+
Spiders	85%+	50+

Run Tests

# All tests
pytest

# Specific component
pytest tests/unit/common/test_config_manager.py -v

# With coverage
pytest --cov=src --cov-report=html

# Fast tests only
pytest -m "not slow"

🎓 Learning Resources

📖 Documentation

Architecture Guide - Detailed technical docs
Refactoring Summary - Recent changes
K8s Deployment - Production setup

🎯 Examples

ConfigManager Tests - Usage examples
BaseSpider - Integration patterns
Worker Template - Worker structure

🔧 Development

Setup

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # or `.venv\Scripts\activate` on Windows

# Install dependencies
pip install -r requirements.txt
pip install -r dev-requirements.txt

# Run tests
pytest

# Code quality
ruff check .
mypy src/

Pre-commit Hooks

# Install hooks
pre-commit install

# Run manually
pre-commit run --all-files

Common Tasks

# Reseed data
python reseed.py

# Load new URLs
python cli.py load_seeds data/urls.csv

# Reset Delta tables
python start.py --reset-delta

# View logs
docker-compose logs -f scrapy-app

# Enter container
docker-compose exec scrapy-app bash

📂 Repository Layout & Ignore Policy

Single Authoritative .gitignore

This repository uses a single root-level .gitignore file for all ignore rules. All nested .gitignore files have been consolidated into /.gitignore for easier maintenance and consistency.

What's Ignored

The root .gitignore covers:

Python artifacts: bytecode, wheels, eggs, build outputs
Virtual environments: .venv/, venv/, ENV/, env/
IDE files: .idea/, .vscode/, *.iml, swap files
Data & logs: data/**, logs/**, *.log, *.db
Test artifacts: .pytest_cache/, .coverage, htmlcov/
Secrets: .env*, *.pem, *.key, credentials.json
Database files: *.sqlite, *.db-shm, *.db-wal
Delta Lake: data/delta_lake/, _delta_log/, checkpoints
Kafka/Streaming: kafka-logs/, zookeeper/
Docker overrides: docker-compose.override.yml
Monitoring data: prometheus-data/, grafana-data/
Temp files: tmp/, temp/, *.tmp, *.bak
macOS artifacts: .DS_Store, ._*
Rust/Cargo: target/, .cargo/, *.rs.bk

What's Tracked (Whitelisted)

Important project files are explicitly whitelisted:

package.json, package-lock.json (Node dependencies)
tsconfig.json (TypeScript config)
Cargo.toml, Cargo.lock (Rust dependencies)

View the complete ignore rules in .gitignore.

🐛 Troubleshooting

Quick Diagnostics

# System health check
./scripts/diagnose_issues.sh

# View all services
docker-compose ps

# Check specific service
docker-compose logs kafka-delta-ingestor

# Verify storage health
python -c "from src.common.storage_manager import StorageManager; \
           print(StorageManager.get_instance().health_check())"

Common Issues

🔴 Spiders not starting

Check seed URLs are loaded:

docker-compose exec scrapy-app python cli.py list_seeds

Reload if needed:

python start.py --reset-delta

🔴 Grafana dashboard empty

Reset Grafana:

./scripts/reset_grafana_complete.sh

Wait 30s, then reload dashboard.

🔴 High memory usage

Adjust batch sizes in config.yml:

stage1:
  batch_size: 25  # Reduce from 50

Restart services:

python shutdown.py && python start.py

📊 Performance

Benchmarks

Metric	Scout Spider	Deep Dive	JS Spider
Throughput	1000+ URLs/min	100+ URLs/min	20+ URLs/min
Concurrent Requests	1024	32	20
Memory Usage	~2GB	~1GB	~4GB
Discovery Rate	95%+	85%+	100%

Optimization Tips

🎯 Use min_value_score to filter low-value URLs early
🔄 Enable Redis queue for distributed crawling
📊 Monitor queue depth to prevent backpressure
⚡ Adjust batch_size based on available memory

🤝 Contributing

We welcome contributions! Here's how:

Fork the repository
Create a feature branch (git checkout -b feature/amazing)
Add tests for new functionality
Ensure all tests pass (pytest)
Commit with clear messages
Push to your fork
Open a Pull Request

Code Standards

✅ Type hints required
✅ Tests required (90%+ coverage)
✅ Documentation required
✅ Ruff linting passes
✅ Pre-commit hooks pass

📝 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

Built with these amazing tools:

Tool	Purpose
🕷️ Scrapy	Web crawling framework
🦀 Delta Lake	Data lake storage
📊 Grafana	Visualization
🔥 Prometheus	Metrics collection
🎭 Playwright	Browser automation
🐘 PostgreSQL	Relational database
🔴 Redis	In-memory store

🚀 Start Crawling Now!

python start.py

Questions? Open an issue • Star ⭐ if you find this useful!

Made with ❤️ and lots of ☕

Name		Name	Last commit message	Last commit date
Latest commit History 269 Commits
.github		.github
.vscode		.vscode
Scraping_project		Scraping_project
monitoring		monitoring
temp_scripts		temp_scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.production.yml		docker-compose.production.yml
test_pipeline_10k.py		test_pipeline_10k.py

License

BenjaminSRussell/Scrapy

Folders and files

Latest commit

History

Repository files navigation