# Crawl API — Headless Browser REST API Recreación de [crawlapi.dev](https://crawlapi.dev) en Rust Full-Stack. ## Stack - **Backend**: Axum (Rust) - **Database**: PostgreSQL + sqlx - **Queue**: Redis (jobs + caching + rate limiting) - **Browser Automation**: Playwright (Node.js) con Browser Pool - **File Storage**: MinIO (S3-compatible) - **Frontend**: Next.js 14 - **Observabilidad**: Prometheus + Grafana + Sentry - **Seguridad**: Rate limiting + IP blocking + input validation - **AI**: OpenAI GPT-4o-mini extraction - **Auth**: Email/password + Google OAuth - **Billing**: Stripe Checkout + Webhooks - **CI/CD**: GitHub Actions (test, build, deploy) - **Infra**: Docker Compose + Kubernetes + HPA + cert-manager - **Load Testing**: k6 (smoke, load, stress, screenshot) ## Estructura ``` crawlapi/ ├── crates/ │ ├── api/ # Servidor REST (Axum) + seed script │ ├── worker/ # Worker distribuido con Redis queue │ ├── shared/ # Tipos y config compartidos │ └── db/ # Capa de base de datos + migraciones ├── playwright/ # Script Node.js con Browser Pool + Stealth + CAPTCHA ├── frontend/ # Landing + Playground + Billing + Dashboard + Docs ├── e2e/ # Tests E2E con Playwright ├── load-tests/ # k6 load testing scripts ├── k8s/ # Kubernetes manifests + cert-manager ├── legal/ # Terms, Privacy, DPA ├── .github/ # GitHub Actions workflows ├── docker-compose.yml └── prometheus.yml ``` ## Endpoints ### Crawl/Scrape/AI | Endpoint | Descripción | |----------|-------------| | `POST /api/crawl` | Full JS-rendered page crawl | | `POST /api/content` | Raw HTML | | `POST /api/screenshot` | PNG screenshot (subido a S3) | | `POST /api/pdf` | PDF export (subido a S3) | | `POST /api/markdown` | Markdown extraction | | `POST /api/snapshot` | HTML + screenshot combined | | `POST /api/scrape` | CSS selector extraction | | `POST /api/json` | Structured JSON | | `POST /api/links` | Extract all links | | `POST /api/extract` | AI-powered extraction con OpenAI | ### Auth | Endpoint | Descripción | |----------|-------------| | `POST /api/auth/register` | Crear cuenta | | `POST /api/auth/login` | Login (devuelve JWT) | | `GET /api/auth/google` | URL de OAuth Google | | `GET /api/auth/google/callback` | Callback de OAuth (real con token exchange) | | `POST /api/auth/api-keys` | Crear API key (requiere JWT) | | `GET /api/auth/api-keys` | Listar API keys (requiere JWT) | | `DELETE /api/auth/api-keys/{id}` | Eliminar API key (requiere JWT) | ### Billing | Endpoint | Descripción | |----------|-------------| | `POST /api/stripe/checkout` | Crear checkout session funcional | | `POST /api/stripe/webhook` | Webhook de Stripe (procesa eventos reales) | ### Teams | Endpoint | Descripción | |----------|-------------| | `POST /api/teams` | Crear equipo | | `GET /api/teams/{slug}` | Ver equipo y miembros | | `POST /api/teams/{slug}/members` | Agregar miembro | ### Observabilidad | Endpoint | Descripción | |----------|-------------| | `GET /metrics` | Métricas Prometheus | | `GET /ws/logs` | WebSocket live logs | ## Quick Start (Docker) ```bash # 1. Iniciar toda la stack cd crawlapi docker-compose up --build # 2. Crear seed data (en otra terminal) export DATABASE_URL="postgres://crawlapi:crawlapi@localhost:5432/crawlapi" source "$HOME/.cargo/env" cargo run -p api --bin seed # 3. Servicios disponibles: # API: http://localhost:3000 # Frontend: http://localhost # MinIO: http://localhost:9001 (minioadmin/minioadmin) # Prometheus: http://localhost:9090 # Grafana: http://localhost:3001 (admin/admin) ``` ## CI/CD (GitHub Actions) ```bash # En cada push a main: # 1. cargo fmt --check # 2. cargo clippy -- -D warnings # 3. cargo test --workspace # 4. cargo audit # 5. Docker build + push a registry # 6. Deploy a staging # 7. Smoke tests # 8. Deploy a production (solo en tags v*) ``` **Workflows:** - `.github/workflows/ci.yml` — Test, build, push images - `.github/workflows/deploy.yml` — Deploy a staging y production ## Kubernetes ```bash # Instalar cert-manager primero kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.0/cert-manager.yaml # Deploy todo cd k8s kubectl apply -f namespace.yaml kubectl apply -f cert-manager.yaml kubectl apply -f secrets.yaml kubectl apply -f postgres.yaml kubectl apply -f redis.yaml kubectl apply -f minio.yaml kubectl apply -f api.yaml kubectl apply -f worker.yaml kubectl apply -f frontend.yaml # Workers auto-scale con HPA (3-20 réplicas) kubectl get hpa -n crawlapi ``` ## Load Testing (k6) ```bash cd load-tests # Smoke test (1 VU, 1 min) k6 run smoke.js # Load test (ramp up a 20 VUs, 14 min) k6 run load.js # Stress test (hasta 200 VUs) k6 run stress.js # Screenshot test (5 VUs concurrentes) k6 run screenshot.js ``` ## Tests ```bash # Unit tests cargo test # E2E tests cd e2e && npm install && npx playwright test ``` ## Features implementadas ### Core - ✅ **10 endpoints REST** (9 crawl + 1 AI) - ✅ **Browser Pool** — 5 navegadores Chromium, 10 páginas cada uno - ✅ **Session/Cookie Persistence** — Guarda cookies por `session_id` - ✅ **Mobile Emulation** — iPhone 14 viewport - ✅ **Infinite Scroll** — Auto-scroll hasta el final - ✅ **Custom Headers** — Headers arbitrarios por request ### Workers - ✅ **Distributed Queue** — Redis LPUSH/BLPOP - ✅ **Retry con Backoff** — 3 retries con espera exponencial (2s, 4s, 8s) - ✅ **Dead Letter Queue** — Jobs fallidos guardados por 24h - ✅ **Caching** — Resultados en Redis con TTL 5 min ### Scraping Avanzado - ✅ **Stealth Mode** — Evade detección de bots (webdriver, plugins, canvas) - ✅ **Proxy Rotation** — Múltiples proxies vía `PROXY_URL` - ✅ **CAPTCHA Solving** — Integración con CapSolver/2captcha ### Auth & Billing - ✅ **Email/Password** — Bcrypt + JWT - ✅ **Google OAuth** — Exchange real de code → token → user info - ✅ **Stripe** — Checkout funcional + webhooks reales - ✅ **Teams** — Owner/member roles ### Observabilidad - ✅ **Prometheus** — `/metrics` con counters y histograms - ✅ **Grafana** — Dashboard incluido - ✅ **Sentry** — Error tracking en API y Worker - ✅ **Structured Logging** — JSON logs con correlation IDs - ✅ **WebSocket Logs** — `/ws/logs` ### Seguridad - ✅ **Input Validation** — URLs, webhooks, tamaños (SSRF protection) - ✅ **Rate Limiting** — Por API key (60/min) + por IP (100/min) - ✅ **IP Blocking** — Auto-bloqueo por 1 hora ### Infraestructura - ✅ **Docker Compose** — Todo en un comando - ✅ **Kubernetes** — Full manifests con ingress TLS + cert-manager - ✅ **HPA** — Auto-scaling 3-20 workers - ✅ **Health Checks** — Liveness, readiness, startup probes - ✅ **SSL/TLS** — Let's Encrypt automático via cert-manager ### Secrets Management - ✅ **Multi-provider** — Env vars → Vault → AWS Secrets Manager - ✅ **Fallback chain** — Intenta cada provider en orden ### Frontend - ✅ **Landing Page** - ✅ **API Documentation** - ✅ **Interactive Playground** — Probar endpoints con code snippets - ✅ **Billing Page** — Plans + usage bar - ✅ **Dashboard** — Login, API keys, tester ### CI/CD - ✅ **GitHub Actions** — CI con test, clippy, audit - ✅ **Docker Build & Push** — Multi-stage builds - ✅ **Deploy Staging** — Auto-deploy en push a main - ✅ **Deploy Production** — Solo en tags v* - ✅ **Smoke Tests** — Verificación post-deploy ### Legal - ✅ **Terms of Service** - ✅ **Privacy Policy** - ✅ **Data Processing Agreement** ### Load Testing - ✅ **k6 Smoke Test** — 1 VU - ✅ **k6 Load Test** — Ramp up a 20 VUs - ✅ **k6 Stress Test** — Hasta 200 VUs - ✅ **k6 Screenshot Test** — 5 VUs concurrentes ## Variables de entorno ```bash # Core DATABASE_URL="postgres://..." REDIS_URL="redis://..." JWT_SECRET="..." # Storage S3_ENDPOINT, S3_BUCKET, S3_ACCESS_KEY, S3_SECRET_KEY # Auth GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET # Billing STRIPE_SECRET_KEY, STRIPE_WEBHOOK_SECRET # AI OPENAI_API_KEY # Scraping PROXY_URL="http://proxy1:8080,http://proxy2:8080" CAPTCHA_API_KEY="..." # Error Tracking SENTRY_DSN="https://..." # Logging JSON_LOGGING="true" # Enable structured JSON logs # Secrets Management VAULT_ADDR="https://vault.example.com" VAULT_TOKEN="..." # Browser Pool BROWSER_POOL_SIZE=5 MAX_PAGES_PER_BROWSER=10 ``` ## Uso de la API ### AI Extraction ```bash curl -X POST http://localhost:3000/api/extract \ -H "Content-Type: application/json" \ -H "x-api-key: YOUR_API_KEY" \ -d '{ "url": "https://example.com/products", "schema": {"products": [{"name": "string", "price": "number"}]} }' ``` ### Screenshot con stealth + proxy + CAPTCHA ```bash curl -X POST http://localhost:3000/api/screenshot \ -H "Content-Type: application/json" \ -H "x-api-key: YOUR_API_KEY" \ -d '{ "url": "https://protected-site.com", "options": { "stealth": true, "use_proxy": true, "solve_captcha": true, "session_id": "user_123" } }' ``` ### Mobile emulation ```bash curl -X POST http://localhost:3000/api/screenshot \ -H "x-api-key: YOUR_API_KEY" \ -d '{"url": "https://example.com", "options": {"mobile": true}}' ``` ## Licencia MIT