Developer 62994d4f3d
Some checks failed
CI / Test (push) Has been cancelled
Deploy / Deploy to Staging (push) Has been cancelled
CI / Build & Push (push) Has been cancelled
Deploy / Deploy to Production (push) Has been cancelled
Initial commit: Full Crawl API implementation
2026-04-29 07:03:48 +00:00

Crawl API — Headless Browser REST API

Recreación de crawlapi.dev en Rust Full-Stack.

Stack

  • Backend: Axum (Rust)
  • Database: PostgreSQL + sqlx
  • Queue: Redis (jobs + caching + rate limiting)
  • Browser Automation: Playwright (Node.js) con Browser Pool
  • File Storage: MinIO (S3-compatible)
  • Frontend: Next.js 14
  • Observabilidad: Prometheus + Grafana + Sentry
  • Seguridad: Rate limiting + IP blocking + input validation
  • AI: OpenAI GPT-4o-mini extraction
  • Auth: Email/password + Google OAuth
  • Billing: Stripe Checkout + Webhooks
  • CI/CD: GitHub Actions (test, build, deploy)
  • Infra: Docker Compose + Kubernetes + HPA + cert-manager
  • Load Testing: k6 (smoke, load, stress, screenshot)

Estructura

crawlapi/
├── crates/
│   ├── api/          # Servidor REST (Axum) + seed script
│   ├── worker/       # Worker distribuido con Redis queue
│   ├── shared/       # Tipos y config compartidos
│   └── db/           # Capa de base de datos + migraciones
├── playwright/       # Script Node.js con Browser Pool + Stealth + CAPTCHA
├── frontend/         # Landing + Playground + Billing + Dashboard + Docs
├── e2e/              # Tests E2E con Playwright
├── load-tests/       # k6 load testing scripts
├── k8s/              # Kubernetes manifests + cert-manager
├── legal/            # Terms, Privacy, DPA
├── .github/          # GitHub Actions workflows
├── docker-compose.yml
└── prometheus.yml

Endpoints

Crawl/Scrape/AI

Endpoint Descripción
POST /api/crawl Full JS-rendered page crawl
POST /api/content Raw HTML
POST /api/screenshot PNG screenshot (subido a S3)
POST /api/pdf PDF export (subido a S3)
POST /api/markdown Markdown extraction
POST /api/snapshot HTML + screenshot combined
POST /api/scrape CSS selector extraction
POST /api/json Structured JSON
POST /api/links Extract all links
POST /api/extract AI-powered extraction con OpenAI

Auth

Endpoint Descripción
POST /api/auth/register Crear cuenta
POST /api/auth/login Login (devuelve JWT)
GET /api/auth/google URL de OAuth Google
GET /api/auth/google/callback Callback de OAuth (real con token exchange)
POST /api/auth/api-keys Crear API key (requiere JWT)
GET /api/auth/api-keys Listar API keys (requiere JWT)
DELETE /api/auth/api-keys/{id} Eliminar API key (requiere JWT)

Billing

Endpoint Descripción
POST /api/stripe/checkout Crear checkout session funcional
POST /api/stripe/webhook Webhook de Stripe (procesa eventos reales)

Teams

Endpoint Descripción
POST /api/teams Crear equipo
GET /api/teams/{slug} Ver equipo y miembros
POST /api/teams/{slug}/members Agregar miembro

Observabilidad

Endpoint Descripción
GET /metrics Métricas Prometheus
GET /ws/logs WebSocket live logs

Quick Start (Docker)

# 1. Iniciar toda la stack
cd crawlapi
docker-compose up --build

# 2. Crear seed data (en otra terminal)
export DATABASE_URL="postgres://crawlapi:crawlapi@localhost:5432/crawlapi"
source "$HOME/.cargo/env"
cargo run -p api --bin seed

# 3. Servicios disponibles:
# API:        http://localhost:3000
# Frontend:   http://localhost
# MinIO:      http://localhost:9001 (minioadmin/minioadmin)
# Prometheus: http://localhost:9090
# Grafana:    http://localhost:3001 (admin/admin)

CI/CD (GitHub Actions)

# En cada push a main:
# 1. cargo fmt --check
# 2. cargo clippy -- -D warnings
# 3. cargo test --workspace
# 4. cargo audit
# 5. Docker build + push a registry
# 6. Deploy a staging
# 7. Smoke tests
# 8. Deploy a production (solo en tags v*)

Workflows:

  • .github/workflows/ci.yml — Test, build, push images
  • .github/workflows/deploy.yml — Deploy a staging y production

Kubernetes

# Instalar cert-manager primero
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.0/cert-manager.yaml

# Deploy todo
cd k8s
kubectl apply -f namespace.yaml
kubectl apply -f cert-manager.yaml
kubectl apply -f secrets.yaml
kubectl apply -f postgres.yaml
kubectl apply -f redis.yaml
kubectl apply -f minio.yaml
kubectl apply -f api.yaml
kubectl apply -f worker.yaml
kubectl apply -f frontend.yaml

# Workers auto-scale con HPA (3-20 réplicas)
kubectl get hpa -n crawlapi

Load Testing (k6)

cd load-tests

# Smoke test (1 VU, 1 min)
k6 run smoke.js

# Load test (ramp up a 20 VUs, 14 min)
k6 run load.js

# Stress test (hasta 200 VUs)
k6 run stress.js

# Screenshot test (5 VUs concurrentes)
k6 run screenshot.js

Tests

# Unit tests
cargo test

# E2E tests
cd e2e && npm install && npx playwright test

Features implementadas

Core

  • 10 endpoints REST (9 crawl + 1 AI)
  • Browser Pool — 5 navegadores Chromium, 10 páginas cada uno
  • Session/Cookie Persistence — Guarda cookies por session_id
  • Mobile Emulation — iPhone 14 viewport
  • Infinite Scroll — Auto-scroll hasta el final
  • Custom Headers — Headers arbitrarios por request

Workers

  • Distributed Queue — Redis LPUSH/BLPOP
  • Retry con Backoff — 3 retries con espera exponencial (2s, 4s, 8s)
  • Dead Letter Queue — Jobs fallidos guardados por 24h
  • Caching — Resultados en Redis con TTL 5 min

Scraping Avanzado

  • Stealth Mode — Evade detección de bots (webdriver, plugins, canvas)
  • Proxy Rotation — Múltiples proxies vía PROXY_URL
  • CAPTCHA Solving — Integración con CapSolver/2captcha

Auth & Billing

  • Email/Password — Bcrypt + JWT
  • Google OAuth — Exchange real de code → token → user info
  • Stripe — Checkout funcional + webhooks reales
  • Teams — Owner/member roles

Observabilidad

  • Prometheus/metrics con counters y histograms
  • Grafana — Dashboard incluido
  • Sentry — Error tracking en API y Worker
  • Structured Logging — JSON logs con correlation IDs
  • WebSocket Logs/ws/logs

Seguridad

  • Input Validation — URLs, webhooks, tamaños (SSRF protection)
  • Rate Limiting — Por API key (60/min) + por IP (100/min)
  • IP Blocking — Auto-bloqueo por 1 hora

Infraestructura

  • Docker Compose — Todo en un comando
  • Kubernetes — Full manifests con ingress TLS + cert-manager
  • HPA — Auto-scaling 3-20 workers
  • Health Checks — Liveness, readiness, startup probes
  • SSL/TLS — Let's Encrypt automático via cert-manager

Secrets Management

  • Multi-provider — Env vars → Vault → AWS Secrets Manager
  • Fallback chain — Intenta cada provider en orden

Frontend

  • Landing Page
  • API Documentation
  • Interactive Playground — Probar endpoints con code snippets
  • Billing Page — Plans + usage bar
  • Dashboard — Login, API keys, tester

CI/CD

  • GitHub Actions — CI con test, clippy, audit
  • Docker Build & Push — Multi-stage builds
  • Deploy Staging — Auto-deploy en push a main
  • Deploy Production — Solo en tags v*
  • Smoke Tests — Verificación post-deploy
  • Terms of Service
  • Privacy Policy
  • Data Processing Agreement

Load Testing

  • k6 Smoke Test — 1 VU
  • k6 Load Test — Ramp up a 20 VUs
  • k6 Stress Test — Hasta 200 VUs
  • k6 Screenshot Test — 5 VUs concurrentes

Variables de entorno

# Core
DATABASE_URL="postgres://..."
REDIS_URL="redis://..."
JWT_SECRET="..."

# Storage
S3_ENDPOINT, S3_BUCKET, S3_ACCESS_KEY, S3_SECRET_KEY

# Auth
GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET

# Billing
STRIPE_SECRET_KEY, STRIPE_WEBHOOK_SECRET

# AI
OPENAI_API_KEY

# Scraping
PROXY_URL="http://proxy1:8080,http://proxy2:8080"
CAPTCHA_API_KEY="..."

# Error Tracking
SENTRY_DSN="https://..."

# Logging
JSON_LOGGING="true"  # Enable structured JSON logs

# Secrets Management
VAULT_ADDR="https://vault.example.com"
VAULT_TOKEN="..."

# Browser Pool
BROWSER_POOL_SIZE=5
MAX_PAGES_PER_BROWSER=10

Uso de la API

AI Extraction

curl -X POST http://localhost:3000/api/extract \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "url": "https://example.com/products",
    "schema": {"products": [{"name": "string", "price": "number"}]}
  }'

Screenshot con stealth + proxy + CAPTCHA

curl -X POST http://localhost:3000/api/screenshot \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{
    "url": "https://protected-site.com",
    "options": {
      "stealth": true,
      "use_proxy": true,
      "solve_captcha": true,
      "session_id": "user_123"
    }
  }'

Mobile emulation

curl -X POST http://localhost:3000/api/screenshot \
  -H "x-api-key: YOUR_API_KEY" \
  -d '{"url": "https://example.com", "options": {"mobile": true}}'

Licencia

MIT

Description
Headless Browser REST API - Crawl API recreation
Readme 95 KiB
Languages
Rust 58.1%
TypeScript 27%
JavaScript 14.9%