62994d4f3dd5be77d815c6e1b3ef5aafd323d56a
Crawl API — Headless Browser REST API
Recreación de crawlapi.dev en Rust Full-Stack.
Stack
- Backend: Axum (Rust)
- Database: PostgreSQL + sqlx
- Queue: Redis (jobs + caching + rate limiting)
- Browser Automation: Playwright (Node.js) con Browser Pool
- File Storage: MinIO (S3-compatible)
- Frontend: Next.js 14
- Observabilidad: Prometheus + Grafana + Sentry
- Seguridad: Rate limiting + IP blocking + input validation
- AI: OpenAI GPT-4o-mini extraction
- Auth: Email/password + Google OAuth
- Billing: Stripe Checkout + Webhooks
- CI/CD: GitHub Actions (test, build, deploy)
- Infra: Docker Compose + Kubernetes + HPA + cert-manager
- Load Testing: k6 (smoke, load, stress, screenshot)
Estructura
crawlapi/
├── crates/
│ ├── api/ # Servidor REST (Axum) + seed script
│ ├── worker/ # Worker distribuido con Redis queue
│ ├── shared/ # Tipos y config compartidos
│ └── db/ # Capa de base de datos + migraciones
├── playwright/ # Script Node.js con Browser Pool + Stealth + CAPTCHA
├── frontend/ # Landing + Playground + Billing + Dashboard + Docs
├── e2e/ # Tests E2E con Playwright
├── load-tests/ # k6 load testing scripts
├── k8s/ # Kubernetes manifests + cert-manager
├── legal/ # Terms, Privacy, DPA
├── .github/ # GitHub Actions workflows
├── docker-compose.yml
└── prometheus.yml
Endpoints
Crawl/Scrape/AI
| Endpoint | Descripción |
|---|---|
POST /api/crawl |
Full JS-rendered page crawl |
POST /api/content |
Raw HTML |
POST /api/screenshot |
PNG screenshot (subido a S3) |
POST /api/pdf |
PDF export (subido a S3) |
POST /api/markdown |
Markdown extraction |
POST /api/snapshot |
HTML + screenshot combined |
POST /api/scrape |
CSS selector extraction |
POST /api/json |
Structured JSON |
POST /api/links |
Extract all links |
POST /api/extract |
AI-powered extraction con OpenAI |
Auth
| Endpoint | Descripción |
|---|---|
POST /api/auth/register |
Crear cuenta |
POST /api/auth/login |
Login (devuelve JWT) |
GET /api/auth/google |
URL de OAuth Google |
GET /api/auth/google/callback |
Callback de OAuth (real con token exchange) |
POST /api/auth/api-keys |
Crear API key (requiere JWT) |
GET /api/auth/api-keys |
Listar API keys (requiere JWT) |
DELETE /api/auth/api-keys/{id} |
Eliminar API key (requiere JWT) |
Billing
| Endpoint | Descripción |
|---|---|
POST /api/stripe/checkout |
Crear checkout session funcional |
POST /api/stripe/webhook |
Webhook de Stripe (procesa eventos reales) |
Teams
| Endpoint | Descripción |
|---|---|
POST /api/teams |
Crear equipo |
GET /api/teams/{slug} |
Ver equipo y miembros |
POST /api/teams/{slug}/members |
Agregar miembro |
Observabilidad
| Endpoint | Descripción |
|---|---|
GET /metrics |
Métricas Prometheus |
GET /ws/logs |
WebSocket live logs |
Quick Start (Docker)
# 1. Iniciar toda la stack
cd crawlapi
docker-compose up --build
# 2. Crear seed data (en otra terminal)
export DATABASE_URL="postgres://crawlapi:crawlapi@localhost:5432/crawlapi"
source "$HOME/.cargo/env"
cargo run -p api --bin seed
# 3. Servicios disponibles:
# API: http://localhost:3000
# Frontend: http://localhost
# MinIO: http://localhost:9001 (minioadmin/minioadmin)
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3001 (admin/admin)
CI/CD (GitHub Actions)
# En cada push a main:
# 1. cargo fmt --check
# 2. cargo clippy -- -D warnings
# 3. cargo test --workspace
# 4. cargo audit
# 5. Docker build + push a registry
# 6. Deploy a staging
# 7. Smoke tests
# 8. Deploy a production (solo en tags v*)
Workflows:
.github/workflows/ci.yml— Test, build, push images.github/workflows/deploy.yml— Deploy a staging y production
Kubernetes
# Instalar cert-manager primero
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.0/cert-manager.yaml
# Deploy todo
cd k8s
kubectl apply -f namespace.yaml
kubectl apply -f cert-manager.yaml
kubectl apply -f secrets.yaml
kubectl apply -f postgres.yaml
kubectl apply -f redis.yaml
kubectl apply -f minio.yaml
kubectl apply -f api.yaml
kubectl apply -f worker.yaml
kubectl apply -f frontend.yaml
# Workers auto-scale con HPA (3-20 réplicas)
kubectl get hpa -n crawlapi
Load Testing (k6)
cd load-tests
# Smoke test (1 VU, 1 min)
k6 run smoke.js
# Load test (ramp up a 20 VUs, 14 min)
k6 run load.js
# Stress test (hasta 200 VUs)
k6 run stress.js
# Screenshot test (5 VUs concurrentes)
k6 run screenshot.js
Tests
# Unit tests
cargo test
# E2E tests
cd e2e && npm install && npx playwright test
Features implementadas
Core
- ✅ 10 endpoints REST (9 crawl + 1 AI)
- ✅ Browser Pool — 5 navegadores Chromium, 10 páginas cada uno
- ✅ Session/Cookie Persistence — Guarda cookies por
session_id - ✅ Mobile Emulation — iPhone 14 viewport
- ✅ Infinite Scroll — Auto-scroll hasta el final
- ✅ Custom Headers — Headers arbitrarios por request
Workers
- ✅ Distributed Queue — Redis LPUSH/BLPOP
- ✅ Retry con Backoff — 3 retries con espera exponencial (2s, 4s, 8s)
- ✅ Dead Letter Queue — Jobs fallidos guardados por 24h
- ✅ Caching — Resultados en Redis con TTL 5 min
Scraping Avanzado
- ✅ Stealth Mode — Evade detección de bots (webdriver, plugins, canvas)
- ✅ Proxy Rotation — Múltiples proxies vía
PROXY_URL - ✅ CAPTCHA Solving — Integración con CapSolver/2captcha
Auth & Billing
- ✅ Email/Password — Bcrypt + JWT
- ✅ Google OAuth — Exchange real de code → token → user info
- ✅ Stripe — Checkout funcional + webhooks reales
- ✅ Teams — Owner/member roles
Observabilidad
- ✅ Prometheus —
/metricscon counters y histograms - ✅ Grafana — Dashboard incluido
- ✅ Sentry — Error tracking en API y Worker
- ✅ Structured Logging — JSON logs con correlation IDs
- ✅ WebSocket Logs —
/ws/logs
Seguridad
- ✅ Input Validation — URLs, webhooks, tamaños (SSRF protection)
- ✅ Rate Limiting — Por API key (60/min) + por IP (100/min)
- ✅ IP Blocking — Auto-bloqueo por 1 hora
Infraestructura
- ✅ Docker Compose — Todo en un comando
- ✅ Kubernetes — Full manifests con ingress TLS + cert-manager
- ✅ HPA — Auto-scaling 3-20 workers
- ✅ Health Checks — Liveness, readiness, startup probes
- ✅ SSL/TLS — Let's Encrypt automático via cert-manager
Secrets Management
- ✅ Multi-provider — Env vars → Vault → AWS Secrets Manager
- ✅ Fallback chain — Intenta cada provider en orden
Frontend
- ✅ Landing Page
- ✅ API Documentation
- ✅ Interactive Playground — Probar endpoints con code snippets
- ✅ Billing Page — Plans + usage bar
- ✅ Dashboard — Login, API keys, tester
CI/CD
- ✅ GitHub Actions — CI con test, clippy, audit
- ✅ Docker Build & Push — Multi-stage builds
- ✅ Deploy Staging — Auto-deploy en push a main
- ✅ Deploy Production — Solo en tags v*
- ✅ Smoke Tests — Verificación post-deploy
Legal
- ✅ Terms of Service
- ✅ Privacy Policy
- ✅ Data Processing Agreement
Load Testing
- ✅ k6 Smoke Test — 1 VU
- ✅ k6 Load Test — Ramp up a 20 VUs
- ✅ k6 Stress Test — Hasta 200 VUs
- ✅ k6 Screenshot Test — 5 VUs concurrentes
Variables de entorno
# Core
DATABASE_URL="postgres://..."
REDIS_URL="redis://..."
JWT_SECRET="..."
# Storage
S3_ENDPOINT, S3_BUCKET, S3_ACCESS_KEY, S3_SECRET_KEY
# Auth
GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET
# Billing
STRIPE_SECRET_KEY, STRIPE_WEBHOOK_SECRET
# AI
OPENAI_API_KEY
# Scraping
PROXY_URL="http://proxy1:8080,http://proxy2:8080"
CAPTCHA_API_KEY="..."
# Error Tracking
SENTRY_DSN="https://..."
# Logging
JSON_LOGGING="true" # Enable structured JSON logs
# Secrets Management
VAULT_ADDR="https://vault.example.com"
VAULT_TOKEN="..."
# Browser Pool
BROWSER_POOL_SIZE=5
MAX_PAGES_PER_BROWSER=10
Uso de la API
AI Extraction
curl -X POST http://localhost:3000/api/extract \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
-d '{
"url": "https://example.com/products",
"schema": {"products": [{"name": "string", "price": "number"}]}
}'
Screenshot con stealth + proxy + CAPTCHA
curl -X POST http://localhost:3000/api/screenshot \
-H "Content-Type: application/json" \
-H "x-api-key: YOUR_API_KEY" \
-d '{
"url": "https://protected-site.com",
"options": {
"stealth": true,
"use_proxy": true,
"solve_captcha": true,
"session_id": "user_123"
}
}'
Mobile emulation
curl -X POST http://localhost:3000/api/screenshot \
-H "x-api-key: YOUR_API_KEY" \
-d '{"url": "https://example.com", "options": {"mobile": true}}'
Licencia
MIT
Description
Languages
Rust
58.1%
TypeScript
27%
JavaScript
14.9%