Initial commit: Full Crawl API implementation
This commit is contained in:
328
README.md
Normal file
328
README.md
Normal file
@@ -0,0 +1,328 @@
|
||||
# Crawl API — Headless Browser REST API
|
||||
|
||||
Recreación de [crawlapi.dev](https://crawlapi.dev) en Rust Full-Stack.
|
||||
|
||||
## Stack
|
||||
|
||||
- **Backend**: Axum (Rust)
|
||||
- **Database**: PostgreSQL + sqlx
|
||||
- **Queue**: Redis (jobs + caching + rate limiting)
|
||||
- **Browser Automation**: Playwright (Node.js) con Browser Pool
|
||||
- **File Storage**: MinIO (S3-compatible)
|
||||
- **Frontend**: Next.js 14
|
||||
- **Observabilidad**: Prometheus + Grafana + Sentry
|
||||
- **Seguridad**: Rate limiting + IP blocking + input validation
|
||||
- **AI**: OpenAI GPT-4o-mini extraction
|
||||
- **Auth**: Email/password + Google OAuth
|
||||
- **Billing**: Stripe Checkout + Webhooks
|
||||
- **CI/CD**: GitHub Actions (test, build, deploy)
|
||||
- **Infra**: Docker Compose + Kubernetes + HPA + cert-manager
|
||||
- **Load Testing**: k6 (smoke, load, stress, screenshot)
|
||||
|
||||
## Estructura
|
||||
|
||||
```
|
||||
crawlapi/
|
||||
├── crates/
|
||||
│ ├── api/ # Servidor REST (Axum) + seed script
|
||||
│ ├── worker/ # Worker distribuido con Redis queue
|
||||
│ ├── shared/ # Tipos y config compartidos
|
||||
│ └── db/ # Capa de base de datos + migraciones
|
||||
├── playwright/ # Script Node.js con Browser Pool + Stealth + CAPTCHA
|
||||
├── frontend/ # Landing + Playground + Billing + Dashboard + Docs
|
||||
├── e2e/ # Tests E2E con Playwright
|
||||
├── load-tests/ # k6 load testing scripts
|
||||
├── k8s/ # Kubernetes manifests + cert-manager
|
||||
├── legal/ # Terms, Privacy, DPA
|
||||
├── .github/ # GitHub Actions workflows
|
||||
├── docker-compose.yml
|
||||
└── prometheus.yml
|
||||
```
|
||||
|
||||
## Endpoints
|
||||
|
||||
### Crawl/Scrape/AI
|
||||
| Endpoint | Descripción |
|
||||
|----------|-------------|
|
||||
| `POST /api/crawl` | Full JS-rendered page crawl |
|
||||
| `POST /api/content` | Raw HTML |
|
||||
| `POST /api/screenshot` | PNG screenshot (subido a S3) |
|
||||
| `POST /api/pdf` | PDF export (subido a S3) |
|
||||
| `POST /api/markdown` | Markdown extraction |
|
||||
| `POST /api/snapshot` | HTML + screenshot combined |
|
||||
| `POST /api/scrape` | CSS selector extraction |
|
||||
| `POST /api/json` | Structured JSON |
|
||||
| `POST /api/links` | Extract all links |
|
||||
| `POST /api/extract` | AI-powered extraction con OpenAI |
|
||||
|
||||
### Auth
|
||||
| Endpoint | Descripción |
|
||||
|----------|-------------|
|
||||
| `POST /api/auth/register` | Crear cuenta |
|
||||
| `POST /api/auth/login` | Login (devuelve JWT) |
|
||||
| `GET /api/auth/google` | URL de OAuth Google |
|
||||
| `GET /api/auth/google/callback` | Callback de OAuth (real con token exchange) |
|
||||
| `POST /api/auth/api-keys` | Crear API key (requiere JWT) |
|
||||
| `GET /api/auth/api-keys` | Listar API keys (requiere JWT) |
|
||||
| `DELETE /api/auth/api-keys/{id}` | Eliminar API key (requiere JWT) |
|
||||
|
||||
### Billing
|
||||
| Endpoint | Descripción |
|
||||
|----------|-------------|
|
||||
| `POST /api/stripe/checkout` | Crear checkout session funcional |
|
||||
| `POST /api/stripe/webhook` | Webhook de Stripe (procesa eventos reales) |
|
||||
|
||||
### Teams
|
||||
| Endpoint | Descripción |
|
||||
|----------|-------------|
|
||||
| `POST /api/teams` | Crear equipo |
|
||||
| `GET /api/teams/{slug}` | Ver equipo y miembros |
|
||||
| `POST /api/teams/{slug}/members` | Agregar miembro |
|
||||
|
||||
### Observabilidad
|
||||
| Endpoint | Descripción |
|
||||
|----------|-------------|
|
||||
| `GET /metrics` | Métricas Prometheus |
|
||||
| `GET /ws/logs` | WebSocket live logs |
|
||||
|
||||
## Quick Start (Docker)
|
||||
|
||||
```bash
|
||||
# 1. Iniciar toda la stack
|
||||
cd crawlapi
|
||||
docker-compose up --build
|
||||
|
||||
# 2. Crear seed data (en otra terminal)
|
||||
export DATABASE_URL="postgres://crawlapi:crawlapi@localhost:5432/crawlapi"
|
||||
source "$HOME/.cargo/env"
|
||||
cargo run -p api --bin seed
|
||||
|
||||
# 3. Servicios disponibles:
|
||||
# API: http://localhost:3000
|
||||
# Frontend: http://localhost
|
||||
# MinIO: http://localhost:9001 (minioadmin/minioadmin)
|
||||
# Prometheus: http://localhost:9090
|
||||
# Grafana: http://localhost:3001 (admin/admin)
|
||||
```
|
||||
|
||||
## CI/CD (GitHub Actions)
|
||||
|
||||
```bash
|
||||
# En cada push a main:
|
||||
# 1. cargo fmt --check
|
||||
# 2. cargo clippy -- -D warnings
|
||||
# 3. cargo test --workspace
|
||||
# 4. cargo audit
|
||||
# 5. Docker build + push a registry
|
||||
# 6. Deploy a staging
|
||||
# 7. Smoke tests
|
||||
# 8. Deploy a production (solo en tags v*)
|
||||
```
|
||||
|
||||
**Workflows:**
|
||||
- `.github/workflows/ci.yml` — Test, build, push images
|
||||
- `.github/workflows/deploy.yml` — Deploy a staging y production
|
||||
|
||||
## Kubernetes
|
||||
|
||||
```bash
|
||||
# Instalar cert-manager primero
|
||||
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.0/cert-manager.yaml
|
||||
|
||||
# Deploy todo
|
||||
cd k8s
|
||||
kubectl apply -f namespace.yaml
|
||||
kubectl apply -f cert-manager.yaml
|
||||
kubectl apply -f secrets.yaml
|
||||
kubectl apply -f postgres.yaml
|
||||
kubectl apply -f redis.yaml
|
||||
kubectl apply -f minio.yaml
|
||||
kubectl apply -f api.yaml
|
||||
kubectl apply -f worker.yaml
|
||||
kubectl apply -f frontend.yaml
|
||||
|
||||
# Workers auto-scale con HPA (3-20 réplicas)
|
||||
kubectl get hpa -n crawlapi
|
||||
```
|
||||
|
||||
## Load Testing (k6)
|
||||
|
||||
```bash
|
||||
cd load-tests
|
||||
|
||||
# Smoke test (1 VU, 1 min)
|
||||
k6 run smoke.js
|
||||
|
||||
# Load test (ramp up a 20 VUs, 14 min)
|
||||
k6 run load.js
|
||||
|
||||
# Stress test (hasta 200 VUs)
|
||||
k6 run stress.js
|
||||
|
||||
# Screenshot test (5 VUs concurrentes)
|
||||
k6 run screenshot.js
|
||||
```
|
||||
|
||||
## Tests
|
||||
|
||||
```bash
|
||||
# Unit tests
|
||||
cargo test
|
||||
|
||||
# E2E tests
|
||||
cd e2e && npm install && npx playwright test
|
||||
```
|
||||
|
||||
## Features implementadas
|
||||
|
||||
### Core
|
||||
- ✅ **10 endpoints REST** (9 crawl + 1 AI)
|
||||
- ✅ **Browser Pool** — 5 navegadores Chromium, 10 páginas cada uno
|
||||
- ✅ **Session/Cookie Persistence** — Guarda cookies por `session_id`
|
||||
- ✅ **Mobile Emulation** — iPhone 14 viewport
|
||||
- ✅ **Infinite Scroll** — Auto-scroll hasta el final
|
||||
- ✅ **Custom Headers** — Headers arbitrarios por request
|
||||
|
||||
### Workers
|
||||
- ✅ **Distributed Queue** — Redis LPUSH/BLPOP
|
||||
- ✅ **Retry con Backoff** — 3 retries con espera exponencial (2s, 4s, 8s)
|
||||
- ✅ **Dead Letter Queue** — Jobs fallidos guardados por 24h
|
||||
- ✅ **Caching** — Resultados en Redis con TTL 5 min
|
||||
|
||||
### Scraping Avanzado
|
||||
- ✅ **Stealth Mode** — Evade detección de bots (webdriver, plugins, canvas)
|
||||
- ✅ **Proxy Rotation** — Múltiples proxies vía `PROXY_URL`
|
||||
- ✅ **CAPTCHA Solving** — Integración con CapSolver/2captcha
|
||||
|
||||
### Auth & Billing
|
||||
- ✅ **Email/Password** — Bcrypt + JWT
|
||||
- ✅ **Google OAuth** — Exchange real de code → token → user info
|
||||
- ✅ **Stripe** — Checkout funcional + webhooks reales
|
||||
- ✅ **Teams** — Owner/member roles
|
||||
|
||||
### Observabilidad
|
||||
- ✅ **Prometheus** — `/metrics` con counters y histograms
|
||||
- ✅ **Grafana** — Dashboard incluido
|
||||
- ✅ **Sentry** — Error tracking en API y Worker
|
||||
- ✅ **Structured Logging** — JSON logs con correlation IDs
|
||||
- ✅ **WebSocket Logs** — `/ws/logs`
|
||||
|
||||
### Seguridad
|
||||
- ✅ **Input Validation** — URLs, webhooks, tamaños (SSRF protection)
|
||||
- ✅ **Rate Limiting** — Por API key (60/min) + por IP (100/min)
|
||||
- ✅ **IP Blocking** — Auto-bloqueo por 1 hora
|
||||
|
||||
### Infraestructura
|
||||
- ✅ **Docker Compose** — Todo en un comando
|
||||
- ✅ **Kubernetes** — Full manifests con ingress TLS + cert-manager
|
||||
- ✅ **HPA** — Auto-scaling 3-20 workers
|
||||
- ✅ **Health Checks** — Liveness, readiness, startup probes
|
||||
- ✅ **SSL/TLS** — Let's Encrypt automático via cert-manager
|
||||
|
||||
### Secrets Management
|
||||
- ✅ **Multi-provider** — Env vars → Vault → AWS Secrets Manager
|
||||
- ✅ **Fallback chain** — Intenta cada provider en orden
|
||||
|
||||
### Frontend
|
||||
- ✅ **Landing Page**
|
||||
- ✅ **API Documentation**
|
||||
- ✅ **Interactive Playground** — Probar endpoints con code snippets
|
||||
- ✅ **Billing Page** — Plans + usage bar
|
||||
- ✅ **Dashboard** — Login, API keys, tester
|
||||
|
||||
### CI/CD
|
||||
- ✅ **GitHub Actions** — CI con test, clippy, audit
|
||||
- ✅ **Docker Build & Push** — Multi-stage builds
|
||||
- ✅ **Deploy Staging** — Auto-deploy en push a main
|
||||
- ✅ **Deploy Production** — Solo en tags v*
|
||||
- ✅ **Smoke Tests** — Verificación post-deploy
|
||||
|
||||
### Legal
|
||||
- ✅ **Terms of Service**
|
||||
- ✅ **Privacy Policy**
|
||||
- ✅ **Data Processing Agreement**
|
||||
|
||||
### Load Testing
|
||||
- ✅ **k6 Smoke Test** — 1 VU
|
||||
- ✅ **k6 Load Test** — Ramp up a 20 VUs
|
||||
- ✅ **k6 Stress Test** — Hasta 200 VUs
|
||||
- ✅ **k6 Screenshot Test** — 5 VUs concurrentes
|
||||
|
||||
## Variables de entorno
|
||||
|
||||
```bash
|
||||
# Core
|
||||
DATABASE_URL="postgres://..."
|
||||
REDIS_URL="redis://..."
|
||||
JWT_SECRET="..."
|
||||
|
||||
# Storage
|
||||
S3_ENDPOINT, S3_BUCKET, S3_ACCESS_KEY, S3_SECRET_KEY
|
||||
|
||||
# Auth
|
||||
GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET
|
||||
|
||||
# Billing
|
||||
STRIPE_SECRET_KEY, STRIPE_WEBHOOK_SECRET
|
||||
|
||||
# AI
|
||||
OPENAI_API_KEY
|
||||
|
||||
# Scraping
|
||||
PROXY_URL="http://proxy1:8080,http://proxy2:8080"
|
||||
CAPTCHA_API_KEY="..."
|
||||
|
||||
# Error Tracking
|
||||
SENTRY_DSN="https://..."
|
||||
|
||||
# Logging
|
||||
JSON_LOGGING="true" # Enable structured JSON logs
|
||||
|
||||
# Secrets Management
|
||||
VAULT_ADDR="https://vault.example.com"
|
||||
VAULT_TOKEN="..."
|
||||
|
||||
# Browser Pool
|
||||
BROWSER_POOL_SIZE=5
|
||||
MAX_PAGES_PER_BROWSER=10
|
||||
```
|
||||
|
||||
## Uso de la API
|
||||
|
||||
### AI Extraction
|
||||
```bash
|
||||
curl -X POST http://localhost:3000/api/extract \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "x-api-key: YOUR_API_KEY" \
|
||||
-d '{
|
||||
"url": "https://example.com/products",
|
||||
"schema": {"products": [{"name": "string", "price": "number"}]}
|
||||
}'
|
||||
```
|
||||
|
||||
### Screenshot con stealth + proxy + CAPTCHA
|
||||
```bash
|
||||
curl -X POST http://localhost:3000/api/screenshot \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "x-api-key: YOUR_API_KEY" \
|
||||
-d '{
|
||||
"url": "https://protected-site.com",
|
||||
"options": {
|
||||
"stealth": true,
|
||||
"use_proxy": true,
|
||||
"solve_captcha": true,
|
||||
"session_id": "user_123"
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
### Mobile emulation
|
||||
```bash
|
||||
curl -X POST http://localhost:3000/api/screenshot \
|
||||
-H "x-api-key: YOUR_API_KEY" \
|
||||
-d '{"url": "https://example.com", "options": {"mobile": true}}'
|
||||
```
|
||||
|
||||
## Licencia
|
||||
|
||||
MIT
|
||||
Reference in New Issue
Block a user