How I Built a Secure Multi-Model AI Gateway with Cloudflare Workers (Grok + GPT/Codex) | 观者终端

Why I Built This

I wanted a single AI entry point where users can switch between Grok and GPT/Codex without login friction and unstable relay layers.

The target was simple:

secure by default
zero server maintenance
low latency global access
streaming UX that feels native

Live entry: Open AI Chat Terminal

Architecture Overview

Browser (Astro frontend)
        ↓  POST /v1/chat
Cloudflare Worker (security gateway)
        ↓  forward request
Upstream AI providers (Grok / GPT / Codex)
        ↓  SSE stream
Worker passthrough
        ↓
Browser incremental rendering

This design removes traditional server ops completely. No VPS, no container orchestration, no long-running backend.

Security Hardening Strategy

1) Origin Gate (CORS + allowlist)

Never trust direct client requests.
Only requests from approved origins are accepted:

const ALLOWED_ORIGINS = ["https://your-domain.com", "http://localhost:4321"];
if (!ALLOWED_ORIGINS.includes(origin)) {
  return new Response("Forbidden", { status: 403 });
}

This blocks cross-site abuse before touching expensive upstream APIs.

2) Prompt Hardening (Token Cost Control)

Before forwarding messages, the gateway injects a hidden instruction that enforces concise, content-first responses and text-only constraints.
This reduces token waste from repetitive filler text and model drift.

3) Abuse Intercept for Image-Bait Prompts

Some users try to bypass image quotas through chat mode.
The worker pre-checks short image-generation intents with regex and can return a synthetic SSE denial response without calling upstream.

Result: no token burn, better quota protection.

4) Dual-Layer Rate Limiting (KV)

Cloudflare KV stores daily counters:

per-IP chat limit
per-IP image limit
global daily circuit breaker

async function checkRateLimit(kv, ip, type, max) {
  const today = new Date().toISOString().split("T")[0];
  const key = `limit:${ip}:${type}:${today}`;
  const count = parseInt((await kv.get(key)) ?? "0", 10);
  if (count >= max) return false;
  await kv.put(key, String(count + 1), { expirationTtl: 86400 });
  return true;
}

This protects both against single-IP abuse and high-volume proxy pool attacks.

SSE Streaming in Worker: The Key Detail

To keep the typewriter UX, do not buffer full upstream response.
Pass through upstream.body directly:

return new Response(upstream.body, {
  headers: {
    "Content-Type": "text/event-stream",
    "Cache-Control": "no-cache",
    "X-Accel-Buffering": "no",
  },
});

On the frontend, parse SSE chunks incrementally with ReadableStream.getReader() and render progressively.

Frontend Notes (Astro + Lightweight JS)

The /chat page intentionally avoids heavy frameworks for fast cold start and resilient mobile UX.

Key points:

fixed viewport strategy for mobile browser UI-bar quirks
streaming heartbeat timeout via AbortController
safe markdown rendering + XSS sanitation
graceful UX copy for blocked/throttled responses

Deployment Notes

Configure Worker and wrangler.toml
Create KV namespace for limiter state
Store secrets via wrangler secret put API_KEY
Deploy Worker and bind your domain

This gives a practical AI gateway with strong security controls and near-zero ops cost.

Recent Updates (2026-03)

After launch, I shipped another round of practical fixes worth documenting:

Strict bilingual content split: Chinese UI now excludes English posts, and English UI only shows English posts
Completed EN route set: /en, /en/chat, /en/posts, plus language switch entry points
Locale-aware gateway prompt injection: Worker injects EN/ZH guard prompts based on locale, keeping English chat output consistently English
Streaming race-condition fix: mode/model switching is locked while a stream is active, preventing false blocked fallbacks during in-flight responses

These are not cosmetic tweaks—they are stability and consistency fixes discovered under real usage.

Closing

If you are building an AI terminal for public traffic, security and cost control are the real product.

CORS allowlists
prompt hardening
abuse intercept
layered rate limiting

These are not optional—they are the foundation.

→ Try the English AI Chat Terminal