Build a service block · gonkablocks guides

What a service is

A service block is one long-running container, shared by every caller. The author deploys it; the platform starts it; it stays running; everyone hits the same instance. Output of a service is whatever your HTTP routes return — JSON, HTML, SSE, redirects, anything.

author runs deploy
       │
       ▼
 platform starts container (one)
       │
       ▼
 container binds 0.0.0.0:<expose_port>
       │
       ▼
 anyone hits  https://blocks.gonka.gg/services/<author>/<slug>/<path>
       │
       ▼
 every request → session-proxy → 127.0.0.1:<hostPort>/<path>
       │
       ▼
 (until the author stops it, or it crashes, or it hits its spend cap)

URL is stable. Versions are deployable without changing it (redeploy swaps the container atomically). Public services are reachable without auth; private ones require an API key.

Service vs. session

Both spin up a container and proxy HTTP through to it. The key difference is scope:

	session	service
Containers	one per user	one total, shared
Lifetime	idle TTL → reaped	runs until you stop it
State	per-user, ephemeral	shared (global)
Public URL?	only for the user's session	yes, if isPublic
Pays who?	caller's spend cap	author's wallet
Use for	chats, notebooks, apps	APIs, webhooks, integrations

When to pick service

You want a REST endpoint other systems call.
State should be shared across callers (or there is no state).
You'd otherwise stand up a tiny container on Fly / Railway / Render and forget about it. A service block is the same shape, with inference baked in.
Examples: an MCP server backed by your data; a webhook receiver that triggers an LLM action; a public "classify this URL" API; a status page that summarises runs in natural language.

When NOT to pick service

You want per-user state — pick session.
You only need a one-shot result — pick job. Services are always-on; jobs are pay-per-run.
The thing should run on a schedule — pick worker. (You can combine: a worker that POSTs to a service is a perfectly good pattern.)

What you'll build

A service called e2e-service with three routes:

GET / → tiny JSON health check
GET /echo?q=… → echoes back the query string
POST /haiku → JSON {topic: "..."} → JSON {haiku: "..."} via Qwen3-235B

The full HTTP loop — including the chunked-body bug that I had to work around the first time — is documented below.

manifest.yaml

name: e2e-service
version: 0.1.2
type: service
description: Tiny FastAPI echo server with one extra route that calls the model.
category: misc

runtime:
  build: dockerfile
  entrypoint: uvicorn server:app --host 0.0.0.0 --port 8080
  expose_port: 8080
  env:
    MODEL: qwen3-235b

resources:
  cpu: 1
  memory_mb: 512
  timeout_seconds: 7200
  network: deny

pricing:
  type: per_run
  base_price_cents: 0
  rate_cents_per_minute: 0
  inference_pass_through: true
  inference_markup_pct: 0

Field-by-field, what's new vs. session

type: service — required. Tells the platform this is a single shared container, not per-user.
runtime.entrypoint — must start a long-running HTTP server (see the pitfalls).
runtime.expose_port — TCP port your server binds inside the container. Must match the port uvicorn / express / etc. listens on.
resources.timeout_seconds — for services, this is the absolute lifetime cap. 7200 (two hours) is a common minimum; many authors set it to 3600 * 24 * 7 and rely on spend caps to bound cost.
No idle_ttl_seconds for services — services don't auto-reap on idle. They run until stopped or until the spend cap is hit.

Dockerfile

FROM python:3.12-slim
WORKDIR /workspace
RUN pip install --no-cache-dir 'openai==1.55.0' 'httpx<0.28' \
                              'fastapi==0.115.5' 'uvicorn[standard]==0.32.1'
COPY . /workspace
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8080"]

server.py

"""FastAPI service for e2e-service.

Routes:
  GET  /            -> {"ok": true, "service": "e2e-service"}
  GET  /echo?q=…    -> {"echo": "<q>"}
  POST /haiku       -> body: {"topic": "..."}, returns {"haiku": "..."}

We use FastAPI/uvicorn because the platform's session-proxy strips
content-length and forwards the body chunked. Frameworks that decode
chunked transfer-encoding (FastAPI, Flask, Starlette, etc.) work
transparently; Python's stdlib http.server does not.
"""
from __future__ import annotations
import os

from fastapi import FastAPI
from pydantic import BaseModel
from openai import OpenAI

MODEL = os.environ.get("MODEL", "qwen3-235b")
_client: OpenAI | None = None

def oai() -> OpenAI:
    global _client
    if _client is None:
        _client = OpenAI(
            base_url=os.environ["OPENAI_BASE_URL"],
            api_key=os.environ["OPENAI_API_KEY"],
        )
    return _client

class HaikuReq(BaseModel):
    topic: str | None = None

app = FastAPI()

@app.get("/")
def root() -> dict:
    return {"ok": True, "service": "e2e-service"}

@app.get("/echo")
def echo(q: str = "") -> dict:
    return {"echo": q}

@app.post("/haiku")
def haiku(body: HaikuReq) -> dict:
    topic = (body.topic or "").strip() or "the universe"
    resp = oai().chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": "Reply with one 3-line haiku, plain text only."},
            {"role": "user", "content": topic},
        ],
        temperature=0.6,
    )
    return {"topic": topic, "haiku": (resp.choices[0].message.content or "").strip()}

Things worth noting:

Lazy client construction. oai() only hits the environment when actually called. Means your service starts even if some env var is briefly missing — the error surfaces on first request, not on boot, which is much easier to debug.
One process is fine. Don't reach for gunicorn / multiple workers unless you actually measure contention. A single uvicorn process handles thousands of concurrent in-flight requests via asyncio.
No CORS by default. If you'll call this service from a browser, add fastapi.middleware.cors.CORSMiddleware with whatever origin allowlist makes sense.

Deploy the block

Same as a job — package the source, the platform builds and publishes it. The container isn't running yet; that's a separate step.

$ gonkablocks deploy
packaging /private/tmp/gbk-e2e/e2e-service → e2e-service@0.1.2…
uploading 2 files…
building…
 • ready
publishing…
✓ published. View at https://blocks.gonka.gg/blocks/admin/e2e-service

Start the service

The container is started by the author (you), once. From the block page, click Deploy service. Or via the API:

# 1. resolve block id
BLOCKID=$(curl -s -H "Authorization: Bearer $GONKA_API_KEY" \
  "https://blocks.gonka.gg/api/blocks/resolve?author=admin&slug=e2e-service" \
  | jq -r .blockId)

# 2. start the service container
curl -X POST https://blocks.gonka.gg/api/services/$BLOCKID/deploy \
  -H "Authorization: Bearer $GONKA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "isPublic": true,
    "spendCapCents": 100
  }'
# {"id":"cmoogu85e002v4a1dtekc2w9k","url":"/services/admin/e2e-service/"}

The response's id is the BlockSession id (services and sessions share the same underlying table). The URL is the public-facing path. Check readiness:

curl -H "Authorization: Bearer $GONKA_API_KEY" \
  "https://blocks.gonka.gg/api/services/$BLOCKID"
# {"service":{"status":"running", "hostPort":32778, ...}}

Wait until status: running before sending real traffic. The first request after start also waits internally for the bind, but with a short timeout — checking once explicitly avoids cold-start 503s.

Calling it

The public URL pattern is /services/<author>/<slug>/<path>. If isPublic: true, anyone can hit it. If false, you need a bearer token from the author's account.

# health
curl https://blocks.gonka.gg/services/admin/e2e-service
# {"ok":true,"service":"e2e-service"}

# query string
curl "https://blocks.gonka.gg/services/admin/e2e-service/echo?q=hello%20world"
# {"echo":"hello world"}

# POST with body
curl -X POST -H "Content-Type: application/json" \
  https://blocks.gonka.gg/services/admin/e2e-service/haiku \
  -d '{"topic":"a quiet harbour at dusk"}'
# {"topic":"a quiet harbour at dusk","haiku":"Crimson sun descends, ..."}

From a browser fetch:

const res = await fetch(
  "https://blocks.gonka.gg/services/admin/e2e-service/haiku",
  {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ topic: "a quiet harbour at dusk" }),
  },
);
const { haiku } = await res.json();

Redeploy a new version

Two-step: bump the version + deploy, then redeploy the container.

# manifest.yaml
- version: 0.1.2
+ version: 0.1.3

$ gonkablocks deploy
…
publishing…
✓ published. View at https://blocks.gonka.gg/blocks/admin/e2e-service

# swap the live container atomically
curl -X POST https://blocks.gonka.gg/api/services/$BLOCKID/deploy \
  -H "Authorization: Bearer $GONKA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"redeploy": true, "isPublic": true, "spendCapCents": 100}'
# {"id":"cmoogxba3003f4a1d96c0d6x3","url":"/services/admin/e2e-service/"}

redeploy: true stops the current container after the new one starts and binds — the public URL keeps responding throughout. Drop in-flight requests gracefully by hooking SIGTERM (see the session guide).

Auth & visibility

isPublic: true — anyone can hit the service's public URL. No auth required. Use this for genuine public APIs.
isPublic: false — public URL still works, but every request needs an Authorization: Bearer gk-live-… from the author's account. Use this for internal tools / paid services / staging environments.
Per-caller billing. The author pays for inference run inside the container — that's why spendCapCents matters. To pass cost on, deploy with pricing.inference_pass_through: false + a markup, or wrap the service in a paid plan via the dashboard.
Inside the container, requests carry x-gonkablocks-user-id (when authenticated) and x-gonkablocks-session-id. You can use these for app-level rate limiting or tenanting, even though the container is shared.

Stop the service

curl -X DELETE https://blocks.gonka.gg/api/sessions/$SESSION_ID \
  -H "Authorization: Bearer $GONKA_API_KEY"
# {"ok":true}

DELETE is graceful — SIGTERM, then SIGKILL after ~10 s. The block stays published; future callers will see {"error":"service not deployed"} until you start it again.

Note the URL: services use the same /api/sessions/<id> endpoint as session blocks for stop/inspect. The /api/services/<blockId>/deploy endpoint is just for starting.

Pitfalls I actually hit

Stdlib http.server drops POST bodies. Hardest one to debug: the platform's session-proxy treats Content-Length as a hop-by-hop header and forwards the body chunked. Python's stdlib BaseHTTPRequestHandler reads from int(self.headers["content-length"]) — missing means 0, so your handler sees an empty body. Fix: use any real framework. FastAPI, Flask, Starlette, aiohttp, Express, Fastify — they all decode chunked encoding correctly.
Entrypoint that doesn't actually start a server. If you set entrypoint: python server.py and server.py just defines app = FastAPI() without starting uvicorn, the container exits cleanly with code 0. Service shows up as failed with the (cryptic) message container exited (code=0). Fix: use a CLI launcher in entrypoint — uvicorn server:app --host 0.0.0.0 --port 8080 — or guard with if __name__ == "__main__": uvicorn.run(app) + entrypoint: python server.py.
Trailing-slash 308 on the root path. GET /services/<you>/<slug>/ redirects to the no-slash version with HTTP 308. Browsers and most HTTP libraries follow it automatically; some scrapers / scripts without --location may flag it. Cosmetic; tell callers to use the no-slash URL.
WebSocket upgrades aren't supported yet. Same as for sessions — the proxy is HTTP-only at the moment. Use SSE / streamed HTTP responses for live updates.
Spend cap silently breaks the service. Once the container hits its cap, inference calls 502 and the service may keep accepting requests but return errors. Set a spendCapCents with enough headroom for a day of expected traffic, or wire up a worker that bumps it via the API.
Cold starts on first hit. Even with the container in running, the very first request can be slower (filesystem caches, pip's lazy imports). For latency-sensitive endpoints, ship a tiny in-app warmup job that hits a no-op route a few times after deploy.

Next steps

Make this service a paid product? See pricing in the manifest reference.
Trigger it on a schedule from the platform? Combine with a worker that curls an endpoint every N minutes.
Compose this service into a larger pipeline? Workflow blocks.