bend — Lumbda's GPU dispatch primitive

bend is a Lumbda primitive that decides per call whether to evaluate locally or ship to a CUDA worker over our wire protocol. Tiny inputs stay local; heavy inputs bend to a worker that holds a warm CUDA context across requests. The decision uses a cost estimator on the argument shape, not the operation name.

Start a GPU worker

# On any host with nvcc + a CUDA-capable GPU:
make gpu-worker
# → builds examples/cuda-fanout/shake256-fanout
# → builds the C tier (~10× faster wire orchestration than Python)
# → launches gpu-worker.lsp on TWO ports:
#     8320  — wire-TCP protocol (native callers: bend.lsp, asm clients)
#     8321  — HTTP/1.1 + CORS  (browser callers: the playground/REPL tabs)

# Port 8320 = BEND mnemonic:
#   8 ~= B (implied infinity B flattened; bake a cake; baby & me)
#   3 ~= E (backward)
#   2 ~= N (pivoted 90 degrees)
#   0 ~= D (flattened)
# Port 8321 = 8320 + 1 (HTTP is always wire-port + 1, override via --http-port).

# Override tier (default port stays 8320 / BEND):
make gpu-worker LUMBDA=python             # easier debugging
make gpu-worker LUMBDA=asm                # smallest footprint
# Override port too (only when running a second worker on the same host):
make gpu-worker LUMBDA=python PORT=9320   # second worker (wire 9320, http 9321)

Run your own bend → use it from a browser tab

Browsers cannot open raw TCP sockets, but they can fetch / XHR to http://localhost:<port> even from an HTTPS page (a permissive "potentially trustworthy" exception every major browser ships). So the playground field accepts a URL that points at a bend worker running on your own machine. No fox-owned ingress, no certificate, no proxy hop.

# 1. Start a worker on your laptop / desktop / homelab
git clone https://lumbda.com/lumbda.git && cd lumbda
make gpu-worker LUMBDA=python   # CPU works; CUDA path lights up if nvcc is present
# → "gpu-worker listening on port 8320"
# → "gpu-worker HTTP listening on port 8321"

# 2. In a separate shell, smoke-test the HTTP path
curl -sS -X POST -H 'Content-Type: text/plain' \
    --data-binary '(ping)' http://localhost:8321/
# → (ok pong)

# 3. Open https://lumbda.com/playground/ and paste into the ⚡ bend field:
#      http://localhost:8321/
# Save, then in the REPL:
#      λ> (bend!-call '(ping))

Both ports share the same handle-request dispatcher: whatever an op-head (ping, echo, cuda-shake-fanout, cuda-sim-ops-bin, …) returns over wire is what HTTP returns. Adding a new op-head exposes it on both transports automatically.

HTTP scope today is the S-expression text path — for browser-friendly recipes the server understands today, with no binary blobs crossing the wire. Binary modes (BSHK / BCGB / BSCP / BSRT / BSB3) stay wire-only because they exist for native callers who already hold the binary locally.

Phase 2 (deferred until authentication lands): upload an .lsp recipe describing a champion circuit, dispatch into a server-side factory that compiles the .bin on the worker's filesystem and bends it. Same playground field, same single endpoint. Auth gates write access; today a worker on the public internet would let any caller occupy your GPU.

Call it from any tier

;; bend works on every tier — Python, C, asm — through the same
;; tcp-* + portal primitives lumbda already ships.
(load "examples/cuda-fanout/wire.lsp")
(load "examples/cuda-fanout/bend.lsp")
(load "examples/cuda-fanout/bend-macros.lsp")  ; Python/C only — asm uses bend-call

;; Tiny — cost below threshold, evaluates locally
(bend (cuda-shake-fanout '("00" "01" "deadbeef") 32))

;; Heavy — cost above threshold, ships to the GPU worker
(bend (cuda-shake-fanout one-million-inputs 32))

Wire protocol

Two modes: S-expression text (the default) and binary (magic BSHK header + raw bytes). Binary mode bypasses S-expression parsing entirely.

workload	Py S-exp	Py binary	C S-exp	C binary
100 × 16 B	3.43 ms	0.74 ms	0.40 ms	0.15 ms
1k × 16 B	23.24 ms	0.76 ms	2.77 ms	0.22 ms
10k × 16 B	218.82 ms	1.27 ms	CLIFF	0.88 ms
100k × 16 B	2,219 ms	10.18 ms	CLIFF	10.35 ms
1M × 16 B	23,811 ms	159 ms	CLIFF	157 ms

Binary mode wins by 30–200× over S-expression at scale; at 1 M × 16 B inputs C tier binary is 157 ms vs 23,811 ms for S-exp, and bend beats host hashlib by ~12×. The CUDA toolchain stays isolated to the leaf binary the worker spawns — no tier links libcudart; asm tier hosts workers through hand-written pipe2 + fork + execve syscalls.

Fleet

Production default: single host. Multi-host fan-out is built (round-robin in bend.lsp via *bend-workers* + BEND_WORKERS env), available on demand for long-running parallel workloads — not used routinely.

host	GPU	arch	port	status
`3090-ai.foxhop.net`	RTX 3090 (24 GB)	sm_86	8320	active — production worker
`ai.foxhop.net`	RTX 4090 (24 GB)	sm_89	8320	active — 3-node mesh
`cammy.foxhop.net`	Tesla P40 (24 GB)	sm_61	8320	active — 3-node mesh (Pascal)

3-node mesh active. 3090 + 4090 + P40 all serve bend workloads on port 8320 (BEND mnemonic). Round-robin or explicit selection via (bend-set-workers! '(("3090-ai.foxhop.net" . 8320) ("ai.foxhop.net" . 8320) ("cammy.foxhop.net" . 8320))). Cost-routed coordinator pattern: greedy bin-pack candidates by predicted cost; per-host wall factor 1.00 / 0.51 / 1.84 (3090 / 4090 / P40 — Pascal lags but contributes a third concurrent stream); one worker thread per node fires concurrently against our wire protocol. Each node serializes its own queue (workers do not multiplex requests cleanly inside one CUDA context). Full hardware table, bench numbers, & LLM-coexistence detail at foxhop gpu-mesh.

Worker health heartbeat

Every dispatch goes through a lazy-refresh health probe so bend never wastes a payload on a worker that is down, overloaded, or VRAM-starved. The worker exposes a (health) op returning measured numbers; the client caches results & ranks healthy peers by free VRAM descending.

Worker side — `(health)`

; over the same wire as any other op:
(health)
→ (ok (load-avg 0.42)
         (vram-free-mb 22777)
         (uptime-ms 1780752288371))

load-avg — 1-minute load average from /proc/loadavg
vram-free-mb — free VRAM from nvidia-smi --query-gpu=memory.free --format=csv,noheader,nounits; 0 on a host without an NVIDIA GPU
uptime-ms — current wallclock (millisecond), so a worker that hangs & restarts between probes shows a uptime jump

Backward compatible: older workers without a (health) handler return (error (unknown-op health)) — the client treats that as ok with vram-free-mb=0 so a pre-heartbeat build still ranks as available.

Client side — `worker-health` cache

An alist keyed by "host:port" maps each known peer to a record (last-checked-ms status vram-mb) where status ∈ {ok, down}.

Cache TTL on ok: 5 s. A fresh probe runs only when a cache entry ages past this; idle bursts never re-probe.
Cooldown on down: 30 s. A worker that fails an (health) call or a regular dispatch gets skipped for half a minute before a re-probe, so a transient failure costs at most one call.
VRAM-ranked pick. When more than one peer caches as ok, bend-pick-worker sorts by vram-free-mb descending & returns the top entry. Long-running jobs route to whichever box carries the most spare GPU memory.
Round-robin fallback. When every worker is in down-cooldown, the client falls back to plain round-robin so a recovering box still gets a real attempt rather than refusing dispatch outright.

Failure handling

Mid-flight failures in bend-dispatch-to-gpu — tcp-connect raising on a connect-refused, the worker dropping a connection mid-reply, a malformed reply — flip the worker to down. The next pick skips it for the cooldown window. A single failed call never aborts a multi-worker iteration: probes wrap in with-exception-handler so a dead peer surfaces as 'transport-fail rather than propagating up.

Form catalog

A form earns a slot here only after we have published a benchmark or measured one on our hardware. "I think this would be fast" does not earn a slot — the form-status column says planned until numbers exist.

Status legend

live — binary built, worker dispatches it, numbers recorded on our hardware
surveyed — published benchmark cited, prototype binary not yet wrapped

Live forms

form	hardware	throughput	wire shape
`cuda-shake-fanout`	RTX 3090	12× host hashlib at 1 M × 16 B inputs	`(cuda-shake-fanout '(hex ...) out-bytes)` + `BSHK` binary
`cuda-sim-ops-bin`	RTX 3090	1.27× at 141 batches (9024 shots); crossover ~115 batches; SHAKE-RNG mode matches upstream Pareto bit-for-bit (Σ Toffoli = 15,999,651,264, avg Toffoli = 1,773,011.000 on stock `ops.bin`). Kernel wall scales linearly with Σ Toffoli across circuit variants — foxhop secp256k1 lever sweep at p=251 (n+1=9 secp256k1-toy) over 6 lumbda-emitted ops.bin variants ran 9.83× wall reduction Fermat-schoolbook → refined-Solinas, 261.4 ms → 26.6 ms / batch at 1024 shots; matches 6.95× Σ Toffoli ratio + parallel Clifford drop. No scheduler surprise: pick any QECCOPS1 circuit, predict GPU wall from a cheap CPU op-counter.	`(cuda-sim-ops-bin "path/to/ops.bin" n-batches)` — `demo_ops` defaults `--rng-mode shake` (Fiat-Shamir over op stream); `--rng-mode lfsr` keeps legacy xorshift for debug parity
`cuda-sim-axis-flip`	RTX 3090	217 Mops/s @ K=32 M=4 (many-candidates × few-shots)	`(cuda-sim-axis (variant-paths ...) n-shots)`
`cuda-bignum-cgbn`	RTX 3090	1.28 Gops/s kernel mod-mul @ n=1M (256-bit, ~256× GMP CPU); 9 ops	`BCGB` binary: op_id + bitwidth + n + modulus + a + b
`cuda-secp256k1-batched-mul`	RTX 3090	13.83 Mkeys/s @ n=1M (~309× coincurve CPU; windowed-G ladder w=4)	`BSCP` binary: scalars + base-point → `BSCR` points
`cuda-radix-sort`	RTX 3090	5.50 Gkeys/s @ n=10M (kernel 1.82 ms); 3.77 Gkeys/s @ n=1M; CUB DeviceRadixSort u64 ascending	`BSRT` binary: op_id + n + u64[n] → `BSRR` sorted u64[n]
`cuda-blake3-tree`	RTX 3090	32.5 GB/s @ 1M × 64 B (kernel 1.97 ms); 20.4 GB/s @ 1k × 1 MB (kernel 51.5 ms); byte-identical to BLAKE3 reference spec across single-chunk + multi-chunk paths	`BSB3` binary: out_bytes + n + (u32 len + bytes) per input → `BSR3` n + out_bytes + digests

Surveyed forms (Wave 1)

form	throughput	hardware	ref
`cuda-secp256k1-batched-mul`	6.5 Gkeys/s VanitySearch RTX 4090; 8.6 Gkeys/s RTX 5090; 2.65 Gkeys/s RTX 3080	(promoted — see Live)	gECC · Bitcrack
`cuda-bignum-cgbn`	100×+ on dense mul vs Xeon-20c + GMP + OpenMP	V100	NVlabs CGBN · midsize-int
`cuda-rho-pollard-walk`	87.7 M ops/sec for ECCp79	RTX 2070 Super	atlomak · oritwoen
`cuda-clifford-stabilizer`	186× over Stim (CPU SOTA) on equivalence-checking	STABSim	STABSim · Qimax
`cuda-bernstein-yang-inv`	3–10× per inversion over Fermat on CPU; novel territory on GPU	—	safegcd · Jumping
`cuda-ntt-poly`	up to 123× over CPU; 21× on RTX 3070	RTX 3070	NTTSuite · FHE NTT
`cuda-radix-sort`	1.4 G keys/sec; 20–50× over CPU merge sort; 257× vs Xeon Phi for scan	(promoted — see Live)	CUB · Onesweep

Surveyed forms (Wave 2)

Sorted by reported speedup descending.

form	throughput	hardware	ref
`cuda-minhash-weighted`	600–1000× vs numpy+MKL	Titan X vs Xeon E5-1650	src-d/minhashcuda
`cuda-cuckoo-filter`	378× insert, 258× delete	A100	arXiv:2603.15486
`cuda-aes-ctr-chacha20`	211–400 GB/s	single GPU	AsyncGBP
`cuda-suffix-array-skew`	30–242× vs CPU SA-IS	Tesla K20	Liu/Luo
`cuda-kdtree-build`	30–242× build, 1.6–200× kNN	RTX (RT cores)	Zhou et al.
`cuda-sat-paraFROST-elim`	93× peak, 48× avg variable elim	NVIDIA + Kissat baseline	ParaFROST
`cuda-aho-corasick-pfac`	~50–100× IDS pkt-inspect	GTX-class	PFAC
`cuda-dilithium-pqsig`	57.7× keygen+sign+verify vs single CPU thread	RTX 3090 Ti	IACR 2024/1365
`cuda-mc-options-pricing`	25–152×	Tesla C1060 / modern	GPU Gems Ch.45
`cuda-cuFFT-batched-1D`	8–32× vs MKL; tcFFT 1.1–3.2× vs cuFFT	V100 / A100	tcFFT
`cuda-blake3-tree`	32.5 GB/s @ 1M × 64 B; 20.4 GB/s @ 1k × 1 MB on 3090	(promoted — see Live)	Blaze-3
`cuda-bloom-filter-modern`	~6× CPU; 3.4 B inserts/s	B200 / Perlmutter	arXiv:2512.15595
`cuda-gemm-batched-FP8`	4.8× FP8 vs A100; 716 TFLOPS H100	H100 SXM	cuBLAS 12.0
`cuda-batched-matrix-inverse`	4.3–16.8× vs MAGMA	P100	Superfri 2018
`cuda-hash-join-radix`	4 B tuples/s single; 1.8 T tuples/s on 1024 A100	A100 cluster	ADMS-21
`cuda-kmer-count`	4–6× vs KMC2	RapidGKC, Gerbil	RapidGKC
`cuda-cgraph-traversal`	38 B TEPS	DGX2	cuGraph
`cuda-triangle-count-TRUST`	~1 T TEPS	multi-A100	TRUST
`cuda-ldpc-bp-decoder`	10 Gbps with early-termination	GPGPU	MDPI Electronics 2022
`cuda-nvcomp-zstd`	2.2× zstd; 1.4× LZ4; 1.9× snappy	H100 / A100	nvCOMP

Surveyed forms (Wave 3)

form	throughput	hardware	ref
`cuda-fluidx3d-lbm`	100–200× vs ANSYS Fluent; 8,799 MLUPS single A100	A100	FluidX3D
`cuda-mfcc-spectral`	~97× CPU MFCC; STFT ~75× via cuSignal	GTX 580 / RTX 30-series	cuSignal
`cuda-batched-lp-simplex`	95× over CPLEX; 5× over GLPK	GTX 980-class	arXiv 1802.08557
`cuda-betweenness-centrality-weighted`	30–150× warp-centric weighted BC	GTX onwards	arXiv 1701.05975
`cuda-cudasift-orb-ransac`	~60× SIFT CPU→GPU; ORB 11.3×	GTX 1060+	CudaSift
`cuda-cudasw-gasal2`	CUDASW++4.0 16.2×; 5.71 TCUPS on H100	H100	CUDASW++4.0
`cuda-loopy-bp-mrf`	45× over CPU LBP for stereo MRF	GTX 280+	arXiv 2509.22337
`cuda-sgm-stereo`	42 fps @ 640×480, 128 disparities	Tegra X1 / discrete	arXiv 1610.04121
`cuda-hungarian-lap`	10–50×; 400 M-variable LAP in ~13 s	NVIDIA GPU	ScienceDirect
`cuda-msm-bls12-381`	27.86× over Pippenger AVX baseline	A100 / RTX 4090	SimdMSM TCHES
`cuda-pdwt-lifting`	15.9× over best optimized CPU DWT	GTX / Tesla	PDWT
`cuda-tensornet-contract`	8–20× vs CuPy; tensor QR ~100× vs Xeon	A100	cuTensorNet
`cuda-ega-gpu-aggregation`	6.45–29.12× multi-pass; group-by 19.4×	NVIDIA GPU	VLDB Top-k EGA
`cuda-bicgstab-ilu-spmv`	SpTRSV 10.7×; BiCGSTAB 3.2× vs cuSPARSE	V100 / MI210	arXiv 2508.04917
`cuda-icicle-snark-groth16`	fastest Groth16 today; NTT 91% of prover	RTX 4090 / A100	ICICLE-Snark

Surveyed forms (Wave 4)

form	throughput	hardware	ref
`cuda-kalman-batched`	1386× for 5000-component measurements	various	CUDAkalmanFilter
`cuda-g6k-tensor-sieve`	1230× vs G6K CPU sieve at dim 120; SVP record dim 180 on 4 Turing GPUs	4 Turing	Ducas/Stevens/van Woerden EC 2021
`cuda-zkspeed-sumcheck-hyperplonk`	801× geomean over CPU; sumcheck 8.4 s → 9.5 ms	full-chip accelerator	zkSpeed HPCA 2025
`cuda-kyber-batched-ntt`	~451× batched (B=65k); HI-Kyber 6.47× over prior GPU SOTA	RTX 3080	HI-Kyber
`cuda-ironman-ote`	237× OT throughput vs full-thread CPU	near-memory variant	Ironman arXiv 2507.16391
`cuda-cudss-cholesky`	>100× vs QDLDL; 20× vs CHOLMOD factor	NVIDIA	NVIDIA cuDSS
`cuda-particle-filter`	~150× absolute (5000 particles @ 170 Hz)	GPGPU	EURASIP J ASP 2013
`cuda-rk-stiff-chemkin`	126× vs single-core; 25× vs 6-core for hydrogen RKCK 524k ODEs	GPGPU	Niemeyer & Sung
`cuda-fem-assembly-jit`	87× assembly vs serial CPU; 126× peak numerical integration	GPGPU	Mironov et al.
`cuda-cufalcon-sign`	201k sig/s Falcon-512 on A100; verify 2.72M sig/s, 29.5× vs AVX2	A100	cuFalcon eprint 2025/249
`cuda-cudahull-3d`	30–40× over Qhull CPU	NVIDIA	CudaHull CAG 2012
`cuda-fastplay-garbled`	35–40× over serial garbling on GPU cluster	GPU cluster	Fastplay eprint 2011/097
`cuda-air-fri`	~22.8× avg end-to-end ZK speedup; FRI commitment	GPGPU	Air-FRI SAC 2025
`cuda-rabin-fingerprint`	16× over single-thread CPU; 40 Gbps absolute	GTX 780 (HARENS)	HARENS CloudCom 2016
`cuda-gdel3d`	10× over CGAL 3D Delaunay; 70× Voronoi/jump-flood at 10M points	NVIDIA	gDel3D I3D 2014
`cuda-perasure-crs`	10× vs multithread Jerasure; 10 GB/s on GTX780 absolute	GTX 780	PErasure IEEE Cluster 2015
`cuda-piranha-mpc`	4× vs CryptGPU on VGG16 private inference; full 3/4-party stacks single-GPU	single GPU	Piranha USENIX Sec 2022
`cuda-scamp-matrix-profile`	quintillion pairwise comparisons / day (absolute)	GPGPU	SCAMP
`cuda-cudtw-subseq`	2–3 orders of magnitude over UCR-Suite CPU; soft-DTW up to 5000×	Volta	cuDTW++ Euro-Par 2020
`cuda-terachem-dft`	1–2 orders of magnitude over CPU; 8–50× vs GAMESS on 256-core cluster	4× Tesla	TeraChem

Wave 4 filter-outs: MAFFT MSA (11–20×, surpassed), RAxML likelihood (32× kernel only, ~3–10× end-to-end), AmgX CG (3–4× vs AmgX baseline), BVH Karras LBVH (2–3× over prior GPU LBVH), LDPC decode (40–160 Mbps, not ≥10× over modern SIMD CPU), mesh decimation (application-dependent), discrete Gaussian sampler (single-digit % gains; fold into Kyber NTT). Revisit when the published number changes.

Why a form earns its slot

A form is GPU-worth-it when at least one of:

Embarrassingly parallel. N independent items, no cross-item dependency.
Dense, branch-free inner loop. Same operation on every element.
Reduction-friendly. Tree-reduce / prefix-sum / parallel-scan patterns.
Big batch amortizes fixed kernel overhead.

When none of these hold, find a different decomposition: parallelize on a different axis, or stay on CPU & fan out across fleet hosts.

Source & specs

examples/cuda-fanout/ — wire contract, daemon protocol, bench data, per-tier integration.
CATALOG.md — canonical source for form metadata.