bend is a Lumbda primitive that decides per call whether to evaluate locally or ship to a CUDA worker over our wire protocol. Tiny inputs stay local; heavy inputs bend to a worker that holds a warm CUDA context across requests. The decision uses a cost estimator on the argument shape, not the operation name.
Start a GPU worker
# On any host with nvcc + a CUDA-capable GPU:
make gpu-worker
# → builds examples/cuda-fanout/shake256-fanout
# → builds the C tier (~10× faster wire orchestration than Python)
# → launches gpu-worker.lsp on port 9091
# Override tier or port:
make gpu-worker LUMBDA=python PORT=9001 # easier debugging
make gpu-worker LUMBDA=asm # smallest footprint
Call it from any tier
;; bend works on every tier — Python, C, asm — through the same
;; tcp-* + portal primitives lumbda already ships.
(load "examples/cuda-fanout/wire.lsp")
(load "examples/cuda-fanout/bend.lsp")
(load "examples/cuda-fanout/bend-macros.lsp") ; Python/C only — asm uses bend-call
;; Tiny — cost below threshold, evaluates locally
(bend (cuda-shake-fanout '("00" "01" "deadbeef") 32))
;; Heavy — cost above threshold, ships to the GPU worker
(bend (cuda-shake-fanout one-million-inputs 32))
Wire protocol
Two modes: S-expression text (the default) and binary (magic BSHK header + raw bytes). Binary mode bypasses S-expression parsing entirely.
| workload | Py S-exp | Py binary | C S-exp | C binary |
|---|---|---|---|---|
| 100 × 16 B | 3.43 ms | 0.74 ms | 0.40 ms | 0.15 ms |
| 1k × 16 B | 23.24 ms | 0.76 ms | 2.77 ms | 0.22 ms |
| 10k × 16 B | 218.82 ms | 1.27 ms | CLIFF | 0.88 ms |
| 100k × 16 B | 2,219 ms | 10.18 ms | CLIFF | 10.35 ms |
| 1M × 16 B | 23,811 ms | 159 ms | CLIFF | 157 ms |
Binary mode wins by 30–200× over S-expression at scale; at 1 M × 16 B inputs C tier binary is 157 ms vs 23,811 ms for S-exp, and bend beats host hashlib by ~12×. The CUDA toolchain stays isolated to the leaf binary the worker spawns — no tier links libcudart; asm tier hosts workers through hand-written pipe2 + fork + execve syscalls.
Fleet
Production default: single host. Multi-host fan-out is built (round-robin in bend.lsp via *bend-workers* + BEND_WORKERS env), available on demand for long-running parallel workloads — not used routinely.
| host | GPU | arch | port | status |
|---|---|---|---|---|
3090-ai.foxhop.net | RTX 3090 (24 GB) | sm_86 | 9091 | active — production worker |
ai.foxhop.net | RTX 4090 (24 GB) | sm_89 | 9092 | reserved for qwen LLM (llama.cpp); bend worker enabled per workload |
Multi-host fan-out was validated at 1.6× aggregate throughput on small workloads, but routine round-robin against the 4090 would steal VRAM from qwen. Caller opts in explicitly when a workload justifies fan-out: (bend-set-workers! '(("3090-ai.foxhop.net" . 9091) ("ai.foxhop.net" . 9092))) brings the 4090 online for that call.
Form catalog
A form earns a slot here only after we have published a benchmark or measured one on our hardware. "I think this would be fast" does not earn a slot — the form-status column says planned until numbers exist.
Status legend
live— binary built, worker dispatches it, numbers recorded on our hardwaresurveyed— published benchmark cited, prototype binary not yet wrapped
Live forms
| form | hardware | throughput | wire shape |
|---|---|---|---|
cuda-shake-fanout |
RTX 3090 | 12× host hashlib at 1 M × 16 B inputs | (cuda-shake-fanout '(hex ...) out-bytes) + BSHK binary |
cuda-sim-ops-bin |
RTX 3090 | 1.07× at 128 batches; crossover ~115 batches | (cuda-sim-ops-bin "path/to/ops.bin" n-batches) |
cuda-sim-axis-flip |
RTX 3090 | 217 Mops/s @ K=32 M=4 (many-candidates × few-shots) | (cuda-sim-axis (variant-paths ...) n-shots) |
cuda-bignum-cgbn |
RTX 3090 | 1.28 Gops/s kernel mod-mul @ n=1M (256-bit, ~256× GMP CPU); 9 ops | BCGB binary: op_id + bitwidth + n + modulus + a + b |
cuda-secp256k1-batched-mul |
RTX 3090 | 13.83 Mkeys/s @ n=1M (~309× coincurve CPU; windowed-G ladder w=4) | BSCP binary: scalars + base-point → BSCR points |
cuda-radix-sort |
RTX 3090 | 5.50 Gkeys/s @ n=10M (kernel 1.82 ms); 3.77 Gkeys/s @ n=1M; CUB DeviceRadixSort u64 ascending | BSRT binary: op_id + n + u64[n] → BSRR sorted u64[n] |
cuda-blake3-tree |
RTX 3090 | 32.5 GB/s @ 1M × 64 B (kernel 1.97 ms); 20.4 GB/s @ 1k × 1 MB (kernel 51.5 ms); byte-identical to BLAKE3 reference spec across single-chunk + multi-chunk paths | BSB3 binary: out_bytes + n + (u32 len + bytes) per input → BSR3 n + out_bytes + digests |
Surveyed forms (Wave 1)
| form | throughput | hardware | ref |
|---|---|---|---|
cuda-secp256k1-batched-mul | 6.5 Gkeys/s VanitySearch RTX 4090; 8.6 Gkeys/s RTX 5090; 2.65 Gkeys/s RTX 3080 | (promoted — see Live) | gECC · Bitcrack |
cuda-bignum-cgbn | 100×+ on dense mul vs Xeon-20c + GMP + OpenMP | V100 | NVlabs CGBN · midsize-int |
cuda-rho-pollard-walk | 87.7 M ops/sec for ECCp79 | RTX 2070 Super | atlomak · oritwoen |
cuda-clifford-stabilizer | 186× over Stim (CPU SOTA) on equivalence-checking | STABSim | STABSim · Qimax |
cuda-bernstein-yang-inv | 3–10× per inversion over Fermat on CPU; novel territory on GPU | — | safegcd · Jumping |
cuda-ntt-poly | up to 123× over CPU; 21× on RTX 3070 | RTX 3070 | NTTSuite · FHE NTT |
cuda-radix-sort | 1.4 G keys/sec; 20–50× over CPU merge sort; 257× vs Xeon Phi for scan | (promoted — see Live) | CUB · Onesweep |
Surveyed forms (Wave 2)
Sorted by reported speedup descending.
| form | throughput | hardware | ref |
|---|---|---|---|
cuda-minhash-weighted | 600–1000× vs numpy+MKL | Titan X vs Xeon E5-1650 | src-d/minhashcuda |
cuda-cuckoo-filter | 378× insert, 258× delete | A100 | arXiv:2603.15486 |
cuda-aes-ctr-chacha20 | 211–400 GB/s | single GPU | AsyncGBP |
cuda-suffix-array-skew | 30–242× vs CPU SA-IS | Tesla K20 | Liu/Luo |
cuda-kdtree-build | 30–242× build, 1.6–200× kNN | RTX (RT cores) | Zhou et al. |
cuda-sat-paraFROST-elim | 93× peak, 48× avg variable elim | NVIDIA + Kissat baseline | ParaFROST |
cuda-aho-corasick-pfac | ~50–100× IDS pkt-inspect | GTX-class | PFAC |
cuda-dilithium-pqsig | 57.7× keygen+sign+verify vs single CPU thread | RTX 3090 Ti | IACR 2024/1365 |
cuda-mc-options-pricing | 25–152× | Tesla C1060 / modern | GPU Gems Ch.45 |
cuda-cuFFT-batched-1D | 8–32× vs MKL; tcFFT 1.1–3.2× vs cuFFT | V100 / A100 | tcFFT |
cuda-blake3-tree | 32.5 GB/s @ 1M × 64 B; 20.4 GB/s @ 1k × 1 MB on 3090 | (promoted — see Live) | Blaze-3 |
cuda-bloom-filter-modern | ~6× CPU; 3.4 B inserts/s | B200 / Perlmutter | arXiv:2512.15595 |
cuda-gemm-batched-FP8 | 4.8× FP8 vs A100; 716 TFLOPS H100 | H100 SXM | cuBLAS 12.0 |
cuda-batched-matrix-inverse | 4.3–16.8× vs MAGMA | P100 | Superfri 2018 |
cuda-hash-join-radix | 4 B tuples/s single; 1.8 T tuples/s on 1024 A100 | A100 cluster | ADMS-21 |
cuda-kmer-count | 4–6× vs KMC2 | RapidGKC, Gerbil | RapidGKC |
cuda-cgraph-traversal | 38 B TEPS | DGX2 | cuGraph |
cuda-triangle-count-TRUST | ~1 T TEPS | multi-A100 | TRUST |
cuda-ldpc-bp-decoder | 10 Gbps with early-termination | GPGPU | MDPI Electronics 2022 |
cuda-nvcomp-zstd | 2.2× zstd; 1.4× LZ4; 1.9× snappy | H100 / A100 | nvCOMP |
Surveyed forms (Wave 3)
| form | throughput | hardware | ref |
|---|---|---|---|
cuda-fluidx3d-lbm | 100–200× vs ANSYS Fluent; 8,799 MLUPS single A100 | A100 | FluidX3D |
cuda-mfcc-spectral | ~97× CPU MFCC; STFT ~75× via cuSignal | GTX 580 / RTX 30-series | cuSignal |
cuda-batched-lp-simplex | 95× over CPLEX; 5× over GLPK | GTX 980-class | arXiv 1802.08557 |
cuda-betweenness-centrality-weighted | 30–150× warp-centric weighted BC | GTX onwards | arXiv 1701.05975 |
cuda-cudasift-orb-ransac | ~60× SIFT CPU→GPU; ORB 11.3× | GTX 1060+ | CudaSift |
cuda-cudasw-gasal2 | CUDASW++4.0 16.2×; 5.71 TCUPS on H100 | H100 | CUDASW++4.0 |
cuda-loopy-bp-mrf | 45× over CPU LBP for stereo MRF | GTX 280+ | arXiv 2509.22337 |
cuda-sgm-stereo | 42 fps @ 640×480, 128 disparities | Tegra X1 / discrete | arXiv 1610.04121 |
cuda-hungarian-lap | 10–50×; 400 M-variable LAP in ~13 s | NVIDIA GPU | ScienceDirect |
cuda-msm-bls12-381 | 27.86× over Pippenger AVX baseline | A100 / RTX 4090 | SimdMSM TCHES |
cuda-pdwt-lifting | 15.9× over best optimized CPU DWT | GTX / Tesla | PDWT |
cuda-tensornet-contract | 8–20× vs CuPy; tensor QR ~100× vs Xeon | A100 | cuTensorNet |
cuda-ega-gpu-aggregation | 6.45–29.12× multi-pass; group-by 19.4× | NVIDIA GPU | VLDB Top-k EGA |
cuda-bicgstab-ilu-spmv | SpTRSV 10.7×; BiCGSTAB 3.2× vs cuSPARSE | V100 / MI210 | arXiv 2508.04917 |
cuda-icicle-snark-groth16 | fastest Groth16 today; NTT 91% of prover | RTX 4090 / A100 | ICICLE-Snark |
Surveyed forms (Wave 4)
| form | throughput | hardware | ref |
|---|---|---|---|
cuda-kalman-batched | 1386× for 5000-component measurements | various | CUDAkalmanFilter |
cuda-g6k-tensor-sieve | 1230× vs G6K CPU sieve at dim 120; SVP record dim 180 on 4 Turing GPUs | 4 Turing | Ducas/Stevens/van Woerden EC 2021 |
cuda-zkspeed-sumcheck-hyperplonk | 801× geomean over CPU; sumcheck 8.4 s → 9.5 ms | full-chip accelerator | zkSpeed HPCA 2025 |
cuda-kyber-batched-ntt | ~451× batched (B=65k); HI-Kyber 6.47× over prior GPU SOTA | RTX 3080 | HI-Kyber |
cuda-ironman-ote | 237× OT throughput vs full-thread CPU | near-memory variant | Ironman arXiv 2507.16391 |
cuda-cudss-cholesky | >100× vs QDLDL; 20× vs CHOLMOD factor | NVIDIA | NVIDIA cuDSS |
cuda-particle-filter | ~150× absolute (5000 particles @ 170 Hz) | GPGPU | EURASIP J ASP 2013 |
cuda-rk-stiff-chemkin | 126× vs single-core; 25× vs 6-core for hydrogen RKCK 524k ODEs | GPGPU | Niemeyer & Sung |
cuda-fem-assembly-jit | 87× assembly vs serial CPU; 126× peak numerical integration | GPGPU | Mironov et al. |
cuda-cufalcon-sign | 201k sig/s Falcon-512 on A100; verify 2.72M sig/s, 29.5× vs AVX2 | A100 | cuFalcon eprint 2025/249 |
cuda-cudahull-3d | 30–40× over Qhull CPU | NVIDIA | CudaHull CAG 2012 |
cuda-fastplay-garbled | 35–40× over serial garbling on GPU cluster | GPU cluster | Fastplay eprint 2011/097 |
cuda-air-fri | ~22.8× avg end-to-end ZK speedup; FRI commitment | GPGPU | Air-FRI SAC 2025 |
cuda-rabin-fingerprint | 16× over single-thread CPU; 40 Gbps absolute | GTX 780 (HARENS) | HARENS CloudCom 2016 |
cuda-gdel3d | 10× over CGAL 3D Delaunay; 70× Voronoi/jump-flood at 10M points | NVIDIA | gDel3D I3D 2014 |
cuda-perasure-crs | 10× vs multithread Jerasure; 10 GB/s on GTX780 absolute | GTX 780 | PErasure IEEE Cluster 2015 |
cuda-piranha-mpc | 4× vs CryptGPU on VGG16 private inference; full 3/4-party stacks single-GPU | single GPU | Piranha USENIX Sec 2022 |
cuda-scamp-matrix-profile | quintillion pairwise comparisons / day (absolute) | GPGPU | SCAMP |
cuda-cudtw-subseq | 2–3 orders of magnitude over UCR-Suite CPU; soft-DTW up to 5000× | Volta | cuDTW++ Euro-Par 2020 |
cuda-terachem-dft | 1–2 orders of magnitude over CPU; 8–50× vs GAMESS on 256-core cluster | 4× Tesla | TeraChem |
Wave 4 filter-outs: MAFFT MSA (11–20×, surpassed), RAxML likelihood (32× kernel only, ~3–10× end-to-end), AmgX CG (3–4× vs AmgX baseline), BVH Karras LBVH (2–3× over prior GPU LBVH), LDPC decode (40–160 Mbps, not ≥10× over modern SIMD CPU), mesh decimation (application-dependent), discrete Gaussian sampler (single-digit % gains; fold into Kyber NTT). Revisit when the published number changes.
Why a form earns its slot
A form is GPU-worth-it when at least one of:
- Embarrassingly parallel. N independent items, no cross-item dependency.
- Dense, branch-free inner loop. Same operation on every element.
- Reduction-friendly. Tree-reduce / prefix-sum / parallel-scan patterns.
- Big batch amortizes fixed kernel overhead.
When none of these hold, find a different decomposition: parallelize on a different axis, or stay on CPU & fan out across fleet hosts.
Source & specs
examples/cuda-fanout/ — wire contract, daemon protocol, bench data, per-tier integration.
CATALOG.md — canonical source for form metadata.