lumbda

bend — dispatch to a GPU without rewriting your code

bend is a Lumbda primitive that decides per call whether to evaluate locally or ship to a CUDA worker over our wire protocol. Tiny inputs stay local; heavy inputs bend to a worker that holds a warm CUDA context across requests. The decision uses a cost estimator on the argument shape, not the operation name.

Start a GPU worker

# On any host with nvcc + a CUDA-capable GPU:
make gpu-worker
# → builds examples/cuda-fanout/shake256-fanout
# → builds the C tier (~10× faster wire orchestration than Python)
# → launches gpu-worker.lsp on port 9091

# Override tier or port:
make gpu-worker LUMBDA=python PORT=9001   # easier debugging
make gpu-worker LUMBDA=asm                # smallest footprint

Call it from any tier

;; bend works on every tier — Python, C, asm — through the same
;; tcp-* + portal primitives lumbda already ships.
(load "examples/cuda-fanout/wire.lsp")
(load "examples/cuda-fanout/bend.lsp")
(load "examples/cuda-fanout/bend-macros.lsp")  ; Python/C only — asm uses bend-call

;; Tiny — cost below threshold, evaluates locally
(bend (cuda-shake-fanout '("00" "01" "deadbeef") 32))

;; Heavy — cost above threshold, ships to the GPU worker
(bend (cuda-shake-fanout one-million-inputs 32))

Wire protocol

Two modes: S-expression text (the default) and binary (magic BSHK header + raw bytes). Binary mode bypasses S-expression parsing entirely.

workloadPy S-expPy binaryC S-expC binary
100 × 16 B3.43 ms0.74 ms0.40 ms0.15 ms
1k × 16 B23.24 ms0.76 ms2.77 ms0.22 ms
10k × 16 B218.82 ms1.27 msCLIFF0.88 ms
100k × 16 B2,219 ms10.18 msCLIFF10.35 ms
1M × 16 B23,811 ms159 msCLIFF157 ms

Binary mode wins by 30–200× over S-expression at scale; at 1 M × 16 B inputs C tier binary is 157 ms vs 23,811 ms for S-exp, and bend beats host hashlib by ~12×. The CUDA toolchain stays isolated to the leaf binary the worker spawns — no tier links libcudart; asm tier hosts workers through hand-written pipe2 + fork + execve syscalls.

Fleet

Production default: single host. Multi-host fan-out is built (round-robin in bend.lsp via *bend-workers* + BEND_WORKERS env), available on demand for long-running parallel workloads — not used routinely.

hostGPUarchportstatus
3090-ai.foxhop.netRTX 3090 (24 GB)sm_869091active — production worker
ai.foxhop.netRTX 4090 (24 GB)sm_899092reserved for qwen LLM (llama.cpp); bend worker enabled per workload

Multi-host fan-out was validated at 1.6× aggregate throughput on small workloads, but routine round-robin against the 4090 would steal VRAM from qwen. Caller opts in explicitly when a workload justifies fan-out: (bend-set-workers! '(("3090-ai.foxhop.net" . 9091) ("ai.foxhop.net" . 9092))) brings the 4090 online for that call.

Form catalog

A form earns a slot here only after we have published a benchmark or measured one on our hardware. "I think this would be fast" does not earn a slot — the form-status column says planned until numbers exist.

Status legend

Live forms

formhardwarethroughputwire shape
cuda-shake-fanout RTX 3090 12× host hashlib at 1 M × 16 B inputs (cuda-shake-fanout '(hex ...) out-bytes) + BSHK binary
cuda-sim-ops-bin RTX 3090 1.07× at 128 batches; crossover ~115 batches (cuda-sim-ops-bin "path/to/ops.bin" n-batches)
cuda-sim-axis-flip RTX 3090 217 Mops/s @ K=32 M=4 (many-candidates × few-shots) (cuda-sim-axis (variant-paths ...) n-shots)
cuda-bignum-cgbn RTX 3090 1.28 Gops/s kernel mod-mul @ n=1M (256-bit, ~256× GMP CPU); 9 ops BCGB binary: op_id + bitwidth + n + modulus + a + b
cuda-secp256k1-batched-mul RTX 3090 13.83 Mkeys/s @ n=1M (~309× coincurve CPU; windowed-G ladder w=4) BSCP binary: scalars + base-point → BSCR points
cuda-radix-sort RTX 3090 5.50 Gkeys/s @ n=10M (kernel 1.82 ms); 3.77 Gkeys/s @ n=1M; CUB DeviceRadixSort u64 ascending BSRT binary: op_id + n + u64[n] → BSRR sorted u64[n]
cuda-blake3-tree RTX 3090 32.5 GB/s @ 1M × 64 B (kernel 1.97 ms); 20.4 GB/s @ 1k × 1 MB (kernel 51.5 ms); byte-identical to BLAKE3 reference spec across single-chunk + multi-chunk paths BSB3 binary: out_bytes + n + (u32 len + bytes) per input → BSR3 n + out_bytes + digests

Surveyed forms (Wave 1)

formthroughputhardwareref
cuda-secp256k1-batched-mul6.5 Gkeys/s VanitySearch RTX 4090; 8.6 Gkeys/s RTX 5090; 2.65 Gkeys/s RTX 3080(promoted — see Live)gECC · Bitcrack
cuda-bignum-cgbn100×+ on dense mul vs Xeon-20c + GMP + OpenMPV100NVlabs CGBN · midsize-int
cuda-rho-pollard-walk87.7 M ops/sec for ECCp79RTX 2070 Superatlomak · oritwoen
cuda-clifford-stabilizer186× over Stim (CPU SOTA) on equivalence-checkingSTABSimSTABSim · Qimax
cuda-bernstein-yang-inv3–10× per inversion over Fermat on CPU; novel territory on GPUsafegcd · Jumping
cuda-ntt-polyup to 123× over CPU; 21× on RTX 3070RTX 3070NTTSuite · FHE NTT
cuda-radix-sort1.4 G keys/sec; 20–50× over CPU merge sort; 257× vs Xeon Phi for scan(promoted — see Live)CUB · Onesweep

Surveyed forms (Wave 2)

Sorted by reported speedup descending.

formthroughputhardwareref
cuda-minhash-weighted600–1000× vs numpy+MKLTitan X vs Xeon E5-1650src-d/minhashcuda
cuda-cuckoo-filter378× insert, 258× deleteA100arXiv:2603.15486
cuda-aes-ctr-chacha20211–400 GB/ssingle GPUAsyncGBP
cuda-suffix-array-skew30–242× vs CPU SA-ISTesla K20Liu/Luo
cuda-kdtree-build30–242× build, 1.6–200× kNNRTX (RT cores)Zhou et al.
cuda-sat-paraFROST-elim93× peak, 48× avg variable elimNVIDIA + Kissat baselineParaFROST
cuda-aho-corasick-pfac~50–100× IDS pkt-inspectGTX-classPFAC
cuda-dilithium-pqsig57.7× keygen+sign+verify vs single CPU threadRTX 3090 TiIACR 2024/1365
cuda-mc-options-pricing25–152×Tesla C1060 / modernGPU Gems Ch.45
cuda-cuFFT-batched-1D8–32× vs MKL; tcFFT 1.1–3.2× vs cuFFTV100 / A100tcFFT
cuda-blake3-tree32.5 GB/s @ 1M × 64 B; 20.4 GB/s @ 1k × 1 MB on 3090(promoted — see Live)Blaze-3
cuda-bloom-filter-modern~6× CPU; 3.4 B inserts/sB200 / PerlmutterarXiv:2512.15595
cuda-gemm-batched-FP84.8× FP8 vs A100; 716 TFLOPS H100H100 SXMcuBLAS 12.0
cuda-batched-matrix-inverse4.3–16.8× vs MAGMAP100Superfri 2018
cuda-hash-join-radix4 B tuples/s single; 1.8 T tuples/s on 1024 A100A100 clusterADMS-21
cuda-kmer-count4–6× vs KMC2RapidGKC, GerbilRapidGKC
cuda-cgraph-traversal38 B TEPSDGX2cuGraph
cuda-triangle-count-TRUST~1 T TEPSmulti-A100TRUST
cuda-ldpc-bp-decoder10 Gbps with early-terminationGPGPUMDPI Electronics 2022
cuda-nvcomp-zstd2.2× zstd; 1.4× LZ4; 1.9× snappyH100 / A100nvCOMP

Surveyed forms (Wave 3)

formthroughputhardwareref
cuda-fluidx3d-lbm100–200× vs ANSYS Fluent; 8,799 MLUPS single A100A100FluidX3D
cuda-mfcc-spectral~97× CPU MFCC; STFT ~75× via cuSignalGTX 580 / RTX 30-seriescuSignal
cuda-batched-lp-simplex95× over CPLEX; 5× over GLPKGTX 980-classarXiv 1802.08557
cuda-betweenness-centrality-weighted30–150× warp-centric weighted BCGTX onwardsarXiv 1701.05975
cuda-cudasift-orb-ransac~60× SIFT CPU→GPU; ORB 11.3×GTX 1060+CudaSift
cuda-cudasw-gasal2CUDASW++4.0 16.2×; 5.71 TCUPS on H100H100CUDASW++4.0
cuda-loopy-bp-mrf45× over CPU LBP for stereo MRFGTX 280+arXiv 2509.22337
cuda-sgm-stereo42 fps @ 640×480, 128 disparitiesTegra X1 / discretearXiv 1610.04121
cuda-hungarian-lap10–50×; 400 M-variable LAP in ~13 sNVIDIA GPUScienceDirect
cuda-msm-bls12-38127.86× over Pippenger AVX baselineA100 / RTX 4090SimdMSM TCHES
cuda-pdwt-lifting15.9× over best optimized CPU DWTGTX / TeslaPDWT
cuda-tensornet-contract8–20× vs CuPy; tensor QR ~100× vs XeonA100cuTensorNet
cuda-ega-gpu-aggregation6.45–29.12× multi-pass; group-by 19.4×NVIDIA GPUVLDB Top-k EGA
cuda-bicgstab-ilu-spmvSpTRSV 10.7×; BiCGSTAB 3.2× vs cuSPARSEV100 / MI210arXiv 2508.04917
cuda-icicle-snark-groth16fastest Groth16 today; NTT 91% of proverRTX 4090 / A100ICICLE-Snark

Surveyed forms (Wave 4)

formthroughputhardwareref
cuda-kalman-batched1386× for 5000-component measurementsvariousCUDAkalmanFilter
cuda-g6k-tensor-sieve1230× vs G6K CPU sieve at dim 120; SVP record dim 180 on 4 Turing GPUs4 TuringDucas/Stevens/van Woerden EC 2021
cuda-zkspeed-sumcheck-hyperplonk801× geomean over CPU; sumcheck 8.4 s → 9.5 msfull-chip acceleratorzkSpeed HPCA 2025
cuda-kyber-batched-ntt~451× batched (B=65k); HI-Kyber 6.47× over prior GPU SOTARTX 3080HI-Kyber
cuda-ironman-ote237× OT throughput vs full-thread CPUnear-memory variantIronman arXiv 2507.16391
cuda-cudss-cholesky>100× vs QDLDL; 20× vs CHOLMOD factorNVIDIANVIDIA cuDSS
cuda-particle-filter~150× absolute (5000 particles @ 170 Hz)GPGPUEURASIP J ASP 2013
cuda-rk-stiff-chemkin126× vs single-core; 25× vs 6-core for hydrogen RKCK 524k ODEsGPGPUNiemeyer & Sung
cuda-fem-assembly-jit87× assembly vs serial CPU; 126× peak numerical integrationGPGPUMironov et al.
cuda-cufalcon-sign201k sig/s Falcon-512 on A100; verify 2.72M sig/s, 29.5× vs AVX2A100cuFalcon eprint 2025/249
cuda-cudahull-3d30–40× over Qhull CPUNVIDIACudaHull CAG 2012
cuda-fastplay-garbled35–40× over serial garbling on GPU clusterGPU clusterFastplay eprint 2011/097
cuda-air-fri~22.8× avg end-to-end ZK speedup; FRI commitmentGPGPUAir-FRI SAC 2025
cuda-rabin-fingerprint16× over single-thread CPU; 40 Gbps absoluteGTX 780 (HARENS)HARENS CloudCom 2016
cuda-gdel3d10× over CGAL 3D Delaunay; 70× Voronoi/jump-flood at 10M pointsNVIDIAgDel3D I3D 2014
cuda-perasure-crs10× vs multithread Jerasure; 10 GB/s on GTX780 absoluteGTX 780PErasure IEEE Cluster 2015
cuda-piranha-mpc4× vs CryptGPU on VGG16 private inference; full 3/4-party stacks single-GPUsingle GPUPiranha USENIX Sec 2022
cuda-scamp-matrix-profilequintillion pairwise comparisons / day (absolute)GPGPUSCAMP
cuda-cudtw-subseq2–3 orders of magnitude over UCR-Suite CPU; soft-DTW up to 5000×VoltacuDTW++ Euro-Par 2020
cuda-terachem-dft1–2 orders of magnitude over CPU; 8–50× vs GAMESS on 256-core cluster4× TeslaTeraChem

Wave 4 filter-outs: MAFFT MSA (11–20×, surpassed), RAxML likelihood (32× kernel only, ~3–10× end-to-end), AmgX CG (3–4× vs AmgX baseline), BVH Karras LBVH (2–3× over prior GPU LBVH), LDPC decode (40–160 Mbps, not ≥10× over modern SIMD CPU), mesh decimation (application-dependent), discrete Gaussian sampler (single-digit % gains; fold into Kyber NTT). Revisit when the published number changes.

Why a form earns its slot

A form is GPU-worth-it when at least one of:

  1. Embarrassingly parallel. N independent items, no cross-item dependency.
  2. Dense, branch-free inner loop. Same operation on every element.
  3. Reduction-friendly. Tree-reduce / prefix-sum / parallel-scan patterns.
  4. Big batch amortizes fixed kernel overhead.

When none of these hold, find a different decomposition: parallelize on a different axis, or stay on CPU & fan out across fleet hosts.

Source & specs

examples/cuda-fanout/ — wire contract, daemon protocol, bench data, per-tier integration.
CATALOG.md — canonical source for form metadata.