Home / Sectors / In-browser ML inference

ML inference.

Browser-side ML where the heavy linear algebra runs through Hadron's dispatch.

The buyer's problem

Teams want to push ML to the browser for cost, privacy, and offline reasons. But the post-embedding compute (similarity matmul, ranking, clustering) gets stuck on the main thread or hard-coded to WASM, and the dispatch story matters most exactly there.

What Hadron replaces

The custom-rolled matmul and ranking layer that sits between Transformers.js / ONNX and the UI. The compute engine handles the K×K similarity matrix; ayoob-sort handles the top-K ranking; each picks GPU or CPU per call.

What the buyer gets

Working translation, sentiment classification, and semantic-search demos running on real Hugging Face models - on whatever hardware the user happens to have.

Patents demonstrated: P2 P5 Reference customers: browser-based ML inference frameworks

How to tell Hadron is working

Hadron is not claiming the GPU is fastest. It picks the right backend per operation. A small attention matmul on a quantised model might lose on GPU because of transfer cost; the same matmul at sequence length 512 wins on GPU by 5×. Watch the Dispatch evidence panel beside each app: every model invocation prints which path was chosen, the score, and a one-line reason.

GPUHardware-accelerated - chosen when batch size, shape and precision all favour it.

WORKERBackground thread - chosen when GPU transfer cost would dominate.

CPUInline - chosen for small workloads where dispatch overhead would dominate.

REFUSEDGPU categorically refused (patent-protected) - precision-risk or shape-incompatible operations.

UI heartbeat: 0 frames painted Longest frame gap (last 5s): 0 ms main thread responsive

Note Models stream from the Hugging Face CDN on first use (~25-30 MB each) and cache in the browser. The first load is slower; refresh and try again to see steady-state behaviour.