golem — AI Technology Demos

cached = in browser storage · ready = instantiated this session

Asset	Size	Status

Load tokenizer

save locally

HuggingFace model name for AutoTokenizer.from_pretrained. Try Xenova/bert-base-uncased — BERT’s WordPiece tokenizer (~570 KB), a useful contrast to GPT-2’s BPE: subword units use ## continuation markers and a 30k vocabulary instead of 50k. For something that will really push the browser: Xenova/bloom-560m — byte-level BPE with a 250,880-token vocabulary (~17 MB tokenizer.json), 5× larger than GPT-2.

console API

// What's loaded?
golem.tokenizers()
// → { 'gpt2-tokenizer': 'ready', 'xenova-bert-base-uncased': 'ready' }

// Smoke-test tokenization
golem.tokenize('gpt2-tokenizer', 'Hello, world!')
// → [{ piece: 'Hello', id: 15496 }, { piece: ',', id: 11 }, { piece: ' world', id: 995 }, …]

// Write iterative logic (callback mirrors real usage patterns)
golem.tokenize('xenova-bert-base-uncased', 'tokenization is fun', (piece, id, i) => {
  console.log(i, piece, id)
})
// 0 'token' 19204
// 1 'ization' 2734   ← WordPiece continuation, no ##, but you can see the boundary
// …

// Verify round-trips
const tokens = golem.tokenize('gpt2-tokenizer', 'The quick brown fox')
golem.decode('gpt2-tokenizer', tokens.map(t => t.id))
// → 'The quick brown fox'

Load language model

save locally

HuggingFace model name for AutoModelForCausalLM.from_pretrained. Try Xenova/distilgpt2 — a distilled GPT-2 (82 M parameters, ~40 MB quantized) that shares GPT-2’s tokenizer and vocabulary; useful for comparing generation quality against the full model at half the weight. For a heavier test: Xenova/gpt2-medium — 345 M parameters, ~170 MB quantized, same tokenizer. Registry keys get a -lm suffix (e.g. xenova-distilgpt2-lm). Loading Xenova/gpt2 here pre-warms the same xenova-gpt2-lm instance that §2 and §3 use via golem.loadModel() — no duplicate download.

console API

// What's loaded?
golem.models()
// → { 'xenova-distilgpt2-lm': 'ready' }

// Pre-warm the GPT-2 model used by §2/§3 — same instance, no duplicate
const m = await golem.loadLM('Xenova/gpt2')
// REGISTRY['xenova-gpt2-lm'].status → 'ready'
// §2's golem.loadModel() returns the same cached instance immediately

// Load DistilGPT-2 for comparison (same tokenizer as GPT-2)
const d = await golem.loadLM('Xenova/distilgpt2')
// REGISTRY key: 'xenova-distilgpt2-lm'

// Release from memory (browser cache stays — next load is instant)
// Pre-declared entry (xenova-gpt2-lm) resets to 'cached'; dynamic entries are removed
await golem.unloadLM('xenova-gpt2-lm')
await golem.unloadLM('xenova-distilgpt2-lm')

Load embedding model

save locally

HuggingFace model name for a feature-extraction pipeline. Registry keys get an -emb suffix (e.g. xenova-all-minilm-l6-v2-emb). Try Xenova/all-MiniLM-L6-v2 — 384-dimensional sentence embeddings, ~23 MB quantized, the same model used by §4, §5, and §6; loading it here pre-warms it for those sections with no duplicate download. For a multilingual alternative: Xenova/paraphrase-multilingual-MiniLM-L12-v2 — 384 dimensions, 50+ languages, ~120 MB quantized.

console API

// What's loaded?
golem.embedders()
// → { 'xenova-all-minilm-l6-v2-emb': 'ready' }

// Pre-warm the MiniLM used by §4/§5/§6 — same instance, no duplicate
await golem.loadEmb('Xenova/all-MiniLM-L6-v2')
// REGISTRY['xenova-all-minilm-l6-v2-emb'].status → 'ready'
// golem.embed() / golem.loadEmbedder() use this same entry

// Load a multilingual model for comparison
await golem.loadEmb('Xenova/paraphrase-multilingual-MiniLM-L12-v2')
// REGISTRY key: 'xenova-paraphrase-multilingual-minilm-l12-v2-emb'

// Embed with a specific model
const vec = await golem.embedWith('xenova-paraphrase-multilingual-minilm-l12-v2-emb', 'Bonjour le monde')
// → Array(384) of numbers (unit vector)

// The default golem.embed() always uses all-MiniLM-L6-v2
const vec2 = await golem.embed('hello world')

// Unload — pre-declared entry resets to 'cached'; dynamic entries are removed
await golem.unloadEmb('xenova-all-minilm-l6-v2-emb')
await golem.unloadEmb('xenova-paraphrase-multilingual-minilm-l12-v2-emb')

Memory store (§7)

Embedded memories from §7 stored in IndexedDB.
A ✕ button appears in the state table above while memories are present.

console API

golem.loadMemories()              // → [{id, text, vec}, …]
golem.saveMemory(id, text, vec)   // upsert a memory (id = crypto.randomUUID())
golem.deleteMemory(id)            // delete one memory
golem.clearMemories()             // delete all memories

§1 — Tokenization

Tokenization is how large language models break text into discrete units before processing. This demo uses the GPT-2 BPE (Byte Pair Encoding) tokenizer via Transformers.js. The tokenizer vocabulary (~800 KB) is downloaded from HuggingFace on first use and cached by the browser.

Loading tokenizer vocabulary…

§2 — Next-Token Prediction

At each step, a language model produces a probability distribution over every token in its vocabulary — not a single answer. This demo runs GPT-2 (quantized, ~81 MB, cached after first download) fully in-browser and shows that raw distribution for the token that would follow your input. Click any row to append that token and recalculate.

§3 — Sampling Strategies

Autoregressive generation is a loop: run the model, get a distribution, pick a token, append it, repeat. How you pick determines the character of the output. The four strategies below all operate on the same probability distribution from §2.

Greedy — always take the highest-probability token. Deterministic; tends toward repetitive, "safe" text.
Temperature — sample from the full distribution. Higher temperature = more random.
Top-k — restrict sampling to the k most probable tokens, then sample.
Top-p (nucleus) — restrict sampling to the smallest set of tokens whose cumulative probability ≥ p, then sample.

§4 — Embeddings

An embedding encodes text as a point in high-dimensional space: semantically similar texts land near each other regardless of surface wording. This demo uses all-MiniLM-L6-v2 (~23 MB quantized) to produce 384-dimensional sentence vectors, then measures their cosine similarity — the angle between the two vectors in that space.

Each bar below is the full 384-dimensional vector visualized left-to-right: blue = positive, red = negative. Similar sentences produce similar-looking patterns.

§5 — Semantic Search

Embeddings make it possible to search by meaning rather than by matching keywords. The corpus here is 40 one-sentence descriptions of AI/ML concepts — the same ideas demonstrated in §1–§4. The index is built once with the same all-MiniLM-L6-v2 model from §4 and stored in IndexedDB so subsequent visits skip the embedding step entirely.

Try queries that don't share words with any passage: "breaking text into pieces", "making output more creative", "how the model focuses on relevant words", "preventing wrong answers".

§6 — Retrieval-Augmented Generation

RAG combines §5 and §3 into one pipeline: embed a query, retrieve the most relevant passages from a corpus, prepend them to the prompt, then generate. The retrieved context steers GPT-2 toward topic-relevant continuations — even though GPT-2 isn't instruction-tuned and won't produce clean "answers". The demo makes each step visible: which passages were retrieved, exactly what string is fed to the model, and how the output shifts when context is present versus absent.

Try "how attention works", "training a language model", "searching by meaning", or "why models make things up".

retrieved passages

context fed to model

generation — with retrieved context

§7 — Memory

An agent's "memory" is a writable version of the retrieval index from §5–§6. Each fact is embedded once and stored in IndexedDB — the same technique used to build the static corpus in §6, but now you control what goes in. When you query the agent in the next section, these memories will be retrieved by similarity and injected into the prompt as context, closing the loop from storage to generation.

Add a few facts about yourself or anything you want the agent to remember: "I prefer concise explanations", "I am a machine learning engineer", "My favorite programming language is JavaScript".

§8 — Memory-Grounded Generation

The memories stored in §7 are put to work here. When you send a message it is embedded with the same model that stored the memories, and the most similar memories are retrieved by cosine similarity — exactly like §5 semantic search, but over your personal store. Those memories are then substituted into a prompt template before GPT-2 generates a continuation. This is the core loop behind most memory-enabled agents: embed → retrieve → inject → generate.

[[memories]] is replaced by the top-k retrieved memories as a comma-separated list.
[[message]] is replaced verbatim by your message below.
Both placeholders are optional — leave one out to skip that step.

retrieved memories

assembled prompt

generated output

golem an automaton

§1 — Tokenization

§2 — Next-Token Prediction

§3 — Sampling Strategies

§4 — Embeddings

§5 — Semantic Search

§6 — Retrieval-Augmented Generation

§7 — Memory

§8 — Memory-Grounded Generation