Proof-of-concept demo of AI core technologies, running entirely in the browser.

Updated 2026-04-01T04:22:16.023Z

§1 — Tokenization

Tokenization is how large language models break text into discrete units before processing. This demo uses the GPT-2 BPE (Byte Pair Encoding) tokenizer via Transformers.js. The tokenizer vocabulary (~800 KB) is downloaded from HuggingFace on first use and cached by the browser.

Loading tokenizer vocabulary…

§2 — Next-Token Prediction

At each step, a language model produces a probability distribution over every token in its vocabulary — not a single answer. This demo runs GPT-2 (quantized, ~81 MB, cached after first download) fully in-browser and shows that raw distribution for the token that would follow your input. Click any row to append that token and recalculate.

Temperature < 1 sharpens the distribution; > 1 flattens it.

§3 — Sampling Strategies

Autoregressive generation is a loop: run the model, get a distribution, pick a token, append it, repeat. How you pick determines the character of the output. The four strategies below all operate on the same probability distribution from §2.

Greedy — always take the highest-probability token. Deterministic; tends toward repetitive, "safe" text.
Temperature — sample from the full distribution. Higher temperature = more random.
Top-k — restrict sampling to the k most probable tokens, then sample.
Top-p (nucleus) — restrict sampling to the smallest set of tokens whose cumulative probability ≥ p, then sample.

§4 — Embeddings

An embedding encodes text as a point in high-dimensional space: semantically similar texts land near each other regardless of surface wording. This demo uses all-MiniLM-L6-v2 (~23 MB quantized) to produce 384-dimensional sentence vectors, then measures their cosine similarity — the angle between the two vectors in that space.

Each bar below is the full 384-dimensional vector visualized left-to-right: blue = positive, red = negative. Similar sentences produce similar-looking patterns.

§5 — Semantic Search

Embeddings make it possible to search by meaning rather than by matching keywords. The corpus here is 40 one-sentence descriptions of AI/ML concepts — the same ideas demonstrated in §1–§4. The index is built once with the same all-MiniLM-L6-v2 model from §4 and stored in IndexedDB so subsequent visits skip the embedding step entirely.

Try queries that don't share words with any passage: "breaking text into pieces", "making output more creative", "how the model focuses on relevant words", "preventing wrong answers".

§6 — Retrieval-Augmented Generation

RAG combines §5 and §3 into one pipeline: embed a query, retrieve the most relevant passages from a corpus, prepend them to the prompt, then generate. The retrieved context steers GPT-2 toward topic-relevant continuations — even though GPT-2 isn't instruction-tuned and won't produce clean "answers". The demo makes each step visible: which passages were retrieved, exactly what string is fed to the model, and how the output shifts when context is present versus absent.

Try "how attention works", "training a language model", "searching by meaning", or "why models make things up".

§7 — Memory

An agent's "memory" is a writable version of the retrieval index from §5–§6. Each fact is embedded once and stored in IndexedDB — the same technique used to build the static corpus in §6, but now you control what goes in. When you query the agent in the next section, these memories will be retrieved by similarity and injected into the prompt as context, closing the loop from storage to generation.

Add a few facts about yourself or anything you want the agent to remember: "I prefer concise explanations", "I am a machine learning engineer", "My favorite programming language is JavaScript".

§8 — Memory-Grounded Generation

The memories stored in §7 are put to work here. When you send a message it is embedded with the same model that stored the memories, and the most similar memories are retrieved by cosine similarity — exactly like §5 semantic search, but over your personal store. Those memories are then substituted into a prompt template before GPT-2 generates a continuation. This is the core loop behind most memory-enabled agents: embed → retrieve → inject → generate.

[[memories]] is replaced by the top-k retrieved memories as a comma-separated list.
[[message]] is replaced verbatim by your message below.
Both placeholders are optional — leave one out to skip that step.

Prompt template
Message