Three controls, one word
Every codebase has one. “Auth” means authentication in one file, authorization in another, and “the auth team” is whichever of the two is closer to the speaker.
This is fine in code. Everyone has context. It’s not fine in an audit.
SOC 2 CC6.1 covers logical access — authentication. CC6.3 covers role-based permissions — authorization. An auditor asking for “all authorization policy decisions from the last quarter” wants CC6.3 evidence. Not login flow redesigns. Not password reset improvements. Not OAuth provider migrations.
A vector store can’t give them that answer. Not because the embeddings are bad, but because “auth” is the loudest token in the neighborhood. Authn and authz collapse into the same cluster. Everything near either gets pulled in. The auditor gets a pile.
What the literature already knew
Requirements engineering has studied this problem for a decade under a different name: glossary term extraction.
Arora et al. (2017) built REGICE, which extracts candidate glossary terms from requirements documents and clusters them. In their industrial case studies, syntactic similarity (SoftTFIDF over token overlap) outperformed WordNet-based semantic measures for clustering accuracy.
Hasso et al. (2022) went further on abbreviation-expansion pairs — “JWT” ↔ “JSON Web Token”, “authz” ↔ “authorization”. FastText embeddings scored F1 between 0.06 and 0.29. Feature-based matching on initial letters, length, and character order scored 0.90 to 0.94.
Why the gap? Abbreviations are inherently ambiguous. A single embedding vector averages across every expansion the model has seen. “Authz” in a security paper and “authz” in an OAuth provider’s docs produce one confused vector. Feature matching doesn’t care what the word means — it just checks whether the letters line up.
Bhatia et al. (2020) added a coverage metric: the fraction of source documents touched by at least one glossary term. A glossary with poor coverage is incomplete. A term with declining coverage is dying.
None of the three papers addressed what happens when the glossary itself changes.
What changes in an append-only corpus
Organizations don’t stabilize their vocabulary. They drift.
A company splits “auth” into “authn” and “authz” on the day an auditor asks for SOC 2 CC6.1 evidence. That’s a glossary event. Every decision before that day used “auth” for both. Every decision after distinguishes them.
A retrieval system that treats the corpus as stable loses this. A system that ranks terms by usage over time sees the split — “authn” and “authz” start appearing in March 2025; “auth” stops covering authorization decisions by April.
luplo records these events. The term #authz carries provenance: the decision that introduced it, the canonical expansion, the alternate forms, the date its activation overtook #auth for authorization queries. An auditor asking for CC6.3 evidence gets three rows, not twelve. Each row cites the glossary version it was filed under.
The original literature assumed either that requirements stabilize before a glossary is built, or that a human librarian maintains one in parallel. Both assumptions depended on the same fact: glossary curation was expensive. LLM-assisted curation changed that fact. The method is the same. The regime is new.