Curiosity - Abbreviation Detection

Abbreviation Detection

Curiosity Workspace automatically learns the abbreviations that appear in your content. When a document writes out a term in full and then puts its short form in parentheses, the workspace records the pair so that a search for the short form can match the long form and vice versa.

… measured the level of Adenosine Triphosphate (ATP) in …

From that one sentence the workspace learns ATP = "Adenosine Triphosphate" and stores it on the graph. The detection is unsupervised: there is no built-in dictionary of abbreviations. Everything is mined from the structure of your own text, so domain-specific abbreviations (an internal product code, a team acronym, a chemical name) are picked up the same way the common ones are.

This page explains the algorithm, the heuristics that keep it precise, and how an admin reviews and overrides what it learns. For where this sits among the other text-processing capabilities, see the NLP overview.

What it produces

Each learned abbreviation becomes an _Abbreviation node on the graph with three parts:

Field	Meaning
`Description`	The expanded long form, normalized (e.g. `Adenosine Triphosphate`).
`AbbreviatedAs`	The set of short forms seen for that description (e.g. `{ ATP }`).
`Context`	The surrounding content words, counted, used to disambiguate ambiguous forms.

These nodes feed search and entity linking: the same query expansion that lets USA find "United States" is driven by what abbreviation detection has learned from the corpus.

The core algorithm

Detection is a variant of the Schwartz and Hearst algorithm, the classic approach for mining long form (SHORT FORM) definitions out of running text. The implementation lives in the open-source Catalyst NLP library (AbbreviationCapturer), and Curiosity Workspace wraps it with the graph storage and review layer described further down.

The whole flow runs per sentence:

flowchart TD A["Find a parenthetical: ( 1–5 letter/digit tokens )"] --> B[Pick the inner token with the most capitals] B --> C{Cheap reject? too long / all-lowercase / common word / symbol} C -->|reject| X[Skip] C -->|keep| D{Looks like an abbreviation? ≥ 2 capitals AND capitals > 50% of length} D -->|no| X D -->|yes| E[Walk backwards up to 5 tokens per capital letter] E --> F{Do the capital letters appear in order in the preceding words?} F -->|no| X F -->|yes| G{Final sanity checks pass?} G -->|no| X G -->|yes| H[Record long form ↔ short form]

1. Find a candidate in parentheses

The detector scans for an opening parenthesis followed by one to five letter-or-digit tokens and a closing parenthesis. Nested parentheses break the match. If several tokens sit inside the parentheses, it keeps the one with the most uppercase letters, on the assumption that the most-capitalized token is the abbreviation (so ( ATP, mM ) resolves to ATP).

2. Reject obvious non-abbreviations

The candidate is discarded immediately if it is:

longer than 100 characters,
entirely lower-case,
a common word (a built-in list of prepositions and question words: is, and, for, with, which, …), or
not made of letters or digits (a stray symbol).

3. Require it to look like an abbreviation

The detector extracts the uppercase letters of the candidate and keeps going only if:

there are at least 2 of them, and
they make up more than 50% of the token's length.

So ATP (3 of 3 capitals) and mRNA (3 of 4) qualify, but about (0 capitals) and Mr (later filtered for length) do not. This is what lets the algorithm accept abbreviations that carry a few lower-case letters while still rejecting ordinary words that merely happen to sit in parentheses.

4. Walk backwards to find the long form

Starting just before the opening parenthesis, the detector walks backwards through the preceding tokens, up to five tokens for every uppercase letter in the abbreviation (so 15 tokens for a 3-letter form). It stops early if it hits another bracket or parenthesis. As it goes, it marks which of the abbreviation's letters it has seen in the preceding text.

Once every letter has been seen, the span from that point up to the opening parenthesis is the candidate long form, and it must pass the order check below.

5. Confirm the match in order

Two checks decide whether the candidate long form is accepted:

Subsequence in order. The abbreviation's capital letters must appear left-to-right, in sequence within the long form. For ATP against "Adenosine Triphosphate", A → T → P must occur in that order. This is the heart of Schwartz and Hearst, and it is what prevents accidental matches where the right letters are present but scrambled.
Capital coverage. Every uppercase letter in the long form must be accounted for by the abbreviation. A long form with a stray capital the abbreviation does not cover is rejected, which keeps the short form aligned to the significant (capitalized) words of the definition.

6. Final sanity checks

Before the pair is recorded, it is dropped if:

the long form is all uppercase (it is itself an acronym, not a definition),
the long form is too short: length < abbreviation length + 5 (an abbreviation should be meaningfully shorter than what it stands for), or
the abbreviation contains the long form (a circular ABB (ABB) capture).

7. Normalize the long form

The accepted long form is run through a normalizer (GetStandardForm) that collapses runs of whitespace to single spaces, keeps apostrophes inside words (normalizing a typographic ’ to a plain '), and trims the ends. "Adenosine Triphosphate " becomes "Adenosine Triphosphate".

Worked example

The cell stores energy as Adenosine Triphosphate (ATP).

Parenthetical found: (ATP). Single inner token, ATP.
Cheap rejects: not too long, not all-lowercase, not a common word, all letters. Kept.
Looks like an abbreviation: capitals A T P = 3, which is ≥ 2 and > 50% of length 3. Kept.
Backtrack: up to 15 tokens before (. The words Triphosphate, Adenosine supply T, P and A. All letters seen.
In order: A → T → P is a subsequence of "Adenosine Triphosphate". The only capitals in the long form, A and T, are both covered. Pass.
Sanity: long form is not all-caps, length 22 ≥ 3 + 5, abbreviation does not contain it. Pass.
Normalize: Adenosine Triphosphate.

Result: _Abbreviation { Description = "Adenosine Triphosphate", AbbreviatedAs = { "ATP" } }.

Disambiguation by context

The same short form can mean different things (Dr. is "Doctor" in one document and "Drive" in another). To tell them apart, detection also captures a small bag of context words around each occurrence: the nouns, adjectives and proper nouns within roughly 250 characters on either side, excluding stop words, short tokens, numbers, and already-known entities.

Those words are counted per description and stored in the Context field. At indexing time, when an ambiguous short form needs to be resolved, the workspace compares the surrounding text to each candidate's stored context and picks the description whose context overlaps most. This is a frequency vote, not a model: the description that has historically appeared in the most similar surroundings wins.

Reconciling competing long forms

Across a corpus the same abbreviation often shows up with slightly different long forms, usually because tokenization split or merged words differently (United States vs. UnitedStates). The workspace canonicalizes descriptions by stripping whitespace to dedupe them, and when two genuinely different spellings collide it chooses the better one with these preferences:

When both candidates have only "real" words (every word longer than 3 characters), prefer the one with more words, unless the extra split was an erroneous break on a lower-case boundary.
Otherwise prefer the spelling that correctly split on an upper-case boundary (United States) over the one that merged across it (UnitedStates).
When the word counts tie, prefer the spelling whose shortest word is longer.

The losing spelling's short forms are merged into the surviving node so no abbreviation variant is lost.

Admin review and overrides

Detection is automatic, but admins stay in control of the result.

Automatic cleanup. A maintenance pass removes low-quality captures that slipped through: descriptions that are all uppercase, shorter than 5 characters, or where the short form equals the description. These are moved to the disallow list rather than silently deleted, so they do not get re-learned on the next pass.
Manual disallow. An admin can disallow a specific abbreviation. Once disallowed, the detector skips that pair on every future document via a shouldSkip filter, so a bad or unwanted abbreviation stays gone even as new content arrives. The action is reversible (undo the disallow to let it be learned again).

Limitations and pitfalls

Parentheticals only. Detection learns from the long form (SHORT) pattern. An abbreviation that is never defined that way in your corpus will not be learned automatically; add it as a dictionary spotter alias instead.
Capital-letter driven. The "looks like an abbreviation" test leans on uppercase letters. All-lower-case or punctuation-heavy short forms (e.g., etc.) are intentionally not captured as abbreviations.
Sentence-local. The long form and its parenthetical must sit close together (the backtrack window is bounded). A definition spread across a sentence boundary will not match.
Ambiguity needs signal. Context disambiguation only works once a short form has been seen enough times in distinguishable surroundings. Rare ambiguous forms may resolve to whichever description was learned first.

Where to go next

Entity Extraction — dictionary, pattern and ML spotters, including using aliases for abbreviations that are never spelled out in text.
NLP Overview — how detection fits with embeddings, extraction, and linking.
Catalyst — the open-source NLP library that implements the core abbreviation capturer.