Hierarchically Abstracted rePeat unit of PolYmer — a compact, human-readable, and machine-learnable string representation for polymer repeat units.
HAPPY decomposes monomer SMILES at single bonds (excluding ring systems) into reusable subgroups, each encoded as a single HAPPY token. This enables language-model-based property prediction, generation, and chemical-space exploration.
Real polymer repeat units decomposed into HAPPY representation. Click any card to load it in the interactive demo.
FORGE (Frequency-Oriented Re-Grouping of Entities) iteratively merges frequently co-occurring subgroups — just like BPE in NLP merges token pairs into subword tokens.
Frequent token pairs merge into subword tokens, reducing sequence length while preserving meaning.
Frequent subgroup pairs merge into higher-level chemical motifs, compressing polymer representation.
Watch how subgroups merge iteratively
Scan all polymer sequences for adjacent subgroup pairs and count frequencies.
Select the most frequently occurring adjacent pair across the entire corpus.
Replace every occurrence of that pair with a single new subgroup token.
Iterate until desired vocabulary size or compression ratio is reached.
Each FORGE iteration merges the most frequent subgroup pairs. Explore the new subgroups created at each step.
768 subgroups
62 subgroups
29 subgroups
7 subgroups
2 subgroups
Merged subgroups generated by the FORGE algorithm. Each token represents a frequently co-occurring chemical motif. 88 FORGE subgroups. Click any token to see its descriptors.
How HAPPY compresses polymer SMILES — comparing string lengths before and after encoding, with and without FORGE merging.
Each dot = one polymer. X = SMILES length (chars), Y = HAPPY tokens. Blue = Base HAPPY, Purple = FORGE HAPPY.
Ratio = HAPPY tokens / SMILES length. Lower = more compression.
Convert between SMILES and HAPPY representations in both directions.