SM² Lab, DGIST — Polymer Informatics

HAPPY
Representation

Hierarchically Abstracted rePeat unit of PolYmer — a compact, human-readable, and machine-learnable string representation for polymer repeat units.

HAPPY decomposes monomer SMILES at single bonds (excluding ring systems) into reusable subgroups, each encoded as a single HAPPY token. This enables language-model-based property prediction, generation, and chemical-space exploration.

Example Conversion
SMILES (Polymer Repeat Unit)
*Oc1ccc(C(=O)c2ccc(NC(=O)c3cccc(C(=O)Nc4ccc(C(=O)c5ccc(*)cc5)cc4)c3)cc2)cc1
HAPPY →
HAPPY Representation
Haa00ab00ac00ab00ad00ac00ae00ac00ad00ab00ac00ab00T
aa00
–O–
ab00
phenyl
ac00
C=O

Example Conversions

Real polymer repeat units decomposed into HAPPY representation. Click any card to load it in the interactive demo.

*CC(*)c1ccccc1
Haa00ab00ag00T
SMILES 14 tokens HAPPY 3 tokens
*c1ccc2c(c1)Cc1cc(*)ccc1-2
HcZ00T
SMILES 25 tokens HAPPY 1 tokens
*c1cc2c(s1)-c1sc(*)cc1C2=O
Hjg00T
SMILES 26 tokens HAPPY 1 tokens
*n1nc(-n2c(=O)c3cc4c(=O)n(*)c(=O)c4cc3c2=O)c2ccccc…
HfQ00cu00T
SMILES 55 tokens HAPPY 2 tokens
*c1cccc2c(-n3c(=O)c4cc5c(=O)n(*)c(=O)c5cc4c3=O)ccc…
Hcr00cu00T
SMILES 53 tokens HAPPY 2 tokens
*Oc1ccc2nc3oc4cc5oc6c7c(cn6ccc5on4cnc3c12)nc1c(*)c…
Hao00lj00T
SMILES 55 tokens HAPPY 2 tokens
*c1ccc(-c2ccc3nc4oc5cc6[nH]c7nc8cc(*)ccc8nc7nc-6[n…
Hat00ll00T
SMILES 62 tokens HAPPY 2 tokens
*c1ccc(-c2ccn3sc4nc5nc6c(cc5oc-4cc-3nc2)Nc2nc3ccc(…
Hat00lk00T
SMILES 63 tokens HAPPY 2 tokens
Advanced

The FORGE Algorithm

FORGE (Frequency-Oriented Re-Grouping of Entities) iteratively merges frequently co-occurring subgroups — just like BPE in NLP merges token pairs into subword tokens.

📝
BPE in NLP
chars: l o w e r → low er

Frequent token pairs merge into subword tokens, reducing sequence length while preserving meaning.

🧪
FORGE in HAPPY
subgroups: ab00 ac00XX00

Frequent subgroup pairs merge into higher-level chemical motifs, compressing polymer representation.

FORGE Step-by-Step

Watch how subgroups merge iteratively

Step 0 / 3
Tracing a Real Polymer Through FORGE (from 10,647 polymer corpus)
SMILES
*Oc1ccc(C(=O)c2ccc(NC(=O)c3cccc(C(=O)Nc4ccc(C(=O)c5ccc(*)cc5)cc4)c3)cc2)cc1
FORGE Statistics (per iteration)
New Subgroups Created
Press Next or Auto Play to start FORGE iterations
10,647
Polymers in Corpus
12
Tokens (this polymer)
768
Vocab Size
0%
Compression
1

Count Pairs

Scan all polymer sequences for adjacent subgroup pairs and count frequencies.

2

Find Top Pair

Select the most frequently occurring adjacent pair across the entire corpus.

3

Merge → New Token

Replace every occurrence of that pair with a single new subgroup token.

4

Repeat

Iterate until desired vocabulary size or compression ratio is reached.

FORGE Details

Iteration Breakdown

Each FORGE iteration merges the most frequent subgroup pairs. Explore the new subgroups created at each step.

0

Base Library

768 subgroups

768
subgroups
Top Subgroups (by frequency)
aa00
count: 3760
ao00
count: 3757
ac00
count: 2575
at00
count: 2442
aq00
count: 1728
ab00
count: 1681
1

Iteration 1

62 subgroups

62
subgroups
Top Subgroups (by frequency)
lR00
count: 1426
lT00
count: 1026
lV00
count: 954
mf00
count: 954
lW01
count: 786
lT01
count: 659
2

Iteration 2

29 subgroups

29
subgroups
Top Subgroups (by frequency)
mR00
count: 485
mX00
count: 392
mW00
count: 355
mT00
count: 334
mX01
count: 297
nb00
count: 295
3

Iteration 3

7 subgroups

7
subgroups
Top Subgroups (by frequency)
nv00
count: 600
nt00
count: 265
nu00
count: 262
nw00
count: 245
ns00
count: 215
ny00
count: 200
4

Iteration 4

2 subgroups

2
subgroups
Top Subgroups (by frequency)
nA00
count: 195
nz00
count: 172
768
Base
62
Iter 1
29
Iter 2
7
Iter 3
2
Iter 4

FORGE Subgroup Library

Merged subgroups generated by the FORGE algorithm. Each token represents a frequently co-occurring chemical motif. 88 FORGE subgroups. Click any token to see its descriptors.

lR00
[1*]CC[2*]
lT00
[1*]Oc1ccc([2*])cc1
lV00
[2*]OC([1*])=O
mf00
[1*]C([2*])(F)F
lY00
[1*]CC
md00
[1*]C(F)(F)F
lW00
[1*]NC([2*])=O
mb00
[1*]c1ccc(Oc2ccc([2*])cc2)cc1
lS00
[1*]CCC[2*]
ml00
[1*]OC
mm00
[1*]C([2*])(C)C
lU00
[1*]CO[2*]
mu00
[1*]C([2*])C
mj00
[2*]C(=O)Oc1ccc([1*])cc1
mt00
[1*][Si]([2*])(C)C
mA00
[1*]C(C)(C)C
mz00
[1*]Oc1ccc2c(c1)C(=O)N([2*])C2=O
mv00
[1*]Oc1cccc([2*])c1
mr00
[1*]C(=O)c1ccc2c(c1)C(=O)N([2*])C2=O
mp00
[1*]c1ccc(-c2ccc([2*])cc2)cc1
mn00
[1*]Oc1ccc(C([2*])=O)cc1
mx00
[1*]Oc1ccc(N2C(=O)c3ccc([2*])cc3C2=O)cc1
mc00
[1*]COC([2*])=O
mo00
[1*]c1ccc(N2C(=O)c3ccc([2*])cc3C2=O)cc1
lX00
[1*]CCO[2*]
mE00
[1*][N+](=O)[O-]
mF00
[1*]C(C)C
me00
[1*]CCC([2*])=O
mC00
[1*]c1ccc(Cc2ccc([2*])cc2)cc1
mw01
[1*]NC(=O)O[2*]
mH00
[1*]c1ccc(N=Nc2ccc([2*])cc2)cc1
mB00
[1*]Oc1ccc(-c2ccc([2*])cc2)cc1
ma00
[1*]C(=O)c1ccc([2*])cc1
mQ00
[1*]N([2*])C
mg00
[1*]Nc1ccc([2*])cc1
mJ00
[2*]c1cc(C)c([1*])c(C)c1
mG00
[1*]c1cccc(N2C(=O)c3ccc([2*])cc3C2=O)c1
mk00
[1*]OC(=O)c1ccc([2*])cc1
mh00
[2*]C(=O)Nc1ccc([1*])cc1
my00
[1*]C(=O)c1cccc(C([2*])=O)c1
Showing 40 of 88 subgroups
Analysis

Compression Ratio

How HAPPY compresses polymer SMILES — comparing string lengths before and after encoding, with and without FORGE merging.

SMILES Length vs HAPPY Length

Each dot = one polymer. X = SMILES length (chars), Y = HAPPY tokens. Blue = Base HAPPY, Purple = FORGE HAPPY.

Compression Ratio Distribution

Ratio = HAPPY tokens / SMILES length. Lower = more compression.

Interactive

Try HAPPY Conversion

Convert between SMILES and HAPPY representations in both directions.

Examples: