FINDING · DEFENSE

A mixed Huffman codebook combining character-level coding with explicit entries for the 300 most frequent English words (covering ~65% of written material) achieves a 52% compression ratio on average across 4,825 sentences of 4–15 words—7 percentage points better than a character-only alphabet—directly increasing the covert bits available per output word.

From 2016-safaka-matryoshka — Matryoshka: Hiding Secret Communication in Plain Sight · §3.1 · 2016 · Free and Open Communications on the Internet

Implications

Include the top ~300 domain-specific words for the expected cover corpus (e.g., sports, finance) in the Huffman codebook to maximize compression and reduce stegotext length, which itself reduces detectability via length-inflation.
Apply topic-specific corpus shaping at both the codebook and the word-bin level to produce stegotexts that are thematically coherent rather than generically English, defeating topic-model-based steganalysis.