A mixed Huffman codebook combining character-level coding with explicit entries for the 300 most frequent English words (covering ~65% of written material) achieves a 52% compression ratio on average across 4,825 sentences of 4–15 words—7 percentage points better than a character-only alphabet—directly increasing the covert bits available per output word.
From 2016-safaka-matryoshka — Matryoshka: Hiding Secret Communication in Plain Sight
· §3.1
· 2016
· Free and Open Communications on the Internet
Implications
Include the top ~300 domain-specific words for the expected cover corpus (e.g., sports, finance) in the Huffman codebook to maximize compression and reduce stegotext length, which itself reduces detectability via length-inflation.
Apply topic-specific corpus shaping at both the codebook and the word-bin level to produce stegotexts that are thematically coherent rather than generically English, defeating topic-model-based steganalysis.