google-research
Implementation of Penn & Choma's correlation measure.
The script run.sh gives examples of usage. The results of one run of that script, averaged over the 5 runs for each case are as follows:
Summed absolute correlations: Document is 1 chapter Document is 6 chapters Chinese 5828 14859 3-gram English 6386 15116 Korean 8508 18200
Mean number of characters per document: Document is 1 chapter Document is 6 chapters Chinese 782 4674 3-gram English 1110 6635 Korean 1118 6686
Korean has higher overall numbers but it also has fewer distinct characters, meaning that more characters have a better chance of cooccurring in any document:
Number of distinct characters:
Chinese 3177 3-gram English 3194 Korean 1249
That is, 3-gram English and Chinese are pretty well matched for character types, whereas Korean has about a third, yielding the result we see.
Between document size and the number of characters, all of Penn & Choma's differences can be explained.