google-research

Форк
0

README

Implementation of Penn & Choma's correlation measure.

The script run.sh gives examples of usage. The results of one run of that script, averaged over the 5 runs for each case are as follows:

Summed absolute correlations: Document is 1 chapter Document is 6 chapters Chinese 5828 14859 3-gram English 6386 15116 Korean 8508 18200

Mean number of characters per document: Document is 1 chapter Document is 6 chapters Chinese 782 4674 3-gram English 1110 6635 Korean 1118 6686

Korean has higher overall numbers but it also has fewer distinct characters, meaning that more characters have a better chance of cooccurring in any document:

Number of distinct characters:

Chinese 3177 3-gram English 3194 Korean 1249

That is, 3-gram English and Chinese are pretty well matched for character types, whereas Korean has about a third, yielding the result we see.

Between document size and the number of characters, all of Penn & Choma's differences can be explained.

Использование cookies

Мы используем файлы cookie в соответствии с Политикой конфиденциальности и Политикой использования cookies.

Нажимая кнопку «Принимаю», Вы даете АО «СберТех» согласие на обработку Ваших персональных данных в целях совершенствования нашего веб-сайта и Сервиса GitVerse, а также повышения удобства их использования.

Запретить использование cookies Вы можете самостоятельно в настройках Вашего браузера.