About metric for language and charset guessing by content.

Maxime Zakharov

In paper Cavnar, W. B. And J. M. Trenkle, `` N-Gram-Based Text Categorization " In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994 the method for document language detection by content was offered. The method is based on counting N-grams (substrings of length no more N) frequencies for documents with known language, and supposition, that about 300 most frequently used N-grams hardly depend on language.

The method's algorithm are: 1. find N-grams frequencies for all test documents, for which language is known, and also for each document, which language we attempt to define; 2. among all test documents discover one for which distance between it N-grams statistics and statistics of the tested document is minimal; 3. the language for document with unknown language piked from test document selected in step 2.

Distance between statistics calculated as follows: all N-grams for every document are sorted in decreasing frequencies order, then for every N-gram the difference between it position in appropriate lists is calculated. Distance between statisticses is defined as the sum of differences for each N-grams.

The value of N is offered to be used are 5.

Perl based realization: TextCat.

For search engine MnogoSearch, since version 3.2.4, the different metric for distance between the N-grams lists is used. As distance between statistics the follow information gain function is used:

            M       2
         SUM     log  ( Pi/Qi )

where Qi - N-gram frequency for document with unknown language, and Pi - N-gram frequency for document with known language.

The used value of N is equal 5.

This method as against previous does not require sorting N-grams by frequency.

Яндекс цитирования Рейтинг@Mail.ru
Modified: 19.03.2011 06:45 MSK