In paper Cavnar, W. B. And J. M. Trenkle, `` N-Gram-Based Text Categorization " In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April 1994 the method for document language detection by content was offered. The method is based on counting N-grams (substrings of length no more N) frequencies for documents with known language, and supposition, that about 300 most frequently used N-grams hardly depend on language.
The method's algorithm are: 1. find N-grams frequencies for all test documents, for which language is known, and also for each document, which language we attempt to define; 2. among all test documents discover one for which distance between it N-grams statistics and statistics of the tested document is minimal; 3. the language for document with unknown language piked from test document selected in step 2.
Distance between statistics calculated as follows: all N-grams for every document are sorted in decreasing frequencies order, then for every N-gram the difference between it position in appropriate lists is calculated. Distance between statisticses is defined as the sum of differences for each N-grams.
The value of N is offered to be used are 5.
Perl based realization: TextCat.
For search engine MnogoSearch, since version 3.2.4, the different metric for distance between the N-grams lists is used. As distance between statistics the follow information gain function is used:
M 2 SUM log ( Pi/Qi ) i=1where Qi - N-gram frequency for document with unknown language, and Pi - N-gram frequency for document with known language.
The used value of N is equal 5.
This method as against previous does not require sorting N-grams by frequency.