chared: Character Encoding Detection with a Known Language

Investor logo
Investor logo

Warning

This publication doesn't include Faculty of Arts. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

POMIKÁLEK Jan SUCHOMEL Vít

Year of publication 2011
Type Article in Proceedings
Conference RASLAN 2011
MU Faculty or unit

Faculty of Informatics

Citation
web https://nlp.fi.muni.cz/raslan/2011/paper16.pdf
Field Informatics
Keywords character encoding; character encoding detection; charset; Unicode
Description chared is a system which can detect character encoding of a text document provided the language of the document is known. The system supports a wide range of languages and the most commonly used character encodings. We explain the details of the algorithm, describe the process of creating models for various languages and present results of an evaluation on a collection of Web pages.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.