Jul 2011
uchardet is a C language binding of the original C++ implementation of the universal charset detection library by Mozilla. uchardet is an encoding detector library, which takes a sequence of bytes in an unknown character encoding without any additional information, and attempts to determine the encoding of the text.
When I was developing the graphic user interface of OpenCC, I attempted to find a library to guess the encoding of plain texts. Then I found Mozilla universalchardet, which is a part of Firefox and Seamonkey for detecting web pages’ encodings. Unluckily, it is not compatible with C language. Interestingly, there are many ports of other language:
- python-chardet Python port
- ruby-rchardet Ruby port
- juniversalchardet Java port of universalchardet
- jchardet Java port of chardet
- nuniversalchardet C# port of universalchardet
- nchardet C# port of chardet
There is barely no C language version of the ports. So I did the package work, separated it from Mozilla, and published it as a stand-alone library (libuchardet). Now libuchardet is accepted by Debian package system.
Last modified on 2017-05-17