Charset detection for fun and non-profit…

by

in

FreeDB [www.freedb.org] is a free online CD-information database allowing applications to query the FreeDB server over the internet for disc & track titles. The database was built with user-submitted information; as such entries were being submitted in whatever default character set the submitting-user had. The problem arises when someone with a different default character set retrieves this CD-information and finds it completely scrambled.
The problem of character sets is not limited to FreeDB but applies to any system where text from one character set meets another, like internet email for example. For this reason Unicode was created, a single set of characters which encompass all existing written languages.
I wrote a simple application to process entries from the FreeDB database; it attempts to ‘detect’ the original character set and then converts it to Unicode (UTF-8) before writing out the converted entry.
Firstly it strips the FreeDB format from the entry – it is plain US-ASCII in form and would bias the character set detection.
– Any lines beginning with “#” are skipped
– Only the VALUE part of the NAME=VALUE format is kept
It then passes the remaining text through character set detection provided by cpDetector.
Out of the 1,865,309 entries in the latest FreeDB database (freedb-complete-20051104.tar.bz2) cpdetector only failed to ‘detect’ 1,106 of them. Below is a breakdown of the detected character sets.
[code]UTF-8 1007658
US-ASCII 444896
UTF-16LE 31442
UTF-16BE 14899
windows-1252 214906
GB18030 76914
Big5 56379
x-EUC-CN 6778
EUC-KR 5823
Shift_JIS 4042
x-EUC-TW 438
EUC-JP 28
(unknown) 1106[/code]
Of the entries that failed I found they were a mix of obscure character sets like ‘IBM-866’, ‘windows-1251’, ‘koi8-r’ & ‘x-mac-cyrillic’ – which can’t be detected by cpdetector at the moment.
Over the next couple of posts I plan to release the tool to do the conversion as well as a fully converted copy of the database.
Related Links
Unicode? Character Sets? UTF-what?
SourceForge.net – cpdetector
SourceForge.net – jchardet
Mozilla – Charset Detectors


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *