Monday, 25 January 2016

QT QTextCodec and the Euro Character

The Problem

This is, in the end, a simple one to understand, but started with a QT application that when it imported a text file produced on a Windows machine with Notepad, and exporting the same data as in a UTF-8 file, turned the Euro character (0x80) into (0xC280).

Now this displays as a € sign in a QT application on Linux, but not on Windows.  It also does not show in gedit in Linux, but does work in Libre Office.


It turns out that the correct UTF-8 character sequence for the Euro is actually 0xE382AC, not 0xC280 - in fact, if you look at the definition on the unicode website, unicode character 0080 is actually undefined:

Whereas character 20AC is defined as the Euro.

The cause of the problem is the use of the incorrect Codec used to load the file in the first place.

A common mistake is to assume that ISO8859-1 includes the Euro sign, however, ISO8859-1 Does not actually define the Euro character (€) at all, but because this mistake has been made many times over, a lot of applications (including QT and Libre Office) display the character as a Euro.

A down-side of this helpfulness is that it is perfectly possible to have a text file with two characters that look like the €, however one could be unicode character 0080, and the other could be character 20AC - if you searched for a Euro, you would only find one of them!

The Solution

Codecs that do correctly include the Euro character include: Windows-1250, Windows-1252, and ISO8859-15.  QT has the special Codec called 'System', which is whatever the system is using (this should be the same as 'Notepad' - it can be referred to by codecForName("System") or codecForLocale().

In QT applications, when loading text files, it is possible to assume that the file is UTF-8, and failing that, fall back to the preferred alternative, or the default system codec.
QString text ;
QFile file(QString("file.txt")) ;

if ( | QIODevice::Text)) {
  QByteArray filedata = file.readAll() ;
  file.close() ;
  QTextCodec *codec = QTextCodec::codecForName("UTF-8");
  text = codec->toUnicode(byteArray.constData(), byteArray.size(), &state);
  if (state.invalidChars > 0) {
    codec = QTextCodec::codecForLocale() ;
    text=codec->toUnicode(byteArray.constData(), byteArray.size(), &state) ;
    if (state.invalidChars > 0) {
      qDebug() << "Invalid File Format\n" ;