The Problem
This is, in the end, a simple one to understand, but started with a QT application that when it imported a text file produced on a Windows machine with Notepad, and exporting the same data as in a UTF-8 file, turned the Euro character (0x80) into (0xC280).
Now this displays as a € sign in a QT application on Linux, but not on Windows. It also does not show in gedit in Linux, but does work in Libre Office.
Background
It turns out that the correct UTF-8 character sequence for the Euro is actually 0xE382AC, not 0xC280 - in fact, if you look at the definition on the unicode website, unicode character 0080 is actually undefined:
Whereas character 20AC is defined as the Euro.
The cause of the problem is the use of the incorrect Codec used to load the file in the first place.
A common mistake is to assume that ISO8859-1 includes the Euro sign, however, ISO8859-1 Does not actually define the Euro character (€) at all, but because this mistake has been made many times over, a lot of applications (including QT and Libre Office) display the character as a Euro.
A down-side of this helpfulness is that it is perfectly possible to have a text file with two characters that look like the €, however one could be unicode character 0080, and the other could be character 20AC - if you searched for a Euro, you would only find one of them!
The Solution
Codecs that do correctly include the Euro character include: Windows-1250, Windows-1252, and ISO8859-15. QT has the special Codec called 'System', which is whatever the system is using (this should be the same as 'Notepad' - it can be referred to by codecForName("System") or codecForLocale().
In QT applications, when loading text files, it is possible to assume that the file is UTF-8, and failing that, fall back to the preferred alternative, or the default system codec.
QString text ; QFile file(QString("file.txt")) ; if (file.open(QIODevice::ReadOnly | QIODevice::Text)) { QByteArray filedata = file.readAll() ; file.close() ; QTextCodec *codec = QTextCodec::codecForName("UTF-8"); text = codec->toUnicode(byteArray.constData(), byteArray.size(), &state); if (state.invalidChars > 0) { codec = QTextCodec::codecForLocale() ; text=codec->toUnicode(byteArray.constData(), byteArray.size(), &state) ; if (state.invalidChars > 0) { qDebug() << "Invalid File Format\n" ; } } }
No comments:
Post a Comment