How to write Unicode to console?
-
I have found out (by myself) explanation and solution.
Explanation
The console has default codepage 866 (come from DOS times) in my russian Windows.
The char strings ("Text"
) in the program have codepage 1251.
The wchar_t strings (L"Text"
) in the program have UTF-16 encoding.The simplest code:
cout << "Text"
should not work! It works due to the great fortune: the characters'T'
,'e'
,'x'
and't'
are in the same places in 866 and 1251 codepages!This clears up why the
cout << "[Russian text]"
produce mojibake: the russian letters are in different places in 866 and 1251 codepages.Let us speak about
wcout
. It does not support Unicode! The only thing it does is conversion of all Unicode characters to 7-bit ASCII characters. Ifwcout
encounter the character having code > 127, its behaviour is "implementation dependent". In my case thewcout
become unusable after that. Even ifwcout
could support Unicode, it will not work, since the console can't support it!Since the
cout
can support character codes 0..255, andwcout
can only support 0..127, thewcout
is almost useless and should never be used!Now the Solution (Microsoft specific :)).
I have found the great WinAPI functionCharToOemA
and its counterpartCharToOemW
. What they do?The
CharToOemA
converts string ofchar
from default program codepage to default console codepage. If I compile my program havingchar s[] = "[Russian text]"
in it, then the program codepage will be 1251 (stored in executable file resources), and[Russian text]
will be stored in this codepage. Now I can convert and output it:CharToOemA(s, buf); cout<<buf<<endl;
. This will output russian text on every system that have russian letters in its console codepage (not only 866)! If a system has not russian sybols in the console codepage, than symbols will become question marks (?). At least user will know that some text trying to output, but his console codepage has no support for it.As soon as I have default system codepage 1251, I can not use this:
char s[] = "[e]alpha[/e][e]beta[/e]"
. I'm getting compile warning: "warning C4566: character represented by universal-character-name '\u03B1' cannot be represented in the current code page (1251)". How to fix it? Read ahead!The
CharToOemW
is more interesting. It always consider the input encoding UTF-16, the same as inL"Text here"
. Thus, this will not depend on the system default codepage (1251 in my case). So I can do that way:wchar_t s[] = L"[e]alpha[/e][e]beta[/e]"; CharToOemW(s, buf); cout<<buf<<endl;
. Notice that I use herecout
, notwcout
. Now every man who will run this program and who has greek letters in his codepage will get greek characters in the console! If the codepage has no greek characters, then underscores (_) are output instead of these letters.I have overloaded the minus operator (the << operator is already defined :() to do all possible conversions automatically for me:
#include <tchar.h> #include <windows.h> #include <iostream> #include <string> #include <sstream> #include <cstring> #ifdef _UNICODE #define tString std::wstring #define tStringStream std::wstringstream #else #define tString std::string #define tStringStream std::stringstream #endif //The output stream is always char stream (not wchar_t stream) std::ostream &operator-(std::ostream &stream, std::wstring const &s) { char *const buf( new char[ s.length()+1 ] ); //Not more than one char per wchar_t plus zero terminator CharToOemW(s.c_str(), buf); stream << buf; delete[] buf; return stream; } std::ostream &operator-(std::ostream &stream, wchar_t const *const s) { char *const buf( new char[ std::wcslen(s)+1 ] ); CharToOemW(s, buf); stream << buf; delete[] buf; return stream; } std::ostream &operator-(std::ostream &stream, std::string const &s) { char *const buf( new char[ s.length()+1 ] ); CharToOemA(s.c_str(), buf); stream << buf; delete[] buf; return stream; } std::ostream &operator-(std::ostream &stream, char const *const s) { char *const buf( new char[ std::strlen(s)+1 ] ); CharToOemA(s, buf); stream << buf; delete[] buf; return stream; } template<class C> std::ostream &operator-(std::ostream &stream, C const &c) { //Other types tStringStream s; s << c; //First convert to string (or wstring, depending of _UNICODE) stream - s.str(); //Output string with conversion return stream; }
Now I can write as follows:
cout - "[Russian Text]" << endl; //1251 -> 866 conversion here cout - L"[Russian Text]" <<endl; //UTF-16 -> 866 conversion here cout << 10 << endl; //Wrong. The symbols '0' and '1' can be in different places in the program codepage and in the console codepage. cout - 10 << endl; //Correct. cout - _T("[Russian text]") << endl; //1251 -> 866 or UTF-16 -> 866 conversion depending on _UNICODE project setting
The above code works great with
_UNICODE
defined and with_UNICODE
undefined too.The following code work only with
_UNICODE
defined:cout - _T("[e]alpha[/e][e]beta[/e]") << endl;
As it should be on my system.
Does anybody know, how to make my own
cout
object to write<<
instead of-
?P.S.: Why I can not write russian text in the Forum? This is ungerecht. For example, in most russian forums visitors can write in English (and in German too).
-
I have noticed the Latex tag...
What I am doing wrong?
-
AFAIK it is broken at the moment.
-
Hey this is really cool and
CharToOem
seems to be really genius! Thank you for this little article, SAn To be honest, I don't really like the minus-workaround for cout but I can't think of a better solution myself (maybe plain functions for output?).To create your own stream-object afaik you should derive from
basic_ostream<charT,traits>
. Maybe you could take a look here.
-
Where I can find full description of facets (std::codecvt) ?
-
comment1,
-
comment1,
-
comment5,
-
comment5,
-
comment5,
-
comment1,
-
comment1,
-
comment3,
-
der gehirnspack ist wieder unterwegs
-
tujedrtu