How to write Unicode to console?
-
I have found that the console has 866 codepage set on my russified Win Vista.
This is codepage with russian and english letters.I decided to test it.
I created files with different names (russian, english, greek, chinese) and typed DIR command in the command-line to see how these names will be displayed. Interesting that english and russian symbols are displayed, but other letters were substituted by question marks (?).So, I don't want MY program to be better than DIR in this case. I just want to display all symbols that can be displayed in the current codepage. Because if the user has his files named in russian, then (probably) he is russian and also set the appropriate codepage.
Also I read in MSDN that it is not a good practice to change console codepage, because console is not belongs to the aplication and may be shared between applications.
Now, the questions are:
- How to convert wide-string from UTF16 to the codepage used in the console?
- Why this conversion is not performed automatically by wcout? Why the output stops at the first non-english character?
-
I know it doesn't directly fit your needs, but at least I can give you a hint on how to do it. boost provides a facet to perform a similar task. You should look for something like that.
-
Dieser Thread wurde von Moderator/in HumeSikkins aus dem Forum C++ in das Forum DOS und Win32-Konsole verschoben.
Im Zweifelsfall bitte auch folgende Hinweise beachten:
C/C++ Forum :: FAQ - Sonstiges :: Wohin mit meiner Frage?Dieses Posting wurde automatisch erzeugt.
-
I have found out (by myself) explanation and solution.
Explanation
The console has default codepage 866 (come from DOS times) in my russian Windows.
The char strings ("Text"
) in the program have codepage 1251.
The wchar_t strings (L"Text"
) in the program have UTF-16 encoding.The simplest code:
cout << "Text"
should not work! It works due to the great fortune: the characters'T'
,'e'
,'x'
and't'
are in the same places in 866 and 1251 codepages!This clears up why the
cout << "[Russian text]"
produce mojibake: the russian letters are in different places in 866 and 1251 codepages.Let us speak about
wcout
. It does not support Unicode! The only thing it does is conversion of all Unicode characters to 7-bit ASCII characters. Ifwcout
encounter the character having code > 127, its behaviour is "implementation dependent". In my case thewcout
become unusable after that. Even ifwcout
could support Unicode, it will not work, since the console can't support it!Since the
cout
can support character codes 0..255, andwcout
can only support 0..127, thewcout
is almost useless and should never be used!Now the Solution (Microsoft specific :)).
I have found the great WinAPI functionCharToOemA
and its counterpartCharToOemW
. What they do?The
CharToOemA
converts string ofchar
from default program codepage to default console codepage. If I compile my program havingchar s[] = "[Russian text]"
in it, then the program codepage will be 1251 (stored in executable file resources), and[Russian text]
will be stored in this codepage. Now I can convert and output it:CharToOemA(s, buf); cout<<buf<<endl;
. This will output russian text on every system that have russian letters in its console codepage (not only 866)! If a system has not russian sybols in the console codepage, than symbols will become question marks (?). At least user will know that some text trying to output, but his console codepage has no support for it.As soon as I have default system codepage 1251, I can not use this:
char s[] = "[e]alpha[/e][e]beta[/e]"
. I'm getting compile warning: "warning C4566: character represented by universal-character-name '\u03B1' cannot be represented in the current code page (1251)". How to fix it? Read ahead!The
CharToOemW
is more interesting. It always consider the input encoding UTF-16, the same as inL"Text here"
. Thus, this will not depend on the system default codepage (1251 in my case). So I can do that way:wchar_t s[] = L"[e]alpha[/e][e]beta[/e]"; CharToOemW(s, buf); cout<<buf<<endl;
. Notice that I use herecout
, notwcout
. Now every man who will run this program and who has greek letters in his codepage will get greek characters in the console! If the codepage has no greek characters, then underscores (_) are output instead of these letters.I have overloaded the minus operator (the << operator is already defined :() to do all possible conversions automatically for me:
#include <tchar.h> #include <windows.h> #include <iostream> #include <string> #include <sstream> #include <cstring> #ifdef _UNICODE #define tString std::wstring #define tStringStream std::wstringstream #else #define tString std::string #define tStringStream std::stringstream #endif //The output stream is always char stream (not wchar_t stream) std::ostream &operator-(std::ostream &stream, std::wstring const &s) { char *const buf( new char[ s.length()+1 ] ); //Not more than one char per wchar_t plus zero terminator CharToOemW(s.c_str(), buf); stream << buf; delete[] buf; return stream; } std::ostream &operator-(std::ostream &stream, wchar_t const *const s) { char *const buf( new char[ std::wcslen(s)+1 ] ); CharToOemW(s, buf); stream << buf; delete[] buf; return stream; } std::ostream &operator-(std::ostream &stream, std::string const &s) { char *const buf( new char[ s.length()+1 ] ); CharToOemA(s.c_str(), buf); stream << buf; delete[] buf; return stream; } std::ostream &operator-(std::ostream &stream, char const *const s) { char *const buf( new char[ std::strlen(s)+1 ] ); CharToOemA(s, buf); stream << buf; delete[] buf; return stream; } template<class C> std::ostream &operator-(std::ostream &stream, C const &c) { //Other types tStringStream s; s << c; //First convert to string (or wstring, depending of _UNICODE) stream - s.str(); //Output string with conversion return stream; }
Now I can write as follows:
cout - "[Russian Text]" << endl; //1251 -> 866 conversion here cout - L"[Russian Text]" <<endl; //UTF-16 -> 866 conversion here cout << 10 << endl; //Wrong. The symbols '0' and '1' can be in different places in the program codepage and in the console codepage. cout - 10 << endl; //Correct. cout - _T("[Russian text]") << endl; //1251 -> 866 or UTF-16 -> 866 conversion depending on _UNICODE project setting
The above code works great with
_UNICODE
defined and with_UNICODE
undefined too.The following code work only with
_UNICODE
defined:cout - _T("[e]alpha[/e][e]beta[/e]") << endl;
As it should be on my system.
Does anybody know, how to make my own
cout
object to write<<
instead of-
?P.S.: Why I can not write russian text in the Forum? This is ungerecht. For example, in most russian forums visitors can write in English (and in German too).
-
I have noticed the Latex tag...
What I am doing wrong?
-
AFAIK it is broken at the moment.
-
Hey this is really cool and
CharToOem
seems to be really genius! Thank you for this little article, SAn To be honest, I don't really like the minus-workaround for cout but I can't think of a better solution myself (maybe plain functions for output?).To create your own stream-object afaik you should derive from
basic_ostream<charT,traits>
. Maybe you could take a look here.
-
Where I can find full description of facets (std::codecvt) ?
-
comment1,
-
comment1,
-
comment5,
-
comment5,
-
comment5,
-
comment1,
-
comment1,
-
comment3,
-
der gehirnspack ist wieder unterwegs
-
tujedrtu