How to write Unicode to console?

SAn

I have found that the console has 866 codepage set on my russified Win Vista.
This is codepage with russian and english letters.

I decided to test it.
I created files with different names (russian, english, greek, chinese) and typed DIR command in the command-line to see how these names will be displayed. Interesting that english and russian symbols are displayed, but other letters were substituted by question marks (?).

So, I don't want MY program to be better than DIR in this case. I just want to display all symbols that can be displayed in the current codepage. Because if the user has his files named in russian, then (probably) he is russian and also set the appropriate codepage.

Also I read in MSDN that it is not a good practice to change console codepage, because console is not belongs to the aplication and may be shared between applications.

Now, the questions are:

How to convert wide-string from UTF16 to the codepage used in the console?
Why this conversion is not performed automatically by wcout? Why the output stops at the first non-english character?

I know it doesn't directly fit your needs, but at least I can give you a hint on how to do it. boost provides a facet to perform a similar task. You should look for something like that.

C++ Forumbot

Dieser Thread wurde von Moderator/in HumeSikkins aus dem Forum C++ in das Forum DOS und Win32-Konsole verschoben.

Im Zweifelsfall bitte auch folgende Hinweise beachten:
C/C++ Forum :: FAQ - Sonstiges :: Wohin mit meiner Frage?

Dieses Posting wurde automatisch erzeugt.

SAn

I have found out (by myself) explanation and solution.

Explanation
The console has default codepage 866 (come from DOS times) in my russian Windows.
The char strings ( "Text" ) in the program have codepage 1251.
The wchar_t strings ( L"Text" ) in the program have UTF-16 encoding.

The simplest code: cout << "Text" should not work! It works due to the great fortune: the characters 'T' , 'e' , 'x' and 't' are in the same places in 866 and 1251 codepages!

This clears up why the cout << "[Russian text]" produce mojibake: the russian letters are in different places in 866 and 1251 codepages.

Let us speak about wcout . It does not support Unicode! The only thing it does is conversion of all Unicode characters to 7-bit ASCII characters. If wcout encounter the character having code > 127, its behaviour is "implementation dependent". In my case the wcout become unusable after that. Even if wcout could support Unicode, it will not work, since the console can't support it!

Since the cout can support character codes 0..255, and wcout can only support 0..127, the wcout is almost useless and should never be used!

Now the Solution (Microsoft specific :)).
I have found the great WinAPI function CharToOemA and its counterpart CharToOemW . What they do?

The CharToOemA converts string of char from default program codepage to default console codepage. If I compile my program having char s[] = "[Russian text]" in it, then the program codepage will be 1251 (stored in executable file resources), and [Russian text] will be stored in this codepage. Now I can convert and output it: CharToOemA(s, buf); cout<<buf<<endl; . This will output russian text on every system that have russian letters in its console codepage (not only 866)! If a system has not russian sybols in the console codepage, than symbols will become question marks (?). At least user will know that some text trying to output, but his console codepage has no support for it.

As soon as I have default system codepage 1251, I can not use this: char s[] = "[e]alpha[/e][e]beta[/e]" . I'm getting compile warning: "warning C4566: character represented by universal-character-name '\u03B1' cannot be represented in the current code page (1251)". How to fix it? Read ahead!

The CharToOemW is more interesting. It always consider the input encoding UTF-16, the same as in L"Text here" . Thus, this will not depend on the system default codepage (1251 in my case). So I can do that way: wchar_t s[] = L"[e]alpha[/e][e]beta[/e]"; CharToOemW(s, buf); cout<<buf<<endl; . Notice that I use here cout , not wcout . Now every man who will run this program and who has greek letters in his codepage will get greek characters in the console! If the codepage has no greek characters, then underscores (_) are output instead of these letters.

I have overloaded the minus operator (the << operator is already defined :() to do all possible conversions automatically for me:

#include <tchar.h>
#include <windows.h>

#include <iostream>
#include <string>
#include <sstream>
#include <cstring>

#ifdef _UNICODE
	#define tString std::wstring
	#define tStringStream std::wstringstream
#else
	#define tString std::string
	#define tStringStream std::stringstream
#endif

//The output stream is always char stream (not wchar_t stream)
std::ostream &operator-(std::ostream &stream, std::wstring const &s)
{
	char *const buf( new char[ s.length()+1 ] ); //Not more than one char per wchar_t plus zero terminator
	CharToOemW(s.c_str(), buf);
	stream << buf;
	delete[] buf;

	return stream;
}

std::ostream &operator-(std::ostream &stream, wchar_t const *const s)
{
	char *const buf( new char[ std::wcslen(s)+1 ] );
	CharToOemW(s, buf);
	stream << buf;
	delete[] buf;

	return stream;
}

std::ostream &operator-(std::ostream &stream, std::string const &s)
{
	char *const buf( new char[ s.length()+1 ] );
	CharToOemA(s.c_str(), buf);
	stream << buf;
	delete[] buf;

	return stream;
}

std::ostream &operator-(std::ostream &stream, char const *const s)
{
	char *const buf( new char[ std::strlen(s)+1 ] );
	CharToOemA(s, buf);
	stream << buf;
	delete[] buf;

	return stream;
}

template<class C> std::ostream &operator-(std::ostream &stream, C const &c)
{ //Other types
	tStringStream s; s << c; //First convert to string (or wstring, depending of _UNICODE)
	stream - s.str(); //Output string with conversion

	return stream;
}

Now I can write as follows:

cout - "[Russian Text]" << endl; //1251 -> 866 conversion here
cout - L"[Russian Text]" <<endl; //UTF-16 -> 866 conversion here
cout << 10 << endl; //Wrong. The symbols '0' and '1' can be in different places in the program codepage and in the console codepage.
cout - 10 << endl; //Correct.
cout - _T("[Russian text]") << endl; //1251 -> 866 or UTF-16 -> 866 conversion depending on _UNICODE project setting

The above code works great with _UNICODE defined and with _UNICODE undefined too.

The following code work only with _UNICODE defined:

cout - _T("[e]alpha[/e][e]beta[/e]") << endl;

As it should be on my system.

Does anybody know, how to make my own cout object to write << instead of - ?

P.S.: Why I can not write russian text in the Forum? This is ungerecht. For example, in most russian forums visitors can write in English (and in German too).

SAn

I have noticed the Latex tag...
$\left|\frac{1}{\zeta-z-h}-\frac{1}{\zeta-z}\right|=\left|\frac{(\zeta-z)-(\zeta-z-h)}{(\zeta-z-h)(\zeta-z)}\right|=\left|\frac{h}{(\zeta-z-h)(\zeta-z)}\right|\leq\frac{2|h|}{|\zeta-z|^2}$
What I am doing wrong?

AFAIK it is broken at the moment.

Badestrand

Hey this is really cool and CharToOem seems to be really genius! Thank you for this little article, SAn To be honest, I don't really like the minus-workaround for cout but I can't think of a better solution myself (maybe plain functions for output?).

To create your own stream-object afaik you should derive from basic_ostream<charT,traits> . Maybe you could take a look here.

SAn

Where I can find full description of facets (std::codecvt) ?

comment1,

comment1,

comment5,

comment5,

comment5,

comment1,

comment1,

comment3,

der gehirnspack ist wieder unterwegs

tujedrtu