What are TCHAR, WCHAR, LPSTR, LPWSTR, LPCTSTR (etc.)?

Many Windows C++ programmers get confused over what bizarre data type identifiers like TCHAR and LPCTSTR are. Here, in brief, I will try to clear out the fog.

Many C++ Windows programmers get confused over what bizarre identifiers like TCHAR, LPCTSTR are.
In this article, I would attempt by best to clear out the fog.

In general, a character can be represented in 1 byte or 2 bytes. Let's say 1-byte character is ANSI character - all English characters are represented through
this encoding. And let's say a 2-byte character is Unicode, which can represent ALL languages in the world.

The Visual C++ compiler supports char and wchar_t as native data-types for ANSI and Unicode characters, respectively.
Though there is more concrete definition of Unicode, but for understanding assume it as two-byte character which Windows OS uses for multiple language support.

There is more to Unicode than 2-bytes character representation Windows uses. Microsoft Windows use UTF-16 character encoding.

What if you want your C/C++ code to be independent of character encoding/mode used?

Suggestion: Use generic data-types and names to represent characters and string.

The following project setting in General page describes which Character Set is to be used for compilation: (General -> Character Set)

This way, when your project is being compiled as Unicode, the TCHAR would translate to wchar_t. If it is being compiled as ANSI/MBCS,
it would be translated to char. You are free to use char and wchar_t, and project settings will not affect any direct use of these keywords.

TCHAR is defined as:

#ifdef _UNICODEtypedefwchar_t TCHAR;
#elsetypedefchar TCHAR;
#endif

The macro _UNICODE is defined when you set Character Set to "Use Unicode Character Set", and therefore TCHAR
would mean wchar_t. When Character Set if set to "Use Multi-Byte Character Set", TCHAR would mean char.

Likewise, to support multiple character-set using single code base, and possibly supporting multi-language, use specific functions (macros). Instead of using
strcpy, strlen, strcat (including the secure versions suffixed with _s); or wcscpy, wcslen,
wcscat (including secure), you should better use use _tcscpy, _tcslen, _tcscat functions.

As you know strlen is prototyped as:

size_t strlen(constchar*);

And, wcslen is prototyped as:

size_t wcslen(constwchar_t* );

You may better use _tcslen, which is logically prototyped as:

size_t _tcslen(const TCHAR* );

WC is for Wide Character. Therefore, wcs turns to be wide-character-string. This way, _tcs would mean
_T Character String. And you know _T may be char or what_t, logically.

But, in reality, _tcslen (and other _tcs functions) are actually not functions, but macros. They are defined simply as:

You might ask why they are defined as macros, and not implemented as functions instead? The reason is simple: A library or DLL may export a single function,
with same name and prototype (Ignore overloading concept of C++). For instance, when you export a function as:

void_TPrintChar(char);

How the client is supposed to call it as?

void _TPrintChar(wchar_t);

_TPrintChar cannot be magically converted into function taking 2-byte character. There has to be two separate functions:

Note that both TCHAR and _TPrintChar would map to either Unicode or ANSI, and therefore cChar
and the argument to function would be either char or wchar_t.

Macros do avoid these complications, and allows us to use either ANSI or Unicode function for characters and strings. Most of the Windows functions,
that take string or a character are implemented this way, and for programmers convenience, only one function (a macro!) is good. SetWindowText is one example:

There are very few functions that do not have macros, and are available only with suffixed W or A.
One example is ReadDirectoryChangesW, which doesn't have ANSI equivalent.

You all know that we use double quotation marks to represent strings. The string represented in this manner is ANSI-string, having 1-byte each character. Example:

"This is ANSI String. Each letter takes 1 byte."

The string text given above is not Unicode, and would be quantifiable for multi-language support. To represent Unicode string,
you need to use prefix L. An example:

L"This is Unicode string. Each letter would take 2 bytes, including spaces."

Note the L at the beginning of string, which makes it a Unicode string. All characters (I repeat all characters)
would take two bytes, including all English letters, spaces, digits, and the null character. Therefore, length of Unicode string would always be in multiple
of 2-bytes. A Unicode string of length 7 characters would need 14 bytes, and so on. Unicode string taking 15 bytes, for example, would not be valid in any context.

The non-prefixed string is ANSI string, the L prefixed string is Unicode, and string specified in _T or TEXT
would be either, depending on compilation. Again, _T and TEXT are nothing but macros, and are defined as:

The ## symbol is token pasting operator,
which would turn _T("Unicode") into L"Unicode", where the string passed is argument to macro - If _UNICODE
is defined. If _UNICODE is not defined, _T("Unicode") would simply mean "Unicode". The token pasting operator
did exist even in C language, and is not specific about VC++ or character encoding.

Note that these macros can be used for strings as well as characters.
_T('R') would turn into L'R' or simple 'R' - former is Unicode character, latter is ANSI character.

No, you cannot use these macros to convert variables (string or character) into Unicode/non-Unicode text. Following is not valid:

char c = 'C';
char str[16] = "CodeProject";
_T(c);
_T(str);

The bold lines would get successfully compiled in ANSI (Multi-Byte) build, since _T(x) would simply be x, and therefore
_T(c) and _T(str) would come out to be c and str, respectively.
But, when you build it with Unicode character set, it would fail to compile:

I would not like to insult your intelligence by describing why and what those errors are.

There exist set of conversion routine to convert MBCS to Unicode and vice versa, which I would explain soon.

It is important to note that almost all functions that take string (or character), primarily in Windows API, would have generalized prototype
in MSDN and elsewhere. The function SetWindowTextA/W, for instance, be classified as:

BOOL SetWindowText(HWND, constTCHAR*);

But, as you know, SetWindowText is just a macro, and depending on your build settings, it would mean either of following:

All of the functions that have ANSI and Unicode versions, would have actual implementation only in Unicode version. That means, when you call SetWindowTextA
from your code, passing an ANSI string - it would convert the ANSI string to Unicode text and then would call SetWindowTextW. The actual work (setting the window
text/title/caption) will be performed by Unicode version only!

Take another example, which would retrieve the window text, using GetWindowText. You call GetWindowTextA, passing ANSI buffer
as target buffer. GetWindowTextA would first call GetWindowTextW, probably allocating a Unicode string (a wchar_t array)
for it. Then it would convert that Unicode stuff, for you, into ANSI string.

This ANSI to Unicode and vice-versa conversion is not limited to GUI functions, but entire set of Windows API, which do take strings and have two variants. Few examples could be:

CreateProcess

GetUserName

OpenDesktop

DeleteFile

etc

It is therefore very much recommended to call the Unicode version directly. In turn, it means you should always target for Unicode builds,
and not ANSI builds - just because you are accustomed to using ANSI string for years. Yes, you may save and retrieve ANSI strings, for example in file,
or send as chat message in your messenger application. The conversion routines do exist for such needs.

Note: There exists another typedef: WCHAR, which is equivalent to wchar_t.

The TCHAR macro is for a single character. You can definitely declare an array of TCHAR. What if you would like to express
a character-pointer, or a const-character-pointer - Which one of the following?

After reading about TCHAR stuff, you would definitely select the last one as your choice. There are better alternatives available
to represent strings. For that, you just need to include Windows.h. Note: If your project implicitly or explicitly includes
Windows.h, you need not include TCHAR.H

The type of szTarget is LPSTR, without C in the type-name. It is defined as:

typedefchar* LPSTR;

Note that the szSource is LPCSTR, since strcpy function will not modify the source buffer, hence the const attribute.
The return type is non-constant-string: LPSTR.

Alright, these str-functions are for ANSI string manipulation. But we want routines for 2-byte Unicode strings. For the same, the equivalent wide-character
str-functions are provided. For example, to calculate length of wide-character (Unicode string), you would use wcslen:

size_t nLength;
nLength = wcslen(L"Unicode");

The prototype of wcslen is:

size_t wcslen(constwchar_t* szString); // Or WCHAR*

And that can be represented as:

size_t wcslen(LPCWSTR szString);

Where the symbol LPCWSTR is defined as:

typedefconst WCHAR* LPCWSTR;
// const wchar_t*

Which can be broken down as:

LP - Pointer

C - Constant

WSTR - Wide character String

Similarly, strcpy equivalent is wcscpy, for Unicode strings:

wchar_t* wcscpy(wchar_t* szTarget, constwchar_t* szSource)

Which can be represented as:

LPWSTR wcscpy(LPWSTR szTarget, LPWCSTR szSource);

Where the target is non-constant wide-string (LPWSTR), and source is constant-wide-string.

There exist set of equivalent wcs-functions for str-functions. The str-functions would be used for plain ANSI strings, and wcs-functions would be used for Unicode strings.

Though, I already advised to use Unicode native functions, instead of ANSI-only or TCHAR-synthesized functions. The reason was simple - your application must only be Unicode, and you should not even care about code portability for ANSI builds. But for the sake of completeness, I am mentioning these generic mappings.

To calculate length of string, you may use _tcslen function (a macro). In general, it is prototyped as:

size_t _tcslen(const TCHAR* szString);

Or, as:

size_t _tcslen(LPCTSTR szString);

Where the type-name LPCTSTR can be classified as:

LP - Pointer

C - Constant

T = TCHAR

STR = String

Depending on the project settings, LPCTSTR would be mapped to either LPCSTR (ANSI) or LPCWSTR (Unicode).

Note: strlen, wcslen or _tcslen will return number of characters in string, not the number of bytes.

Unfortunately (or fortunately), this error can be incorrectly corrected by simple C-style typecast:

nLen = wcslen((constwchar_t*)"Saturn");

And you'd think you've attained one more experience level in pointers! You are wrong - the code would give incorrect result, and in most cases would simply cause Access Violation. Typecasting this way is like passing a float variable where a structure of 80 bytes is expected (logically).

The string "Saturn" is sequence of 7 bytes:

'S' (83)

'a' (97)

't' (116)

'u' (117)

'r' (114)

'n' (110)

'\0' (0)

But when you pass same set of bytes to wcslen, it treats each 2-byte as a single character. Therefore first two bytes [97, 83] would be treated as one
character having value: 24915 (97<<8 | 83). It is Unicode character: ?. And the next character
is represented by [117, 116] and so on.

For sure, you didn't pass those set of Chinese characters, but improper typecasting has done it! Therefore it is very essential to know that type-casting
will not work! So, for the first line of initialization, you must do:

TCHAR name[] = _T("Saturn");

Which would translate to 7-bytes or 14-bytes, depending on compilation. The call to wcslen should be:

wcslen(L"Saturn");

In the sample program code given above, I used strlen, which causes error when building in Unicode. The non-working solution is C-sytle typecast:

lLen = strlen ((constchar*)name);

On Unicode build, name would be of 14-bytes (7 Unicode characters, including null). Since string "Saturn" contains only English letters,
which can be represented using original ASCII, the Unicode letter 'S' would be represented as [83, 0]. Other ASCII characters would be represented
with a zero next to them. Note that 'S' is now represented as 2-byte value 83. The end of string would be represented
by two bytes having value 0.

So, when you pass such string to strlen, the first character (i.e. first byte) would be correct ('S' in case of "Saturn").
But the second character/byte would indicate end of string. Therefore, strlen would return incorrect value 1 as the length of string.

As you know, Unicode string may contain non-English characters, the result of strlen would be more undefined.

In short, typecasting will not work. You either need to represent strings in correct form itself, or use ANSI to Unicode, and vice-versa, routines for conversions.

Continuing. You must have seen some functions/methods asking you to pass number of characters, or returning the number of characters.
Well, like GetCurrentDirectory, you need to pass number of characters, and not number of bytes. For example:

Awesome job Ajay !!! Keep up the good work. It is really helpful for the beginners as well as the experienced programmers who are confused about the strings. I like the way you have explained these concepts...very simple and elegant !!!

I like this article very much. Reading the comments, it's astonishing to me how "professional programmers" vote badly even though it's more on us beginners to feedback if we find the article understandable and valuable! Fact is: it describes exactly what it says it would: TCHAR stuff. If I will need to know about UTF16 or encoding, well then I'll look for an article about it.