Character Set Considerations

From Crypto++ Wiki
Jump to navigation Jump to search

This page will attempt to discuss design considerations for the user. For expediency, the simplest solution is to not define _UNICODE and UNICODE. Then everything uses narrow characters.

Data is Neutral

Crypto++ is generally a neutral library. That is, when the Crypto++ Library operates on data (even when the data is housed in a string), the data is being interpreted as a byte[]. A better (but less portable) abstraction is a Rope. Consider the following fragment, presuming the File is storing binary data:

string sink;
FileSource( filename, true, new StringSink( sink ) );

There is no regard to wide or narrow - data is data. Next, suppose it is desired to hash the data:

MD5 hash;
hash.Put( (byte*)sink.c_str(), sink.size() );
hash.MessageEnd();
...

The hash operates on a stream of bytes - the stream could be binary data, narrow characters (which the hash regards as byte[]), or wide characters (which the hash regards as byte[]). The programmer only needs to specify the number of bytes (size) to hash. The hash is indifferent.

Finally, suppose the previous example Hex Encoded the hash before storing it in a narrow string. The program could be either Unicode, SBCS, or MBCS. The next sections discuss this issue.

Crypto++ is Narrow

There are times when one will requires passing a string to Crypto++. These times would include Named Parameters and Filenames. In this case, one of two situations arise.

Wide to Narrow

Wide to Narrow conversion can further be decomposed into two cases:

  • using the Standard C++ Library
  • using the Win32 API

Using the Standard C++ Library

Users of Visual Studio 6.0 and earlier are at a handicap. Bjarne Stroustrup devoted Appendix D: Locales of his work to issues similar to these (complete with Sample code). However, the code does not compile with VS 6.0. The following will work for the reader.

// Courtesy of Tom Widmer (VC++ MVP)
std::wstring StringWiden( const std::string& narrow ) { 

    std::wstring wide;
    wide.resize( narrow.length() );

    typedef std::ctype<wchar_t> CT;
    CT const& ct = std::_USE(std::locale(), CT);

    // Non Portable
    //   Iterators should not be used as pointers (works in VC++ 6.0)
    //   ct.widen( narrow.begin(), narrow.end(), wide.begin() );

    // Portable
    // ct.widen(&narrow[0], &narrow[0] + narrow.size(), &wide[0]);
	
    // Portable
    ct.widen(narrow.data(), narrow.data() + narrow.size(), wide.data());	

    return wide;
}

Using the Win32 API

See MSDN for examples of using MultiByteToWideChar.

Narrow to Wide

Narrow to Wide conversion can further be decomposed two cases:

  • using the Standard C++ Library
  • using the Win32 API

Using the Standard C++ Library

// Courtesy of Tom Widmer (VC++ MVP)
std::string StringNarrow( const std::wstring& wide ) {

    typedef std::ctype<wchar_t> CT;

    std::string narrow;
    narrow.resize( wide.length() );

    CT const& ct = std::_USE(std::locale(), CT);

    // Non-Portable
    // ct.narrow( wide.begin(), wide.end(), '_', narrow.begin() );

    // Portable
    ct.narrow( &wide[0], &wide[0] + wide.length(), '_', &narrow[0] );

    return narrow;
}

Using the Win32 API

See MSDN for examples of using WideToMultiByteChar.

Application is Wide

Due to the predominace of Windows NT and family, the author exclusively uses the Unicode character set. With that in mind, the following is a typical Design Overview. Notice that anything data related is omitted - a byte[] is a byte[].

CharcterSetDesign.png

Windows API ⇔ Application is fairly generic. The Application will use L"" rather than the _T("") macro. This means conversion are occuring frequently if UNICODE and _UNICODE are not defined.

Generally, the Crypto ⇔ Application conversion is StringWiden(...) for items such as digests. An exception is the occasional need for narrowing a filename.

Caveats

The Win32 API switches between narrow and wide character set based on UNICODE. The Standard C++ Library switch occurs based on _UNICODE. This will rear its head when one outputs using cout. One may receive memory addresses rather than strings on the console (in Visual C++ 6.0). Either #define both, or #define neither (and use cout or wcout accordingly). A similar behavior used to occur in database code.

When using wide.resize( narrow.length() ) (and the narrow version), do not use length() + 1 - the resulting string will have an additional NULL added. This will break some substring and most string matching code.

Sample

Please visit The Code Project and download A File Checksum Shell Menu Extension Dll.

Downloads

AppendixDLocales.zip - The C++ Programming Language (3rd Edition), Appendix D: Locales by Stroustrup - 232 kB