Gnash
0.8.10
|
Utilities to convert between std::string and std::wstring. More...
Enumerations | |
enum | TextEncoding { encUNSPECIFIED, encUTF8, encUTF16BE, encUTF16LE, encUTF32BE, encUTF32LE, encSCSU, encUTF7, encUTFEBCDIC, encBOCU1 } |
enum | EncodingGuess { ENCGUESS_UNICODE = 0, ENCGUESS_JIS = 1, ENCGUESS_OTHER = 2 } |
Functions | |
std::wstring | decodeCanonicalString (const std::string &str, int version) |
Converts a std::string with multibyte characters into a std::wstring. | |
std::string | encodeCanonicalString (const std::wstring &wstr, int version) |
Converts a std::wstring into canonical std::string. | |
std::string | encodeLatin1Character (boost::uint32_t ucsCharacter) |
Encodes the given wide character into an at least 8-bit character. | |
boost::uint32_t | decodeNextUnicodeCharacter (std::string::const_iterator &it, const std::string::const_iterator &e) |
Return the next Unicode character in the UTF-8 encoded string. | |
std::string | encodeUnicodeCharacter (boost::uint32_t ucs_character) |
Encodes the given wide character into a canonical string, theoretically up to 6 chars in length. | |
char * | stripBOM (char *in, size_t &size, TextEncoding &encoding) |
Interpret (and skip) Byte Order Mark in input stream. | |
const char * | textEncodingName (TextEncoding enc) |
Return name of a text encoding. | |
EncodingGuess | guessEncoding (const std::string &s, int &length, std::vector< int > &offsets) |
Common code for guessing at the encoding of random text, between. |
Utilities to convert between std::string and std::wstring.
Strings in Gnash are generally stored as std::strings. We have to deal, however, with characters larger than standard ASCII (128), which can be encoded in two different ways.
SWF6 and later use UTF-8, encoded as multibyte characters and allowing many thousands of unique codes. Multibyte characters are difficult to handle, as their length - used for many string operations - is not certain without parsing the string. Converting the string to a wstring (generally a uint32_t - the pp seems only to handle characters up to 65535 - two bytes is the minimum size of a wchar) facilitates string operations, as the length of the string is equal to the number of valid characters.
SWF5 and earlier, however, used the ISO-8859 specification, allowing the standard 128 ASCII characters plus 128 extra characters that depend on the particular subset of ISO-8859. Characters are 8 bits, not the ASCII standard 7. SWF5 cannot handle multi-byte characters without special functions.
It is important that SWF5 can distinguish between the two encodings, so we cannot convert all strings to UTF-8. Please note that, although this is called utf8, what the Adobe player uses is only loosely related to real unicode, so the encoding support here is correspondingly non-standard.
DSOEXPORT std::wstring gnash::utf8::decodeCanonicalString | ( | const std::string & | str, |
int | version | ||
) |
Converts a std::string with multibyte characters into a std::wstring.
str | the canonical string to convert. |
version | the SWF version, used to decide how to decode the string. For SWF5, UTF-8 (or any other) multibyte encoded characters are converted char by char, mangling the string. |
References gnash::key::e, and decodeNextUnicodeCharacter().
Referenced by gnash::TextField::TextField(), gnash::TextField::replaceSelection(), and gnash::TextField::updateText().
DSOEXPORT boost::uint32_t gnash::utf8::decodeNextUnicodeCharacter | ( | std::string::const_iterator & | it, |
const std::string::const_iterator & | e | ||
) |
Return the next Unicode character in the UTF-8 encoded string.
Invalid UTF-8 sequences produce a U+FFFD character as output. Advances string iterator past the character returned, unless the returned character is '\0', in which case the iterator does not advance.
References FIRST_BYTE, and NEXT_BYTE.
Referenced by decodeCanonicalString(), and guessEncoding().
DSOEXPORT std::string gnash::utf8::encodeCanonicalString | ( | const std::wstring & | wstr, |
int | version | ||
) |
Converts a std::wstring into canonical std::string.
wstr | the wide string to convert. |
version | the SWF version, used to decide how to encode the string. |
For SWF 5, each character is stored as an 8-bit (at least) char, rather than converting it to a canonical UTF-8 byte sequence. Gnash can then distinguish between 8-bit characters, which it handles correctly, and multi-byte characters, which are regarded as multiple characters for string methods.
References encodeUnicodeCharacter(), and encodeLatin1Character().
Referenced by gnash::TextField::setTextValue(), gnash::TextField::get_text_value(), and gnash::TextField::get_htmltext_value().
DSOEXPORT std::string gnash::utf8::encodeLatin1Character | ( | boost::uint32_t | ucsCharacter | ) |
Encodes the given wide character into an at least 8-bit character.
Allows storage of Latin1 (ISO-8859-1) characters. This is the format of SWF5 and below.
Referenced by encodeCanonicalString().
DSOEXPORT std::string gnash::utf8::encodeUnicodeCharacter | ( | boost::uint32_t | ucs_character | ) |
Encodes the given wide character into a canonical string, theoretically up to 6 chars in length.
Referenced by encodeCanonicalString().
DSOEXPORT EncodingGuess gnash::utf8::guessEncoding | ( | const std::string & | s, |
int & | length, | ||
std::vector< int > & | offsets | ||
) |
Common code for guessing at the encoding of random text, between.
TODO: It's doubtful if this even works, and it may not be useful at all.
References width, gnash::key::e, length, gnash::key::c, decodeNextUnicodeCharacter(), ENCGUESS_UNICODE, ENCGUESS_JIS, and ENCGUESS_OTHER.
DSOEXPORT char * gnash::utf8::stripBOM | ( | char * | in, |
size_t & | size, | ||
TextEncoding & | encoding | ||
) |
Interpret (and skip) Byte Order Mark in input stream.
This function takes a pointer to a buffer and returns the start of actual data after an eventual BOM. No conversion is performed, no bytes copy, just skipping of the BOM snippet and interpretation of it returned to the encoding input parameter.
See http://en.wikipedia.org/wiki/Byte-order_mark
in | The input buffer. |
size | Size of the input buffer, will be decremented by the size of the BOM, if any. |
encoding | Output parameter, will always be set. encUNSPECIFIED if no BOM is found. |
References encUNSPECIFIED, encUTF16LE, encUTF16BE, encUTF8, encUTF32BE, and encUTF32LE.
Referenced by gnash::movie_root::LoadCallback::processLoad().
DSOEXPORT const char * gnash::utf8::textEncodingName | ( | TextEncoding | enc | ) |
Return name of a text encoding.
References encUNSPECIFIED, encUTF8, encUTF16BE, encUTF16LE, encUTF32BE, encUTF32LE, encSCSU, encUTF7, encUTFEBCDIC, and encBOCU1.
Referenced by gnash::movie_root::LoadCallback::processLoad().