Utilities to convert between std::string and std::wstring. More...

Enumerations
enum	TextEncoding { encUNSPECIFIED, encUTF8, encUTF16BE, encUTF16LE, encUTF32BE, encUTF32LE, encSCSU, encUTF7, encUTFEBCDIC, encBOCU1 }
enum	EncodingGuess { ENCGUESS_UNICODE = 0, ENCGUESS_JIS = 1, ENCGUESS_OTHER = 2 }
Functions
std::wstring	decodeCanonicalString (const std::string &str, int version)
	Converts a std::string with multibyte characters into a std::wstring.
std::string	encodeCanonicalString (const std::wstring &wstr, int version)
	Converts a std::wstring into canonical std::string.
std::string	encodeLatin1Character (boost::uint32_t ucsCharacter)
	Encodes the given wide character into an at least 8-bit character.
boost::uint32_t	decodeNextUnicodeCharacter (std::string::const_iterator &it, const std::string::const_iterator &e)
	Return the next Unicode character in the UTF-8 encoded string.
std::string	encodeUnicodeCharacter (boost::uint32_t ucs_character)
	Encodes the given wide character into a canonical string, theoretically up to 6 chars in length.
char *	stripBOM (char *in, size_t &size, TextEncoding &encoding)
	Interpret (and skip) Byte Order Mark in input stream.
const char *	textEncodingName (TextEncoding enc)
	Return name of a text encoding.
EncodingGuess	guessEncoding (const std::string &s, int &length, std::vector< int > &offsets)
	Common code for guessing at the encoding of random text, between.

Detailed Description

Utilities to convert between std::string and std::wstring.

Strings in Gnash are generally stored as std::strings. We have to deal, however, with characters larger than standard ASCII (128), which can be encoded in two different ways.

SWF6 and later use UTF-8, encoded as multibyte characters and allowing many thousands of unique codes. Multibyte characters are difficult to handle, as their length - used for many string operations - is not certain without parsing the string. Converting the string to a wstring (generally a uint32_t - the pp seems only to handle characters up to 65535 - two bytes is the minimum size of a wchar) facilitates string operations, as the length of the string is equal to the number of valid characters.

SWF5 and earlier, however, used the ISO-8859 specification, allowing the standard 128 ASCII characters plus 128 extra characters that depend on the particular subset of ISO-8859. Characters are 8 bits, not the ASCII standard 7. SWF5 cannot handle multi-byte characters without special functions.

It is important that SWF5 can distinguish between the two encodings, so we cannot convert all strings to UTF-8. Please note that, although this is called utf8, what the Adobe player uses is only loosely related to real unicode, so the encoding support here is correspondingly non-standard.

Enumeration Type Documentation

enum gnash::utf8::EncodingGuess

Enumerator:

ENCGUESS_UNICODE
ENCGUESS_JIS
ENCGUESS_OTHER

enum gnash::utf8::TextEncoding

Enumerator:

encUNSPECIFIED
encUTF8
encUTF16BE
encUTF16LE
encUTF32BE
encUTF32LE
encSCSU
encUTF7
encUTFEBCDIC
encBOCU1

Function Documentation

DSOEXPORT std::wstring gnash::utf8::decodeCanonicalString	(	const std::string &	str,
		int	version
	)

Converts a std::string with multibyte characters into a std::wstring.

Returns:: a version-dependent wstring.

Parameters:

str	the canonical string to convert.
version	the SWF version, used to decide how to decode the string. For SWF5, UTF-8 (or any other) multibyte encoded characters are converted char by char, mangling the string.

References gnash::key::e, and decodeNextUnicodeCharacter().

Referenced by gnash::TextField::TextField(), gnash::TextField::replaceSelection(), and gnash::TextField::updateText().

DSOEXPORT boost::uint32_t gnash::utf8::decodeNextUnicodeCharacter	(	std::string::const_iterator &	it,
		const std::string::const_iterator &	e
	)

Return the next Unicode character in the UTF-8 encoded string.

Invalid UTF-8 sequences produce a U+FFFD character as output. Advances string iterator past the character returned, unless the returned character is '\0', in which case the iterator does not advance.

References FIRST_BYTE, and NEXT_BYTE.

Referenced by decodeCanonicalString(), and guessEncoding().

DSOEXPORT std::string gnash::utf8::encodeCanonicalString	(	const std::wstring &	wstr,
		int	version
	)

Converts a std::wstring into canonical std::string.

Returns:: a version-dependent encoded std::string.

Parameters:

wstr	the wide string to convert.
version	the SWF version, used to decide how to encode the string.

For SWF 5, each character is stored as an 8-bit (at least) char, rather than converting it to a canonical UTF-8 byte sequence. Gnash can then distinguish between 8-bit characters, which it handles correctly, and multi-byte characters, which are regarded as multiple characters for string methods.

References encodeUnicodeCharacter(), and encodeLatin1Character().

Referenced by gnash::TextField::setTextValue(), gnash::TextField::get_text_value(), and gnash::TextField::get_htmltext_value().

DSOEXPORT std::string gnash::utf8::encodeLatin1Character ( boost::uint32_t ucsCharacter )

Encodes the given wide character into an at least 8-bit character.

Allows storage of Latin1 (ISO-8859-1) characters. This is the format of SWF5 and below.

Referenced by encodeCanonicalString().

DSOEXPORT std::string gnash::utf8::encodeUnicodeCharacter ( boost::uint32_t ucs_character )

Encodes the given wide character into a canonical string, theoretically up to 6 chars in length.

Referenced by encodeCanonicalString().

DSOEXPORT EncodingGuess gnash::utf8::guessEncoding	(	const std::string &	s,
		int &	length,
		std::vector< int > &	offsets
	)

Common code for guessing at the encoding of random text, between.

TODO: It's doubtful if this even works, and it may not be useful at all.

References width, gnash::key::e, length, gnash::key::c, decodeNextUnicodeCharacter(), ENCGUESS_UNICODE, ENCGUESS_JIS, and ENCGUESS_OTHER.

DSOEXPORT char * gnash::utf8::stripBOM	(	char *	in,
		size_t &	size,
		TextEncoding &	encoding
	)

Interpret (and skip) Byte Order Mark in input stream.

This function takes a pointer to a buffer and returns the start of actual data after an eventual BOM. No conversion is performed, no bytes copy, just skipping of the BOM snippet and interpretation of it returned to the encoding input parameter.

See http://en.wikipedia.org/wiki/Byte-order_mark

Parameters:

in	The input buffer.
size	Size of the input buffer, will be decremented by the size of the BOM, if any.
encoding	Output parameter, will always be set. encUNSPECIFIED if no BOM is found.

Returns:: A pointer either equal to 'in' or some bytes inside it.

References encUNSPECIFIED, encUTF16LE, encUTF16BE, encUTF8, encUTF32BE, and encUTF32LE.

Referenced by gnash::movie_root::LoadCallback::processLoad().

DSOEXPORT const char * gnash::utf8::textEncodingName ( TextEncoding enc )

Return name of a text encoding.

References encUNSPECIFIED, encUTF8, encUTF16BE, encUTF16LE, encUTF32BE, encUTF32LE, encSCSU, encUTF7, encUTFEBCDIC, and encBOCU1.

Referenced by gnash::movie_root::LoadCallback::processLoad().

Enumerations

Functions

Detailed Description

Enumeration Type Documentation

Function Documentation