functions operating on Unicode characters and UTF-8 strings.
This section describes a number of functions for dealing with Unicode characters and strings. There are analogues of the traditional ctype.h character classification and case conversion functions, UTF-8 analogues of some string utility functions, functions to perform normalization, case conversion and collation on UTF-8 strings and finally functions to convert between the UTF-8, UTF-16 and UCS-4 encodings of Unicode.
The implementations of the Unicode functions in GLib are based on the Unicode Character Data tables, which are available from www.unicode.org. GLib 2.8 supports Unicode 4.0, GLib 2.10 supports Unicode 4.1, GLib 2.12 supports Unicode 5.0.
unsigned-int32
) ⇒ (ret bool
)Checks whether ch is a valid Unicode character. Some possible integer values of ch will not be valid. 0 is considered a valid character, though it's normally a string terminator.
- ch
- a Unicode character
- ret
- ‘
#t
’ if ch is a valid Unicode character
unsigned-int32
) ⇒ (ret bool
)Determines whether a character is alphanumeric. Given some UTF-8 text, obtain a character value with
g-utf8-get-char
.
- c
- a Unicode character
- ret
- ‘
#t
’ if c is an alphanumeric character
unsigned-int32
) ⇒ (ret bool
)Determines whether a character is alphabetic (i.e. a letter). Given some UTF-8 text, obtain a character value with
g-utf8-get-char
.
- c
- a Unicode character
- ret
- ‘
#t
’ if c is an alphabetic character
unsigned-int32
) ⇒ (ret bool
)Determines whether a character is a control character. Given some UTF-8 text, obtain a character value with
g-utf8-get-char
.
- c
- a Unicode character
- ret
- ‘
#t
’ if c is a control character
unsigned-int32
) ⇒ (ret bool
)Determines whether a character is numeric (i.e. a digit). This covers ASCII 0-9 and also digits in other languages/scripts. Given some UTF-8 text, obtain a character value with
g-utf8-get-char
.
- c
- a Unicode character
- ret
- ‘
#t
’ if c is a digit
unsigned-int32
) ⇒ (ret bool
)Determines whether a character is printable and not a space (returns ‘
#f
’ for control characters, format characters, and spaces).g-unichar-isprint
is similar, but returns ‘#t
’ for spaces. Given some UTF-8 text, obtain a character value withg-utf8-get-char
.
- c
- a Unicode character
- ret
- ‘
#t
’ if c is printable unless it's a space
unsigned-int32
) ⇒ (ret bool
)Determines whether a character is a lowercase letter. Given some UTF-8 text, obtain a character value with
g-utf8-get-char
.
- c
- a Unicode character
- ret
- ‘
#t
’ if c is a lowercase letter
unsigned-int32
) ⇒ (ret bool
)Determines whether a character is printable. Unlike
g-unichar-isgraph
, returns ‘#t
’ for spaces. Given some UTF-8 text, obtain a character value withg-utf8-get-char
.
- c
- a Unicode character
- ret
- ‘
#t
’ if c is printable
unsigned-int32
) ⇒ (ret bool
)Determines whether a character is punctuation or a symbol. Given some UTF-8 text, obtain a character value with
g-utf8-get-char
.
- c
- a Unicode character
- ret
- ‘
#t
’ if c is a punctuation or symbol character
unsigned-int32
) ⇒ (ret bool
)Determines whether a character is a space, tab, or line separator (newline, carriage return, etc.). Given some UTF-8 text, obtain a character value with
g-utf8-get-char
.(Note: don't use this to do word breaking; you have to use Pango or equivalent to get word breaking right, the algorithm is fairly complex.)
- c
- a Unicode character
- ret
- ‘
#t
’ if c is a space character
unsigned-int32
) ⇒ (ret bool
)Determines if a character is uppercase.
- c
- a Unicode character
- ret
- ‘
#t
’ if c is an uppercase character
unsigned-int32
) ⇒ (ret bool
)Determines if a character is a hexidecimal digit.
- c
- a Unicode character.
- ret
- ‘
#t
’ if the character is a hexadecimal digit
unsigned-int32
) ⇒ (ret bool
)Determines if a character is titlecase. Some characters in Unicode which are composites, such as the DZ digraph have three case variants instead of just two. The titlecase form is used at the beginning of a word where only the first letter is capitalized. The titlecase form of the DZ digraph is U+01F2 LATIN CAPITAL LETTTER D WITH SMALL LETTER Z.
- c
- a Unicode character
- ret
- ‘
#t
’ if the character is titlecase
unsigned-int32
) ⇒ (ret bool
)Determines if a given character is assigned in the Unicode standard.
- c
- a Unicode character
- ret
- ‘
#t
’ if the character has an assigned value
unsigned-int32
) ⇒ (ret bool
)Determines if a character is typically rendered in a double-width cell.
- c
- a Unicode character
- ret
- ‘
#t
’ if the character is wide
unsigned-int32
) ⇒ (ret bool
)Determines if a character is typically rendered in a double-width cell under legacy East Asian locales. If a character is wide according to
g-unichar-iswide
, then it is also reported wide with this function, but the converse is not necessarily true. See the Unicode Standard Annex for details.
- c
- a Unicode character
- ret
- ‘
#t
’ if the character is wide in legacy East Asian localesSince 2.12
unsigned-int32
) ⇒ (ret unsigned-int32
)Converts a character to uppercase.
- c
- a Unicode character
- ret
- the result of converting c to uppercase. If c is not an lowercase or titlecase character, or has no upper case equivalent c is returned unchanged.
unsigned-int32
) ⇒ (ret unsigned-int32
)Converts a character to lower case.
- c
- a Unicode character.
- ret
- the result of converting c to lower case. If c is not an upperlower or titlecase character, or has no lowercase equivalent c is returned unchanged.
unsigned-int32
) ⇒ (ret unsigned-int32
)Converts a character to the titlecase.
- c
- a Unicode character
- ret
- the result of converting c to titlecase. If c is not an uppercase or lowercase character, c is returned unchanged.
unsigned-int32
) ⇒ (ret int
)Determines the numeric value of a character as a decimal digit.
- c
- a Unicode character
- ret
- If c is a decimal digit (according to
g-unichar-isdigit
), its numeric value. Otherwise, -1.
unsigned-int32
) ⇒ (ret int
)Determines the numeric value of a character as a hexidecimal digit.
- c
- a Unicode character
- ret
- If c is a hex digit (according to
g-unichar-isxdigit
), its numeric value. Otherwise, -1.
unsigned-int32
) ⇒ (ret <g-unicode-type>
)Classifies a Unicode character by type.
- c
- a Unicode character
- ret
- the type of the character.
unsigned-int32
) ⇒ (ret <g-unicode-break-type>
)Determines the break type of c. c should be a Unicode character (to derive a character from UTF-8 encoded text, use
g-utf8-get-char
). The break type is used to find word and line breaks ("text boundaries"), Pango implements the Unicode boundary resolution algorithms and normally you would use a function such aspango-break
instead of caring about break types yourself.
- c
- a Unicode character
- ret
- the break type of c
unsigned-int32
) ⇒ (ret bool
) (mirrored_ch unsigned-int32
)In Unicode, some characters are mirrored. This means that their images are mirrored horizontally in text that is laid out from right to left. For instance, "(" would become its mirror image, ")", in right-to-left text.
If ch has the Unicode mirrored property and there is another unicode character that typically has a glyph that is the mirror image of ch's glyph and mirrored-ch is set, it puts that character in the address pointed to by mirrored-ch. Otherwise the original character is put.
- ch
- a Unicode character
- mirrored-ch
- location to store the mirrored character
- ret
- ‘
#t
’ if ch has a mirrored character, ‘#f
’ otherwiseSince 2.4
mchars
) ⇒ (ret unsigned-int32
)Converts a sequence of bytes encoded as UTF-8 to a Unicode character. If p does not point to a valid UTF-8 encoded character, results are undefined. If you are not sure that the bytes are complete valid Unicode characters, you should use
g-utf8-get-char-validated
instead.
- p
- a pointer to Unicode character encoded as UTF-8
- ret
- the resulting character
mchars
) ⇒ (ret mchars
)Finds the start of the next UTF-8 character in the string after p.
p does not have to be at the beginning of a UTF-8 character. No check is made to see if the character found is actually valid other than it starts with an appropriate byte.
- p
- a pointer to a position within a UTF-8 encoded string
- end
- a pointer to the end of the string, or ‘
#f
’ to indicate that the string is nul-terminated, in which case the returned value will be- ret
- a pointer to the found character or ‘
#f
’
mchars
) ⇒ (ret long
)Returns the length of the string in characters.
- p
- pointer to the start of a UTF-8 encoded string.
- max
- the maximum number of bytes to examine. If max is less than 0, then the string is assumed to be nul-terminated. If max is 0, p will not be examined and may be ‘
#f
’.- ret
- the length of the string in characters
mchars
) (c unsigned-int32
) ⇒ (ret mchars
)Finds the leftmost occurrence of the given Unicode character in a UTF-8 encoded string, while limiting the search to len bytes. If len is -1, allow unbounded search.
- p
- a nul-terminated UTF-8 encoded string
- len
- the maximum length of p
- c
- a Unicode character
- ret
- ‘
#f
’ if the string does not contain the character, otherwise, a pointer to the start of the leftmost occurrence of the character in the string.
mchars
) (c unsigned-int32
) ⇒ (ret mchars
)Find the rightmost occurrence of the given Unicode character in a UTF-8 encoded string, while limiting the search to len bytes. If len is -1, allow unbounded search.
- p
- a nul-terminated UTF-8 encoded string
- len
- the maximum length of p
- c
- a Unicode character
- ret
- ‘
#f
’ if the string does not contain the character, otherwise, a pointer to the start of the rightmost occurrence of the character in the string.
mchars
) ⇒ (ret mchars
)Reverses a UTF-8 string. str must be valid UTF-8 encoded text. (Use
g-utf8-validate
on all text before trying to use UTF-8 utility functions with it.)Note that unlike
g-strreverse
, this function returns newly-allocated memory, which should be freed withg-free
when no longer needed.
- str
- a UTF-8 encoded string
- len
- the maximum length of str to use. If len < 0, then the string is nul-terminated.
- ret
- a newly-allocated string which is the reverse of str.
Since 2.2
mchars
) ⇒ (ret bool
)Validates UTF-8 encoded text. str is the text to validate; if str is nul-terminated, then max-len can be -1, otherwise max-len should be the number of bytes to validate. If end is non-‘
#f
’, then the end of the valid range will be stored there (i.e. the start of the first invalid character if some bytes were invalid, or the end of the text being validated otherwise).Note that
g-utf8-validate
returns ‘#f
’ if max-len is positive and NUL is met before max-len bytes have been read.Returns ‘
#t
’ if all of str was valid. Many GLib and GTK+ routines require valid UTF-8 as input; so data read from a file or the network should be checked withg-utf8-validate
before doing anything else with it.
- str
- a pointer to character data
- max-len
- max bytes to validate, or -1 to go until NUL
- end
- return location for end of valid data
- ret
- ‘
#t
’ if the text was valid UTF-8
mchars
) ⇒ (ret mchars
)Converts all Unicode characters in the string that have a case to uppercase. The exact manner that this is done depends on the current locale, and may result in the number of characters in the string increasing. (For instance, the German ess-zet will be changed to SS.)
- str
- a UTF-8 encoded string
- len
- length of str, in bytes, or -1 if str is nul-terminated.
- ret
- a newly allocated string, with all characters converted to uppercase.
mchars
) ⇒ (ret mchars
)Converts all Unicode characters in the string that have a case to lowercase. The exact manner that this is done depends on the current locale, and may result in the number of characters in the string changing.
- str
- a UTF-8 encoded string
- len
- length of str, in bytes, or -1 if str is nul-terminated.
- ret
- a newly allocated string, with all characters converted to lowercase.
mchars
) ⇒ (ret mchars
)Converts a string into a form that is independent of case. The result will not correspond to any particular case, but can be compared for equality or ordered with the results of calling
g-utf8-casefold
on other strings.Note that calling
g-utf8-casefold
followed byg-utf8-collate
is only an approximation to the correct linguistic case insensitive ordering, though it is a fairly good one. Getting this exactly right would require a more sophisticated collation function that takes case sensitivity into account. GLib does not currently provide such a function.
- str
- a UTF-8 encoded string
- len
- length of str, in bytes, or -1 if str is nul-terminated.
- ret
- a newly allocated string, that is a case independent form of str.
mchars
) (mode <g-normalize-mode>
) ⇒ (ret mchars
)Converts a string into canonical form, standardizing such issues as whether a character with an accent is represented as a base character and combining accent or as a single precomposed character. You should generally call
g-utf8-normalize
before comparing two Unicode strings.The normalization mode ‘G_NORMALIZE_DEFAULT’ only standardizes differences that do not affect the text content, such as the above-mentioned accent representation. ‘G_NORMALIZE_ALL’ also standardizes the "compatibility" characters in Unicode, such as SUPERSCRIPT THREE to the standard forms (in this case DIGIT THREE). Formatting information may be lost but for most text operations such characters should be considered the same. For example,
g-utf8-collate
normalizes with ‘G_NORMALIZE_ALL’ as its first step.‘G_NORMALIZE_DEFAULT_COMPOSE’ and ‘G_NORMALIZE_ALL_COMPOSE’ are like ‘G_NORMALIZE_DEFAULT’ and ‘G_NORMALIZE_ALL’, but returned a result with composed forms rather than a maximally decomposed form. This is often useful if you intend to convert the string to a legacy encoding or pass it to a system with less capable Unicode handling.
- str
- a UTF-8 encoded string.
- len
- length of str, in bytes, or -1 if str is nul-terminated.
- mode
- the type of normalization to perform.
- ret
- a newly allocated string, that is the normalized form of str.
mchars
) (str2 mchars
) ⇒ (ret int
)Compares two strings for ordering using the linguistically correct rules for the current locale. When sorting a large number of strings, it will be significantly faster to obtain collation keys with
g-utf8-collate-key
and compare the keys withstrcmp
when sorting instead of sorting the original strings.
- str1
- a UTF-8 encoded string
- str2
- a UTF-8 encoded string
- ret
- < 0 if str1 compares before str2, 0 if they compare equal, > 0 if str1 compares after str2.
mchars
) ⇒ (ret mchars
)Converts a string into a collation key that can be compared with other collation keys produced by the same function using
strcmp
. The results of comparing the collation keys of two strings withstrcmp
will always be the same as comparing the two original keys withg-utf8-collate
.
- str
- a UTF-8 encoded string.
- len
- length of str, in bytes, or -1 if str is nul-terminated.
- ret
- a newly allocated string. This string should be freed with
g-free
when you are done with it.
mchars
) ⇒ (ret mchars
)Converts a string into a collation key that can be compared with other collation keys produced by the same function using
strcmp
.In order to sort filenames correctly, this function treats the dot '.' as a special case. Most dictionary orderings seem to consider it insignificant, thus producing the ordering "event.c" "eventgenerator.c" "event.h" instead of "event.c" "event.h" "eventgenerator.c". Also, we would like to treat numbers intelligently so that "file1" "file10" "file5" is sorted as "file1" "file5" "file10".
- str
- a UTF-8 encoded string.
- len
- length of str, in bytes, or -1 if str is nul-terminated.
- ret
- a newly allocated string. This string should be freed with
g-free
when you are done with it.Since 2.8
unsigned-int32
) ⇒ (ret mchars
)Converts a single character to UTF-8.
- c
- a Unicode character code
- outbuf
- output buffer, must have at least 6 bytes of space. If ‘
#f
’, the length will be computed and returned and nothing will be written to outbuf.- ret
- number of bytes written