grapheme (API documentation)

Main module for the grapheme package.

grapheme.UNICODE_VERSION

The currently supported Unicode version

grapheme.graphemes(string)

Returns an iterator of all graphemes of given string.

>>> rainbow_flag = "🏳️‍🌈"
>>> [codepoint for codepoint in rainbow_flag]
['🏳', '️', '‍', '🌈']
>>> list(grapheme.graphemes("multi codepoint grapheme: " + rainbow_flag))
['m', 'u', 'l', 't', 'i', ' ', 'c', 'o', 'd', 'e', 'p', 'o', 'i', 'n', 't', ' ', 'g', 'r', 'a', 'p', 'h', 'e', 'm', 'e', ':', ' ', '🏳️‍🌈']
grapheme.length(string, until=None)

Returns the number of graphemes in the string.

Note that this functions needs to traverse the full string to calculate the length, unlike len(string) and it’s time consumption is linear to the length of the string (up to the until value).

Only counts up to the until argument, if given. This is useful when testing the length of a string against some limit and the excess length is not interesting.

>>> rainbow_flag = "🏳️‍🌈"
>>> len(rainbow_flag)
4
>>> graphemes.length(rainbow_flag)
1
>>> graphemes.length("".join(str(i) for i in range(100)), 30)
30
grapheme.grapheme_lengths(string)

Returns an iterator of number of code points in each grapheme of the string.

grapheme.slice(string, start=None, end=None)

Returns a substring of the given string, counting graphemes instead of codepoints.

Negative indices is currently not supported. >>> string = “tamil நி (ni)”

>>> string[:7]
'tamil ந'
>>> grapheme.slice(string, end=7)
'tamil நி'
>>> string[7:]
'ி (ni)'
>>> grapheme.slice(string, 7)
' (ni)'
grapheme.contains(string, substring)

Returns true if the sequence of graphemes in substring is also present in string.

This differs from the normal python in operator, since the python operator will return true if the sequence of codepoints are withing the other string without considering grapheme boundaries.

Performance notes: Very fast if substring not in string, since that also means that the same graphemes can not be in the two strings. Otherwise this function has linear time complexity in relation to the string length. It will traverse the sequence of graphemes until a match is found, so it will generally perform better for grapheme sequences that match early.

>>> "🇸🇪" in "🇪🇸🇪🇪"
True
>>> grapheme.contains("🇪🇸🇪🇪", "🇸🇪")
False
grapheme.safe_split_index(string, max_len)

Returns the highest index up to max_len at which the given string can be sliced, without breaking a grapheme.

This is useful for when you want to split or take a substring from a string, and don’t really care about the exact grapheme length, but don’t want to risk breaking existing graphemes.

This function does normally not traverse the full grapheme sequence up to the given length, so it can be used for arbitrarily long strings and high `max_len`s. However, some grapheme boundaries depend on the previous state, so the worst case performance is O(n). In practice, it’s only very long non-broken sequences of country flags (represented as Regional Indicators) that will perform badly.

The return value will always be between 0 and len(string).

>>> string = "tamil நி (ni)"
>>> i = grapheme.safe_split_index(string, 7)
>>> i
6
>>> string[:i]
'tamil '
>>> string[i:]
'நி (ni)'
grapheme.startswith(string, prefix)

Like str.startswith, but also checks that the string starts with the given prefixes sequence of graphemes.

str.startswith may return true for a prefix that is not visually represented as a prefix if a grapheme cluster is continued after the prefix ends.

>>> grapheme.startswith("✊🏾", "✊")
False
>>> "✊🏾".startswith("✊")
True
grapheme.endswith(string, suffix)

Like str.endswith, but also checks that the string ends with the given prefixes sequence of graphemes.

str.endswith may return true for a suffix that is not visually represented as a suffix if a grapheme cluster is initiated before the suffix starts.

>>> grapheme.endswith("🏳️‍🌈", "🌈")
False
>>> "🏳️‍🌈".endswith("🌈")
True