String Length in Elixir

In Elixir, you can return the number of characters in a string with the String.length/1 function:

String.length("Myzigyasa") # 9
String.length("")        # 0
String.length("résumé")  # 6

Discussion

In the case of a string like, "Myzigyasa", there isn’t much to discuss. String.length("Myzigyasa") returns the number of characters in the string.

The string clearly has 9 characters and since each character in this particular string can be represented with a single byte, its raw representation is 9 bytes as well.

Things become more interesting when a string contains special characters.

Consider the string, "résumé". In this case, String.length/1 returns 6. You can think of this as the number of “visible” or “user-perceived” characters in the string. However, let’s investigate its raw representation:

iex(2)> i "résumé"
Term
  "résumé"
Data type
  BitString
Byte size
  8
Description
  This is a string: a UTF-8 encoded binary. It's printed surrounded by
  "double quotes" because all UTF-8 encoded code points in it are printable.
Raw representation
  <<114, 195, 169, 115, 117, 109, 195, 169>>
Reference modules
  String, :binary
Implemented protocols
  Collectable, IEx.Info, Inspect, List.Chars, String.Chars

You’ll notice that the string’s actual data type is BitString. Specifically, this binary is made up of 8 bytes, even though there are only 6 user-perceived characters. This is because it takes 2 bytes to represent its accented é characters.

If you are interested in the number of bytes within a string, instead of its length, you can use the Kernel.byte_size/1 function:

iex(3)> string = "résumé"
"résumé"

iex(4)> String.length(string)
6

iex(5)> byte_size(string)
8

From a performance standpoint, it’s worth noting that Kernel.byte_size/1 is more efficient than String.length/1. Unlike the latter, which takes longer as the string grows, Kernel.byte_size/1 will return in constant time.

Strings in Elixir are UTF-8 encoded binaries. You can think of them as collections of code points. A code point is a Unicode character, whose underlying representation might require one or more bytes. For example, the é characters in the string "résumé" are code points whose representation requires two bytes each.

The Unicode standard also defines some special characters as the combination of other characters. In other words, even though they appear to the reader as a single character, they are in fact a combination of two or more code points. These are known as grapheme clusters.

For example, the e-acute letter can be represented as a single code point as we’ve done so far (this is also known as a precomposed character) or as a combination of two code points (the letter e and a combining acute accent). These look the same but they are technically two different characters as far as Elixir is concerned; so the two strings below end up representing two different binaries:

iex(6)> "é" == "é"
false

When working with strings, you’ll often want to consider them in terms of the user-perceived characters, rather than their code points or the binary they actually represent.

To help us out, the Elixir String module provides us with String.graphemes/1 which returns a list of characters, without splitting grapheme clusters into the underlying code points. If you need the codepoints, you can always use String.codepoints/1.

To see the distinction between code points and graphemes in action, consider the following string (using the grapheme cluster built from two code points):

iex(7)> String.codepoints("cliché")
["c", "l", "i", "c", "h", "e", "́"]

iex(8)> String.graphemes("cliché")
["c", "l", "i", "c", "h", "é"]

As you can see, String.codepoints/1 shows us a list of code points in the string, and the special e-acute gets split into two code points. If you look closely, you’ll notice the accent as the last code point in the list.

String.graphemes/1 simply returns the list of graphemes and is the closest thing that we have to a function that provides a list of user-perceived characters.

Now, consider its length and byte size:

ie(9)> String.length("cliché")
6

iex(10)> byte_size("cliché")
8

String.length/1 returns 6, the number of user-perceived characters in the string. Kernel.byte_size/1 returns 8 because it takes 3 bytes to represent the special character/grapheme (1 for the letter e, and 2 for the combining acute accent).

We used the expression “visible” or “user-perceived” characters to give us an intuitive understanding. Now that you know more about bytes, code points, and graphemes, we can be a little more precise and say that String.length/1 returns the number of graphemes in a string.

Finally, if you wanted to know the number of codepoints, you could trivially compose the Kernel.length/1 function and the String.codepoints/1 function:

"cliché"
  |> String.codepoints()
  |> length()

Note that if you are trying this in IEx, you’ll need to use the backslash character to continue on new lines:

iex(11)> "cliché" \
...(11)> |> String.codepoints() \
...(11)> |> length()
7

So in summary, our string contains 6 graphemes, 7 code points, and 8 bytes.

Leave a Reply

Your email address will not be published. Required fields are marked *