Skip to content

2020-2: Character Building

UTF-8 encodes Unicode characters using 1-4 integers for each character. Dyalog APL includes a system function, ⎕UCS, that can convert characters into integers and integers into characters. The expression 'UTF-8'∘⎕UCS converts between characters and UTF-8.

Consider the following:

      'UTF-8'∘⎕UCS 'D¥⍺⌊○9'
68 194 165 226 141 186 226 140 138 226 151 139 57
      'UTF-8'∘⎕UCS 68 194 165 226 141 186 226 140 138 226 151 139 57
D¥⍺⌊○9

How many integers does each character use?

      'UTF-8'∘⎕UCS¨ 'D¥⍺⌊○9' ⍝ using ]Boxing on
┌──┬───────┬───────────┬───────────┬───────────┬──┐
│68│194 165│226 141 186│226 140 138│226 151 139│57│
└──┴───────┴───────────┴───────────┴───────────┴──┘ 

The rule is that an integer in the range 128 to 191 (inclusive) continues the character of the previous integer (which may itself be a continuation). With that in mind, write a function that, given a right argument which is a simple integer vector representing valid UTF-8 text, encloses each sequence of integers that represent a single character, like the result of 'UTF-8'∘⎕UCS¨'UTF-8'∘⎕UCS but does not use any system functions (names beginning with ).

💡 Hint: Use ⎕UCS to verify your solution.

Examples:

      (your_function) 68 194 165 226 141 186 226 140 138 240 159 148 178 57   ⍝ using ]Boxing on
┌──┬───────┬───────────┬───────────┬───────────────┬──┐
│68│194 165│226 141 186│226 140 138│240 159 148 178│57│
└──┴───────┴───────────┴───────────┴───────────────┴──┘

      (your_function) 68 121 97 108 111 103   ⍝ 'Dyalog'
┌──┬───┬──┬───┬───┬───┐
│68│121│97│108│111│103│
└──┴───┴──┴───┴───┴───┘

      (your_function) ⍬   ⍝ '' (any empty vector result is acceptable here)
your_function ←

Solutions

Video Thumbnail YouTube

Chat transcript Code on GitHub