UTF-16/Unicode to byte array in JavaScript

Rushing to the solution (in case you need it now)

The following JavaScript will build a Byte Array from a string. It should even handle exotic characters (like new unicode emoji’s), which is better than some of the other implementations that you may find online.

After you have run the previous line of JavaScript, the function is assigned to the String prototype. This means you can call the function on any string, like this:

The (optional) boolean argument is only relevant if your string may contain invalid encoding. In that case this data is by default replaced with the U+FFFD � replacement character, but if you give true as an argument, the U+FFFE “not a character”-character is used (explicitly marking the data as invalid).

What did I need this for

In SharePoint 2013 you can create files in JavaScript, but you need to define the file content as a byte array. So if you like to write a log file to a library for instance, you first need to convert your string content to a byte array. The examples that I found online, did not fully support UTF-16 and Unicode. So I’ve tried to understand what was going on and I’ve build a solution that worked for me.

The following example function is an illustration of how to write a text file to a SharePoint document library (you can find plenty of other examples for this online):

I’ve written the above function as an example. It works but I’ve omitted things to focus on the subject.

If you have included both functions (toByteArray and createSPFile), the following example shows how to create a hello_world.txt file in the Documents library:

The comprehensive implementation of the toByteArray function

The toByteArray implementation that I shared at the start of this blog was a minified version (using Google’s Closure Compiler). I will also share the long version here, with comments, so you can check what is going on in my implementation.

What’s up with the optional boolean argument

The string input data can have invalid encoding in the following ways:

  1. The charcode is out of range even for Unicode
  2. The charcode represents a part of a UTF-16 surrogate pair, but the other part of the pair is missing or defined incorrectly.

I’ve looked at a few other implementations to produce a byte array from a string (not only JavaScript implementations), and noticed that different libraries lead to different results in this scenario. It seems most common to just replace the invalid encoding with the U+FFFD replacement character. This basically means that there was data, but it has been lost while processing.

An alternative that seems just as valid is to use the U+FFFE “not a character” charcode in the byte array. If you give true as an argument to the toByteArray function, this charcode will be used in the byte array for invalid incoding in stead of U+FFFD.

I’m not sure about all the nuances between U+FFFE and U+FFFD, but it seemed relevant to leave U+FFFE in as an option. To me FFFD seems more forgiving, common and user friendly (usually presented with a question mark �), so I’ve chosen that value as the default.

 


Leave a Reply

Your email address will not be published. Required fields are marked *