Storing .NET objects in cookies part 2 – compact bytes to string conversion

As I mentioned in part 1, Forms authentication cookies can get quite big when they have some data in the UserData field. The main problem is that every 8-bit character in the user data occupies four characters in the cookie, because it is UTF-16 encoded (1 character – 2 bytes, an extra zero is added) and converted to a hexadecimal string (1 character – 4 characters). For example, the “X” character (U+0058) ends up as the four character string “5800″. Plus a little overhead of the ticket itself, plus the 33% overhead of Base64 encoding. Here’s how you can do a lot better.

The problem

In the previous post, we had 128 bytes of binary data after the serialization, let’s call it data. This is how it’s converted to Base64 (172 characters), then into a forms cookie:

var str = Convert.ToBase64String(data);
var ticket = new FormsAuthenticationTicket(2, "username", DateTime.Now, DateTime.Now.AddMinutes(30), false, str);
var cookie = FormsAuthentication.Encrypt(ticket);

The cookie is 960 characters long. 4*172=688 bytes is our user data, 272 is the overhead of the ticket. (Note: this is not an exact calculation, because an encrypted ticket also uses a padding, so without the 688 bytes of user data, the ticket may not be exactly 688 bytes shorter.)

One idea that comes to mind is compression, of course. Let’s consider that cheating for now, but it may be added on top of what’s next. Then there is Ascii85, which is a little better than Base64, but not too much.

The solution

There is no way we can change the hexadecimal string conversion, but we can get rid of the Base64 encoding and the wasted extra zero bytes in the UTF-16 encoding. But you can’t just treat any byte array as an UTF-16 encoded byte array. This is because of surrogate pairs (see UTF-16). The 0xD800-0xDFFF double byte range is used to encode code points above U+FFFF, and also the U+D800-U+DFFF code point range is invalid.

Then my question was: are there other bytes ranges that are invalid? Because there are some weird things between U+E000 and U+FFFF as well, but I hoped that .NET won’t mind them. I tested my theory with this code:

for (var bh = 0; bh < 256; bh++)
    for (var bl = 0; bl < 256; bl++)
    {
        var bytes = new[] { (byte)bl, (byte)bh };
        var bytes2 = Encoding.Unicode.GetBytes(Encoding.Unicode.GetString(bytes));
        if (!bytes.SequenceEqual(bytes2))
            Console.WriteLine("0x{0:X2}{1:X2} -> 0x{2:X2}{3:X2}", bytes[1], bytes[0], bytes2[1], bytes2[0]);
    }

It tries to encode every 2 possible byte combination in UTF-16 and then decode it (note that .NET uses little-endian UTF-16). As it turns out, everything from 0xD800 to 0xDFFF ends up as 0xFFFD, the standard invalid character replacement. The good news is that everything else works just fine.

Now I just needed an algorithm to make sure that I don’t end up with those invalid bytes. I used the very simple idea of using an extra zero byte as an escape mechanism. Whenever I see that I’m about to output a 0xD8-0xDF byte in an odd position, I insert a zero byte before it. I also insert a zero byte before an actual zero byte in the input, and finally a zero byte to make the length of the output even. You can see the source code here: BinaryStringConverter.cs.

The end result with this method: 544 characters! In our example, we only add two extra zeroes, so the user data becomes 130 bytes, which is 65 (very weird, mostly Asian, but completely valid :) ) Unicode characters, as opposed to 172 with Base64. Now the ticket overhead adds up to 284 (like I said, padding is variable length). Together with the binary XML serialization, these two methods yield a 6.6 times improvement in raw data size (431/65) and a 3.7 times improvement in the total ticket size (2016/544) over conventional methods, very sweet :) .

Here is a chart that shows the results:

Cookie seralization 2

Leave a Reply