Problem deserializing documents with certain characters in specific positions

Hi guys,

We’ve been dealing with an issue where calls to GetAsync would fail for certain documents repeatedly. After some digging around, I was able to find the exact situation that causes it. There seems to be a bug in Utf8MemoryReader when calling Decoder.Convert in .NET Framework 4.7.2 using Couchbase .NET SDK 3.4.12.

If the last character read to the output buffer is a high surrogate half when the buffer is full, it results in the output buffer not being expanded properly and the next call to Utf8MemoryReader.Read for a single character will cause an exception.

Exception: The output char buffer is too small to contain the decoded characters, encoding ‘Unicode (UTF-8)’ fallback ‘System.Text.DecoderReplacementFallback’.

Example Code:

using Couchbase;
using Newtonsoft.Json;
using System;
using System.Text;
using System.Threading.Tasks;

namespace ConsoleApp4
{
	internal class Program
	{
		static void Main(string[] args)
		{
			var couchbaseConnection = "couchbase://localhost";
			var couchbaseUser = "<User>";
			var couchbasePassword = "<Password>";
			var couchbaseBucket = "<Bucket>";

			var keyName = "test_key";

			var sb = new StringBuilder();
			sb.AppendLine("{\"Text\": \"");
			for (var i = 0; i < 1005; i++)
			{
				// Fill with a bunch of a's
				sb.Append("a");
			}

			// Add surrogate characters towards the end of the string.
			//
			// Hex code D83E must show up in the character position 1022 in the output buffer. (Output buffer length 1024)
			// This causes Utf8MemoryReader's _decoder.Convert to return a character read length of 1022 instead of 1023 since D83E is a high surrogate half.
			//
			// Newtonsoft.Json will attempt to read the last character, but it won't resize the output buffer before making another Utf8MemoryReader.Read call.
			//
			// This results in the following exception:
			// The output char buffer is too small to contain the decoded characters, encoding 'Unicode (UTF-8)' fallback 'System.Text.DecoderReplacementFallback'.
			sb.Append("\ud83e\udd3a\ud83e\udd3a\ud83e\udd3a\ud83e\udd3a\ud83e\udd3a\ud83e\udd3a\ud83e\udd3a\" }");
			var testItem = JsonConvert.DeserializeObject<TestPayload>(sb.ToString());

			Task.Run(async () => {
				var cluster = await Cluster.ConnectAsync(couchbaseConnection, couchbaseUser, couchbasePassword);
				await cluster.WaitUntilReadyAsync(TimeSpan.FromSeconds(60));
				var bucket = await cluster.BucketAsync(couchbaseBucket).ConfigureAwait(false);
				var defaultCollection = await bucket.DefaultCollectionAsync();

				try
				{
					await defaultCollection.InsertAsync(keyName, testItem);

					// This will throw an error.
					var result = await defaultCollection.GetAsync(keyName);
					var payload = result.ContentAs<TestPayload>();
				}
				catch (Exception ex)
				{
					Console.WriteLine(ex);
				}

				// delete the key
				await defaultCollection.RemoveAsync(keyName);
				await cluster.DisposeAsync();
			}).Wait();

			Console.ReadLine();
		}

		public class TestPayload
		{
			public string Text { get; set; }
		}
	}
}

Thanks for bring this to our attention. I opened an issue - NCBC-3543
.NET 6 is recommended for the 3.x SDKs.

1 Like

Thanks for the quick reply @mreiche. We are actually working on migrating to .NET 7, but we have some projects still running on 4.7.2.

I’ve found a workaround for now, so it’s not a complete showstopper.

@dredmond would you mind sharing the workaround for anyone else who runs into the same issue?

1 Like

Sure, I’m just using the raw string transcoder then deserializing those results. The same could be done with the raw binary transcoder, but it requires the extra step of calling Encoding.UTF8.GetString(), which is already handled by the raw string transcoder.

var stringTranscoderOptions = new GetOptions().Transcoder(new RawStringTranscoder());
var stringResult = await defaultCollection.GetAsync(keyName, stringTranscoderOptions);
var result = JsonConvert.DeserializeObject<TestPayload>(stringResult.ContentAs<string>());
3 Likes

I have successfully made a simpler unit test reproduction of the problem.

[Fact]
public void Test()
{
    // Arrange

    var sb = new StringBuilder(1100);
    sb.Append('"');
    sb.Append('a', 1019);

    // Hex code D83E must show up in the character position 1022 in the output buffer. (Output buffer length 1024)
    // This causes Utf8MemoryReader's _decoder.Convert to return a character read length of 1022 instead of 1023 since D83E is a high surrogate half.
    for (var i = 0; i < 5; i++)
    {
        sb.Append("\ud83e\udd3a");
    }

    sb.Append('"');

    var bytes = new UTF8Encoding(false).GetBytes(sb.ToString());

    // Act (failure is a thrown exception)

    DefaultSerializer.Instance.Deserialize<string>(bytes);
}

At this point I can confirm that the problem is with the interaction between Newtonsoft.Json, the new Utf8MemoryReader class I added for performance improvements, and the UTF8 decoder. More importantly, it also affects .NET 6 it isn’t just limited to .NET 4. The Utf8MemoryReader was based on similar code within the .NET framework used for a similar purpose, but it was used with System.Text.Json so it didn’t run into this particular interaction problem where Newtonsoft.Json doesn’t seem to realize there’s such a thing as a surrogate pair.

That said, it is a pretty unlikely corner case. So far as I can tell it can only occur in the following set of circumstances:

  • Using the DefaultSerializer
  • Document has a string greater than 1023 characters in length
  • The string includes Unicode surrogate pairs
  • A surrogate pair falls precisely on a boundary multiple of 1023 characters (which is the read size used by Newtonsoft.Json)
1 Like

@dredmond

The fix has been merged and should be included in the 3.4.13 release. Thanks for the excellent detail in the report, it was a great help in resolving the issue.

1 Like

@btburnett3 No problem, I try to be as detailed as I can be in order to make things easier. It honestly took me a while to figure out what was actually happening, so I’m glad that time spent was worth it.

Thanks for the quick turnaround.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.