Problem deserializing documents with certain characters in specific positions

dredmond · November 1, 2023, 9:44pm

Hi guys,

We’ve been dealing with an issue where calls to GetAsync would fail for certain documents repeatedly. After some digging around, I was able to find the exact situation that causes it. There seems to be a bug in Utf8MemoryReader when calling Decoder.Convert in .NET Framework 4.7.2 using Couchbase .NET SDK 3.4.12.

If the last character read to the output buffer is a high surrogate half when the buffer is full, it results in the output buffer not being expanded properly and the next call to Utf8MemoryReader.Read for a single character will cause an exception.

Exception: The output char buffer is too small to contain the decoded characters, encoding ‘Unicode (UTF-8)’ fallback ‘System.Text.DecoderReplacementFallback’.

Example Code:

using Couchbase;
using Newtonsoft.Json;
using System;
using System.Text;
using System.Threading.Tasks;

namespace ConsoleApp4
{
	internal class Program
	{
		static void Main(string[] args)
		{
			var couchbaseConnection = "couchbase://localhost";
			var couchbaseUser = "<User>";
			var couchbasePassword = "<Password>";
			var couchbaseBucket = "<Bucket>";

			var keyName = "test_key";

			var sb = new StringBuilder();
			sb.AppendLine("{\"Text\": \"");
			for (var i = 0; i < 1005; i++)
			{
				// Fill with a bunch of a's
				sb.Append("a");
			}

			// Add surrogate characters towards the end of the string.
			//
			// Hex code D83E must show up in the character position 1022 in the output buffer. (Output buffer length 1024)
			// This causes Utf8MemoryReader's _decoder.Convert to return a character read length of 1022 instead of 1023 since D83E is a high surrogate half.
			//
			// Newtonsoft.Json will attempt to read the last character, but it won't resize the output buffer before making another Utf8MemoryReader.Read call.
			//
			// This results in the following exception:
			// The output char buffer is too small to contain the decoded characters, encoding 'Unicode (UTF-8)' fallback 'System.Text.DecoderReplacementFallback'.
			sb.Append("\ud83e\udd3a\ud83e\udd3a\ud83e\udd3a\ud83e\udd3a\ud83e\udd3a\ud83e\udd3a\ud83e\udd3a\" }");
			var testItem = JsonConvert.DeserializeObject<TestPayload>(sb.ToString());

			Task.Run(async () => {
				var cluster = await Cluster.ConnectAsync(couchbaseConnection, couchbaseUser, couchbasePassword);
				await cluster.WaitUntilReadyAsync(TimeSpan.FromSeconds(60));
				var bucket = await cluster.BucketAsync(couchbaseBucket).ConfigureAwait(false);
				var defaultCollection = await bucket.DefaultCollectionAsync();

				try
				{
					await defaultCollection.InsertAsync(keyName, testItem);

					// This will throw an error.
					var result = await defaultCollection.GetAsync(keyName);
					var payload = result.ContentAs<TestPayload>();
				}
				catch (Exception ex)
				{
					Console.WriteLine(ex);
				}

				// delete the key
				await defaultCollection.RemoveAsync(keyName);
				await cluster.DisposeAsync();
			}).Wait();

			Console.ReadLine();
		}

		public class TestPayload
		{
			public string Text { get; set; }
		}
	}
}

mreiche · November 2, 2023, 12:14am

Thanks for bring this to our attention. I opened an issue - NCBC-3543
.NET 6 is recommended for the 3.x SDKs.

dredmond · November 2, 2023, 12:48am

Thanks for the quick reply @mreiche. We are actually working on migrating to .NET 7, but we have some projects still running on 4.7.2.

I’ve found a workaround for now, so it’s not a complete showstopper.

matthew.groves · November 2, 2023, 1:51pm

@dredmond would you mind sharing the workaround for anyone else who runs into the same issue?

dredmond · November 2, 2023, 3:02pm

Sure, I’m just using the raw string transcoder then deserializing those results. The same could be done with the raw binary transcoder, but it requires the extra step of calling Encoding.UTF8.GetString(), which is already handled by the raw string transcoder.

var stringTranscoderOptions = new GetOptions().Transcoder(new RawStringTranscoder());
var stringResult = await defaultCollection.GetAsync(keyName, stringTranscoderOptions);
var result = JsonConvert.DeserializeObject<TestPayload>(stringResult.ContentAs<string>());

btburnett3 · November 2, 2023, 5:03pm

I have successfully made a simpler unit test reproduction of the problem.

[Fact]
public void Test()
{
    // Arrange

    var sb = new StringBuilder(1100);
    sb.Append('"');
    sb.Append('a', 1019);

    // Hex code D83E must show up in the character position 1022 in the output buffer. (Output buffer length 1024)
    // This causes Utf8MemoryReader's _decoder.Convert to return a character read length of 1022 instead of 1023 since D83E is a high surrogate half.
    for (var i = 0; i < 5; i++)
    {
        sb.Append("\ud83e\udd3a");
    }

    sb.Append('"');

    var bytes = new UTF8Encoding(false).GetBytes(sb.ToString());

    // Act (failure is a thrown exception)

    DefaultSerializer.Instance.Deserialize<string>(bytes);
}

btburnett3 · November 2, 2023, 5:19pm

At this point I can confirm that the problem is with the interaction between Newtonsoft.Json, the new Utf8MemoryReader class I added for performance improvements, and the UTF8 decoder. More importantly, it also affects .NET 6 it isn’t just limited to .NET 4. The Utf8MemoryReader was based on similar code within the .NET framework used for a similar purpose, but it was used with System.Text.Json so it didn’t run into this particular interaction problem where Newtonsoft.Json doesn’t seem to realize there’s such a thing as a surrogate pair.

That said, it is a pretty unlikely corner case. So far as I can tell it can only occur in the following set of circumstances:

Using the DefaultSerializer
Document has a string greater than 1023 characters in length
The string includes Unicode surrogate pairs
A surrogate pair falls precisely on a boundary multiple of 1023 characters (which is the read size used by Newtonsoft.Json)

btburnett3 · November 3, 2023, 12:30pm

@dredmond

The fix has been merged and should be included in the 3.4.13 release. Thanks for the excellent detail in the report, it was a great help in resolving the issue.

dredmond · November 3, 2023, 1:10pm

@btburnett3 No problem, I try to be as detailed as I can be in order to make things easier. It honestly took me a while to figure out what was actually happening, so I’m glad that time spent was worth it.

Thanks for the quick turnaround.

system · February 1, 2024, 1:11pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Errors in the custom serializer after a rebalance (?) .NET SDK dot-net	2	1531	June 26, 2017
Unable to read documents stored with .NET SDK 2.0 with older drivers .NET SDK	1	2344	February 16, 2015
com.couchbase.client.core.error.EncodingFailureException Java SDK java	2	1077	May 14, 2022
JSON parse error when .NET SDK returns results .NET SDK query , n1ql	6	2587	November 7, 2017
Unexpected character encountered while parsing value when using var result = data.ContentAs<string>(); .NET SDK	19	5069	November 25, 2020

Problem deserializing documents with certain characters in specific positions

Related topics