We’ve been dealing with an issue where calls to GetAsync would fail for certain documents repeatedly. After some digging around, I was able to find the exact situation that causes it. There seems to be a bug in Utf8MemoryReader when calling Decoder.Convert in .NET Framework 4.7.2 using Couchbase .NET SDK 3.4.12.
If the last character read to the output buffer is a high surrogate half when the buffer is full, it results in the output buffer not being expanded properly and the next call to Utf8MemoryReader.Read for a single character will cause an exception.
Exception: The output char buffer is too small to contain the decoded characters, encoding ‘Unicode (UTF-8)’ fallback ‘System.Text.DecoderReplacementFallback’.
Example Code:
using Couchbase;
using Newtonsoft.Json;
using System;
using System.Text;
using System.Threading.Tasks;
namespace ConsoleApp4
{
internal class Program
{
static void Main(string[] args)
{
var couchbaseConnection = "couchbase://localhost";
var couchbaseUser = "<User>";
var couchbasePassword = "<Password>";
var couchbaseBucket = "<Bucket>";
var keyName = "test_key";
var sb = new StringBuilder();
sb.AppendLine("{\"Text\": \"");
for (var i = 0; i < 1005; i++)
{
// Fill with a bunch of a's
sb.Append("a");
}
// Add surrogate characters towards the end of the string.
//
// Hex code D83E must show up in the character position 1022 in the output buffer. (Output buffer length 1024)
// This causes Utf8MemoryReader's _decoder.Convert to return a character read length of 1022 instead of 1023 since D83E is a high surrogate half.
//
// Newtonsoft.Json will attempt to read the last character, but it won't resize the output buffer before making another Utf8MemoryReader.Read call.
//
// This results in the following exception:
// The output char buffer is too small to contain the decoded characters, encoding 'Unicode (UTF-8)' fallback 'System.Text.DecoderReplacementFallback'.
sb.Append("\ud83e\udd3a\ud83e\udd3a\ud83e\udd3a\ud83e\udd3a\ud83e\udd3a\ud83e\udd3a\ud83e\udd3a\" }");
var testItem = JsonConvert.DeserializeObject<TestPayload>(sb.ToString());
Task.Run(async () => {
var cluster = await Cluster.ConnectAsync(couchbaseConnection, couchbaseUser, couchbasePassword);
await cluster.WaitUntilReadyAsync(TimeSpan.FromSeconds(60));
var bucket = await cluster.BucketAsync(couchbaseBucket).ConfigureAwait(false);
var defaultCollection = await bucket.DefaultCollectionAsync();
try
{
await defaultCollection.InsertAsync(keyName, testItem);
// This will throw an error.
var result = await defaultCollection.GetAsync(keyName);
var payload = result.ContentAs<TestPayload>();
}
catch (Exception ex)
{
Console.WriteLine(ex);
}
// delete the key
await defaultCollection.RemoveAsync(keyName);
await cluster.DisposeAsync();
}).Wait();
Console.ReadLine();
}
public class TestPayload
{
public string Text { get; set; }
}
}
}
Sure, I’m just using the raw string transcoder then deserializing those results. The same could be done with the raw binary transcoder, but it requires the extra step of calling Encoding.UTF8.GetString(), which is already handled by the raw string transcoder.
var stringTranscoderOptions = new GetOptions().Transcoder(new RawStringTranscoder());
var stringResult = await defaultCollection.GetAsync(keyName, stringTranscoderOptions);
var result = JsonConvert.DeserializeObject<TestPayload>(stringResult.ContentAs<string>());
I have successfully made a simpler unit test reproduction of the problem.
[Fact]
public void Test()
{
// Arrange
var sb = new StringBuilder(1100);
sb.Append('"');
sb.Append('a', 1019);
// Hex code D83E must show up in the character position 1022 in the output buffer. (Output buffer length 1024)
// This causes Utf8MemoryReader's _decoder.Convert to return a character read length of 1022 instead of 1023 since D83E is a high surrogate half.
for (var i = 0; i < 5; i++)
{
sb.Append("\ud83e\udd3a");
}
sb.Append('"');
var bytes = new UTF8Encoding(false).GetBytes(sb.ToString());
// Act (failure is a thrown exception)
DefaultSerializer.Instance.Deserialize<string>(bytes);
}
At this point I can confirm that the problem is with the interaction between Newtonsoft.Json, the new Utf8MemoryReader class I added for performance improvements, and the UTF8 decoder. More importantly, it also affects .NET 6 it isn’t just limited to .NET 4. The Utf8MemoryReader was based on similar code within the .NET framework used for a similar purpose, but it was used with System.Text.Json so it didn’t run into this particular interaction problem where Newtonsoft.Json doesn’t seem to realize there’s such a thing as a surrogate pair.
That said, it is a pretty unlikely corner case. So far as I can tell it can only occur in the following set of circumstances:
Using the DefaultSerializer
Document has a string greater than 1023 characters in length
The string includes Unicode surrogate pairs
A surrogate pair falls precisely on a boundary multiple of 1023 characters (which is the read size used by Newtonsoft.Json)
The fix has been merged and should be included in the 3.4.13 release. Thanks for the excellent detail in the report, it was a great help in resolving the issue.
@btburnett3 No problem, I try to be as detailed as I can be in order to make things easier. It honestly took me a while to figure out what was actually happening, so I’m glad that time spent was worth it.