character encoder

How to Customize Character Encoding with System.Text.Json

System.Text.Json is a powerful library in .NET that allows you to serialize and deserialize JSON data. By default, the serializer escapes all non-ASCII characters by replacing them with their Unicode code. However, there are cases where you may want to customize the character encoding to handle specific scenarios. In this article, we will explore how to customize character encoding using System.Text.Json.

Serialize Language Character Sets

By default, the serializer escapes all non-ASCII characters. However, you can specify Unicode ranges to serialize the character sets of one or more languages without escaping. To do this, you need to create an instance of System.Text.Encodings.Web.JavaScriptEncoder and pass the desired Unicode ranges.

“`csharp
using System.Text.Encodings.Web;
using System.Text.Json;
using System.Text.Unicode;

var options = new JsonSerializerOptions
{
Encoder = JavaScriptEncoder.Create(UnicodeRanges.BasicLatin, UnicodeRanges.Cyrillic),
WriteIndented = true
};

var jsonString = JsonSerializer.Serialize(weatherForecast, options);
“`

This code snippet demonstrates how to serialize the character set(s) of the Basic Latin and Cyrillic languages without escaping. The Encoder property of JsonSerializerOptions is set to a JavaScriptEncoder instance created with the desired Unicode ranges. The resulting JSON will not escape Cyrillic characters.

Serialize Specific Characters

Alternatively, you can specify individual characters that you want to allow through without being escaped. This can be done by creating a TextEncoderSettings instance and using the AllowCharacters method to specify the characters you want to allow.

“`csharp
using System.Text.Encodings.Web;
using System.Text.Json;
using System.Text.Unicode;

var encoderSettings = new TextEncoderSettings();
encoderSettings.AllowCharacters(‘u0436’, ‘u0430’);
encoderSettings.AllowRange(UnicodeRanges.BasicLatin);

var options = new JsonSerializerOptions
{
Encoder = JavaScriptEncoder.Create(encoderSettings),
WriteIndented = true
};

var jsonString = JsonSerializer.Serialize(weatherForecast, options);
“`

In this example, the AllowCharacters method is used to allow the characters ‘ж’ and ‘а’ without escaping. The AllowRange method is used to allow the Basic Latin range. The resulting JSON will only escape characters that are not explicitly allowed.

Block Lists

In addition to allow lists, there are also block lists that can override certain code points. Code points in a block list are always escaped, even if they are included in an allow list.

Global Block List

The global block list includes private-use characters, control characters, undefined code points, and certain Unicode categories. For example, the IDEOGRAPHIC SPACE character (U+3000) is escaped even if the Unicode range CJK Symbols and Punctuation is specified in the allow list. It is important to note that the global block list is an implementation detail that can change in different versions of .NET.

Encoder-Specific Block Lists

Each encoder can have its own block list that specifies code points to be escaped. For example, the HTML encoder always escapes ampersands (‘&’), even though it is in the BasicLatin range. Other encoders may have their own specific blocked code points.

Serialize All Characters

If you want to minimize escaping and allow all characters to pass through unescaped, you can use JavaScriptEncoder.UnsafeRelaxedJsonEscaping. However, it is important to note that this encoder is more permissive and does not escape HTML-sensitive characters or provide additional protection against cross-site scripting (XSS) attacks.

“`csharp
using System.Text.Encodings.Web;
using System.Text.Json;
using System.Text.Unicode;

var options = new JsonSerializerOptions
{
Encoder = JavaScriptEncoder.UnsafeRelaxedJsonEscaping,
WriteIndented = true
};

var jsonString = JsonSerializer.Serialize(weatherForecast, options);
“`

The UnsafeRelaxedJsonEscaping encoder allows all characters to pass through unescaped. However, caution must be exercised when using this encoder to ensure that the resulting JSON is interpreted correctly by the client.

Conclusion

In this article, we explored how to customize character encoding with System.Text.Json. We learned how to serialize specific language character sets without escaping, specify individual characters to allow without escaping, and use block lists to override certain code points. We also discussed how to minimize escaping by using the UnsafeRelaxedJsonEscaping encoder. By customizing the character encoding, you can have more control over how your JSON data is serialized and ensure that it meets your specific requirements.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *