One of the main attractions of document databases is the flexibility of the document structure or schema. For the most part, the flexibility in the document schema is beneficial. However, there might be some situations where having some structure to the document might be helpful. This article shows you how to validate your JSON documents against a specified schema using the popular Python library pydantic.
When do you need to validate documents?
A common misconception about using NoSQL databases is that no structures or document schemas are required. In most cases, applications tend to have some constraints for the data even though they may not specifically validate it. For example, there might be some fields in the document that the application depends on for functionality. An application might simply not operate correctly when some of these pydantic JSON fields are missing.
One real-world example of this problem could be an application that reads data from another unreliable application that periodically sends bad data. It would be prudent to highlight any documents that could break the application in such cases. Developers can do this either in the application or at the document level.Â
In this approach, we specify a JSON to pydantic schema for the documents to help identify those that do not match the application specifications at the document level.
Generate JSON testing data
We use the open-source library, Faker, to generate some fake user profiles for this tutorial.
This is the structure of a single document:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
{ "job": "Radiographer, diagnostic", "company": "Klein, Dillon and Neal", "residence": "2232 Jackson Forks\nLake Teresa, CO 46959", "website": [ "http://bishop-torres.net/" ], "username": "aabbott", "name": "Madison Mitchell", "address": "1782 Moore Hill Apt. 717\nWest Stephaniestad, NM 75293", "mail": "amberrodriguez@hotmail.com", "phone": { "home": "677-197-4239x889", "mobile": "001-594-038-9255x99138" } } |
To simulate the broken documents, I modify a small portion of the user profiles by deleting some of the mobile phone and mail fields. We aim to identify these records that, in the real world, would be stored in a document database like Couchbase.
I load the generated pydantic data from JSON into a bucket on our hosted Couchbase Capella cluster using the import functionality in the built-in web console UI for our testing. I specify the username field as the key to uniquely identify each document.
How to specify a schema for JSON documents?
In the user profile data, I expect the documents to conform to the following structure in my applications:
- Mandatory fields:
- username
- name
- phone – with JSON elements for “home” & “mobile”
- website – an array
- Optional fields:
- company
- residence
- job
- address
Pydantic is one of the most popular libraries in Python for data validation. The syntax for specifying the schema is similar to using type hints for functions in Python. Developers can specify the schema by defining a model. Pydantic has a rich set of features to do a variety of JSON validations. We will walk through the representation for some user profile document specifications.
One thing to note about pydantic is that, by default, it tries to coerce the data types by doing type conversions when possibleâfor example, converting string ‘1’ into a numeric 1. However, there is an option to enable strict type checking without performing conversions.
In this code example, you see a basic configuration for the UserProfile schema using pydantic syntax:
1 2 3 4 5 6 7 8 9 10 11 |
class UserProfile(BaseModel):    """Schema for User Profiles"""    username: str    name: str    phone: Phone    mail: str    company: Optional[str]    residence: Optional[str]    website: List[HttpUrl]    job: Optional[str]    address: Optional[str] |
Each field is specified along with the expected data type. The optional fields are marked as Optional. An array is specified by the List keyword followed by the desired data type.
The username field needs to be a string, while the company field is an optional string. If we look at the other lines in the snippet, we see the website field is a list of type HttpUrl â one of the many custom types provided by pydantic out of the box. HttpUrl is used to validate that the URL is valid and not random strings. Similarly, there are other fields like emails that we could use to ensure that the email fields are a valid form.Â
If you look at the phone field, it is marked as a Phone type which is a custom type that we will define in the next code snippet:
1 2 3 4 5 6 7 8 9 10 11 |
class Phone(BaseModel): """Schema for Phone numbers""" home: str mobile: str @validator("mobile", "home") def does_not_contain_extension(cls, v): """Check if the phone numbers contain extensions""" if "x" in v: raise ExtensionError(wrong_value=v) return v |
Here we specify that the Phone is composed of two fields that are both strings: home and mobile. This would be checked inside the UserProfile model and interpreted as the UserProfile model containing a phone field that contains home and mobile fields.Â
Pydantic also allows us to validate the contents of the data and the type and presence of the data. You can do this by defining functions that validate specific fields. In the above example, we validate the mobile and home fields to check for extensions. If they contain an extension, we do not support it and throw a custom error. These schema errors are then shown to the users doing the pydantic validation.
You can view the schema definition by specifying the Model.schema_json() method as shown here:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
{ "title": "UserProfile", "description": "Schema for User Profiles", "type": "object", "properties": { "username": { "title": "Username", "type": "string" }, "name": { "title": "Name", "type": "string" }, "phone": { "$ref": "#/definitions/Phone" }, "mail": { "title": "Mail", "type": "string" }, "company": { "title": "Company", "type": "string" }, "residence": { "title": "Residence", "type": "string" }, "website": { "title": "Website", "type": "array", "items": { "type": "string", "minLength": 1, "maxLength": 2083, "format": "uri" } }, "job": { "title": "Job", "type": "string" }, "address": { "title": "Address", "type": "string" } }, "required": [ "username", "name", "phone", "mail", "website" ], "definitions": { "Phone": { "title": "Phone", "description": "Schema for Phone numbers", "type": "object", "properties": { "home": { "title": "Home", "type": "string" }, "mobile": { "title": "Mobile", "type": "string" } }, "required": [ "home", "mobile" ] } } } |
JSON documents validation against a pydantic Python schema
Now that we have defined the schema let’s explore how we can validate the documents against the schema.
Validation can be done by using the pydantic parse_obj method of the model. You specify the document as a dictionary and check for validation exceptions. In this case, we fetch all the documents (up to the specified limit) using a Couchbase query and test them one by one and report any errors.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
password_authenticator = PasswordAuthenticator(DB_USER, DB_PWD) cluster = Cluster( CONN_STR, ClusterOptions(password_authenticator), ) validation_error_count = 0 query = f"select profile.* from `{BUCKET}`.{SCOPE}.{COLLECTION} profile LIMIT {DOCUMENT_LIMIT}" try: results = cluster.query(query) for row in results: try: UserProfile.parse_obj(row) validation_error_count += 1 except ValidationError as e: print(f"Error found in document: {row['username']}\n", e.json()) except Exception as e: print(e) print(f"Validation Error Count: {validation_error_count}") |
The schema highlights some of the errors observed in the documents:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
Error found in document: aarias  [   {     "loc": [       "phone",       "home"     ],     "msg": "value is an extension, got \"(932)532-4001x319\"",     "type": "value_error.extension",     "ctx": {       "wrong_value": "(932)532-4001x319"     }   },   {     "loc": [       "phone",       "mobile"     ],     "msg": "field required",     "type": "value_error.missing"   } ] |
The document with the ID aarias has extensions in the home field, and the phone â mobile field is also missing.
Summary
This user profile example shows how we can easily create custom schemas for our JSON documents. This post also shows how to use the test and validate capabilities of Python and the pydantic module. It is just the tip of the iceberg, thoughâthere are many more types that we can validate.Â
Other approaches are also possible; for example, you can validate data at the time of writing. This can be done quite easily by integrating the schema that we defined here with the application and verifying the data before insert/upserting into Couchbase.
The entire code for this demo can be found on Github. The instructions to generate the data and run the scripts can also be found there.
Resources
- JSON schema validation – GitHub repository for code from this post
- Couchbase Capella – fully hosted NoSQL DBaaS
- Python pydantic module
- Python Faker module