A Basic Guide to the Methodology of Data Modeling for MongoDB

9 min readSep 25, 2021

MongoDB offers many solutions to the same problem

This article is merely a quick reference guide to key considerations when modeling a database for MongoDB, this article can also be useful for modeling any other non-relational or document database; this article does not delve deep into what MongoDB is.

The Document Model

Data as Key/Value pairs and Polymorphism

In MongoDB, records are represented as documents in a collection (equivalent to a table in a relational DB), to the user, these documents are represented as JSON objects while to the machine the objects are represented in a BSON(Binary JSON) format, because of this, MongoDB benefits from the polymorphism of JSON objects i.e. documents in the same collection can have varying fields (equivalent to columns in a relations DB).

Entity linking or embedding

When it comes to mapping entities and the relationships between them, document modeling makes this a breeze with its characteristic to nest documents in other documents or even represent a one-to-many relationship with a nested array of documents in another document. With document modeling you decide depending on your requirements whether to link collections using the traditional approach of using IDs or some keys that link up the collections or you can nest documents/arrays in other documents which is referred to as embedding documents.

What’s used together or closely related may be stored together in the database

The Methodology phases

Data modeling in non-relational databases is iterative, the data model is not rigid and is affected by the application’s requirements changes i.e. the data that the application needs, its reads, writes and operations load.

Image source: https://www.youtube.com/watch?v=3GHZd0zv170

Phase 1 : Describe the workload

In this phase, presumptions about the data size, operations that the database will handle are made. A person who has the business domain knowledge is required in this phase because they have the understanding of business rules, knowledge about how entities are related and they know and can make assumptions on current and future use cases. Because the modeling process is iterative, having production logs and metrics will help with evaluating the workload and with this meta data about your application data will give you an idea of the data size, database queries and indexes and assumptions of current operations

Phase 2: Creating the entities relationships

In this phase, decisions on whether entities should be linked on embedded are made, you can also link entities in MongoDB like how you’d do in RDBMS but the advantage with MongoDB is that compact and query efficient relationships can be created with the help of the modeling patterns that document database offers, I will demonstrate the idea of embedding documents with a plain JSON object.

Let’s assume we have an AUTHOR and a BOOKS entity relationship in our system, depending on the requirements; for this case let’s say that most of the reads and writes for author and books entities are not that complex, using the ODM Mongoose, the schema and data in the database might be something like below.

//Schemaconst authorSchema = new Schema({
  name:  String,
  books: [{ title: String, datePublished: Date, introduction:  String }] // embedded books
})//Example dataAuthor = {
  _id: 'jhvs32478ghjvjhad'  
  name: 'Tom Tom',
  books: [
   {
     _id: 'iu3443478ghjvjhad',
     title: 'Racing with a bee',
     datePublised: '2019-03-10T23:44:56.289Z',
     introduction: 'In a beautiful morning, a bee was racing with a  person 🤣'
   },
   {
     _id: 'cbdfbgf78ghjvjhad',
     title: 'Dark themes',
     datePublised: '2019-05-11T23:44:56.289Z',
     introduction: 'When the light is lost, it is found'
   }
  ]
}

The above snippet shows a one-to-many embedded relationship between an author and books they have written and published, we can even embed only the book IDs instead on the entire books data, but then again, this will depend on how the data will be queried and used.

For a linked one-to-many relationship, every book document in the books collection would have its authors’ ID. With MongoDB, we could also have an array of book IDs of the books belonging to an author in each author document.

A one-to-one embedded relationship if an author can only publish one book would be something like the below

// Schemaconst bookSchema = { 
title: String,
datePublished: Date,
introduction:  String,
author: {name: String}  // embedded author
}// Data exampleBook = {
     _id: 'cbdfbgf78ghjvjhad',
     title: 'Dark themes',
     datePublised: '2019-05-11T23:44:56.289Z',
     introduction: 'When the light is lost it is found',
     author: {
       _id: 'jhvs32478ghjvjhad',
       name: 'Tom Tom'
     }
}

A one-to-one linked relationship would be having a books and authors collections separate and have the linking IDs in both, that is, every book document will have its author’s ID and every author’s document will have their book’s ID.

We can also have a many-to-many embedded relationship where a list of books are embedded in an author’s document and have a list of authors embedded in a book’s document.

MongoDB offers many solutions to the same problem because with MongoDB you have the option to link and/and embed to create relations between entities.

How do you decide whether to link or embed?

Source: https://www.youtube.com/watch?v=3GHZd0zv170

The below are question you need to ask to make a decision.

How often does the embedded data get accessed?
Is the data queried using the embedded information?
Does the embedded information change often?

Key notes

If you decide to embed, depending on your scenario, that will mean you will do only one read without joins which is an advantage but could lead to a duplication of returned data.
If you decide to link or use references, that means it will be more than one database read from the linked collections but with smaller chunks of data returned.
If a data set related to an entity(s) has a limited lifespan, that would denote that that data might have to be in a separate collection.

See below table.

Image source: https://www.youtube.com/watch?v=DUCvYbcgGsQ

Entity relationships are still important to get right in MongoDB.

Phase 3: Apply patterns

Schema versioning

MongoDB has versions to its schema, you will see each document in a collection having a _v: 0 field which denotes the schema version, and because of the polymorphic nature of MongoDB entities, documents in a collection need not to have the same shape. The schema versioning pattern will help when a need for migration needs to happen in the database due to change in requirements or for optimization reasons, the schema version field can be used to progressively migrate each document to the latest schema version for example you started with _v: 0 and now you have a _v: 1 of your schema, when writing data to the database you can use this field to check if you need to migrate a particular document or not.

With this pattern, no downtime will be experienced because each document can have a different shape, and it will not break the application as long as the back-end code does the schema version checks when interacting with the DB.

Computed pattern

Just storing data and having it available isn’t, typically, all that useful. The usefulness of data becomes much more apparent when we can compute values from it.
The Computed Pattern is utilized when we have data that needs to be computed repeatedly in our application. The Computed Pattern is also utilized when the data access pattern is read intensive; for example, if you have 1,000,000 reads per hour but only 1,000 writes per hour, doing the computation at the time of a write would divide the number of calculations by a factor 1000.
-Source: https://www.mongodb.com/blog/post/building-with-patterns-the-computed-pattern

Example 1.

We have a collection of articles, each article has likes, instead of calculating the number of likes every time we fetch an article; we can accumulate the total likes when an article is liked and store the total number of likes in the article document itself.

Example 2.

The articles can each have one or more comments; the comments have a separate collection; we need to know the total number of comments for an article; depending on the schema pattern, every time a comment is made on an article, we can accumulate the total number of comments for an article and store that total in the article document itself instead of going to the comments document every time to sum up the total comments for an article which can be resource intensive.

Subset pattern

To avoid many disk reads, MongoDB keeps the data that is frequently accessed in memory (called the working set), the data is allocated a limited space in RAM. If the data set in memory exceeds the allocated space, then MongoDB will read the data from disk, to avoid this, we can use the subset pattern, which, for large documents for example, we can break them up into more than one collection and leave the data that is frequently accessed in the main collection/document which will leave in memory.

Example 1.

Having an article document with a nested array of the top 10 most recent comments reside in memory and have a separate collection for the oldest comments of that article reside in the disk; if it happens that the old comments are requested, then an extra query can be made to get the rest of the comments.

When considering where to split your data, the most used part of the document should go into the “main” collection and the less frequently used data into another.
-Source: https://www.mongodb.com/blog/post/building-with-patterns-the-subset-pattern

Warning: Collation ahead

Bucket pattern

When working with time-series data, using the Bucket Pattern in MongoDB is a great option. It reduces the overall number of documents in a collection, improves index performance, and by leveraging pre-aggregation, it can simplify data access.
-Source: https://www.mongodb.com/blog/post/building-with-patterns-the-bucket-pattern

Attribute pattern

The Attribute Pattern is particularly well suited when:
- We have big documents with many similar fields but there is a subset of fields that share common characteristics and we want to sort or query on that subset of fields, or
- The fields we need to sort on are only found in a small subset of documents, or
- Both of the above conditions are met within the documents.
…
For performance reasons, to optimize our search we’d likely need many indexes to account for all of the subsets. Creating all of these indexes could reduce performance. The Attribute Pattern provides a good solution for these cases.
…
If your application and data access patterns rely on searching through many of these different fields at once, the Attribute Pattern provides a good structure for the data.
….
The Attribute Pattern provides for easier indexing the documents, targeting many similar fields per document. By moving this subset of data into a key-value sub-document, we can use non-deterministic field names, add additional qualifiers to the information, and more clearly state the relationship of the original field and value.
When we use the Attribute Pattern, we need fewer indexes, our queries become simpler to write, and our queries become faster.

We have other useful patterns that I will not cover in this article, I have linked them below.

Outlier pattern.

Extended reference pattern

Approximation pattern

Tree pattern

Pre-Allocation pattern

Document versioning pattern

Patterns summary

Conclusion

In conclusion, your data model will change as your application evolves with requirements. The architecture will also change as the team grows in proportion to the application size, for example when the DB is sharded.

The modeling patterns we have discussed are also subject to unique requirements and the setup of the architecture, for example, for a small application worked on by a proportionate team, the embedding of data might be viable in contrast to when the subject application and the team is big and a more performant data model is a requirement, the table below sums up the decision making factors

With all that said, MongoDB proves to be a go to when an application processes real time requests and Big data which is prevalent in this digital era; many applications ought to be efficient, scalable and be able to respond quickly to requests, not only an application’s code should be well implemented and efficient but also its database.