MongoDB Core Concepts Part 2

Ok, so now that we’ve covered the fact that relational technologies were (in large part) created with a primary goal of maximizing efficiency of disk space by leveraging a system of references – multiple tables to store data only once and refer to it multiple times, let’s take a look at another system of storage that provides a different set of efficiencies.

JSON Document Structure

MongoDB is NOT a JSON database… I like to say that right out of the gate.  Sometimes people will inaccurately report that MongoDB is a JSON database or that it stores data in JSON.  It does not.  It does, however, support JSON fully.

MongoDB stores data in BSON.  There’s a full specification over at if you’re interested in the gory details.  What’s the difference, you might be asking?  Hang on… we’ll get there.

Let’s start with a view of the difference between how we store data in the relational world, vs. how we store data in JSON/BSON.

First, a bit of terminology to make sure we’re all on the same verbal page.

Secondary IndexSecondary Index
JoinsEmbedded documents, linking, $lookup & $graphLookup
GROUP_BYAggregation Pipeline

Now, if you’re like me and have developed applications designed to run with a relational database backend, you’ll naturally begin to think about the data elements you’ll manage in your applications and break them into distinct types… maybe even calling them tables… defining the columns for each different piece of data you’ll store and manage.  Further, you’re likely to start thinking about multiple tables for very different pieces of information or data.  For example, People and Cars.  If we’re developing an application that will manage people and the cars they own, you’ll likely end up with something that looks like the following:

Now this is quite logical, especially in light of the fact that you’ve likely been devising these relational schemas for quite some time.

Now to create this structure, we need to develop some DDL, or Data Definition Language.  This, in relational parlance is how we create a schema.  In a RDBMS, the schema lives separately from the data.

View SQL Schema

This is part of the problem associated with relational technologies.  All of that definition language above is not needed if we don’t have a schema… if we don’t care about establishing constraints and column definitions ahead of time.

Instead, we can immediately concentrate on creating documents right in our code.  Let’s look at a simple example using NodeJS.

This simple example will insert one document into a collection called peoplecars in a database also called peoplecars.

The document looks like this:

This simple example was written in NodeJS, but know that there are drivers for literally every modern language.  Here are links to just a few:

I hope you found this introduction useful.  If you have questions or want to learn more, reach out!  Use the comment box or contact me on Twitter.




MongoDB Core Concepts

Maybe you’re a technical professional who’s done work with only relational databases… Oracle, SQL Server, MySQL, etc.  Maybe you’ve just heard of NoSQL databases but haven’t had the chance to dive in and understand what, exactly these modern data storage mechanisms are all about.

The purpose of this article is to provide a high level understanding of exactly what MongoDB is, and why solutions like MongoDB exist.

To understand why MongoDB exists, we need to go back in time to the 1970’s and 1980’s when Relational Technology was developed.

SQL was initially developed at IBM by Donald D. Chamberlin and Raymond F. Boyce in the early 1970s. This version, initially called SEQUEL (Structured English Query Language), was designed to manipulate and retrieve data stored in IBM’s original quasi-relational database management system, System R, which a group at IBM San Jose Research Laboratory had developed during the 1970s. The acronym SEQUEL was later changed to SQL because “SEQUEL” was a trademark of the UK-based Hawker Siddeley aircraft company.

In the late 1970s, Relational Software, Inc. (now Oracle Corporation) saw the potential of the concepts described by Codd, Chamberlin, and Boyce, and developed their own SQL-based RDBMS with aspirations of selling it to the U.S. Navy, Central Intelligence Agency, and other U.S. government agencies. In June 1979, Relational Software, Inc. introduced the first commercially available implementation of SQL, Oracle V2 (Version2) for VAX computers.

After testing SQL at customer test sites to determine the usefulness and practicality of the system, IBM began developing commercial products based on their System R prototype including System/38, SQL/DS, and DB2, which were commercially available in 1979, 1981, and 1983, respectively.

If you think about the 1970’s from a financial perspective – how much did the elements of an application cost?  Well, there’s the computer – disk, cpu, and memory.  Each of these elements were much more expensive back then.

In fact, let’s look at the cost of hard disk specifically.

Price of a Gigabyte by Year
















And then there’s the developer or database administrator – the amount of money paid to these individuals to design, develop or maintain the database.  This variable of the equation was much cheaper than today.  Let’s dig into this a bit.  To understand how differently we (computer programmers, developers, and DBA’s) are compensated between the 80’s and today, let’s look at two key factors – the rate of pay (then, and now) as well as the the U.S. Rate of Inflation.  First – finding the rate of pay for a computer programmer from the 1980’s proved difficult – but I did find one source which listed the average weekly earnings for a computer programmer at $472 per week.  Which works out to roughly $24k per year.


Now, if we calculate the impact of inflation on this number, we get to roughly $71k per year.


This may not be the most scientific method – but let’s assume I’m within a few thousand dollars.

Even if we’re at the 25th Percentile ($60k) today, we’re still earning more than 27% more for doing the same work.  That’s a sizable increase.  At the high end, we’re earning more than 82% more for the same jobs.

So, why go into this detail?

We, as DBA’s and developers are earning more than ever before and the costs that are incurred as part of having us working on applications represent a larger slice of the overall cost pie.  Therefore, it only makes sense that we should be leveraging systems that maximize the efficiency to our resource… to us, rather than to the infrastructure.

It just doesn’t make sense to use a system that’s focused on reducing the number of bits and bytes stored at the cost of developer and DBA time.

Continue Reading

Moving from Tables to Documents with MongoDB

I’m going to ask you to set aside your concept of “proper data modeling” and “3rd normal form.”  Going forward in this article, those concepts will hold you back.  Some DBA’s and data modelers become angry at this suggestion.  If that’s you… welcome – but please hold judgement until you read this complete post.

Data normalization focuses on organization of the data for the purpose of eliminating duplication.  It’s a data-focused approach.  Not an application-focused approach.

With document data modeling, we flip the script and we turn to our application to answer the questions about how we organize the data.  Before we get into exactly how to do this, let’s delve a bit into the possible structures of a document.

MongoDB is document-based.  That is to say that it stores data in documents.  JSON-like documents to be more specific.  Most people, might think of Microsoft Word Documents, or PDF documents when I mention the word document – and while it is true that MongoDB can store these types of documents, what I’m really talking about is JSON documents.


JavaScript Object Notation (JSON)  is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate.  It is based on a subset of the JavaScript Programming Language.

When you store data in JSON documents, you approach data modeling quite differently than as with Relational technologies.  You see, relational technologies were developed in the 1970’s and 1980’s when disk space was extremely expensive.  Thousands and even 10’s of Thousands of dollars per gigabyte of disk was not unusual early on.  So to preserve this most valuable resource, relational technologies developed the concept of data normalization with a set of rules.


Normalization is the systematic method of deconstructing data into tables to eliminate redundancy and undesirable characteristics like Insertion, Update and Deletion Anomalies.  It is a multi-step process that puts data into tables: rows and columns, by removing duplicated data from the tables.

Normalization is used for mainly two purposes:

  • Eliminate redundant data.
  • Ensure data dependencies make sense i.e data is logically stored in line with the core objectives stated above.

Normalization techniques and the rules associated with it are all well and good if you intend to leverage a relational database technology.  However, as discussed, MongoDB is document-based… i.e. non-relational.

That is not to say that you cannot define and maintain relationships between data elements in your document model.  However, this is not a primary constraint when building a document-based data model.

Rich Data Structures

JSON documents are relatively simple structures.  They begin with a curly-brace and end with a curly-brace.  In between these braces, you have a set of key value pairs delimited by commas.  Here’s an example:

In my example, I’ve tidied things up using indents (spaces before the keys) but this is not necessary.  The above example is extremely simple.  These structures can get quite complex and rich.  The above example includes keys, and values.  The keys in all cases with JSON are strings.  The values however, can be string, numeric, decimal, Dates, Arrays, Objects, Arrays of Embedded Objects, and so forth.  Let’s look at a more complex example:

As you can see, the values don’t have to be simple strings or numbers.  They can be quite complex.  Now, if you’re aware of JSON, you might be saying something like – wait a minute, JSON only supports strings and numbers… and you’d be correct.

At the beginning of this article, I stated specifically that MongoDB stores data in JSON-like documents.  We actually, store the data in BSON documents.  BSON is a Binary representation of the JSON document.  You can read all about this standard at

We use BSON so that we can honor the types not supported by JSON… to make it easier for developers to store rich data types and not have to marshal them back from their non-native forms.  When you write a decimal in MongoDB, and then read it back – it comes to you via the drive in decimal form.

Now that we understand a bit about how MongoDB stores and organizes data in document structures, let’s address migrating data from a relational structure to a document-based data model with MongoDB.

Let’s use the obligatory Books and Authors example, not because it’s stunningly brilliant, no – because I’m lazy and it happens to be something to which we can all relate.

Consider the following ERD.

In this simple example, we have two tables.  Authors, and Books.  There is a relationship expressed between these two tables in that Books have an Author.  Rather than storing this data together, we’re storing it separately and expressing the relationship through LINKING.

With MongoDB, we can store this very same information but instead of linking between two disparately, separate locations, we can EMBED the same data.  Consider the following:

In this example, we’ve designed a document structure by creating an actual document.  Notice in the previous, relational example, we created an ERD, or Entity Relationship diagram.  This same ERD may be useful for us as we model our data in documents… but the difference is that with MongoDB, there is no separate, distinct schema.  The schema does not live separately from the actual documents.

In atomic and molecular physics, there’s a concept known as the observer effect.  This applies here to the concept of a schema with MongoDB.  If you don’t look at the data, the schema does not exist.  It’s not until you observe the data do you see that a schema defining what keys / values you have truly exists.

Now, you may begin to wonder something along the lines of what if a data element in the subordinate changes?  What if a subdocument element such as book title changes?  Unlikely, I suppose but possible.  And since we’re storing book titles inside of an Author Record, and possibly even storing the very same information such as book title, description, etc. in another collection specific to these data elements, how will we address this change?  ARE YOU SAYING WE MAY HAVE TO UPDATE THE DATA MORE THAN ONCE!?!


Calm down.  We’re not under the same constraints as relational developers.  We own the destiny of our document structures as well as the content.

We can change data multiple times, in multiple locations.

But… but but that’s wrong.  I feel your terror.  It’s not wrong because we don’t adhere to data normalization rules.  Who cares?  Who cares if we store data in multiple locations – we are the czar of our data and we control when its written with our code.  We are no longer beholden to a schema wielding dba.  We are that dba.  If this feels wrong, you’re not alone.  But trust me the benefits of this approach far out-way the drawbacks.

Benefits of the Document Model Approach

Benefit One: Data Locality – Data that’s accessed together is stored together

When we toss out normalization we gain the notion of “Data that’s accessed together is stored together”, otherwise known as data locality.  An Author document contains all relevant details about an author including the books he or she has written.  When my application needs this data, it issues a read and in most cases, a single read fetches ALL of the data needed.  In relational, or normalized data, a single read gets me perhaps a single row in a single table and then I need to issue another read to get the related data from an additional table.  Multiple reads equals multiple disk seeks equals slower performance for my application.

Benefit Two: Readability

When all the data that’s accessed together is stored together, it’s just logical – it makes sense – you can see it all at once in a document.  Whereas with relational technologies, you must issue SQL commands with JOIN clauses to pull data from multiple locations.  Much less readable.

Benefit Three: Flexibility and Agility

When we store data in documents, adding, removing or modifying the data structures is much easier.  There literally is no governing schema.  We simply modify the code we use to update the data in the database.  We have no external schema to modify.  Therefore, we gain the flexibility to make these changes without stopping the database… without issuing an “alter table” command.


In this first of a series of articles on migrating from relational to documents, we’ve looked at how data is stored in MongoDB, what documents are, the structure of JSON and BSON and explored just a few of the benefits.  While the examples are basic, I hope these have illustrated the power and flexibility of this modern approach to data storage.

In my next article, I’ll tackle a bit more challenging relational schema and convert that to documents and incorporate the code used to maintain the data.

If you’re interested in this topic but need a more structured approach to enablement and learning, MongoDB has amazing resources to help you wrap your mind around the document model.  I highly recommend MongoDB University  if you’re new – or trying to improve your knowledge of MongoDB.

Please leave a comment, ask a question or reach out to me on LinkedIn with feedback.



Deploying MongoDB Enterprise with Ansible

I’ve been asked about this subject several times, so I thought it might be best to put some thoughts into a blog post and share it.

For the purposes of this article, I’m going to assume you have ansible installed.  If you need help with that specifically, refer to the site for specific documentation on installation.

Question: Why does your image refer to Opsmanager?  I thought we were going to cover Ansible.

Answer: Opsmanager can accomplish many things over and above what Ansible covers.  Monitoring, Automating, Optimizing and Backing up your MongoDB installation.  I won’t cover Opsmanager in this article but if you’re running MongoDB in production, I highly recommend looking into Opsmanager.

Ansible is an incredible tool.  It’s also been referred to as “SSH Configuration Management on Steroids.”  Explaining what Ansible is and how it works is beyond the scope of this article.  I will however, provide some basic details specific to the application of Ansible around the problem of deploying MongoDB.

Ansible leverages SSH to enable you to manage, and automate the process of configuring and maintaining configuration on a number of servers.


The first thing to know about Ansible is that it requires knowledge of the servers you’ll be managing using the tool.  This knowledge is maintained using an inventory file.

Ansible works against multiple systems in your infrastructure at the same time. It does this by selecting portions of systems listed in Ansible’s inventory file, which defaults to being saved in the location /etc/ansible/hosts. You can specify a different inventory file using the -i <path> option on the command line.

Not only is this inventory configurable, but you can also use multiple inventory files at the same time (explained below) and also pull inventory from dynamic or cloud sources, as described in Dynamic Inventory.  This is a mind-blowing concept for some.  A dynamic inventory is one that can change… it’s not a static file, it’s a script that returns a list of servers.

For the time being, let’s leave the dynamic capability to the side… and let’s create a static file containing the names of the servers on which we’ll install MongoDB.

Where you see “[mongodb]” – this is a group indicator.  It tells Ansible that the next lines will be servers that should be a part of the group indicated… in this case “mongodb”.  The string mongodb is arbitrary and could be anything… “MyServers” would work just as well.  It’s later when we write some Ansible commands that we’ll refer to these servers as a group – and the group name will be important.

Where you see and these are the fully qualified domain names of the servers on which you’ll be installing mongodb.  If you’re installing a replica set – you’ll most likely have three.

As I’m writing this article, I have 3 servers deployed in AWS/EC2 that I’ll be using.  So – here’s what my inventory file looks like:

Server Access and Connectivity

Ok, so now we’ve defined our universe of servers, let’s talk about how we’re going to leverage Ansible to effect change on these servers.  Ansible uses SSH to connect and manage servers.  In order for this to happen, you need to give Ansible the appropriate credentials.

Ansible will assume you have SSH access available to your servers, usually based on SSH-Key.  Because Ansible uses SSH, the server it’s on needs to be able to SSH into the inventory servers. It will attempt to connect as the current user it is being run as.

Getting Busy

Ok, now we understand the servers on which we’ll install MongoDB, as well as the mechanism with which we’ll access and impact change on those servers, let’s get busy and do something.

In its very basic form, Ansible can be used from command line to do things with your server inventory.  The most basic command you can try right now is Ping… Ping sends a UDP packet over the network to those servers and verifies connectivity.

In this simple  example, I’m using the ansible command, specifying the location of my inventory file and calling an ansible command module called ping against a group called “ReplicaSet.”  Ansible responds with the output and results of the command I ran.


Ansible’s nature is to automate things.  So, naturally, you can automate the process of telling Ansible things about your configuration or your environment.  The inventory file, for example – that can be set in your env so you don’t have to use the -i switch each and every time you run a command.

In this way, I’ve now set my inventory file in my environment and I no longer have to use the -i switch.  So I can simply type the following to achieve the same output as previously.

Additionally, where I’m leveraging the -m switch to specify the module I want to use, I can instead, use another command and move the actual work I want accomplished to another file called a playbook.  Ansible playbooks are like scripts that describe the work you want ansible to accomplish.

Playbooks leverage YAML – Yet Another Markup Language.  This is a straight forward, easy to read configuration language.  You’ll get the hang of it quickly.  Here’s an example of the previous ping command represented in Playbook, YAML Format:

And here’s what that looks like when we execute it:

If you’re playing along at home, place the YAML code for the ping command into a file called ping.yml.  Then execute the command ansible-playbook ping.yml.

So ping is awesome but what about mongodb?

Yea – we’re getting there.  I get it, you’re impatient… so am I.  Alright – so where are we?  We know our inventory of mongodb servers… we understand how we’re going to access them via SSH and we just learned about the fact that we can create these awesome script-like things called playbooks.

In the ping example playbook, I used a section called tasks.  Tasks are where we leverage commands that ansible understands to carry out the things we want to accomplish on our inventory of servers… ok – so how then do we install MongoDB?

Ansible does for you what you are not able, or don’t want to do for yourself.  However, it does it in the same manner.  With MongoDB, specifically in the Linux world… the easiest way to install it is by installing a REPO, or an RPM repository and leveraging the YUM command to install the packages for you.

YUM satisfies all dependencies during the installation process so it makes managing your software installations a lot easier.  We could, technically, use Ansible to download the binaries, and perform a manual compile and install… sure.  But let’s leverage the power of package management to do all that for us.

In order to use YUM, we first need to define a repository.  The following is the repo definition for MongoDB Enterprise.

To use this repo without ansible, you’d copy the repo file to each of your servers, then execute yum update, and yum install mongodb-enterprise, etc.

We’re not going to that – we’re going to let Ansible do that for us.  So.  Step 1 will be to create a file called mongodb-enterprise.repo and copy in the contents from above.  Place this file in a directory called files (for neatness, of course.)

Next, let’s create the playbook we’ll use that will refer to this repo.  Here’s what mine looks like:

Let’s break this down.

Line 001: Starter YML
Line 002: Hosts designation – let’s ansble know what hosts we’re acting on.
Line 003: Remote user designation – who are we impersonating when we execute these changes on the remote host.
Line 004: Become – this is the same as sudo, essentially.  We need to make these changes as a super user.
Line 005: Tasks designator – begin block.
Line 006: What file are we using?  We are going to send this file to our remote hosts, one at a time.
Line 007: Once there, what commands will we be executing.  First, let’s do a generic update.  The equivilant of this is “yum update”.
Line 008: Next, we want to target a specific package whose state we want to change… specifically, we want the state of the package named mongodb-enterprise to be installed at the latest version.
Line 009: Now we want to install the mongodb shell commands.
Line 010: We also need gpg installed.
Line 011: Lastly, we’re going to run MongoDB in a specific directory – namely “/data”.

Let’s give it a shot.  First, place these commands in a file called playbook-replicaset-enterprise-prerequisites.yml in a directory called playbooks.

Here’s what this looks like when it runs:

If all goes according to plan, you should end up with MongoDB Enterprise installed on your hosts.

Let’s go interrogate one of the servers to make sure.

Sure enough, MongoDB is installed and ready to go.

So – now that we’ve installed it using a package manager, what about starting, stopping, restarting, etc.?  Ansible can easily accomplish these from command line – but let’s create playbooks so we have them in our arsenal.

Let’s create a new playbook called playbook-replicaset-start.yml and fill it with the following content.  Notice we’re calling mongod directly, and not relying on the service commands… you will want to examine this should you further your deployment into a production environment.

Here’s what our new mongodb start playbook looks like in action:

And now let’s verify that we’ve actually effected the expected change on our servers:

So that’s it folks, we’ve gone from zero to hero with Ansible and MongoDB.  To recap, we learned about Ansible’s inventory.  We learned how to run ansible from the command line as well as how to create scripts called playbooks.  We created a repo file, distributed that to our servers and leveraged that via ansible to install mongodb.

If this topic interests you and you’re looking to go to the next step, check out my repository on github that contains some great playbook content – some of which I wrote and a lot of which my colleague Torsten Spindler wrote.  This repo enables you to automate the process of installing MongoDB and leverages Opsmanager – registering newly installed hosts with the opsmanager console.  This will better prepare you to manage your production implementation of MongoDB.

This is just a beginning, but I hope you can see the incredible power and flexibility that Ansible can bring you.  Feel free to leave a comment or question should you have either.

You may also want to review the scripts I’ve created as a part of this post.  They can be found and freely downloaded at: