MongoDB Core Concepts Part 2

Ok, so now that we’ve covered the fact that relational technologies were (in large part) created with a primary goal of maximizing efficiency of disk space by leveraging a system of references – multiple tables to store data only once and refer to it multiple times, let’s take a look at another system of storage that provides a different set of efficiencies.

JSON Document Structure

MongoDB is NOT a JSON database… I like to say that right out of the gate.  Sometimes people will inaccurately report that MongoDB is a JSON database or that it stores data in JSON.  It does not.  It does, however, support JSON fully.

MongoDB stores data in BSON.  There’s a full specification over at BSONspec.org if you’re interested in the gory details.  What’s the difference, you might be asking?  Hang on… we’ll get there.

Let’s start with a view of the difference between how we store data in the relational world, vs. how we store data in JSON/BSON.

First, a bit of terminology to make sure we’re all on the same verbal page.

RDBMSMongoDB
TableCollection
RowDocument
ColumnField
Secondary IndexSecondary Index
JoinsEmbedded documents, linking, $lookup & $graphLookup
GROUP_BYAggregation Pipeline

Now, if you’re like me and have developed applications designed to run with a relational database backend, you’ll naturally begin to think about the data elements you’ll manage in your applications and break them into distinct types… maybe even calling them tables… defining the columns for each different piece of data you’ll store and manage.  Further, you’re likely to start thinking about multiple tables for very different pieces of information or data.  For example, People and Cars.  If we’re developing an application that will manage people and the cars they own, you’ll likely end up with something that looks like the following:

Now this is quite logical, especially in light of the fact that you’ve likely been devising these relational schemas for quite some time.

Now to create this structure, we need to develop some DDL, or Data Definition Language.  This, in relational parlance is how we create a schema.  In a RDBMS, the schema lives separately from the data.

View SQL Schema

This is part of the problem associated with relational technologies.  All of that definition language above is not needed if we don’t have a schema… if we don’t care about establishing constraints and column definitions ahead of time.

Instead, we can immediately concentrate on creating documents right in our code.  Let’s look at a simple example using NodeJS.

This simple example will insert one document into a collection called peoplecars in a database also called peoplecars.

The document looks like this:

This simple example was written in NodeJS, but know that there are drivers for literally every modern language.  Here are links to just a few:

I hope you found this introduction useful.  If you have questions or want to learn more, reach out!  Use the comment box or contact me on Twitter.

 

 

 

MongoDB Core Concepts

Maybe you’re a technical professional who’s done work with only relational databases… Oracle, SQL Server, MySQL, etc.  Maybe you’ve just heard of NoSQL databases but haven’t had the chance to dive in and understand what, exactly these modern data storage mechanisms are all about.

The purpose of this article is to provide a high level understanding of exactly what MongoDB is, and why solutions like MongoDB exist.

To understand why MongoDB exists, we need to go back in time to the 1970’s and 1980’s when Relational Technology was developed.

SQL was initially developed at IBM by Donald D. Chamberlin and Raymond F. Boyce in the early 1970s. This version, initially called SEQUEL (Structured English Query Language), was designed to manipulate and retrieve data stored in IBM’s original quasi-relational database management system, System R, which a group at IBM San Jose Research Laboratory had developed during the 1970s. The acronym SEQUEL was later changed to SQL because “SEQUEL” was a trademark of the UK-based Hawker Siddeley aircraft company.

In the late 1970s, Relational Software, Inc. (now Oracle Corporation) saw the potential of the concepts described by Codd, Chamberlin, and Boyce, and developed their own SQL-based RDBMS with aspirations of selling it to the U.S. Navy, Central Intelligence Agency, and other U.S. government agencies. In June 1979, Relational Software, Inc. introduced the first commercially available implementation of SQL, Oracle V2 (Version2) for VAX computers.

After testing SQL at customer test sites to determine the usefulness and practicality of the system, IBM began developing commercial products based on their System R prototype including System/38, SQL/DS, and DB2, which were commercially available in 1979, 1981, and 1983, respectively.

If you think about the 1970’s from a financial perspective – how much did the elements of an application cost?  Well, there’s the computer – disk, cpu, and memory.  Each of these elements were much more expensive back then.

In fact, let’s look at the cost of hard disk specifically.

Price of a Gigabyte by Year
1981

$300,000

1987

$50,000

1990

$10,000

1994

$1000

1997

$100

2000

$10

2004

$1

2010

$0.10

And then there’s the developer or database administrator – the amount of money paid to these individuals to design, develop or maintain the database.  This variable of the equation was much cheaper than today.  Let’s dig into this a bit.  To understand how differently we (computer programmers, developers, and DBA’s) are compensated between the 80’s and today, let’s look at two key factors – the rate of pay (then, and now) as well as the the U.S. Rate of Inflation.  First – finding the rate of pay for a computer programmer from the 1980’s proved difficult – but I did find one source which listed the average weekly earnings for a computer programmer at $472 per week.  Which works out to roughly $24k per year.

Source: https://www.bls.gov/opub/mlr/1985/01/rpt1full.pdf

Now, if we calculate the impact of inflation on this number, we get to roughly $71k per year.

Source: http://www.in2013dollars.com/1980-dollars-in-2017?amount=24000

This may not be the most scientific method – but let’s assume I’m within a few thousand dollars.

Even if we’re at the 25th Percentile ($60k) today, we’re still earning more than 27% more for doing the same work.  That’s a sizable increase.  At the high end, we’re earning more than 82% more for the same jobs.

So, why go into this detail?

We, as DBA’s and developers are earning more than ever before and the costs that are incurred as part of having us working on applications represent a larger slice of the overall cost pie.  Therefore, it only makes sense that we should be leveraging systems that maximize the efficiency to our resource… to us, rather than to the infrastructure.

It just doesn’t make sense to use a system that’s focused on reducing the number of bits and bytes stored at the cost of developer and DBA time.

Continue Reading

Sizing MongoDB: An exercise in knowing the unknowable

Into the Unknown: How many servers do I need?  How many CPU’s, how much memory?  How fast should the storage be?  Do I need to I shard?

As part of my job as a Solutions Architect, I’m asked to help provide guidance and recommendations for sizing infrastructure that will run MongoDB databases.  In nearly every case, I feel like Nostradamus.  That is to say, I feel like I’m being asked to predict the future.

In this article, I’ll talk about the process I use to get as close to comfortable with a prediction as possible – essentially, to know the unknowable.

Charting the Unknown

Let’s start out with some basic MongoDB Sizing Theory.  In order to adequately size a MongoDB deployment, you need to understand the following

  1. Total Amount of Data that will be stored.
  2. Frequently accessed documents
  3. Read / Write Profile of the application
  4. Indexes that will be leveraged by the application to read data efficiently

These four key elements will help you build what is known as the Working Set.  The Working Set is the total amount of data, plus indexes that you will try to fit into RAM.

Wait a minute…, how can it be unknowable?  How is it possible that I’m not be able to know my performance requirements?

Ok – this may be exaggeration, or at least a bit of hyperbole but if you’ve ever completed an exercise in MongoDB sizing for a live production application, you’ll completely agree or at least understand.

The reason I chose the word “unknowable” is because it’s literally impossible to know every possible data point required to ensure that your server resource meets or exceeds the requirements 100% of the time.  This is because most application environments are not closed.  They are changeable and in many cases we are at the mercy of an unpredictable user population.

The best we can hope for is close.  The rest, we will leave up to the flexible, scalable architecture that MongoDB brings to the table.

When it comes right down to it, there are a lot of things we know… or at least can predict with pretty good accuracy when it comes to an application running in production.  Let’s start with the data.  Here’s where we employ good discovery technique.

To understand how MongoDB will perform, you must understand the following elements:

  • Data Volume – How much data will our application manage and store?
  • Application Read Profile – How will the application access this data?
  • Application Write Profile – How will the data be updated or written?

Data Volume

How much data will you be storing in MongoDB?  How many databases?  How many collections in each database?  How many documents in each collection?  What size will the average document be in each of these collections?

This requires a knowledge of your applications, of course.  What data will the applications be managing?  Let’s start with an example.  People and Cars are elements of data to which that everyone can relate.  Let’s imagine, we’re writing an application that helps us keep track of a group of people (our users) and their inventory of cars.

To start, let’s look at the projected number of users of our application: How many users’ car inventories will we be managing with our application and database.  Assume we’re going large scale and we expect to take on approximately 1 billion users.  Each user will own and manage approximately 2-3 automobiles and a few service records for each car.

The better we understand our app – as data modelers, the better chance we have of deploying resources in support of the database that will match the application requirements.  Therefore, let’s dig a bit deeper to understand the data volumes and let’s look at the documents.  What do the People documents and the Cars documents actually look like?  In its most simple form, our document model may look something like the following.

PeopleCars: Avg Doc Size: 1024bytes, # of Docs 1b

In this example, I’m expressing the relationship between people and their cars through embedding.  This leaves us with a single collection for People and their Cars.  In reality, you may require a more diverse mix – so let’s include a linked example collection for service records.  Imagine for our purposes that each person will have on average 10 service records per person.

Service:  Avg Doc Size: 350bytes, # of Docs: 10b

Here’s what our architecture might look like visually:

Let’s do the math:  (1b users * 1024bytes) + (10b * 350bytes) =

1024000000000 + 3500000000000 = 4524000000000 = 4.524TB

Given the estimated users, their cars and their service records, we’ll be storing approximately 4.5TB of data in our database.   As stated previously, the goal in sizing MongoDB is to fit the working set into RAM.  So, to get an accurate assessment of the working set – we need to know how much of that 4.5TB will be accessed on a regular basis.

In many sizing exercises, we’re asked to estimateLet’s assume at any given time during any given day, approximately 30-40% of our user population is actively logged in and reviewing their data.  That would mean that we would have 35% * 1b users * 1024 bytes (user documents) plus 35% * 10b service docs * 350 bytes or…

(.35 * 1b * 1024) = 358400000000
plus
(.35 * 10b * 350) = 1225000000000
———————————————-
equals
1583400000000 or 1.6TB

The last bit of information we need to understand is the indexes we’ll maintain so that the application can swiftly, and efficiently access the data.  Let’s assume that for each People and each Service Record document we’ll create an index document that’s approximately 100bytes in size… or 11b * 100 bytes = 1.1TB

So our total working set will consist of 1.6TB of frequently accessed data and 1.1TB of index for a total working set size of 2.7TB.

Application Read / Write Profile

Your application is going to be writing, updating, reading and deleting your MongoDB data.  Each of these activities is going to consume resource from the servers on which MongoDB is running.  Therefore, to ensure that the performance of MongoDB is going to be acceptable, we should really understand the nature of these actions.

How many reads?  How frequent?

Understanding how many reads, and what data you’ll be accessing as well as how frequently you’ll be reading is critical to ensure that your databases have enough memory to store these frequently accessed documents.

In many cases, knowing this will require estimation.  Here is where we’ll attempt to know the unknown.

In our example application, assume we’ll have an active user population of anywhere between 30% and 40% of the total number of users in our database, or 35% of 1 billion users which will be 350,000,000 users.  Let’s finish out the math.  With 350m users, that means MongoDB will be regularly accessing 350m user documents each approximately 1k in size.  Additionally, each user will likely be accessing their service records – so – 350m users – each having 3 cars with at least 5 service records – let’s assume each accessed the system, thereby causing the application to fetch all of their People documents (350m) and all of their Service Record Documents (3 cars * 5 service records each at 350bytes each).

350m users * 3cars * 5 service records * 350bytes = 1837500000000bytes or 1.83tb

How many writes?

As important as reads are, so too is understanding how many writes, and what the size and frequency will be.  This will probably be the most important factor that will determine the disk IOPS rating you will need to support your use case.

If we continue our imaginary example, you can probably guess that the application, as I’ve described it will not provide a great deal of write workload.  People looking at their car inventories, and reviewing their service records doesn’t exactly sound like a high bandwidth, low-latency requirement.

However, it will be in our best interest to do the math to ensure our infrastructure can support our workload.

Let’s ask some questions.  Regardless of the actual details of your application, the questions are always the same.  What is the data?  How often will it change?  How does this change impact the total data stored?

In our example case the questions will be as follows:

  • How often will users be added?
    • 1m users per day
    • With 1m user additions, we’ll be looking at a daily incremental storage requirement of 1m * 1024bytes or 1GB.  This incremental value is likely negligible for most disk subsystems.
  • How often will service records be added?
    • 10m service updates per day
    • With 10m service updates, we’ll need to support a daily incremental storage requirement of 10m * 350bytes or 3.5GB.  Again – not monumental.

With both people and service records, we’re going to need to ensure that our infrastructure can support a write profile of at least 5GB per day.  The next logical question to ask is WHEN are these updates completed?  Based on what we know about our data and our application, the users will most likely come in at random periods – but let’s say we don’t want to make any assumptions and we want to understand what kind of load this will place on our disks.

We typically measure write performance in terms of IOPs – Input Output Operations Per Second and to understand how much data we’ll be able to move in terms of IOPS, consider the following:

IOPS*TransferSizeInBytes=BytesPerSec

Let’s take a look at what modern disk subsystems can accomplish in terms of IOPs.

  • HDDs: Small reads – 175 IOPs, Small writes – 280 IOPs
  • Flash SSDs: Small reads – 1075 IOPs (6x), Small writes – 21 IOPs (0.1x)
  • DRAM SSDs: Small reads – 4091 IOPs (23x), Small writes – 4184 IOPs (14x)

For this exercise, we’ll assume the fastest disks available for random, small write workloads: DRAM SSDs.

To Shard or Not to Shard

In order for us to determine whether or not we will need to shard, or partition our database, we need to figure out whether or not we’ll be able to provision a service with enough RAM to support our working set.

Do you have servers with in excess of 2.7TB of RAM?  Probably not.  Then let’s take a look at sharding.

What is sharding?
Sharding is the process of storing data records across multiple machines and is MongoDB’s approach to meeting the demands of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput.

The most common goal of sharding is to store and manipulate a larger amount of data at a greater throughput than that which a single server can manage.  (You may also shard, or partition your data to accomplish data locality or residency using zone-based sharding… but we’re going to leave that for another article.)

To determine the total number of partitions we’ll roughly divide the total required data size from our working set by the amount of memory available in each server we’ll use for a partition.

If you’re fortunate enough to be ordering server hardware prior to deployment of your application, make your server order so that each server has the most amount of ram you can afford.  This will limit the number of shards and enable you to scale in the future should it be required.

For the sake of this exercise, let’s assume our standard server profile is equipped with 256GB of RAM.  In order to safely fit our working set into memory, we would want to partition the data in such a way that we created (2.7TB/256GB) or 11 partitions (rounded up, of course.)

In future articles, we’ll discuss in further detail the process of determining exactly how to partition or shard your data.

Conclusion

In summary, we’ve answered the question, how do we go about sizing for a MongoDB deployment – or – how do I go about coming to know the unknowable?  We looked at the data, and the access patterns of that data.  We worked through an example and found that there are really no shortcuts – we must understand the data and how it will be manipulated and managed.

Lastly, we came to a conclusion – an educated guess about the number of servers, and the amount of RAM that will be required for each.  I want to stress that Sizing MongoDB is part art, and part science.  You can rarely, if ever get all of the facts so to bridge the path of uncertainty, we use educated guesses and then we test… we search for empirical data to support our hypotheses and we test again.  You will do yourself a great disservice if you try to size your MongoDB deployment and you neglect this fact.  You must test your sizing predictions and adjust where you see deviations in the patters associated with your application – or your test harness.

If you have a challenge or a project in front of you where you need to deploy server resource for a new MongoDB deployment, let me know.  Reach out via the contact page, or hit me up on LinkedIn and let me know how you’re making out with your project.

 

Moving from Tables to Documents with MongoDB

I’m going to ask you to set aside your concept of “proper data modeling” and “3rd normal form.”  Going forward in this article, those concepts will hold you back.  Some DBA’s and data modelers become angry at this suggestion.  If that’s you… welcome – but please hold judgement until you read this complete post.

Data normalization focuses on organization of the data for the purpose of eliminating duplication.  It’s a data-focused approach.  Not an application-focused approach.

With document data modeling, we flip the script and we turn to our application to answer the questions about how we organize the data.  Before we get into exactly how to do this, let’s delve a bit into the possible structures of a document.

MongoDB is document-based.  That is to say that it stores data in documents.  JSON-like documents to be more specific.  Most people, might think of Microsoft Word Documents, or PDF documents when I mention the word document – and while it is true that MongoDB can store these types of documents, what I’m really talking about is JSON documents.

JSON

JavaScript Object Notation (JSON)  is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate.  It is based on a subset of the JavaScript Programming Language.

When you store data in JSON documents, you approach data modeling quite differently than as with Relational technologies.  You see, relational technologies were developed in the 1970’s and 1980’s when disk space was extremely expensive.  Thousands and even 10’s of Thousands of dollars per gigabyte of disk was not unusual early on.  So to preserve this most valuable resource, relational technologies developed the concept of data normalization with a set of rules.

Normalization

Normalization is the systematic method of deconstructing data into tables to eliminate redundancy and undesirable characteristics like Insertion, Update and Deletion Anomalies.  It is a multi-step process that puts data into tables: rows and columns, by removing duplicated data from the tables.

Normalization is used for mainly two purposes:

  • Eliminate redundant data.
  • Ensure data dependencies make sense i.e data is logically stored in line with the core objectives stated above.

Normalization techniques and the rules associated with it are all well and good if you intend to leverage a relational database technology.  However, as discussed, MongoDB is document-based… i.e. non-relational.

That is not to say that you cannot define and maintain relationships between data elements in your document model.  However, this is not a primary constraint when building a document-based data model.

Rich Data Structures

JSON documents are relatively simple structures.  They begin with a curly-brace and end with a curly-brace.  In between these braces, you have a set of key value pairs delimited by commas.  Here’s an example:

In my example, I’ve tidied things up using indents (spaces before the keys) but this is not necessary.  The above example is extremely simple.  These structures can get quite complex and rich.  The above example includes keys, and values.  The keys in all cases with JSON are strings.  The values however, can be string, numeric, decimal, Dates, Arrays, Objects, Arrays of Embedded Objects, and so forth.  Let’s look at a more complex example:

As you can see, the values don’t have to be simple strings or numbers.  They can be quite complex.  Now, if you’re aware of JSON, you might be saying something like – wait a minute, JSON only supports strings and numbers… and you’d be correct.

At the beginning of this article, I stated specifically that MongoDB stores data in JSON-like documents.  We actually, store the data in BSON documents.  BSON is a Binary representation of the JSON document.  You can read all about this standard at bsonspec.org.

We use BSON so that we can honor the types not supported by JSON… to make it easier for developers to store rich data types and not have to marshal them back from their non-native forms.  When you write a decimal in MongoDB, and then read it back – it comes to you via the drive in decimal form.

Now that we understand a bit about how MongoDB stores and organizes data in document structures, let’s address migrating data from a relational structure to a document-based data model with MongoDB.

Let’s use the obligatory Books and Authors example, not because it’s stunningly brilliant, no – because I’m lazy and it happens to be something to which we can all relate.

Consider the following ERD.

In this simple example, we have two tables.  Authors, and Books.  There is a relationship expressed between these two tables in that Books have an Author.  Rather than storing this data together, we’re storing it separately and expressing the relationship through LINKING.

With MongoDB, we can store this very same information but instead of linking between two disparately, separate locations, we can EMBED the same data.  Consider the following:

In this example, we’ve designed a document structure by creating an actual document.  Notice in the previous, relational example, we created an ERD, or Entity Relationship diagram.  This same ERD may be useful for us as we model our data in documents… but the difference is that with MongoDB, there is no separate, distinct schema.  The schema does not live separately from the actual documents.

In atomic and molecular physics, there’s a concept known as the observer effect.  This applies here to the concept of a schema with MongoDB.  If you don’t look at the data, the schema does not exist.  It’s not until you observe the data do you see that a schema defining what keys / values you have truly exists.

Now, you may begin to wonder something along the lines of what if a data element in the subordinate changes?  What if a subdocument element such as book title changes?  Unlikely, I suppose but possible.  And since we’re storing book titles inside of an Author Record, and possibly even storing the very same information such as book title, description, etc. in another collection specific to these data elements, how will we address this change?  ARE YOU SAYING WE MAY HAVE TO UPDATE THE DATA MORE THAN ONCE!?!

Yes.

Calm down.  We’re not under the same constraints as relational developers.  We own the destiny of our document structures as well as the content.

We can change data multiple times, in multiple locations.

But… but but that’s wrong.  I feel your terror.  It’s not wrong because we don’t adhere to data normalization rules.  Who cares?  Who cares if we store data in multiple locations – we are the czar of our data and we control when its written with our code.  We are no longer beholden to a schema wielding dba.  We are that dba.  If this feels wrong, you’re not alone.  But trust me the benefits of this approach far out-way the drawbacks.

Benefits of the Document Model Approach

Benefit One: Data Locality – Data that’s accessed together is stored together

When we toss out normalization we gain the notion of “Data that’s accessed together is stored together”, otherwise known as data locality.  An Author document contains all relevant details about an author including the books he or she has written.  When my application needs this data, it issues a read and in most cases, a single read fetches ALL of the data needed.  In relational, or normalized data, a single read gets me perhaps a single row in a single table and then I need to issue another read to get the related data from an additional table.  Multiple reads equals multiple disk seeks equals slower performance for my application.

Benefit Two: Readability

When all the data that’s accessed together is stored together, it’s just logical – it makes sense – you can see it all at once in a document.  Whereas with relational technologies, you must issue SQL commands with JOIN clauses to pull data from multiple locations.  Much less readable.

Benefit Three: Flexibility and Agility

When we store data in documents, adding, removing or modifying the data structures is much easier.  There literally is no governing schema.  We simply modify the code we use to update the data in the database.  We have no external schema to modify.  Therefore, we gain the flexibility to make these changes without stopping the database… without issuing an “alter table” command.

Conclusion

In this first of a series of articles on migrating from relational to documents, we’ve looked at how data is stored in MongoDB, what documents are, the structure of JSON and BSON and explored just a few of the benefits.  While the examples are basic, I hope these have illustrated the power and flexibility of this modern approach to data storage.

In my next article, I’ll tackle a bit more challenging relational schema and convert that to documents and incorporate the code used to maintain the data.

If you’re interested in this topic but need a more structured approach to enablement and learning, MongoDB has amazing resources to help you wrap your mind around the document model.  I highly recommend MongoDB University  if you’re new – or trying to improve your knowledge of MongoDB.

Please leave a comment, ask a question or reach out to me on LinkedIn with feedback.