-Hi everyone,
my name is Daniel Cooper. I am a member of the education
team at MongoDB. Welcome to the talk of complete
methodology of data modeling from MongoDB. A few years ago when I worked
with relational databases my team had a process, we would start by
getting the stakeholders together ask the question to help us understand
the domain, then go back to our desk draw the ER diagram, write some code
and apply some denormalizations. Fast forward two years later,
we have these modern and agile
databases
that let you start developing quickly,
however, some of us still want a methodology to
help design the schema. This is what this talk is about,
introducing a simple and flexible methodology to
help you model your data. This presentation is divided into four
sections in the first section we'll highlight the differences
between the document database and the tabular database. Then we'll introduce
our modeling methodology, which is at the core
of this presentation. We will illustrate
the
methodology with a complete example and finally
we'll spend a little bit of time with the last
phase of the methodology, which is to apply
schema design patterns. Let's talk about
the differences between a document database
and a tabular database. These are very important
to understand, so we built a good mental model when
modeling from MongoDB. Know that I use tabular to refer to what many call
the relational database. I prefer using tabular because
it's a better description of traditional rela
tional databases
where the data is stored in tables. As for relationships, they're not
exclusive to tabular databases. MongoDB is very good
at expressing relationships. As a matter of fact it may even be
better than traditional relational database as it offer more different
ways to express relationships. Document databases are based
on the documents model. We can summarize the document model
by the following five attributes. Data is represented ASCII
values polymorphism, allow documents with dif
ferent
shapes to coexist side by side. Some documents in a race to
support relationship between entities, and the representation
that is easy to process and read. The format is called DSON
which is very similar to JSON. You can liken a document to a row
in a table, to go from the row to the document you would start by
extracting the values in the row. These values become
the values in the document. The names of the columns
become the names of the fields. Putting this together,
we have a simple d
ocument. Because each document
carries its list of fields,
documents can have different list of fields, we refer to that as
documents having different shapes. An interesting thing
with a table is that columns usually
have a one-to-one relationship. An alternate model is to
have two tables with an explicit one-to-one
relationship between them. For you to model the top
table with a document, it will easily
translate to this simple representation
translating the two tables model will still result
i
n a single document. However,
we can use a sub-document to group the information
for the engine entity. Note that we could also use this
to model the one at the top if we would have seen that there
was an engine within those fields. One cool thing about using
a logical document is that you can return the wholesome object without
knowing how it's structured. For example,
you're selecting or projecting only the engine from the document. The engine could have a different
shape, different fields. Fo
r example,
it could be an electric engine and will still be returned
by the same query. Another type of relationship
is the one-to-many which is often modeled with two tables
and a foreign key in a tabular world. For example,
the primary key for a car year 007 is also the primary key
for a MongoDB document, all of it because we'll embed
the one-to-many relationship between the car and the wheels
inside the document. We don't need a foreign
key in the document. The wheels are simply
embedded in t
he car. An analogy would be that storing
a car in a traditional relational database is breaking it up and putting
the parts on different shelves. With MongoDB the car
remains as a whole. This is an example
of the SQL query to retrieve the car,
where we have a bunch of joints that would go on the different
shelves and assemble all the parts to return
it to the application. With MongoDB it's very simple,
we just need something to identify the car and the whole
car will come out at once. With Mongo
DB what is used
together in the application is stored together in the database. Let's look at the first example. We're going to model a blog here. I'm going to have users
write articles, articles get comments that are also written by users and/or articles
can be categorized and receive tags. If we're going to model that
in the tabular world, there will be only one solution that would be the one
that respect the third normal form. With MongoDB
and the document model there is a different solution
for the same problem. Let's say we want a solution
that is oriented toward making queries for users,
we would have a main collection of users in which without
the articles for the user and then all the information
about the articles themselves. This is pretty simple. I know the solution may be, well
let's say that they have a system that's making graph queries
on articles, their main entity could be the articles in which
we're going to put information about the user, and also all
the information
about the articles. In this case we get duplication
of user information. Well, this is not something we want
to have a third solution is again oriented toward articles, where
we keep all the information about the articles within this collection
but there are a separate collection for the users, in this case we do avoid
having duplication for our users. Which approach is better here? The two solutions can be valid,
the question is are we going to use the data? What are the queries? Let's look at
another example, we're
going to model a social network. The first solution
we're going to keep all the images
for our social network inside one large collection for all
submitters to put their image in. We have Eve here,
who's coming back from their trip to Italy,
she's posting some pictures. We have Oscar,
who likes to stay in Hawaii and put pictures of this island,
goes also in the same collection. Then one of their friends
is coming and is reading all the pictures from
the people who's follo
wing. That's also a solution. We could have a second
solution where we think it's better to keep
the images per submitter. We're going to get their first
document here that we're going to use and put all the picture
when Eve comes back from her trip
and we're going to do the same thing with Oscar and when
their friends logging in and wants to see all the pictures, we're going
to go and read from both profiles. We could have a third
solution where if we're going to orient or document
store the fo
llowers. In this case, the first friend
is going to get pictures from Eve as she has come back from
her trip and Oscar is going to put also his picture there,
but Oscar has also another follower so we'll have to copy this
image into his other friend. When their friends come in they can
read their images from their profiles. The interesting thing
about this third solution is we pre-aggregated
the data [?] that again. The interesting thing
about this third solution is the data is pre-aggregated, w
hich means we're going to get
slower rights in the case of Oscar, we need to write
the images at two places. This is going to take more storage
and we're going to get duplication, so those are not very interesting
attributes obviously of the system. However, we're going to get
faster reads when the users come in and they want to see
the pictures on their homepage. If the most important
attribute for a system is to get a good user experience
for a user who come to their system and logged
in it ha
s to be really fast, obviously that
solution C would be the best. Again, like the previous example,
in order to be able to tell which one is the best model,
we need to understand what the workload of our system is. If you were going to create a schema for the tabular databases,
the first thing you would do is probably get
in a room define your schema then you go write the app,
write the queries. As for MongoDB, as we've seen,
it's important to really understand the workload first, so we identify
the query then define the schema. Basically,
we're inverting those main two steps. The initial schema
you're going to get for a tabular database
you try usually to get something that respect the third
normal form and that's going to give you one possible
solution as MongoDB, as we've seen, there could be many possible solutions, this is why it's
important to understand the workload. The final schema is likely
to be denormalized with a tabular database,
every time you want to improve the perform
ance,
you usually go that route, MongoDB is likely to
have a fewer changes, because we do care more about
performance right at the beginning but for schema
evolution, it's, as you know, very difficult to
make complex modification on tables,
it usually will require downtime. As for MongoDB,
you can make nearly any kind of modification in the schema
without downtime. At the end of query performance,
you don't have a choice. If you're going to make
a query that go and touch seven tables, it's going
to take time,
there's no magic there. It's going to take
a lot more time compared to something like
MongoDB if all the data is aggregated together
in one document that you have to do only one
read to retrieve it. That brings us to the second section,
of our talk, the methodology. Before we look at the different
phases over methodology, let's just look at something
really important. It's pretty the main
trade off when you model for something,
not just for databases, but for a lot of things,
you
usually have to choose between simplicity
and performance. You would choose simplicity,
if you have a very small project, if you have an effort that
is done by one or two people. On the other hand, if you have a large
system, with a big team, you probably want to spend a little
bit more time up front to design your system and performance will
probably be a lot more important because it's going to have a bigger
impact on larger systems. Obviously, you have to hold
in between the spectrum. Now, if
you don't know where
you are on this axis for your project, I will see error
on the side of simplicity. It's all always easier
to fix things and improve performance of your code
or section or the database than it is to remove
complexity out of a project, or system will
have different inputs. In order to design it,
maybe we will get some scenarios from
a requirement document. We may have production logs if we're migrating this system
from a relational database to MongoDB
or we have a prototype,
so that generated logs we can use. We may have a business domain expert,
someone who really know how things should be working
that we can interview, get some information out of him,
and hopefully we also have a data modeling expert that's going to
help us put everything together. Feeding some of these inputs
into our first phase, as you may have guessed,
the first phase is describing the workload as we said earlier,
we need to understand that in order to be
able to model correctly. So, we're goi
ng to use
the opportunity in that phase to look at things like sizing for resources,
machine data, but the most important thing will be to list
all the operation, and for those operations, we're going to
quantify them and qualify them. One more thing also,
that's very interesting to get out of that first
phase is the assumptions. The assumption will be things that
you're not sure what they may be but you're going to still try to
make a number on it or something. Let's say we may say that
the rel
ationship between an entity and something
else will be max at 1000. Well, if it happened in the course of the development of the project,
or later in operation,
that 1000 end up being the wrong number,
it should really be a million. Well,
that's a flag we should go back to the board and see
the impact of that change on the overall schema and see
if a lot of changes should be done. Once we get that, now we're going
to go and model the relationship. We're going to identify
the relationship like an
ER diagram if you want,
and we're going to quantify them. The quantify part is interesting
here, because as I said, you may have a relationship that goes from
one to a million or 1 to 10 million. I the world of big data,
this is frequent, and that has the main
impact on your design. If you identify those early,
you're going to make better decision,
and then you're going to answer the most important
question in the modeling phase, which is,
should you embed or link data? If for every single rela
tionship
with two-piece of data, you can keep them together
in the same document, or you can put them into
different collection, and that's what you're going
to resolve in that phase. Coming out of that phase,
you're going to have the model, you're going to
have the number of collection, it's not going to match
your number of tables, hopefully,
you're going to have much much less collection that you would
have had tables or entities, but this still feels
pretty much relational and this is where
the third phase,
which is applying patterns is going to
come into the picture. Pattern will be a transformation
that will help you improve performance and make your
data easier to access. What we need to do there is to be
able to recognize the situation in which we should apply
a pattern and then apply it. As we said earlier, we want that
to be a flexible methodology, it's very easy to use MongoDB,
you don't really need a process, you don't need to
create tables and anything, you can just go and
start writing code,
put stuff in the database. You can apply this model
and still keep it very simple. If you are going to strive
really for simplicity out of your first phase,
you may not have to identify all the approach,
and as long as you identify the most frequent operation,
that's going to be really helpful. You could embed nearly everything,
you'll use very little references. Embedding tend to simplify stuff. You have like larger object,
more formation, it just may take a bit more resour
ce,
more memory, take a little bit more time to be sent back
to the client, things like that, but it's going to be simpler,
much easier to manage. Then as for the patterns,
I said, these are transformation,
usually, for performance, they do bring some aspect
of maybe duplication, so you're probably not
going to use many of them. On the other end,
if what's important for you is really performance,
then you should do all the phases with all the steps,
you should really identify all the operation,
quantify all
of them and qualify them. You'll probably get a mix
of embedding and linking in order to not use too much resources
and apply more pattern. As you may guess, those
are in between where you don't need to do everything, you do whatever
is right for the level of your project. That brings us to the third
section of our presentation. What we're going to do,
we're going to start the franchise of coffee shops. We're going to name our coffee
shop Beyond the Star Coffee. We're going to start
with
10,000 stores in North America. If that works well,
we're going to expand to the rest of the world,
where key to success will be have the best coffee in the world,
the best technology, and how do you get the best
coffee in the world? Well, this is an example
from a little coffee shop in Sydney,
they have that on their website. It's a pretty fairies' trick
and precise process in order to create coffee,
everything is measured correctly, perfectly, so we're going
to be inspired by that, but w
e can also have very
intelligent coffee machine. We'll be able to measure
everything we saw on the previous page, and we'll send this
information to our servers, make some analysis,
we'll also have intelligent shelves in all the stores,
so every time we deliver coffee, put that on the shelf
or remove a bag in order to put it in machines would sense
that something has changed in the shelf and it will send
a signal back to our system, so we'll always have the real
measure of the inventory. Then, w
e're going to use
the best intelligent data storage that is out there,
as you all know, it's MongoDB. Okay, our first phase is to
describe the workload. We will first get
the list of preparation, for example,
every time, as I said that, weight on the shelf
is changing because we added or remove coffee,
we're going to send it right to our system,
and we're going to use that data we receive,
run some queries once in a while to see
how much coffee we're to ship to all the stores
in the next days. D
o some analytics on data,
I'm not sure what we can find, but there's probably
something interesting, we're going to get
the best data scientists, I'm sure they'll find
very interesting stuff. Then we also need to send
rights to our system, every time we make a cup
of coffees with all the data that we want to capture about
temperature, weight and all that, and the same data analysis,
we're going to crunch some numbers on your
coffee cups just to be sure that we're on top,
we always have the best
coffee and always try to improve it,
and we need to be able to read some of our data
also to help our franchisees. That would be basically
in this little examples, some of the important queries that
we need to care for our system. Now, the next part is to
quantify and qualify those. Going back to our first query,
if we look at the number of times an event happened for a shelf,
number of shelf we have per store, number of store,
calculate all that, it gives us about one write per second,
which is
obviously not much. We can say maybe as
long as this write is done in one second, the hardware, the shelf will be fine and we will
also label this write as critical. This is what I qualify
it as qualification is some aspect on the queries
the operation themself. I'm saying this is a critical write
because we don't want to lose that write. Whether there's no problem
with the system going down, whether there's
a problem with a network. If we don't get the write
number of coffee bags per shelf the
n we may end
up in a bad situation. It's really important to keep this
information safe all the time, and for those of you who
have a little bit of experience with MongoDB, you know that this
basically would translate to ensure that this write get done
with a write majority concern. Second query here, we're going to
be running something on our side that may take a lot of time up
to 60 seconds, it will be fine. It's a report to just tell us
where we have to get the data. What's important here is
we want to
be sure we get the latest information. Say, I don't want something
that is a few seconds, a few minutes late, because maybe
there's a bag of coffee that was just removed and I want
to be as close as possible when I make my decision with
much coffee I'm going to ship. As for running analytics, right now
is they don't really need to get the coffee, the information about
the data to be absolutely up to date. The data may be stale a little bit,
I'm not sure exactly whether the staleness
i
s acceptable here could be few seconds,
few hours, depending what they're trying to
do but they will also be doing a lot of collection scan
on the data to crunch all that and that is a very
good indicator that maybe you should not be
running those queries on your main server where
are you write workloads. This is really an indicator
that you may want to have, an additional node
just to run analytics and probably the most important
query most of you would have guessed that would
have been this on
e. This is obviously the one that will
generate the most traffic is if we send the write event for every
single cup of coffee we're making. Now, one thing to notice here
is that this is a noncritical write because the only thing I'm going
to be doing with that data is some analysis and even if I lose one
or two writes because there's no way network failure, well, it's
not going to affect me that much. I could label those
as noncritical write, meaning that maybe my coffee machine doesn't have to
have an acknowledgment that the write would
be protected forever. This just the data can
be treated differently, it's important to really
understand that as you're doing your initial design,
this decision may help you a lot. One thing to note here
is that the previous one, line four is the one
that generate the most write but it's not okay
to just look at the load and expect a load that's
going to be constant. Usually, if you want to
size something correctly you really have to think
about the pe
ak period. If you're in e-commerce
peak period could be per day,
but it may be like a few days before Christmas or Thanksgiving,
Black Monday, Black Friday,
whatever they call them. As here what we're going to
say is that 30% for a cup of coffee will be done
within one hour of the day. We're going to have a rush hour
where 30% of them are made. Using that as the new maximum
value then we see where the peak is going to be which
is 833 write per second. Again, this is very small. MongoDB can handl
e tens
of thousands of writes per second on within
single replica set. That gives us a good indication that we don't need sharding
for this system, at least to sustain the writes we're
going to be making on our disks. Taking that query here
we may want to dig a little bit more into it,
especially if we were trying to really optimize everything
so we may have other qualification
or quantification to add to it. The other thing I want to
also to size when I'm there is the resources I'm going
to nee
d for the system. We have two writes coming in,
which means we're going to have two types of data that's
going to take some space. We have coffee cups and every
time also the shelves have an event they're
going to send a signal. Again, we crushed the numbers,
let's say we want to keep one year
of data will multiply by the amount of bytes per
coffee cup awaiting, we end with 370 gigabytes of space
for a cup of coffee and 3.7 for the writing,
and this, like the queries, this is a pattern
you're go
ing to see very often, there's usually
one part of the data that will really overshadow
the rest and that's the one. If you want to do optimization on your
space, you basically just have to look at the cup of coffee,
the waiting don't become as important. Another thing interesting here
is that I said one year of data, and one year of data
give us 370 gigabytes, which is, again,
probably something that a single replica set will be able
to handle pretty well. One of the rule of thumb for MongoDB
i
s if you don't have a terabyte of data, you don't really need
to shard for performance usually. This would be under that,
but this is under that also because
I have one year of data. Let's say we would have said that it's
important to have 10 years of data, and we have been 3.7
terabyte and in that case,
we probably have to shard a system. Here you can make also
a very important decision if you size your things
correctly at the beginning is like,
well, is it really important for the competitive
project
to have a sharded cluster? Do we want to pay also
the price of handling all those machines
for 10 years of data. In this case, making a hard
decision at the beginning, saying I'm not going
to keep everything, that is enough for what I need to do,
will probably save you a lot
of money on the long run. Once we get that we have a better understanding of what
our system is about, we're going to go to the next phase
which is identify the relationship. It's not because you're with MongoDB
that
relationships are not important. It's not because we call it NoSQL,
but you always have relationship
between piece of data. Let's look at an example
of where we have actors playing in movies
and we have reviews for those movies. The first do an actor name
and the date of birth. These two have a one-to-one
relationship, and often it's assumed
but the one-to-one relationship end up being in the same entities, but the relationship is still
there it still exists. As for the relationship
between the
movie title and the actor name,
those are many-to-many and those get put in different
entities and we see the many-to-many
relationship between the two. The reviews apply to movies,
so we have a one-to-many. If we're going to
do that with MongoDB, we still need to understand
those because what it's going to translate then,
here we still have many-to-many
relationship between actors and movies,
but the one-to-many relationship between
the movies and reviews get implemented as an array,
as we've
seen earlier in the first
section of this talk. I mentioned that this is probably
the most important question you have to answer if you're going to embed
and/or reference your relationships. There's three types of relationship, one-to-one,
one-to-many, many-to-many. What's the implication
of embedding versus referencing? Well, when you embed for all of them,
you basically end up doing just one read
and you get all your data. This is much faster,
it's also simpler, it avoids doing joins
but the o
ne case to be careful is when you have the many-to-many
relationship, if you embed the information, for example, if we're
going to embed actor names inside the movie,
we would duplicate the name of the actors,
but it doesn't really matter. Usually, when a movie is released
and there's a bunch of actors, it's not that the actor
information is going to change. If you keep the name, they may change
their name, but you may want to keep the old name that they had when
they basically did that movie. O
n the other end,
when you reference, you end up having to do many reads
because now you have to read two parts,
you need to read the two entities,
but you end up doing smaller reads. You may not need to read everything
in both sides and connections because not everything is grouped, the part
you don't need may not be accessed. You may save on the amount
of data you're reading, but you're going to
do more IOPS basically. Looking at the different
entities that will come out from our first section
will have something like coffee cups,
the stores, the coffee machines, the shelves,
the weightings, and the coffee bags. Those are basically the things
that we made query on. An easy model could
be something like that,
where we're going to keep everything in the store so the store
as coffee machines and the shelves on which
we have coffee bags. I'm going to keep
two other collections also for coffee cups
and the weighting. The reason those
are separated here is because they also have a different
life cycle. We said that this data
is going to expire in one year. Data that expire usually,
it's a good sign that you really want to have a separate
collection for those. At this point,
this is still pretty relational. This is where we're
going to go and apply all patterns to make things
easier to access and faster. That brings us to the last section
for presentation, the patterns. This is not to talk about
schema design patterns. I'm going to go very
quickly on some of them. We do have a lot
of other references. The first one is a series
of blogs that Ken Alger and I wrote last year
so one blog per pattern. The patterns are heavily
defined also in our online class M320 that
I'm adding from MongoDB. There's also another talk happening
at this conference MongoDB Live 2020, which is advent schema design patterns,
that Justin LaBreck and I are doing. Justin is one of our best-consulting
engineer out there. He goes to customer
and implement solution for them. He's been using these patter
ns many,
many times. He's going to
have a very down to earth example that he used
at a specific customer. We're going to run through
that and he's going to be applying patterns
in different sections. It's pretty cool,
so please go and look for this talk. It's basically complimentary talk to
the one we're doing right now here. The first pattern I want to go over
is the schema versioning pattern. Let's say we wanted to
migrate the schema from a version one to a version
two in a relational database
. For example,
let's say that we have people and we keep track of their
favorite restaurant. Well, we decide that
the system is very successful. We should keep track
of more than that. What we're going to do is we're going
to keep track of a bunch of favorite things that our users have, so favorite
restaurants, and favorite dish. We're basically going from this
scalar field to adding two tables adding a one-to-many relationship,
both of them to our people. That's a pretty
heavy transformation. I
f you were going to
do that with a relational database, that would be
a little bit complicated. The migration would look
a little bit like that. We would have a first
version for a schema. We would break down the system. We would reformat that first
table to have less information in it, which we would
migrate to our new tables. When everything has been
migrated that's diversion to the schema,
we could reopen the service. Now, doing the same with MongoDB
using the schema versioning pattern, we wo
uld go from documents, version
one on the left to version two. In term of changes in the document,
the first thing is we would track the version. We could just by the shape,
trying to infer the version, but it's maybe easier if we just
label the version inside. Then we will make the change
from a favorite restaurant to a favorite list
of restaurants and dishes. The key thing here is that with
MongoDB, because you have polymorphism, you can have documents that
have different shapes at the given t
ime. Well, we're not restricted to one
schema version for the whole database. Every document can
carry its own version. The impact of that is that if
you're going to do a migration with MongoDB, you would start with a bunch
of documents in your collection. Then that's what you want to have. You want to have all
your documents in version two when the system
can be functional. We're going to migrate version
one of the document, then migrate version two of the document, then
migrate version three o
f the document. The key here is to
have the application be able to read and process both shapes, which
is very easy. It's polymorphous. We already know MongoDB can read
documents of different shapes. You just have to have your
application, you can understand the logic of these two different
shapes and do whatever is needed. That basically give us a no
downtime situation this is a pretty complex
migration if you look at it. We went from having a scalar field,
going to two new relationship one-to-
many
been added in the schema. You could also have created
the new connection. It can be much more complex
than that if you want. The key thing here is being
able to do that migration. You're more in control. You can decide when
you're going to do it. You can take the time
you want to do it. You can use a batch of data. You can use the application to unwrite
every time it has to make a change, read the previous document, make
the change, and store back the document. You're really in control
and
you can do basically any kind of migration, the right migration
you want without downtime. This to me should be like
a sufficient reason to justify that you want to
use MongoDB on a project. Another pattern
we have is the computer pattern. This one we see it apply
when we have a lot more read operation
than write operation. For people who come from
the rational world, this is a bit like of you but it's better,
you'll see why. Let's say I'm doing,
for each write I'm doing, I'm doing a thousand re
ads, and the reads
need to sum information. If you look at this diagram,
what we're doing here, it's a little bit silly because the summation of all the read
operation is exactly the same thing. We may be doing a thousand
times the same sum in average. This is a lot of wasted resources. The other way you can
do it is when on write operation, you're going to add to your
collection the new piece of data. You're going to do the sum
and then we're going to take the result which
could have been the v
iew. We're going to store
it at the right place. Not just in the view,
but if it's a document that there's a document that already exists,
that should receive that sum. For example, it could be a document
that characterize a movie, and we could put the total
revenues of that movie directly in. We don't need to go
to another source. We do have the document now
that has everything in it. You're going to save a lot
of resource here because every time we have a reader
we don't have to redo the sum.
Another pattern we have is called
the subset pattern. That has to do with management
of your RAM, basically. We use the term working set to
represent all the data that you need to keep in memory all the time, indexes
because you access it very often. For any kind of system,
you want to keep things in room. You don't want to go to
disk if it's possible. Well,
the issue may come on the fact that here,
for example, we have four large documents that we need
to keep in memory, but there's only a spac
e for three. Every time we need to access
the fourth one, we'll need to drop one from memory,
go to this, bring this one. That's not very effective. A better way would be to look
at the big documents, and see which part of the app that
you really need all the time. We're going to break
this document into what's accessed frequently
and what is not. Now,
if we look at the new diagram with that,
the four documents that we need to keep in memory
because there's a small section of it, they fit in mem
ory. Then we have additional
memory to cycle in the page that
we don't read that often. That's going to be
much more performance. That breakout there,
can be done two different ways. It could be done
on the one-to-one relationship. I'm going to take a document
and just remove things at the root, put it somewhere
else and have the same ID, or primary key on both
collections, or I can also use a one breaker,
one to end relationship. I have a document
in which I have an array. I'm going to take thi
s array and put it somewhere else,
but for performance reason, keep part of it the subset
back into the first object. That would be an example
that we take a movie document in which
we have all the actors. We're going to take the actors put
them into different collection, but still keep maybe the top 10
actors in the main movie document. If you think about it,
that's usually the right use case. People who go look
at information about a movie, they're interested by the top
actors, not by the thou
sand actors and different cast members
who were in a movie. If they are,
you can't make a second query. This is what subset about
is breaking the information. Breaking a one-to-one or one-to-many
relationship into two collections. The bucket pattern. This one is usually or always used
by Internet of Things solution. Where you could have one
big document that keeps all the information
for one given device, or I could also create
a little document for every single measure
that my system receive. T
he bucket pattern is seen
as an in-between solution. Where I'm going to take not
all the data, but not just one, and put that together
in what we call the bucket. It will look a little bit like that. For example,
this one here is a bucket per day. I'm going to have one document per
device per day and if you look, there's an array for the temperature
measure for that day. You could also do the same per hour. If you look at my date,
now it includes the hour. I'm going to get one
document per devic
e per hour. What's the right granularity? What should be the size
of your bucket? One thing that I usually
look for is you do a lot of aggregation on that and the first thing you end up doing
is online operation to take the arrays
and extract values. You could be basically went up
a little bit too far in the bucket. You may want to have a smaller
bucket, that's one thing. The other thing is if you're
going to do computation using the computed pattern
and put this data somewhere. Well,
you probab
ly want whatever you're trying to compute to
match the documents. If you're interested
in keeping average of temperature per hour,
well, adding documents that have buckets per hour
could be also the right granularity. If we applied these patterns too
or solution here, well, in the schema versioning you see a number fields
of things to do is pretty simple. You just add one field as
part of the procedure that you're going to be doing
in order to do your migrations. Computed,
that may want to keep
the last weighting that they receive
from a given store and I may have-- may want
to keep the total cup of coffee that have been
done by a given machine. This may be important
for maintenance. Instead of recalculating
all the time, once in a while,
we'll go update that. You want to keep the subset of the last
30 days for my cup of coffee. Remember, we have a separate
collection where could we keep all the cup of coffee but maybe
it's interesting just with the store to keep that information,
so w
e don't have to go to the other collection
and we can quickly show something. I may want to use
a bucket and in that case,
we wanted to generate a lot of data. We decided that we're
going to send single rights for every single cup
of coffee and keep separate documents,
make things a little bit better, but I think that could be
something we could change. We could decide that, "Hey, no,
the coffee cups should be in a bucket. They would be more efficient
for what you're trying to do." Looking at al
l the different
patterns that we have out there and listing them across
the different use cases that we often see at MongoDB,
we come up with that the matrix here. This is not exact science. It doesn't mean that if you're using Internet of Things,
for example, and there's no checkmark that it's something a pattern that would
not be used for that. This is just that this
is probably the most likely patterns you're going to
see for a given use case. You should be extremely
familiar with those patte
rns, but in reality,
it's not that difficult. You should be familiar
with all the different patterns that we have listed there. That bring us to the conclusion. The things I want you to
remember from this talk, the first thing is really
understand the difference between the document database
and the tabular database, so you can have a good mental model when
you're working on your schema. The second thing is the main
step to do modeling. It's very important to
understand that you need to understa
nd your
workload at the beginning. You still need to understand
the relationship and model them, and then apply patterns
for getting better performance. Before we leave, I just want to
remember everyone, we do offer classes on everything MongoDB
from aggregation, replication,
and sharding, data modeling. It's at university.mongodb.com.
Classes are free. This is the best resource
to learn about MongoDB. Again, thanks for listening.
See you around.
Comments