Main

A Complete Methodology of Data Modeling for MongoDB

Are you new to schema design for MongoDB, or are looking for a more complete or agile methodology than the one you and your team are currently following? In this talk, we will guide you through the phases of a flexible data modeling methodology that you can apply to small projects and large projects with very demanding requirements.

MongoDB

3 years ago

-Hi everyone, my name is Daniel Cooper. I am a member of the education team at MongoDB. Welcome to the talk of complete methodology of data modeling from MongoDB. A few years ago when I worked with relational databases my team had a process, we would start by getting the stakeholders together ask the question to help us understand the domain, then go back to our desk draw the ER diagram, write some code and apply some denormalizations. Fast forward two years later, we have these modern and agile
databases that let you start developing quickly, however, some of us still want a methodology to help design the schema. This is what this talk is about, introducing a simple and flexible methodology to help you model your data. This presentation is divided into four sections in the first section we'll highlight the differences between the document database and the tabular database. Then we'll introduce our modeling methodology, which is at the core of this presentation. We will illustrate the
methodology with a complete example and finally we'll spend a little bit of time with the last phase of the methodology, which is to apply schema design patterns. Let's talk about the differences between a document database and a tabular database. These are very important to understand, so we built a good mental model when modeling from MongoDB. Know that I use tabular to refer to what many call the relational database. I prefer using tabular because it's a better description of traditional rela
tional databases where the data is stored in tables. As for relationships, they're not exclusive to tabular databases. MongoDB is very good at expressing relationships. As a matter of fact it may even be better than traditional relational database as it offer more different ways to express relationships. Document databases are based on the documents model. We can summarize the document model by the following five attributes. Data is represented ASCII values polymorphism, allow documents with dif
ferent shapes to coexist side by side. Some documents in a race to support relationship between entities, and the representation that is easy to process and read. The format is called DSON which is very similar to JSON. You can liken a document to a row in a table, to go from the row to the document you would start by extracting the values in the row. These values become the values in the document. The names of the columns become the names of the fields. Putting this together, we have a simple d
ocument. Because each document carries its list of fields, documents can have different list of fields, we refer to that as documents having different shapes. An interesting thing with a table is that columns usually have a one-to-one relationship. An alternate model is to have two tables with an explicit one-to-one relationship between them. For you to model the top table with a document, it will easily translate to this simple representation translating the two tables model will still result i
n a single document. However, we can use a sub-document to group the information for the engine entity. Note that we could also use this to model the one at the top if we would have seen that there was an engine within those fields. One cool thing about using a logical document is that you can return the wholesome object without knowing how it's structured. For example, you're selecting or projecting only the engine from the document. The engine could have a different shape, different fields. Fo
r example, it could be an electric engine and will still be returned by the same query. Another type of relationship is the one-to-many which is often modeled with two tables and a foreign key in a tabular world. For example, the primary key for a car year 007 is also the primary key for a MongoDB document, all of it because we'll embed the one-to-many relationship between the car and the wheels inside the document. We don't need a foreign key in the document. The wheels are simply embedded in t
he car. An analogy would be that storing a car in a traditional relational database is breaking it up and putting the parts on different shelves. With MongoDB the car remains as a whole. This is an example of the SQL query to retrieve the car, where we have a bunch of joints that would go on the different shelves and assemble all the parts to return it to the application. With MongoDB it's very simple, we just need something to identify the car and the whole car will come out at once. With Mongo
DB what is used together in the application is stored together in the database. Let's look at the first example. We're going to model a blog here. I'm going to have users write articles, articles get comments that are also written by users and/or articles can be categorized and receive tags. If we're going to model that in the tabular world, there will be only one solution that would be the one that respect the third normal form. With MongoDB and the document model there is a different solution
for the same problem. Let's say we want a solution that is oriented toward making queries for users, we would have a main collection of users in which without the articles for the user and then all the information about the articles themselves. This is pretty simple. I know the solution may be, well let's say that they have a system that's making graph queries on articles, their main entity could be the articles in which we're going to put information about the user, and also all the information
about the articles. In this case we get duplication of user information. Well, this is not something we want to have a third solution is again oriented toward articles, where we keep all the information about the articles within this collection but there are a separate collection for the users, in this case we do avoid having duplication for our users. Which approach is better here? The two solutions can be valid, the question is are we going to use the data? What are the queries? Let's look at
another example, we're going to model a social network. The first solution we're going to keep all the images for our social network inside one large collection for all submitters to put their image in. We have Eve here, who's coming back from their trip to Italy, she's posting some pictures. We have Oscar, who likes to stay in Hawaii and put pictures of this island, goes also in the same collection. Then one of their friends is coming and is reading all the pictures from the people who's follo
wing. That's also a solution. We could have a second solution where we think it's better to keep the images per submitter. We're going to get their first document here that we're going to use and put all the picture when Eve comes back from her trip and we're going to do the same thing with Oscar and when their friends logging in and wants to see all the pictures, we're going to go and read from both profiles. We could have a third solution where if we're going to orient or document store the fo
llowers. In this case, the first friend is going to get pictures from Eve as she has come back from her trip and Oscar is going to put also his picture there, but Oscar has also another follower so we'll have to copy this image into his other friend. When their friends come in they can read their images from their profiles. The interesting thing about this third solution is we pre-aggregated the data [?] that again. The interesting thing about this third solution is the data is pre-aggregated, w
hich means we're going to get slower rights in the case of Oscar, we need to write the images at two places. This is going to take more storage and we're going to get duplication, so those are not very interesting attributes obviously of the system. However, we're going to get faster reads when the users come in and they want to see the pictures on their homepage. If the most important attribute for a system is to get a good user experience for a user who come to their system and logged in it ha
s to be really fast, obviously that solution C would be the best. Again, like the previous example, in order to be able to tell which one is the best model, we need to understand what the workload of our system is. If you were going to create a schema for the tabular databases, the first thing you would do is probably get in a room define your schema then you go write the app, write the queries. As for MongoDB, as we've seen, it's important to really understand the workload first, so we identify
the query then define the schema. Basically, we're inverting those main two steps. The initial schema you're going to get for a tabular database you try usually to get something that respect the third normal form and that's going to give you one possible solution as MongoDB, as we've seen, there could be many possible solutions, this is why it's important to understand the workload. The final schema is likely to be denormalized with a tabular database, every time you want to improve the perform
ance, you usually go that route, MongoDB is likely to have a fewer changes, because we do care more about performance right at the beginning but for schema evolution, it's, as you know, very difficult to make complex modification on tables, it usually will require downtime. As for MongoDB, you can make nearly any kind of modification in the schema without downtime. At the end of query performance, you don't have a choice. If you're going to make a query that go and touch seven tables, it's going
to take time, there's no magic there. It's going to take a lot more time compared to something like MongoDB if all the data is aggregated together in one document that you have to do only one read to retrieve it. That brings us to the second section, of our talk, the methodology. Before we look at the different phases over methodology, let's just look at something really important. It's pretty the main trade off when you model for something, not just for databases, but for a lot of things, you
usually have to choose between simplicity and performance. You would choose simplicity, if you have a very small project, if you have an effort that is done by one or two people. On the other hand, if you have a large system, with a big team, you probably want to spend a little bit more time up front to design your system and performance will probably be a lot more important because it's going to have a bigger impact on larger systems. Obviously, you have to hold in between the spectrum. Now, if
you don't know where you are on this axis for your project, I will see error on the side of simplicity. It's all always easier to fix things and improve performance of your code or section or the database than it is to remove complexity out of a project, or system will have different inputs. In order to design it, maybe we will get some scenarios from a requirement document. We may have production logs if we're migrating this system from a relational database to MongoDB or we have a prototype,
so that generated logs we can use. We may have a business domain expert, someone who really know how things should be working that we can interview, get some information out of him, and hopefully we also have a data modeling expert that's going to help us put everything together. Feeding some of these inputs into our first phase, as you may have guessed, the first phase is describing the workload as we said earlier, we need to understand that in order to be able to model correctly. So, we're goi
ng to use the opportunity in that phase to look at things like sizing for resources, machine data, but the most important thing will be to list all the operation, and for those operations, we're going to quantify them and qualify them. One more thing also, that's very interesting to get out of that first phase is the assumptions. The assumption will be things that you're not sure what they may be but you're going to still try to make a number on it or something. Let's say we may say that the rel
ationship between an entity and something else will be max at 1000. Well, if it happened in the course of the development of the project, or later in operation, that 1000 end up being the wrong number, it should really be a million. Well, that's a flag we should go back to the board and see the impact of that change on the overall schema and see if a lot of changes should be done. Once we get that, now we're going to go and model the relationship. We're going to identify the relationship like an
ER diagram if you want, and we're going to quantify them. The quantify part is interesting here, because as I said, you may have a relationship that goes from one to a million or 1 to 10 million. I the world of big data, this is frequent, and that has the main impact on your design. If you identify those early, you're going to make better decision, and then you're going to answer the most important question in the modeling phase, which is, should you embed or link data? If for every single rela
tionship with two-piece of data, you can keep them together in the same document, or you can put them into different collection, and that's what you're going to resolve in that phase. Coming out of that phase, you're going to have the model, you're going to have the number of collection, it's not going to match your number of tables, hopefully, you're going to have much much less collection that you would have had tables or entities, but this still feels pretty much relational and this is where
the third phase, which is applying patterns is going to come into the picture. Pattern will be a transformation that will help you improve performance and make your data easier to access. What we need to do there is to be able to recognize the situation in which we should apply a pattern and then apply it. As we said earlier, we want that to be a flexible methodology, it's very easy to use MongoDB, you don't really need a process, you don't need to create tables and anything, you can just go and
start writing code, put stuff in the database. You can apply this model and still keep it very simple. If you are going to strive really for simplicity out of your first phase, you may not have to identify all the approach, and as long as you identify the most frequent operation, that's going to be really helpful. You could embed nearly everything, you'll use very little references. Embedding tend to simplify stuff. You have like larger object, more formation, it just may take a bit more resour
ce, more memory, take a little bit more time to be sent back to the client, things like that, but it's going to be simpler, much easier to manage. Then as for the patterns, I said, these are transformation, usually, for performance, they do bring some aspect of maybe duplication, so you're probably not going to use many of them. On the other end, if what's important for you is really performance, then you should do all the phases with all the steps, you should really identify all the operation,
quantify all of them and qualify them. You'll probably get a mix of embedding and linking in order to not use too much resources and apply more pattern. As you may guess, those are in between where you don't need to do everything, you do whatever is right for the level of your project. That brings us to the third section of our presentation. What we're going to do, we're going to start the franchise of coffee shops. We're going to name our coffee shop Beyond the Star Coffee. We're going to start
with 10,000 stores in North America. If that works well, we're going to expand to the rest of the world, where key to success will be have the best coffee in the world, the best technology, and how do you get the best coffee in the world? Well, this is an example from a little coffee shop in Sydney, they have that on their website. It's a pretty fairies' trick and precise process in order to create coffee, everything is measured correctly, perfectly, so we're going to be inspired by that, but w
e can also have very intelligent coffee machine. We'll be able to measure everything we saw on the previous page, and we'll send this information to our servers, make some analysis, we'll also have intelligent shelves in all the stores, so every time we deliver coffee, put that on the shelf or remove a bag in order to put it in machines would sense that something has changed in the shelf and it will send a signal back to our system, so we'll always have the real measure of the inventory. Then, w
e're going to use the best intelligent data storage that is out there, as you all know, it's MongoDB. Okay, our first phase is to describe the workload. We will first get the list of preparation, for example, every time, as I said that, weight on the shelf is changing because we added or remove coffee, we're going to send it right to our system, and we're going to use that data we receive, run some queries once in a while to see how much coffee we're to ship to all the stores in the next days. D
o some analytics on data, I'm not sure what we can find, but there's probably something interesting, we're going to get the best data scientists, I'm sure they'll find very interesting stuff. Then we also need to send rights to our system, every time we make a cup of coffees with all the data that we want to capture about temperature, weight and all that, and the same data analysis, we're going to crunch some numbers on your coffee cups just to be sure that we're on top, we always have the best
coffee and always try to improve it, and we need to be able to read some of our data also to help our franchisees. That would be basically in this little examples, some of the important queries that we need to care for our system. Now, the next part is to quantify and qualify those. Going back to our first query, if we look at the number of times an event happened for a shelf, number of shelf we have per store, number of store, calculate all that, it gives us about one write per second, which is
obviously not much. We can say maybe as long as this write is done in one second, the hardware, the shelf will be fine and we will also label this write as critical. This is what I qualify it as qualification is some aspect on the queries the operation themself. I'm saying this is a critical write because we don't want to lose that write. Whether there's no problem with the system going down, whether there's a problem with a network. If we don't get the write number of coffee bags per shelf the
n we may end up in a bad situation. It's really important to keep this information safe all the time, and for those of you who have a little bit of experience with MongoDB, you know that this basically would translate to ensure that this write get done with a write majority concern. Second query here, we're going to be running something on our side that may take a lot of time up to 60 seconds, it will be fine. It's a report to just tell us where we have to get the data. What's important here is
we want to be sure we get the latest information. Say, I don't want something that is a few seconds, a few minutes late, because maybe there's a bag of coffee that was just removed and I want to be as close as possible when I make my decision with much coffee I'm going to ship. As for running analytics, right now is they don't really need to get the coffee, the information about the data to be absolutely up to date. The data may be stale a little bit, I'm not sure exactly whether the staleness i
s acceptable here could be few seconds, few hours, depending what they're trying to do but they will also be doing a lot of collection scan on the data to crunch all that and that is a very good indicator that maybe you should not be running those queries on your main server where are you write workloads. This is really an indicator that you may want to have, an additional node just to run analytics and probably the most important query most of you would have guessed that would have been this on
e. This is obviously the one that will generate the most traffic is if we send the write event for every single cup of coffee we're making. Now, one thing to notice here is that this is a noncritical write because the only thing I'm going to be doing with that data is some analysis and even if I lose one or two writes because there's no way network failure, well, it's not going to affect me that much. I could label those as noncritical write, meaning that maybe my coffee machine doesn't have to
have an acknowledgment that the write would be protected forever. This just the data can be treated differently, it's important to really understand that as you're doing your initial design, this decision may help you a lot. One thing to note here is that the previous one, line four is the one that generate the most write but it's not okay to just look at the load and expect a load that's going to be constant. Usually, if you want to size something correctly you really have to think about the pe
ak period. If you're in e-commerce peak period could be per day, but it may be like a few days before Christmas or Thanksgiving, Black Monday, Black Friday, whatever they call them. As here what we're going to say is that 30% for a cup of coffee will be done within one hour of the day. We're going to have a rush hour where 30% of them are made. Using that as the new maximum value then we see where the peak is going to be which is 833 write per second. Again, this is very small. MongoDB can handl
e tens of thousands of writes per second on within single replica set. That gives us a good indication that we don't need sharding for this system, at least to sustain the writes we're going to be making on our disks. Taking that query here we may want to dig a little bit more into it, especially if we were trying to really optimize everything so we may have other qualification or quantification to add to it. The other thing I want to also to size when I'm there is the resources I'm going to nee
d for the system. We have two writes coming in, which means we're going to have two types of data that's going to take some space. We have coffee cups and every time also the shelves have an event they're going to send a signal. Again, we crushed the numbers, let's say we want to keep one year of data will multiply by the amount of bytes per coffee cup awaiting, we end with 370 gigabytes of space for a cup of coffee and 3.7 for the writing, and this, like the queries, this is a pattern you're go
ing to see very often, there's usually one part of the data that will really overshadow the rest and that's the one. If you want to do optimization on your space, you basically just have to look at the cup of coffee, the waiting don't become as important. Another thing interesting here is that I said one year of data, and one year of data give us 370 gigabytes, which is, again, probably something that a single replica set will be able to handle pretty well. One of the rule of thumb for MongoDB i
s if you don't have a terabyte of data, you don't really need to shard for performance usually. This would be under that, but this is under that also because I have one year of data. Let's say we would have said that it's important to have 10 years of data, and we have been 3.7 terabyte and in that case, we probably have to shard a system. Here you can make also a very important decision if you size your things correctly at the beginning is like, well, is it really important for the competitive
project to have a sharded cluster? Do we want to pay also the price of handling all those machines for 10 years of data. In this case, making a hard decision at the beginning, saying I'm not going to keep everything, that is enough for what I need to do, will probably save you a lot of money on the long run. Once we get that we have a better understanding of what our system is about, we're going to go to the next phase which is identify the relationship. It's not because you're with MongoDB that
relationships are not important. It's not because we call it NoSQL, but you always have relationship between piece of data. Let's look at an example of where we have actors playing in movies and we have reviews for those movies. The first do an actor name and the date of birth. These two have a one-to-one relationship, and often it's assumed but the one-to-one relationship end up being in the same entities, but the relationship is still there it still exists. As for the relationship between the
movie title and the actor name, those are many-to-many and those get put in different entities and we see the many-to-many relationship between the two. The reviews apply to movies, so we have a one-to-many. If we're going to do that with MongoDB, we still need to understand those because what it's going to translate then, here we still have many-to-many relationship between actors and movies, but the one-to-many relationship between the movies and reviews get implemented as an array, as we've
seen earlier in the first section of this talk. I mentioned that this is probably the most important question you have to answer if you're going to embed and/or reference your relationships. There's three types of relationship, one-to-one, one-to-many, many-to-many. What's the implication of embedding versus referencing? Well, when you embed for all of them, you basically end up doing just one read and you get all your data. This is much faster, it's also simpler, it avoids doing joins but the o
ne case to be careful is when you have the many-to-many relationship, if you embed the information, for example, if we're going to embed actor names inside the movie, we would duplicate the name of the actors, but it doesn't really matter. Usually, when a movie is released and there's a bunch of actors, it's not that the actor information is going to change. If you keep the name, they may change their name, but you may want to keep the old name that they had when they basically did that movie. O
n the other end, when you reference, you end up having to do many reads because now you have to read two parts, you need to read the two entities, but you end up doing smaller reads. You may not need to read everything in both sides and connections because not everything is grouped, the part you don't need may not be accessed. You may save on the amount of data you're reading, but you're going to do more IOPS basically. Looking at the different entities that will come out from our first section
will have something like coffee cups, the stores, the coffee machines, the shelves, the weightings, and the coffee bags. Those are basically the things that we made query on. An easy model could be something like that, where we're going to keep everything in the store so the store as coffee machines and the shelves on which we have coffee bags. I'm going to keep two other collections also for coffee cups and the weighting. The reason those are separated here is because they also have a different
life cycle. We said that this data is going to expire in one year. Data that expire usually, it's a good sign that you really want to have a separate collection for those. At this point, this is still pretty relational. This is where we're going to go and apply all patterns to make things easier to access and faster. That brings us to the last section for presentation, the patterns. This is not to talk about schema design patterns. I'm going to go very quickly on some of them. We do have a lot
of other references. The first one is a series of blogs that Ken Alger and I wrote last year so one blog per pattern. The patterns are heavily defined also in our online class M320 that I'm adding from MongoDB. There's also another talk happening at this conference MongoDB Live 2020, which is advent schema design patterns, that Justin LaBreck and I are doing. Justin is one of our best-consulting engineer out there. He goes to customer and implement solution for them. He's been using these patter
ns many, many times. He's going to have a very down to earth example that he used at a specific customer. We're going to run through that and he's going to be applying patterns in different sections. It's pretty cool, so please go and look for this talk. It's basically complimentary talk to the one we're doing right now here. The first pattern I want to go over is the schema versioning pattern. Let's say we wanted to migrate the schema from a version one to a version two in a relational database
. For example, let's say that we have people and we keep track of their favorite restaurant. Well, we decide that the system is very successful. We should keep track of more than that. What we're going to do is we're going to keep track of a bunch of favorite things that our users have, so favorite restaurants, and favorite dish. We're basically going from this scalar field to adding two tables adding a one-to-many relationship, both of them to our people. That's a pretty heavy transformation. I
f you were going to do that with a relational database, that would be a little bit complicated. The migration would look a little bit like that. We would have a first version for a schema. We would break down the system. We would reformat that first table to have less information in it, which we would migrate to our new tables. When everything has been migrated that's diversion to the schema, we could reopen the service. Now, doing the same with MongoDB using the schema versioning pattern, we wo
uld go from documents, version one on the left to version two. In term of changes in the document, the first thing is we would track the version. We could just by the shape, trying to infer the version, but it's maybe easier if we just label the version inside. Then we will make the change from a favorite restaurant to a favorite list of restaurants and dishes. The key thing here is that with MongoDB, because you have polymorphism, you can have documents that have different shapes at the given t
ime. Well, we're not restricted to one schema version for the whole database. Every document can carry its own version. The impact of that is that if you're going to do a migration with MongoDB, you would start with a bunch of documents in your collection. Then that's what you want to have. You want to have all your documents in version two when the system can be functional. We're going to migrate version one of the document, then migrate version two of the document, then migrate version three o
f the document. The key here is to have the application be able to read and process both shapes, which is very easy. It's polymorphous. We already know MongoDB can read documents of different shapes. You just have to have your application, you can understand the logic of these two different shapes and do whatever is needed. That basically give us a no downtime situation this is a pretty complex migration if you look at it. We went from having a scalar field, going to two new relationship one-to-
many been added in the schema. You could also have created the new connection. It can be much more complex than that if you want. The key thing here is being able to do that migration. You're more in control. You can decide when you're going to do it. You can take the time you want to do it. You can use a batch of data. You can use the application to unwrite every time it has to make a change, read the previous document, make the change, and store back the document. You're really in control and
you can do basically any kind of migration, the right migration you want without downtime. This to me should be like a sufficient reason to justify that you want to use MongoDB on a project. Another pattern we have is the computer pattern. This one we see it apply when we have a lot more read operation than write operation. For people who come from the rational world, this is a bit like of you but it's better, you'll see why. Let's say I'm doing, for each write I'm doing, I'm doing a thousand re
ads, and the reads need to sum information. If you look at this diagram, what we're doing here, it's a little bit silly because the summation of all the read operation is exactly the same thing. We may be doing a thousand times the same sum in average. This is a lot of wasted resources. The other way you can do it is when on write operation, you're going to add to your collection the new piece of data. You're going to do the sum and then we're going to take the result which could have been the v
iew. We're going to store it at the right place. Not just in the view, but if it's a document that there's a document that already exists, that should receive that sum. For example, it could be a document that characterize a movie, and we could put the total revenues of that movie directly in. We don't need to go to another source. We do have the document now that has everything in it. You're going to save a lot of resource here because every time we have a reader we don't have to redo the sum.
Another pattern we have is called the subset pattern. That has to do with management of your RAM, basically. We use the term working set to represent all the data that you need to keep in memory all the time, indexes because you access it very often. For any kind of system, you want to keep things in room. You don't want to go to disk if it's possible. Well, the issue may come on the fact that here, for example, we have four large documents that we need to keep in memory, but there's only a spac
e for three. Every time we need to access the fourth one, we'll need to drop one from memory, go to this, bring this one. That's not very effective. A better way would be to look at the big documents, and see which part of the app that you really need all the time. We're going to break this document into what's accessed frequently and what is not. Now, if we look at the new diagram with that, the four documents that we need to keep in memory because there's a small section of it, they fit in mem
ory. Then we have additional memory to cycle in the page that we don't read that often. That's going to be much more performance. That breakout there, can be done two different ways. It could be done on the one-to-one relationship. I'm going to take a document and just remove things at the root, put it somewhere else and have the same ID, or primary key on both collections, or I can also use a one breaker, one to end relationship. I have a document in which I have an array. I'm going to take thi
s array and put it somewhere else, but for performance reason, keep part of it the subset back into the first object. That would be an example that we take a movie document in which we have all the actors. We're going to take the actors put them into different collection, but still keep maybe the top 10 actors in the main movie document. If you think about it, that's usually the right use case. People who go look at information about a movie, they're interested by the top actors, not by the thou
sand actors and different cast members who were in a movie. If they are, you can't make a second query. This is what subset about is breaking the information. Breaking a one-to-one or one-to-many relationship into two collections. The bucket pattern. This one is usually or always used by Internet of Things solution. Where you could have one big document that keeps all the information for one given device, or I could also create a little document for every single measure that my system receive. T
he bucket pattern is seen as an in-between solution. Where I'm going to take not all the data, but not just one, and put that together in what we call the bucket. It will look a little bit like that. For example, this one here is a bucket per day. I'm going to have one document per device per day and if you look, there's an array for the temperature measure for that day. You could also do the same per hour. If you look at my date, now it includes the hour. I'm going to get one document per devic
e per hour. What's the right granularity? What should be the size of your bucket? One thing that I usually look for is you do a lot of aggregation on that and the first thing you end up doing is online operation to take the arrays and extract values. You could be basically went up a little bit too far in the bucket. You may want to have a smaller bucket, that's one thing. The other thing is if you're going to do computation using the computed pattern and put this data somewhere. Well, you probab
ly want whatever you're trying to compute to match the documents. If you're interested in keeping average of temperature per hour, well, adding documents that have buckets per hour could be also the right granularity. If we applied these patterns too or solution here, well, in the schema versioning you see a number fields of things to do is pretty simple. You just add one field as part of the procedure that you're going to be doing in order to do your migrations. Computed, that may want to keep
the last weighting that they receive from a given store and I may have-- may want to keep the total cup of coffee that have been done by a given machine. This may be important for maintenance. Instead of recalculating all the time, once in a while, we'll go update that. You want to keep the subset of the last 30 days for my cup of coffee. Remember, we have a separate collection where could we keep all the cup of coffee but maybe it's interesting just with the store to keep that information, so w
e don't have to go to the other collection and we can quickly show something. I may want to use a bucket and in that case, we wanted to generate a lot of data. We decided that we're going to send single rights for every single cup of coffee and keep separate documents, make things a little bit better, but I think that could be something we could change. We could decide that, "Hey, no, the coffee cups should be in a bucket. They would be more efficient for what you're trying to do." Looking at al
l the different patterns that we have out there and listing them across the different use cases that we often see at MongoDB, we come up with that the matrix here. This is not exact science. It doesn't mean that if you're using Internet of Things, for example, and there's no checkmark that it's something a pattern that would not be used for that. This is just that this is probably the most likely patterns you're going to see for a given use case. You should be extremely familiar with those patte
rns, but in reality, it's not that difficult. You should be familiar with all the different patterns that we have listed there. That bring us to the conclusion. The things I want you to remember from this talk, the first thing is really understand the difference between the document database and the tabular database, so you can have a good mental model when you're working on your schema. The second thing is the main step to do modeling. It's very important to understand that you need to understa
nd your workload at the beginning. You still need to understand the relationship and model them, and then apply patterns for getting better performance. Before we leave, I just want to remember everyone, we do offer classes on everything MongoDB from aggregation, replication, and sharding, data modeling. It's at university.mongodb.com. Classes are free. This is the best resource to learn about MongoDB. Again, thanks for listening. See you around.

Comments