1. DocumentDB overview
This section is about DocumentDB. So DocumentDB provides MongoDB engine or the MongoDB database in the AWS cloud. It’s a fully managed nonrelational document database for MongoDB workloads. So when we talk about Document database, it means that you can store JSON documents in your database tables. Or in other words, these are the nested key value pairs that you can store in your database tables. The tables are called as collections in MongoDB or in Document DB, and Document DB is compatible with majority of MongoDB applications, drivers and tools. It’s a high performance database and supports scalability as well as high availability.
It also has support for flexible indexing, powerful ad hoc queries, as well as analytics capabilities, storage and compute nodes. In Document DB can scale independently, and it also supports multiaz read replicas. And you can have about 15 such read replicas, just like you have with Amazon Aurora. And you can scale your storage in ten GB increments from ten gigs to about 64 TB. And the storage layer is fault tolerant and self healing. Apart from this, Document DB offers automatic, continuous and incremental backups along with pitr or pointing recovery. That was a quick overview of Document DB, and we’re going to dive deeper into it in this section. So let’s continue.
2. What and why about document databases
So what exactly is Document database? So Document database means it stores JSON documents. So, in other words, this is a semistructured data. And here’s an example of sample records that you can store in your document database. So you can have nested JSON structures, and you can store them in your document DB database, right? So key value pairs can be nested and comparing the document database with a typical relational database or a SQL database, in SQL, you have tables, whereas in MongoDB, we call them collections. So collection of documents rows in relational databases are documents.
In MongoDB, columns are fields primary keys compared to object IDs, nested tables, or objects compared to embedded documents. And just like relational databases, you have indexes views and arrays in Document DB. All right, so why do we need a document database? Now, we know that JSON is very popular, and it’s a de facto format for data exchange. And Document DB makes it easy to insert query, index and perform aggregation operations over JSON data. You can store JSON output from your APIs straight into your Document DB database and start analyzing it. So this is a very important benefit that document databases offer.
Document DB offers a flexible document model, data types as well as indexing. You can add and remove your indexes easily, and you can also run ad hoc queries for analytics workloads. Now, Document DB is quite similar to the key value stores like DynamoDB, but DynamoDB is well suited when you know your access patterns beforehand. And if you do not know your access patterns, then you can consider using Document DB as well. So that’s about this lecture. Let’s continue to the next lecture where we’ll explore the Document DB architecture.
3. DocumentDB architecture
Now let’s talk about the Document DB architecture. So Document DB architecture is same as the Aurora architecture which we have discussed earlier. So you have six copies of your data across three AZ. So this provides a distributed design. Document DB uses lock free optimistic algorithm or a quorum model. Just like Aurora, four copies out of six are needed for writes, which is four of six write quorum and three copies out of six are needed for reads, which is three of six read quorums. Your rights are considered durable when at least four of the six copies acknowledge the right. All right? And just like Aurora, the storage is shared across three AZ and is self dealing with peer to peer application.
You have the storage striped across hundreds of volumes and provides for an auto expanding or auto scaling storage layer. And just like in Aurora, one document DB instance acts as a writer or a master node. And you can have up to 15 read replicas. And remember that the compute nodes on replicas do not need to write or replicate because we have a shared storage volume and this provides for improved read performance. And the storage layer uses what is called as a log structured distributed storage which passes incremental log records from the compute layer to the storage layer. And since this uses incremental log records, the process is fairly faster.
So in all, you have a master and 15 read replicas to serve your reads and data is continuously backtracked to s three in real time using storage nodes. So backup happens on the storage node. So compute node performance is not affected due to the backup process. All right, so that’s the Document DB architecture. Now let’s look at the Document DB cluster in a little more detail. You have a storage layer, one master and multiple replicas and you talk to the master instance using the cluster endpoint or the writer endpoint and you use reader endpoint to talk to the replicas. So the reader endpoint provides for connection load balancing and it’s always recommended to connect using the cluster endpoint in replicaset mode.
This enables your SDK to auto discover the cluster arrangement as instances get added or removed from the cluster. So always prefer using cluster endpoint and reader endpoints instead of using individual instance endpoints. All right, now let’s talk about Document DB replication. Replication again is same as Aurora. You have up to 15 read replicas. Replication is asynchronous and replicas share the same underlying storage as the master instance. So the shared storage spans across multiple AZ and the replica lag is about tens of milliseconds. Because this is asynchronous replication, there is some lag, so few tens of milliseconds of replica lag is expected when you use the Document DB replicas and this is true with aura as well.
And there is minimal performance impact on the primary due to replication process. And replicas also double up as failure targets. So we don’t have a concept of standby instance as we have with RDS. Now let’s look at the failovers in Document DB. So failovers are automatic, so a Replica is automatically promoted to be a new primary during a disaster recovery operation. And Document DB flips the CNAME of the database instance to point to the Replica and then promotes it. Failover time times are minimal, about 30 seconds or so. And remember, if you don’t have a Replica, then creating a new instance post failure takes about eight to ten minutes, and this happens on a best effort basis and can take longer.
4. DocumentDB backup and restore
Now let’s look at backup and restore options in Document DB. So Document DB supports automatic backups. Just like in RDS, document DB continuously backs up your data to S Three for pitr purposes. And maximum retention period is about 35 days. The latest restorable time for Pitr or point in time recovery can be about five 5 minutes in the past. Because Document DB uploads the logs to S Three every 5 minutes and the first backup is a full backup and subsequent backups are incremental. The backup process is faster after the first backup, automatic backups can be retained maximum up to 35 days.
And if you want to retain beyond 35 days, then you can use manual snapshots. You can retain the manual snapshots as long as you want, so they will be retained until you delete them. And remember, that backup process does not impact your cluster performance. And anyways, these are incremental backups. They are faster than taking full backups. And just like RDS, remember that you can only restore to a new cluster. You can restore an unencrypted snapshot to an encrypted cluster, but not the other way around. So you have a manual snapshot, you enable encryption at rest and then restore it to your new document cluster.
And to restore a cluster from an encrypted snapshot, you have to have access to the appropriate Kms key. And remember, you can only share manual snapshots. You have to copy and share the automated ones. So if you have an automated snapshot, you should copy it to a manual snapshot. And after that you can share that snapshot. And remember, you can’t share a snapshot if that snapshot has been encrypted with the default Kms key of your account. Snapshots can be shared across accounts, but within the same region. So you have to take a manual snapshot and then you can share that snapshot across accounts, but it has to be within the same region.
5. DocumentDB scaling
Now let’s talk about scaling options in Document DB. So Document DB does not support Sharding like a typical MongoDB database does. But since DocumentDB uses Aura architecture, it instead offers us read replicas vertical scaling as well as automatic storage scaling. So for vertical scaling, you simply resize your instances. So you scale up or scale down your instances for horizontal scaling or scaling out and scaling in, you can simply add or remove your read replicas from your database cluster. So you have your document. DB cluster. And you can scale out by adding more replicas or scale in by removing some of the replicas.
And you can also scale up replicas independently from other replicas. So you can have one of the replicas of different size. And this is typically useful if you have analytical workloads. And if you want to use one of the replicas as a dedicated replica for analytical workloads, for example. And scaling of storage is automatic, just like aura. So Document DB also supports automatic storage scaling. So the storage gets automatically scaled on demand from ten gigs to 64 TB in increments of tenGB. And you don’t need any manual intervention here. All the storage scaling process is automatic. All right, so that’s about it. Let’s continue to the next lecture.
6. DocumentDB security
Now, let’s talk about Document DB security. First the im and networking. So you use Im to manage document b resources and document supports MongoDB’s default auth scram or scram, which is salty. Challenge response authentication mechanism. And you use this for database authentication. And it also supports builtin role for DB users with RBAC or role based access control. And speaking about the network security, document DB clusters are VPC only. So you typically use private subnets within your VPC to set up your document DB clusters. And the clients like MongoDB Shell can run on an EC two instance placed in private subnets within the VPC. And of course, if you want to connect from your on premise infrastructure, then you can connect using VPN.
So that’s about the Im and network security. Let’s talk about encryption. So, DocumentDB supports encryption at rest using AES 256 bit algorithm using Kms, and this is applied to the cluster data replicas indexes logs, backups snapshots, and so on. And encryption in transit is provided using TLS. So to enable TLS, you have to set the TLS parameter in the cluster parameter group. And once you do that, you should be able to connect over TLS. To connect over TLS, you have to download the public key certificate from AWS. Then you pass the certificate key while connecting to your cluster. So that’s about the security options in document DB. Let’s continue to the next lecture.
7. DocumentDB pricing
Now let’s discuss document DB pricing. Firstly, you pay for On Demand instances, so pricing is per second with a ten minute minimum. You also pay for IOPS per million IO requests. And each database page read operation from the storage volume counts as one IO. So one page is equal to eight KB in Document DB. And similarly, write iOS are counted in four KB units. All right. Then you also pay for database storage that’s pergb per month and for backups pergb per month and backups up to 100% of your cluster’s. Data storage is free. And just like any other AWS service, you pay for the data transfer charges as well. And just like RDS instances, you can temporarily stop compute instances for up to seven days. And if you don’t restart them within seven days, they will be started automatically. All right, so that’s about Document DB pricing. Let’s continue to the next lecture.
8. DocumentDB monitoring
Now let’s discuss the monitoring aspects of document DB. So, as usual, the API calls are locked with Cloud Trail and common metrics are provided by Cloud Watch. So you can use Cloud Watch to monitor the CPU or Ram utilization, IOPS metrics, DB connections, network traffic, storage, volume consumption, and so on. And there are two two types of logs that DocumentDB supports, and you can export these logs to Cloud Wash Logs as well. So the two types of logs are profiler logs and audit logs. So let’s look at the document DB profiler or the profiler logs. So the profiler log logs the details of operations performed on your cluster. It helps you identify the slow operations and also helps you improve your query performance.
And you can access these profiler logs from Cloud Wash Logs if you have enabled the export to Cloud Wash logs. And enabling the profiler logs is a two step process. So first you have to set a couple of parameters in your parameter group, and these are profiler, profiler threshold, milliseconds, and profiler sampling rate. In addition to this, you have to enable log exports for audit logs by modifying your instance.And both of these steps are mandatory to enable the profiler in your document DB cluster. All right, now let’s look at the second type of logs, the audit logs. So, audit logs are used to record your database activity. So it records different DDL statements, authentication authorization, and user management events to CloudWatch logs.
So all the queries that you make to your document DB cluster will be logged in the audit logs. So this exports your cluster’s auditing records to Cloud Watch logs. And these records are JSON documents. And you can access these audit logs from your Cloud Watch logs if you have enabled the export. And just like profiler logs, enabling audit logs is also a two step process. So to enable auditing on your database cluster, you have to first set the parameter audit logs to enabled and then enable log exports for audit logs by modifying your document DB cluster. And both of these steps are required in order to enable audit logs on your cluster. All right, so that’s about it. Let’s continue to the next lecture.
9. DocumentDB performance management
Now, let’s discuss some of the performance management tasks that you can perform in your document DB cluster. So for example, if you have slow queries on your database cluster, then you can use explain command to identify these slow queries. And the command looks something like this. Then you can also use DB admin command to find and terminate different queries. For example, if if you want to terminate long running queries or blocked queries, you can run admin command with Killop equal to one and pass the op ID of the query. So these were some of the tips for performance management. And with that said, let’s go ahead and create a document DB cluster.
10. Creating a DocumentDB cluster – Hands on
In this demo, let’s create a Document DB cluster. So here I am in the document DB console. So let’s launch a document. DB cluster. Here. You can specify the cluster Identifier, which is just the name of your cluster. Then choose an instance class. And since this is just a demo, I’m going going to choose the smallest one available. So let’s go with T three medium. We can keep number of instances to three. Specify a master username. Remember, you cannot use Admin as the username here as it is a reserve keyword. So I’m going to use my name for example, and specify a password. Further. We can enable this option to show advanced settings and this allows us to configure the network backups encryption and other options.
So you can choose the VPC that you want to place your Document cluster in. Choose a subnet group and a VPC security group. Just make sure that the security group allows inbound connections on the port 27017, which is the default port for Document DB. All right, then you can set up the encryption here. You can configure backups log exports. Document DB supports two types of logs audit logs and profiler logs. And if you want to publish these logs to Cloud Wash logs, you can select these checkboxes here. Then you can choose the maintenance window tags and so on. I do not want to enable the deletion protection, so I’m going to disable it. So that allows us to delete the cluster immediately after our demo.
All right, so let’s go ahead and create the cluster. And now the cluster is creating, it’s going to take a while for the cluster to be ready. So I’m going to pause the video here and come back once it’s available. All right, the cluster is available. Now let’s look at what it has here. You can see the connectivity and security options here, the security group and the connectivity information. To connect to this cluster, you need to download the RDS security certificate or the SSL certificate that you can download using this command. And if you want to connect to this cluster using Mongo shell, you can use this command. And within this command you pass your SSL certificate that you downloaded along with your username and password.
Similarly, if you are connecting using an application, then you can use this command and it also has a reference to your certificate file here. Right, then let’s look at the configuration. You can see the cluster endpoint and the reader endpoint here, as well as the parameter groups and other configuration options. And you can see that we have one writer instance and two reader instances. And if we go to the monitoring tab, you can see different Cloud Watch metrics here. Now, we have just created these clusters. There is nothing to see here, but when you start using the cluster, this section will be populated.
All right, so that’s about it. Before we close, let’s delete this cluster. And if you try to delete this cluster from here, it’s going to ask you to delete the underlying instances first. So let’s go back to instances and delete these instances one by one, all right? And when you delete the last node in the cluster or the last instance in the cluster, it’s going to ask you whether you want to create a final snapshot. So we don’t need it, so I’m going to set it to know, and then we have to enter this phrase and that’s about it. And this is going to delete our entire cluster. All right? So now we can see that the cluster is deleting. So that’s about this demo. Let’s continue.