1. Neptune overview
In this section, let’s explore Neptune, the graph database service provided by AWS. So, Neptune is a fully managed graph database service. It’s a nonrelational database. And in a graph database, relationships are always first class citizens. You can use Neptune to quickly navigate relationships and retrieve complex relations between highly connected data sets. Neptune allows you to query billions of relationships with millisecond latency and Neptune. Ulcer is acid compliant with immediate consistency. In addition to this, it supports transaction semantics for highly concurrent OLTP workloads. And by transaction semantics, I am referring to the asset support that Neptune provides.
And Neptune supports two graph query languages apache TinkerPop Gremlin and RDF Sparkle. And Neptune uses the same architecture as Aurora. It supports 15 low latency replicas, which you can place in multiaz setup. And here are some of the use cases you can use Neptune for. So, you can use it to create a social graph or knowledge graph. You can use it for fraud detection, for real time big data mining, or even for customer interests and recommendation engines. So, what exactly is a graph database? Graph database models relationships between your data. So you typically have a combination of subject, predicate, object, and graph.
Or you can call it as a quad. So, let’s say Joe likes pizza. So this is a relationship between Joe and Pisa. So, Joe is a subject, pizza is an object, and Likes is a predicate. So this is how you model relationships using graph databases. Sara is friends with Joe. Then Sara likes pizza too. Joe is a student. He lives in London. London is a city. So these are different relationships that you can store using a graph database. And it lets you ask different questions like identify Londoners who like Pisa, or identify friends of landowners who like pizza. Now, this is a very simple example, but you definitely can create complex relationships and uncover answers to different questions that relate to the relationships within your data set.
So you have nodes and edges, or you can also call them as votases and actions and you use them to describe the data and the relationships between them. For example, Joe, Sara and Pizza are nodes or vertices and Likes, his friends with lives. In Is, are the edges or the actions that describe the relationship between these nodes. So the database typically stores person, action and object along with an ID to identify the node and edge. So, person action object is same as subject, predicate, object, right? And you have a graph ID or an edge ID to identify the nodes and edges uniquely.
And in addition, you can also filter or discover data based on the strength, weight or quality of these relationships. Now, there are different query languages that you can use to query your graph data. Neptune supports two popular modeling frameworks apache Tinker Pop and RDF. Tinker Pop uses Gremlin as its traversal language and RDF uses Sparkle. Now, Sparkle is great for multiple data sources and has a large variety of data sets available. And we can use Gremlin or Sparkle to load data into Neptune and then to query it. You can store both the Kremlin as well as Sparkle graph data on the same Neptune cluster.
But remember that they will be stored separately on the Neptune cluster. And the graph data that you insert using one of these query languages can only be queried with that language and not with the others. So if you write Gremlin data using Gremlin, then you can query it with Gremlin only. And if you write Sparkle data using Sparkle, you can query it only using Sparkle. And that kind of makes sense, right? So that’s about it. Let’s continue to the next lecture where we’ll study the Neptune architecture.
2. Neptune architecture
Neptune architecture. So this is how it looks. You are probably familiar with this architecture. It’s same as the architecture of Amazon Aurora. So Neptune uses the same cloud native architecture as your aura database. So you have about six copies of your data spread across three AZ. So it gives you a distributed design. It uses lock free optimistic algorithm or a quorum model. And by now you must be knowing what quorum model is. Four copies out of six are needed for rights to be considered durable and three of the six copies are needed for reads. It’s called as a three of six read quorum and four of six write quorum.
And the storage is a self healing storage that uses peer to peer replication. And storage is striped across hundreds of volumes and it’s auto scaling as well. One Neptune instance acts as a writer or the master ned and other instances, up to 15 of them are replicas. And compute nodes on replicas do not need to write or replicate because we have a shared storage volume and this results in improved read performance. The storage is a log structured distributed storage layer and it passes incremental log records from compute layer to the storage layer. And since it uses incremental log records and not full log records, this process is typically faster.
And as mentioned earlier, Neptune uses one master instance with up to 15 read replicas and the data is continuously backed up to S three in real time using storage nodes. So your compute node performance is not affected. And you have a cluster endpoint that you use for writing data to your cluster and reader endpoint for reading data from your cluster. And reader endpoint also provides for connection load balancing between different replicas. And as any replicas get added or removed, the reader endpoint is kept up to date with that information. Now, in addition to these two endpoints, neptune also has a couple of more endpoints.
So we have a loader endpoint that you use to load data into your Neptune cluster, let’s say from s three, for example. And the format is simple. Your cluster and your port followed by loader is your loader endpoint. Similarly, you have a gremlin endpoint that you use for gremlin queries and a sparkle endpoint that you use for sparkle queries. So you simply use slash Kremlin for Kremlin endpoint and slash sparkle for sparkle endpoint. And this will become clear as we go into our demo. So let’s go ahead and create a Neptune cluster so we get a better understanding of how Neptune works.
3. Creating a Neptune cluster – Hands on
In this lecture. Let’s create a Neptune cluster. Okay, so Neptune clusters can be accessed only from an EC To instance within the same VPC or using Jupiter notebooks running on Sage Maker. All right, so when we create a Neptune instance, we also need to create an EC To instance in the same VPC with appropriate security group. Okay, so in order to connect to Neptune, we need to create an easy to instance as well. You can definitely do all these manually, but AWS also provides us with a cloud formation template that we can use to speed up this process. We’re going to use the cloud formation template here.
So here I am in the Neptune console, and if you scroll down, you can see a Neptune Quick start link here. So I’m going to open it in a new window. And here under Getting Started, let’s look for creating a DB cluster and scroll down to locate the cloud formation templates. So you can choose one of the templates here depending on your region. I’m going to use the Oregon region, so I’m going to launch this stack. So click on the stack to launch it. So we’re going to use the same template. So click on next. You can give your stack a name. I’m going to stick to the default one.
Neptune port is 8182. And if you’re creating this manually, make sure that you set up your security group to allow inbound connections on this port. Eight one eight. All right, then we choose the DB instance type. So I’m going to choose the smallest available here, DBT three medium and the EC to client instance type. Also we don’t need two X large, so we can probably go with T three medium. Okay, then here you have to provide an SSH key pair name. So if you don’t have a key pair, then you can create it from the EC To console. I already have a key pair here, so I’m going to use it. An environment can be tested. We don’t need I am authentication.
We also don’t need to enable audit logs. Then here I’m going to enable the notebook instance type. Okay, so I’m going to choose T two medium instance. And what this is going to do is it’s going to create a new EC two instance and it’s going to set up a Jupiter notebook on it. And this Jupyter notebook is managed by Sage Maker. And we’re going to do a quick demo on notebooks as well. So let’s park that for later. Then for Graham Lincoln, I’m going to select True and as well for the Sparkle or RDF four J console. So what this is going to do is it’s going to set up a Gremlin and RDF four J console on your EC Two instance that you’re going to use to connect to your Neptune cluster.
All right, so click on next so we can leave these values to default and continue. So let’s quickly review these tag details. The cluster port is 8182 DB, instance is T three medium. EC two client instance is T three medium. Then this is the key pair that we’re going to use further. The notebook instance type is T two medium ML t two medium. All right, so that looks good. So we simply acknowledge here and continue. And this is going to set up all the resources that we need to use this Neptune cluster. All right? And this is going to take a while, so I’m going to pause here and come back once this stack is created. All right, now we can see that the cloud formation stack is complete.
So let’s go back to the Neptune console. And if you click on databases now we can see that our Neptune cluster is available, right? And if you navigate to the notebooks option from the left side menu, you can see that the notebook also is available. And this cloud formation template also created an EC two instance. Let’s go over to the EC two console, click on running instances. And here we can see that an instance is available. So we can use this instance to connect to our Neptune Cluster, and we can also use this notebook to connect to this cluster. But we’ll do that in another demo. All right, so let’s connect to this cluster using our EC to instance.
So click on the Connect link here. And of course, you can connect from your computer using the key pair that you provided when you created the EC to instance. But I’m going to use the EC to instance Connect option here to connect to this instance. Okay, so just click on Connect. So this instance is going to act as a client to help us talk to our Neptune Cluster. And this instance has been configured for Gremlin as well as for Sparkle. Let’s do an LS and we can see that both the Gremlin as well as the RDF four J or the Sparkle interface is available here. Okay, so for this demo, I’m going to use Gremlin. So let’s see into it. And to start Grumbling, type in wingremlin Sh. All right, now we had to set up a remote connection to our Neptune cluster.
So we do that by colon remote connect TinkerPop server and we provide a configuration file neptune remote YAML and hit Enter. And now we can connect to our Neptune cluster. So type remote console to get to the console. All right, so now we can fire any of the Gremlink queries and they will be executed on our Neptune clusters. Let’s look at what data we have there. So of course, this is a new clusters. There is no data in there. So let’s add something. Let’s say I add a person. So we do that using Advertex. This is a label and we add a property to it. The ID property is mandatory, and if you don’t provide it, then Gremlin is going to add an arbitrary value there. And so I’m going to add my name here.
So I’m adding a person with ID, Riaz, and I’m going to add a property, a custom property called name. And the name is and the name is Riyas. All right, there we go. Similarly, we can add, let’s say one more name. So one more person, let’s say Stefan. All right? And now we can add a relationship between these two persons. So what we can do is we reference the first person that is Riyas, and we add an edge. Okay? Friend two. So Riyaz is a friend two, Stephan, and we also add the ID to this edge, okay? So it’s easy for us to read. So ID remember when you use ID, there are no quotes. And ID could be, let’s say edge one. Looks like I made some mistake. Yeah, I missed the quotes.
Okay, alright, so let’s try that again. So this is the vertex for Riaz, and I’m going to add an edge called print two, Stefan. Okay, so I’m referencing Riyas, and I’m adding an edge called friend from Riyas to Stefan. And then I’m going to add an ID property to this edge, okay? And the ID could be edge one, for example. So this looks good. Enter. So this edge is created. And similarly, I can create another edge from Stefan to Riaz. So I’m simply going to change the names. I’m going to add an edge from Stefan to Riaz, and the ID should be different. So let’s say edge two and enter. So we have the second edge created. So if we want to look at all the vertices, we can say G V. It’s going to tell us all the vertices in the database.
And if we want to look at all the edges, then we can do G E, and these are all the edges. And now if I want to look at the friends of Rios, so I can do G V Riaas out, it’s going to give me all the edges labeled friend going out from RIAs. Okay? And it gives us Stefan, which is correct. And similarly, we can do the same thing if we want to find out defense friends, it’s going to give us Riyas. Now, we just added two records here, so this is all we can do with that. So what I’m going to do is I’m going to show you how to bulk load data from an S three bucket into your Neptune cluster, and then we can do a little more things with it. Okay? So let’s do that in the next lecture. All right?
4. Bulk loading graph data into Neptune
Now bulk loading data into Neptune. So for this purpose, we use the Loader Endpoint. So we simply perform an Http post request to the Loader Endpoint to load data into Neptune. So if you want to load data from S Three, you simply do an Http post request to the Loader Endpoint that is, the Slash Loader Endpoint and provide your request data data in JSON format. And this JSON data contains the path to your data file sitting in S Three. So S Three data can be accessed using an S Three VPC endpoint. That allows access to S Three resources from your VPC. And your Neptune cluster must have an im role that allows it to read data from S Three. All right, so you simply use the S Three VPC Endpoint to allow communication between your Neptune cluster and S Three.
And you can create this s three VPC endpoint using the VPC Management console. And remember that your S Three bucket must be in the same region as your Neptune cluster for you to be able to load data from that S Three bucket into your Neptune cluster. Different Load data formats are supported. For example, if you’re using Gremlin, then it supports CSV data formats. And for sparkle, we have different formats supported like N, triples and quads, RDF, XML, and Turtle. And all these files have to be Utf Eight encoded. And you can definitely load multiple files in a single job. And we’re going to do just that in the next demo. So let’s go ahead into a demo and see how to load graph data into Neptune from your S Three bucket.
5. Bulk loading graph data into Neptune from S3 – Hands on
Now let’s find out how to load data from S Three into your Neptune cluster. Now we’re going to use Gremlin data to load into Neptune. And I have prepared simple CSV files that we can upload to S Three and load them into our Neptune cluster. These are the two files. So when you load grambling data into S Three, you need to create two files. One file with the vertices and one file with the edges. So I have dummy data over here, let’s look at that. So I’m going to add Riyaz as a person. So this is the ID, this is the label and this is the property. So I’m going to add RIAs as a person, Stefan as a person, Mark as a person, N, Ram as a person, and Tracy as a person.
So it looks like there is a typo here. Tracy as a person, then coffee as a food, tea as a food, cookies as food, and pizza as food as well. So I’m going to save this and remember that the ID is a mandatory attribute. And similarly in the edges we can see edge one to edge ten. So we’re adding ten edges and in the edges we provide from label and two. So we’re creating edge from Stefan to Riyas. And the label is friend. So Stefan is friends with Riyas, stefan is friend with Mark and he’s also friend with Tracy. Now RIAs is friend with Stefan, riaz is also friend with Ram. Stefan likes coffee, Riaz likes tea, ram likes cookies, mark likes pizza, tracy likes tea.
All right, so it’s quite intuitive to understand. Now let’s find out how to load this data into our Neptune cluster.So let’s go over to our EC two instance and connect to it surface Lscd into Gremlin. So I’m going into the Gremlin interface to delete the existing data so we can start clean. Okay, so we had added some data in the previous lecture, so I’m going to clean it up before we import data from S three. All right, so let’s go into Grand Lin. So we first provide the configuration file and then we can connect to the console. Now, if I look at the available vertices, we can see there are two vertices. So what I’m going to do is I’m going to drop them.
And now if we look at the vertices, we don’t have any vertices and if we look at the edges, they will be gone as well. So now that we have a clean slate, we can go ahead and load data from S three. All right, so I’m going to clear this and loading data into Neptune from S Three is very simple. We simply have to do an Http post onto the loader endpoint and provide the path to our S Three bucket and it will load the necessary data for us. All right, let’s quickly go over to the Neptune documentation and in this Getting Started guide look for loading Neptune data and Neptune bulk loader. And we have a loading example here, all right? And if you scroll down, you’ll find a curl command here.
So what I’m going to do is I’m going to copy it over and let’s open a Notepad, for example, and we can simply use this command. We don’t need this information, all right? So we can update this command. So first is the endpoint and port. So let’s go to Neptune cluster and grab the cluster endpoint. All right, so I’m going to paste it in here, retain the Https part and the net tune port is 8182. And you can also verify that here that port is 8182. All right, so this is the loader endpoint, your Https followed by your cluster endpoint port loader, and then D denotes data. And then everything after this, from here till here, is a JSON data, or a metadata for that matter that specifies the location of our file.
We’re going to create an S three bucket and add that information here. For now, let’s populate the other properties. So the format of our file is CSV, and we also need an IAM role so we can go to our cloud formation template and look for the Im role. If you go into the resources of the Neptune stack, look for resources here and here you should find an Im role. And here it is, neptune load from S three role. Okay, so let’s navigate to this role and copy this role arm and back into the notepad. I’m going to paste it in here. All right, the region we are using is US West Two. US west two. So this is the region where we have our cluster, right? And we should have our S Three bucket also in the same region.
All right, now let’s go ahead to the S Three console and I’m going to create a bucket. Okay, so let’s say bulk loader. Okay, so let me put my name so it’s unique. Alright, so I’m simply going to accept the default settings here and create the bucket. All right. And then I’m simply going to drag the two data files in here. So edges CSV and vertices CSV. All right, so we have these two files here. If you want, you can also put them in a folder. Okay, so for example, I can create a folder, let’s say Sample Data, and I can instead add the files in here as well. So whichever option you prefer. All right. And then into our notepad, we have to specify the path to this bucket.
So the bucket name is Neptune Bulk Loader Riaz, and the folder name is sample Data. And what it’s going to do is if you provide a folder path, then it will take all the files in this folder. Or if you want to specify a single file, then you can specify the entire source like HS CSV. In that case, you will have to run this command twice. So what I’m going to do is I’m simply going to pass the folder. So it’s going to take all the files from this folder. All right, this looks good. So we can simply copy this and go to our EC two instance and paste in the command and enter. And it gives us a load ID. Okay, so let’s copy that back here. Just going to paste it.
And you take this loader URL here and add it here, followed by a slash set. You can get the status of your load by querying this particular URL. So the way you do is curl dash G followed by this URL. Okay? And what I’m going to do here now is I’m going to paste it in and enter so it shows that load completed, our load is successful. Let’s go into Kremlin and see if we have the data there. All right, so let’s clear this. So we are setting up a remote connection to the server by specifying the Neptune configuration file or the remote configuration file. And once that’s done, we can access the remote console like this. Now, if we look at the vertices, we should get all the vertices that we added.
And similarly, here are all the edges that we added. And now we can run different commands here to query data. So this should give us friends of Riyas, Stefan, and Ram. And now if we query what Riyas likes. So Riaz likes T. All right, now let’s see who are the friends of Stefan. So Stefan’s friends are Riasmark and Tracy. All right, and what does Stefan like? He likes coffee. Okay, and now let’s find out what Stefan’s friends like. Okay, so we first find out who Stefan’s friends are using out edge, out friend. And then from those friends, we look for the likes friends like T Pizza and T. So T is repeated here. So we can do duplication, something like this ddup. And it should give us unique values.
Now, if we wanted to find out who are Stefan’s friends who like T, then we can do something like this. Let’s say who are defense friends. Now. These are Stefan’s friends. We capture that in a variable. Let’s say friends. So friends is going to give us a list of Stefan’s friends and then we do what they like. So this is going to give us what Stefan’s friends like. And then we can look for the likes edge that goes out to the vertex called T. Okay, because we want to find out who are the Stefan’s friends who like T. So has ID T. So this should give us Stefan’s friends who like T. So we want to select friends. Okay, so we are selecting the variable friends that we created.
And now we can see that Riyas and Tracy are the ones who like T. And you can see here, Tracy likes T and RIAs likes T. And the third friend of Stefan is Mark. So Mark likes pizza, so he doesn’t like tea. So our answers are correct. So this is how you can query the data. This is how you can build different graphs with interconnected relationships. This is an interesting subject. You can explore it if you like. And from the exam perspective, you really don’t need to know gremlin or sparkle. Just knowing that these are the two languages that we use with Neptune is good enough for you from the exam perspective. All right, so with that, let’s close this demo and continue to the next lecture.