7. Redshift Spectrum
Now, we know that when you store data using Copy command or any of the ways that we discussed till now, all these approaches will load data into your red shift cluster. So this data is persistently stored in red shift, and that’s going to incur costs, the storage costs, to be specific. So there are other ways you can query data with redshirt with the are actually storing it in red shift. Okay, so there are a couple of ways you can query external data with red shift. For example, data stored in S three. So you can query the data stored in S three without actually storing it or copying it into red shift. So let’s look at what options we have. So we have two options here.
First is using what is called as red shift spectrum, and another one is red shift Federated query. So let’s look at red shift spectrum first. So Spectrum allows you to query exabytes of data from S three without even loading it into red shift. You definitely need a red shift cluster to run Spectrum queries. So you must have a red shift cluster, and you talk to red shift and not to Spectrum. So all your queries are still going through red shift, but red shift will pass those queries onto red shift spectrum, and Spectrum will process your queries, and red shift will return them back to you. So the way it works is you create an external table or an external schema in Athena or Glue or Apache High Meta store, for example.
So you create an external table. The way you do it is using simple SQL queries that you can fire through the SQL endpoint of your red shift cluster. So you don’t have to log into Athena Glue or the hive meta store. You do that directly through red shift. But you have to create these external schemas. And then these schemas are used by Spectrum engine to perform your query operations or your analytical queries, to be specific. So these external tables are read only, so you cannot perform insert update delete operations on them. And this is actually true for any analytical workloads in a sense, because analytical queries are generally read only queries anyways, right.
So you simply run your analytical queries like you do with your local re dshift table queries. So you simply run your queries just like you would on a local red shift table, and red shift spectrum is going to return you the appropriate data. And you can also perform complex join operations on the Spectrum data, as well as your red shift local tables at the same time. And remember that your S Three bucket and your red shift cluster must be in the same region in order for you to use red shift spectrum on a particular S Three bucket. All right, so why don’t we go into a hands on session and see how to use the red shift spectrum? So let’s do that.
8. Querying S3 data with Redshift Spectrum – Hands on
In this lecture, let’s look at how to query S Three data with redshift Spectrum. What we’re going to do is we’re going to run a join operation on the data from S Three and the data sitting in our redshift cluster. All right? And AWS provides us with a sample redshift data set, and we’re going to make use of that for for this demo. So let’s head over to Google and look for Spectrum. Getting started. And the first result should take us to the documentation. And here go to step number three. We already have an Im role, and we have associated the role with our cluster, and this role has access to S Three, Athena and Glue. So we’re good to go. So let’s head over to step number three. And what I’m going to do is I’m going to copy this command from here and into our Aginity Pro.
This is our redshift cluster and the Dev database. So I’m going to paste in and here I’m going to change this Im role to our Im role from the Im console, copy our role, arm and paste it in here. And what this command is going to do is it’s going to create an external schema named Spectrum from data catalog and database named Spectrum DB. It’s going to create this database, Spectrum DB, into your Athena data catalog and into the Glue catalog. All right, before we run this query, let’s go into the Athena and Glue dashboards in the AWS console. All right, so let’s click on Get started. And here we don’t see any databases. And let’s go over to Glue. And here as well, we don’t have any tables. All right, so now we’re going to run this command to create this external schema and database named Spectrum DB.
All right, so let’s run it. And we can see that external database, Spectrum DB has been created. So let’s go back to our AWS console. If you go to Athena and Refresh, we should see our Spectrum database. And here we go. We can see that our external database, Spectrum DB, is showing up here. And now that we have our external database here, let’s go back to the documentation. And here now we have to create an external table within the Spectrum database. So let’s do that back into Agility Pro. I’m going to paste in this command. And what this command is doing is it’s creating an external table into the Spectrum database with this particular schema, and it’s going to load data from this S Three bucket.
So this is the S Three bucket that is provided to us by AWS, and it contains the data for the sales table in the Spectrum database. All right, let’s run this query. And the query is successful. Let’s go back to AWS, glue and refresh. And here we see our sales table in the Spectrum DB database, and it’s loading data from S Three. So this is the data catalog in the database glue. And what you’re seeing here is the metadata for the table. All right. And if you go to Athena here as well, we can see our sales table that we just created. All right, so now we are ready to run our queries on the S Three data using Redshift Spectrum. Let’s go back to the documentation and go to the step number four.
And here we can see some sample queries. We can go ahead and run the first one. So it’s simply selecting account from the Sales table. So it’s going to tell us the number of rows in the Spectrum Sales table. So let’s copy that and into a new tab. I’m going to run this query, and we can see that this table contains 172,456 records. All right, so back into documentation. It asks us to create this table. But we have already created this table when we loaded data from S Three. So we can skip ahead and we can copy over this particular query. Okay into agility pro. And what this query is doing is it’s selecting the top ten records from Spectrum Sales Event ID. So this query is trying to gather data from the Spectrum database as well as from the Event Table.
So you can see that this Event table is a local table within our Redshift cluster. So here you can see that we have this Event table. And Spectrum Sales is the external table that we created is fetching data from S Three. So let’s run this query. And there we go. We have data from the Event table and the sum coming in from S Three. So that’s how we run complex queries across different data sets. Like, we just ran an aggregation query on the data setting in S Three and the data sitting in our Redshift cluster. I hope you found this demo useful. And before we close, let’s go over to our redshift console and delete the cluster so we are not built for it. So from the Actions menu, choose Delete. We don’t need a snapshot. And that’s about it. So that’s about this demo. Let’s continue to the next lecture.
9. Red shift federated query
Red shift federated query so this is another way to query data through red shift without actually loading that data into red shift. So you use federated query to query and analyze data across different databases, data warehouses as well as data leaks. And this currently works with red shift PostgreSQL on RDS, Aura PostgreSQL and Three. All right, and here is an example query referencing s three red shift and PostgreSQL. So you are creating a view that selects different columns from s three and combining that with a table that’s locally available in red shift and another table from Aura PostgreSQL database so you can run these aggregation queries across different sources of data using rest shift. Federated query all right, so that’s about it. Let’s continue to the next lecture.
10. Star schema in data warehouses
Now let’s talk about Star schema. So this is a typical way of storing and organizing data in data warehouses. So the Star schema has two types of tables, fact tables and dimension tables. So generally you have a fact table and multiple dimension tables, and this creates a kind of star pattern. So fact table at the center, and different dimension tables connecting to the fact table. And the fact table is connected to different dimension tables using foreign key relationships, just like you have foreign key relationships and relational databases. And the dimension tables are generally small, they have fewer records, but more often with many fields, while the fact table is a large one. And it’s recommended that you use smaller dimension tables as local redshift tables. And the larger fact tables can be stored in s three, for example.
And you can use spectrum to query these larger fact tables. So this kind of approach would help you save on your data storage costs. So let’s look at an example of staff schema. Let’s say we have a sales fact table here that contains customer ID, date ID, product ID, and employee ID. So typically, the indexes of different dimension tables, you have a customer dimension table that’s connected through the foreign key relationship between the Customer ID in the Sales Fact table and Customer ID in the dimension table, customer dimension table. Similarly, we can have a date dimension table, a product dimension table, employee dimension table, and so on. The Sales Fact table is the central table that contains the foreign key relationships of different dimension tables. This is how a Star schema is created. All right, let’s continue.
11. Redshift fundamentals
Now let’s look at redshift key constraints. It’s important to remember that redshift does not support indexing. Say there are no indexes in redshift clustered or nonclustered, there are no indexes at all. All right? So what redshift has is a concept of sort keys. So the data is stored on the disk in a sorted fashion. Okay? So the primary key, unique key, and foreign key constraints are not enforced. They are only for informational purpose. And the query optimizer uses these constraints to generate a more efficient query plan. And of course, you can always enforce the primary key and foreign key relationships through your application, but redshift by itself does not enforce any of these constraints. All right, now let’s look at the redshift table design.
There are a couple of factors that contribute to an efficient table design. So we have data distribution sort strategies and compression. All right? So data distribution defines how the data is distributed across different redshift nodes. Sort strategies define how the data is sorted in different tables, while compression is used to reduce the storage and I O requirements of your redshift cluster. So let’s look at these three factors one by one. First, the rest shift data distribution. So, as we have already discussed, redshift data is stored in columns and not in rows like we have in typical OLTP databases. So in other words, data corresponding to a given field across rows, of course, is stored together and this can be queried easily.
So when data is stored in columns, it makes it easy for redshift to run different aggregation operations like summing all the data in the column or getting averages of the data in a particular column. So all these operations are faster and more efficient because data is stored in columns here. And the data distribution actually refers to how this data is distributed across different nodes in your redshift cluster. So when you create your redshift tables, you specify this distribution style. So there are a couple of distribution styles that you can use. We have even style all key and auto. When you use Even, the data is spread evenly across all the nodes and this is default option that gives you a decent performance.
Then we have all distribution style and if you use it, then the table data is put on every note. So this is definitely good if you have smaller dimension tables or frequently used tables that are heavily joined. But if you have larger tables, then this is not a good option. Then you can have a key distribution style and if you use it, then the rows with the same dist key column are placed on the same note. So dist key is distribution key. So the rows with the same distribution key value will be put on the same node if you use the key distribution style. And if you use the auto distribution style. What redshift does is it initially assigns all to a small table. And as the table grows in size, then it spreads data evenly across all the nodes by using the even distribution style.
So this is how Redshift manages your data distribution. Now let’s look at this sort styles. This defines how your data is sorted. So you can have, for example, a single column sort key, like the department column you see in this particular example here. And you can also have a compound sort key. So compound sort key means more than one columns in a sort key. So for example, you can have department and location as your sort key. So this becomes a compound sort key. And the compound sort key is hierarchical. So the order of the column in the sort key is important. Okay, so department comes first and location comes after that. So this order is always maintained in a compound sort key. And then we have what is called as interleaved sort key.
This is similar to the compound sort key, but it gives equal weightage to each column or subset of columns in your sort key. So in effect, you can have multiple sort key combinations. If we have interleaved sort key on department plus location, then you can also use location plus department as another sort key. So each column gets the same weight. So there is no hierarchy here. You can have department and location or location and department. And remember, whatever sort key style you use, whatever sort style you use, these must be defined at the time you create your table. All right, let’s continue. Now let’s look at the vacuum operation in your tables.
So what exactly is this vacuum operation? When you delete rows from tables or add additional rows to your table, we already know that data in redshift is stored in a sorted fashion. So when you delete data from your Redshift table, or when you add new data, then the performance might degrade over time. Because when you add data, redshift has to spend some time to sort this data, right, because the data is stored sorted. So over time you might see some performance degradation. And this is where vacuum operation comes into play. So let’s say you have this Redshift table. Some of the items are marked for deletion here. And what you see in green and yellow are the unsorted records that got added to the table.
Then what vacuum does is it resorts these rows in one or all different database tables in your database. So when the vacuum is performed, unsorted records get sorted and the records that are marked for deletion will be deleted. So in effect, you reclaim your space from the table records that were marked for deletion. Also, the data gets sorted when the vacuum operation runs. And Redshift automatically runs these vacuum operations in background. So you don’t have to do that yourself, but you can definitely do that if you need to. All right, now let’s look at the third factor redshift compression. So we already know that redshift stores data in columns. So compression again is a column level operation that helps you reduce the size of your stored data.
So let’s say we have three columns. So in a typical relational database, you might have data looking like this, but in redshift, you might have data looking like this. So different columns can have different compression to reduce the size of data. So the compression helps in reducing the disk IO, and it also improves your query performance because the data size is smaller. So compression, in effect, is simply the encoding that you use to store your data, and each column can be separately compressed with different encodings. Now, this can be done manually or automatically, and when you use the copy command, it applies compression by default. So there are different types of compression. There are eight, nine different types of compression that you can use with redshift.
For example, we have raw compression, which means no compression at all. So the data has the largest size, and this is default for sort key columns. And then you have something called as LCO, which provides high compression with a good performance. And this is default encoding for the columns that are not sort keys. All right. And then, of course, as I mentioned, there are 810 different types of encodings that you can use with redshift. Lastly, it’s important to note that you cannot change your encodings after creating the table. You have to specify the encodings when you create your redshift tables. All right, let’s continue.
12. Redshift Workload Management (WLM)
Now let’s talk about WLM or Rest shipped workload management. Now, WLM is used for prioritizing your workloads. So when you have longrunning queries as well as short running once, then the long running queries might impact the performance of your short running queries. And this is where WLM comes into picture. It allows, allows you to run your long running queries and short running queries separately and that helps you in improving the performance of your short running queries or that helps you prevent long running queries from impacting the short running ones. So there are two mores in WLM, the automatic WLM and a manual WLM. So what WLM does is it creates different queues for different types of queries.
So you have a queue for short running queries. You have a queue for long running queries. Then you can have a queue for the super user, you can have queue for your analytical users. Analysts get a queue. Super user, or the administrator gets a queue. Long run inquiries get a queue. And short run inquiries get a queue. So this allows you to prioritize your workload so one doesn’t affect the other. And in addition to this, the automatic WLM mode also supports what’s called as Q priorities. So here are some of the queue priorities that are supported with automatic WLM. So you can have lowest, low, normal, high, Highest, and Critical.
Now, Critical is reserved for super users and rest of the queue priorities are available for everyone to use with automatic WLM. And in addition, we also have something called as SQA or short query acceleration. And this allows you to prioritize selected short queries without actually having to create a dedicated queue for this purpose. Okay, so you can use short query acceleration if you do not want to create a dedicated queue for your short running queries. All right, let’s continue. Now modifying WLM configuration. Now, WLM configuration is similar to parameter groups, so switching between manual and auto modes of WLM will require a cluster reboot.
So if you have manual mode enabled and if you switch to the auto mode or YC versa, you have to reboot your cluster. And as I mentioned, WLM uses parameter groups for its configuration and the configuration properties are either dynamic or static. And as usual, you already know that dynamic properties do not require cluster reboot when you make changes to them. But static property changes do require a cluster reboot. So if you make any change to a static property, then you must reboot the cluster for that static property to become effective. All right, let’s continue.