By Roshan Jayalath
Abstract
Our client is a global leader on cloud-based digital marketing technology and one of the giants in Ad-Tech. Being in the ad-tech field means gathering a large amount of data in various sources. To battle out the situation our client uses various technologies such as SQL Server, PostgreSQL, Datastax Cassandra, Pivotal Greenplum, Aerospike, MySQL and Amazon RedShift. Each of these database technologies have their own pros and cons – so in this article let’s find out what we do with Amazon Redshift.
Introduction
According to AWS documentation, Amazon Redshift is a fully managed petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. It is a column store database with the ability to do large amounts of parallel processing (called MPP or massively parallel processing). Launched in February 2013, Amazon Redshift has a large corporate clientele as NTT DOCOMO, Johnson & Johnson, Nokia and even NASDAQ.
Our environment
In production we use 8 Redshift clusters of various scales for different business units. Each Redshift cluster would consist of multiple nodes which are EC2 nodes in AWS. Processing power, disk space and I/O performance of the Redshift cluster would depend on the type of nodes and the number of nodes that build the Redshift cluster.
Dense Storage Node Types
Dense Compute Node Types
Older type of node size used at our client is ds1.xlarge which consists of 2TB storage per node. If you need to have 32TB of storage, then you would need 16 nodes in minimum with ds1.xlarge.
For what Redshift should not be used
- Small data sets – Anything less than 100GBs is not worth to go with AWS Redshift.
- OLTP – Even though AWS is fantastic for data warehousing workloads, Redshift is not the tool for transactional data.
- Unstructured data – Even though Redshift is a columnar database, it is not an unstructured data store. You need to predefine table schemas & store data accordingly.
- BLOB data – Best place to keep Blob data is on AWS S3 object storage and have a database store pointers for the files in S3.
Advantages of using Redshift
- Being a column oriented storage system
Unlike traditional RDBMS, Redshift groups data by columns rather than by rows. This allows Redshift to be more I/O efficient for read-only queries because they only have to read those columns accessed by a query. Also columnar storage can provide better compression than row compression. That is because every column is packed into its own set of blocks, every physical block contains the same data type. When all the data is the same data type, the database can use extremely efficient compression algorithms.
- Massively Parallel Processing architecture
Amazon Redshift has a massively parallel processing (MPP) data warehouse architecture, parallelizing and distributing SQL operations to take advantage of all available resources. The underlying AWS hardware is designed for high performance data processing, using local attached storage to maximize throughput between the CPUs and drives, and a high speed mesh network to maximize throughput between nodes. Data in Redshift is partitioned between the nodes using a predefined distribution style.
- Even distribution: where data is written to Redshift nodes in round robin fashion. This is the default distribution style.
- Key distribution: Rows are distributed between the nodes based on the key column values. Rows with the matching key values would be stored in the same node. This might cause hotspots if the key is not correctly selected.
- All distribution: Each row is stored in every node. This increases the storage requirements and might be a good idea only for slow moving tables such as lookup tables.
Because of the MPP architecture of Redshift it can
- Query all nodes in parallel – Leader node would coordinate the query, while computational nodes would do the individual querying and calculations on each node.
- Can load data from sources like raw text files in amazingly fast speed with parallel processing of multiple files. Files stored in Amazon S3, Dynamo DB and even any SSH connection could be used as a store. Data load speed would be linear with the number of nodes.
- Backup/Restores are fast and parallel since each node could backup full snapshots or incremental backups to S3.
- Resizing of a cluster is fast since data could move from multiple nodes to new node or vice versa in parallel.
However there are certain limitations with RedShift.
- Functionality and Features – Redshift is based on PostgreSQL database engine and the syntax is similar. However many of the features available on latest version of PostgreSQL are not available with Redshift.
- Issues with vacuuming – When large amounts of data are deleted in Redshift you need to vacuum the table to get the free space back. Vacuum process slows operations of other processes and might take a long time on very large tables. And only One vacuum can run on a Redshift database at a time.
- Concurrent usage – Redshift uses queues to manage concurrency and limits how many queries it can handle at the same time. Because of this architecture, the number of queries that could be run on Redshift is limited. Queries running on the same queue, would have to wait for a long time to start if there are long running queries in progress.
- Scaling large clusters is time consuming – When a large Redshift cluster is resized, it only supports ‘Selects’ and all data changes are not accepted. When a node is added to a very large cluster, this resizing process might take a long time – keeping the cluster on read only mode during that duration.
However these limitations are not reasons to not use Redshift. Amazon has been working very fast on improving AWS as you can see in below diagram. They have released over 50 + new features since its inception.
How we use Redshift at our project
We transfer data from SQL Server into Redshift using below flow. Vice versa is done, if we need to run some query on Redshift and feed the results into our main RDBMS – SQL Server.
Microsoft SQL Server > SSIS > Text files compressed as Tar > AWS CLI > S3 > Copy into Redshift
On cases where we need to join SQL Server data along with Redshift data, when the number of records to be linked are low – We use linked servers made to a ODBC data source using PostgreSQL psqlODBC driver (9.02.01.00) which works fine with Redshift. However there are some fine tuning you would have to do on the ODBC driver setup for it to correctly work with Redshift. It is advisable to use the ODBC driver only when you need to read or modify few records. Redshift’s born talent is to ingest data from s3 files – and it is very good at it.