Designing a multi tenant data warehousing solution with AWS Redshift
I was recently tasked to present a multi tenant data warehousing solution for a personal project and I did some research that helped improve my understanding of some AWS solutions. I decided to document my proposed ETL process and solution in this blog:
Firstly, there are three multi tenant data storage options including:
- pool: shared cluster, shared database, shared schema with row level access based on the client_id
- bridge: shared cluster, shared database, single schema per client
- silo : shared cluster, single database per client
The silo strategy is more expensive to setup and maintain, as you have to replicate a lot of code in your ETL for each database, and may also lead to inefficient use of resources if a client does not have a large enough data volume or technical complexity. This approach should be selected for more complex client requirements where the cost, effort and technical complexity is warranted.
I came across this AWS blog that demonstrates how you can use AWS Redshift data shares to implement multi-tenant patterns which is an interesting approach because without data shares, you would have to handle data sharing in your ETL workloads for example during the load stage, you would have to write each client’s data to a different schema which would create additional complexity.
Redshift data sharing enables data access across clusters without the need to copy or move it, and at no additional cost. This can unblock several use cases such as sharing data across developer environments as well as sharing data beyond the organisational borders to external clients’ clusters
This implies that you only have to maintain one ETL workload to deliver data to the Redshift cluster, and then create data shares to provide access to specific schemas, views to another cluster. For example, client account 1 can have access to data share 1, client account 2 has access to data share 2 etc
The data shares can be created using DDL commands as shown in this blog https://aws.amazon.com/blogs/big-data/implementing-multi-tenant-patterns-in-amazon-redshift-using-data-sharing/
Another reason to use Redshift for multi-tenant data warehousing is the ability to use Redshift Spectrum and S3 which allows you to further scale the storage and compute capacity of your solution independently. For example in your ETL, you may need to store some data in S3 to save on storage costs and take advantage of some S3 functionality like file expiration. These files can partitioned by client id to speed up queries, and queried directly using Redshift spectrum to create views that are added to data shares and accessed directly by clients. This blog demonstrates how to use Redshift Spectrum to query data in S3.
I hope this blog sparks some thoughts on how to implement multi-tenant data warehousing solutions.