How to cut DWH and DataLake costs for Amazon Web Services? AWS currently provides 175 services.
By choosing a crushing and specialization approach, Amazon has closed two tasks:
- maximum efficiency for solving customer's problems at a low price of one service;
- high total cost of ownership of infrastructure due to the use of a large number of services. It is impossible to accurately calculate what the project will result in when dozens of different services are involved.
Based on 10 years of experience working with DWH and BI projects, we have compiled an architecture that, in our opinion, is the most efficient and at the same time cheap.
What functions should the system perform:
- collection of data from various sources
- data cleansing
- data enrichment,
- upload to DataLake
- loading and building DWH
- data Mapping
- machine learning and predictive analytics
- All infrastructure must work in AWS
To build an analytical pipeline, Amazon suggests using about 30 services. Experience shows that you can do five. If you do not have the task to build a spaceship and surprise Elon Musk, then you will be enough:
Amazon side: EC2 ECS S3 RDS
open source solutions: Python, PostgreSQL, Hive, Presto, Apache Superset
To deploy all open source solutions, we use Amazon's EC2 and ECS services
ETL - python, SQL, GLUE
ML - python.
1) All company data is uploaded to S3-based DataLake.
2) from DataLake we transform and load data into DWH (PostgreSQL) based on RDS AWS.
3) We organize work with DataLake through Hive.
4) We unite DataLake and DWH through Presto.
5) BI - Apache Superset, PowerBI
PrestoSQL table aggregator driven by SQL language, with support for multiple connectors. With Presto, you can combine different data sources from classic databases to modern hdfs repositories. The internal device automatically performs optimization operations on requests in order to reduce the load and processing time. Presto also has the ability to connect from Python applications, thereby replacing the need to connect to postgreSQL databases directly.
Hive Metastore technology for creating databases that are located on the file system. In particular, it allows you to build DWH and Data Lake based on S3, which in turn provides unlimited disk space for data storage and quick access to them.
To ensure security, all communication with the outside world can go through IAM and KMS AWS services. Thus, the most expensive operations in terms of costs will be conducted on open source solutions, and Amazon services will be responsible for speed and security.
This architecture allows you to close most of the tasks of the average customer.
So, using no more than 5 Amazon services and proven open source solutions, you can reduce the cost of analytical pipeline at times, without losing performance and security.
According to average estimates, the cost of owning and using AWS infrastructure for an average company should not exceed $ 1000-3000 per month.
Maintenance and modernization of 2000-3500 dollars a month
The transfer of the entire infrastructure takes from 2 to 6 months. We are able to help you pay less for your DWH at Amazon Web Services. Find out more now!