Thursday, April 27, 2023

AMAZON REDSHIFT

AMAZON REDSHIFT




  1. Amazon RedShift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and existing Business Intelligence (BI) tools. 
  2. Amazon RedShift is a clustered peta-byte scale data warehouse and is an SQL based data warehouse used for analytics applications.
  3. Amazon RedShift is an Online Analytics Processing (OLAP) type of Database which can be used for running complex analytic queries against petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution. 
  4. Amazon RedShift is also ideal for processing large amounts of data for business intelligence.

The benefits of Amazon RedShift are as follows:

Amazon RedShift is extremely cost-effective as compared to some other on-premises data warehouse platforms.

Amazon RedShift is PostgreSQL compatible with JDBC and ODBC drivers available; compatible with most Business Intelligence tools out of the box.

Features parallel processing and columnar data stores which are optimized for complex queries.

Option to query directly from data files on S3 via Amazon RedShift Spectrum.

Amazon RedShift is 10x faster than a traditional SQL DB.

Amazon RedShift can store huge amounts of data but cannot ingest huge amounts of data in real time.

Amazon RedShift uses columnar data storage:

Data is stored sequentially in columns instead of rows.
Columnar based DB is ideal for data warehousing and analytics.
Requires fewer I/Os which greatly enhances performance.

Amazon RedShift provides advanced compression:

Data is stored sequentially in columns which allows for much better performance and less storage space.

Amazon RedShift automatically selects the compression scheme.

Amazon RedShift provides good query performance and compression.

Amazon RedShift provides Massively Parallel Processing (MPP) by distributing data and queries across all nodes.

Availability and Durability

Amazon RedShift uses replication and continuous backups to enhance availability and improve durability and can automatically recover from component and node failures.

Only available in one AZ but you can restore snapshots into another AZ.

Alternatively, you can run data warehouse clusters in multiple AZ’s by loading data into two Amazon RedShift data warehouse clusters in separate AZs from the same set of Amazon S3 input files.

Amazon RedShift replicates your data within your data warehouse cluster and continuously backs up your data to Amazon S3.

Amazon RedShift always keeps three copies of your data:

The original.
A replica of compute nodes (within the cluster).
A backup copy on S3.
Amazon RedShift provides continuous/incremental backups:

Multiple copies within a cluster.
Continuous and incremental backups to S3.
Continuous and incremental backups across regions.
Streaming restore.
Amazon RedShift provides fault tolerance for the following failures:

Disk failures.
Node failures.
Network failures.
AZ/region level disasters.
For node failures the data warehouse cluster will be unavailable for queries and updates until a replacement node is provisioned and added to the DB.

High availability for Amazon RedShift:

Currently, Amazon RedShift does not support Multi-AZ deployments.
The best HA option is to use a multi-node cluster which supports data replication and node recovery.
A single node Amazon RedShift cluster does not support data replication and you’ll have to restore from a snapshot on S3 if a drive fails.
Amazon RedShift can asynchronously replicate your snapshots to S3 in another region for DR.

Single-node clusters do not support data replication (in a failure scenario you would need to restore from a snapshot).

Scaling requires a period of unavailability of a few minutes (typically during the maintenance window).

During scaling operations Amazon RedShift moves data in parallel from the compute nodes in your existing data warehouse cluster to the compute nodes in your new cluster.

By default, Amazon RedShift retains backups for 1 day. You can configure this to be up to 35 days.

If you delete the cluster, you can choose to have a final snapshot taken and retained.

Manual backups are not automatically deleted when you delete a cluster.

Stores data and performs queries and computations.
Local columnar storage.
Parallel/distributed execution of all queries, loads, backups, restores, resizes.
Up to 128 compute nodes.
Amazon RedShift Spectrum is a feature of Amazon RedShift that enables you to run queries against exabytes of unstructured data in Amazon S3, with no loading or ETL required.

Security
You can load encrypted data from S3.

Supports SSL Encryption in-transit between client applications and Amazon RedShift data warehouse cluster.

VPC for network isolation.

Encryption for data at rest (AES 256).

Audit logging and AWS CloudTrail integration.

Amazon RedShift takes care of key management, or you can manage your own through HSM or KMS.

Charges

Charged for compute nodes hours, 1 unit per hour (only compute node, not leader node).

Backup storage – storage on S3.

Data transfer – no charge for data transfer between Amazon RedShift and S3 within a region but for other scenarios you may pay charges.

HDD and SSD storage options.

The size of a single node is 160GB and clusters can be created up to a petabyte or more.

Multi-node consists of:

Leader node:

Manages client connections and receives queries.
Simple SQL endpoint.
Stores metadata.
Optimizes query plan.
Coordinates query execution.
Compute nodes:

Stores data and performs queries and computations.
Local columnar storage.
Parallel/distributed execution of all queries, loads, backups, restores, resizes.
Up to 128 compute nodes.
Amazon RedShift Spectrum is a feature of Amazon RedShift that enables you to run queries against exabytes of unstructured data in Amazon S3, with no loading or ETL required.


Use Cases of Amazon RedShift

A data warehouse for enterprise operations: Many organizations work with data from multiple sources, such as advertising, customer relationship management, and customer support.

As a centralized repository, Redshift can be used to store data from multiple sources in a unified schema and structure. This can then feed enterprise-wide reporting and analytics.

In business intelligence and analytics, Redshift’s fast query execution against terabyte-scale data makes it an excellent selection. BI tools such as Tableau often use Redshift as the underlying database (which would otherwise struggle to perform queries and joins of large datasets).

Organizations may choose to monetize their data by exposing it to their customers through embedded analytics and analytics as a service. In these scenarios, Redshift’s data sharing, search, and aggregation capabilities make it ideal, as it allows customers to access only relevant subsets of data while keeping other databases, tables, or rows confidential.

As long as the cluster is adequately resourced, Redshift’s performance is consistent and predictable. It is therefore a popular choice for data-driven applications, such as reporting and calculations.

Database migration and change data capture: AWS Database Migration Service (DMS) can be used to replicate changes in an operational data store into Amazon Redshift. It is typically done to provide more flexibility in analysis, or when migrating from legacy data warehouses.

AutoScaling in Databases

 

AutoScaling in Databases

Autoscaling is about increasing the number of nodes (instances – computers – servers) in OCI, increasing an instance pool. 

AutoScaling Features 

  • Use performance metrics and their configured thresholds to decide whether or not to scale a pool;
  • When the metrics exceed the threshold, scale-out will occur, that is, the number of instances in the pool will increase;
  • If your application is behind a Load Balancer, it is able to “attach” the new instance to the Load Balancer(s) also configured in OCI;
  • The AutoScaling configuration can be based on metrics (as mentioned above) or on schedule, where you need to create policies informing when the pool will increase and when the pool will decrease;
  • The instance pool must, obligatorily, use an Instance Configuration, guaranteeing standardization in the creation of instances, always using the same configurations.

Well, having said that, the AutoScaling of the Autonomous Database is quite different, but with the common objective of increasing the computational capacity of the resource that will be used, in this case, the Autonomous Database.

Some features about AutoScaling of Autonomous Database:

  • Allows you to use up to three times the amount of OCPUs originally allocated. The originally allocated amount is called the base number of OCPUs;
  • If the workload increases, autoscaling will increase the number of OCPU cores at that time;
  • If the workload decreases, autoscaling will decrease the number of CPU cores at that time;
  • You will be billed for the average OCPUs used per hour;
  • It is enabled by default and can be enabled or disabled at any time;
  • The status of AutoScaling (whether it is enabled or not) can be seen on the Autonomous details page;
  • The base number of OCPUs allocated to an Autonomous is always guaranteed. For Autonomous running on dedicated Exadata infrastructure, the maximum number of colors available to a database depends on the number of colors available on the Exadata infrastructure, and is further limited by the number of free colors that are not in use by autoscaling other banks of data to meet high workload demands.

More details about the maximum amount of available OCPUs can be seen here: 

CPU Scaling Oracle Autonomous Database

Oracle Cloud provides a set of data management services built on self-driving Oracle Autonomous Database technology to deliver automated patching, upgrades, and tuning, including performing all routine database maintenance tasks while the system is running, without human intervention.

When you create an Autonomous Database, you can deploy it to one of two kinds of Exadata infrastructure:

  • Shared, a simple and elastic choice. Oracle autonomously operates all aspects of the database life cycle from database placement to backup and updates.
  • Dedicated on Public Cloud, a private cloud in public cloud choice. A completely dedicated compute, storage, network and database service for only a single tenant, providing for the highest levels of security isolation and governance.
  • Dedicated on Cloud@Customer, Autonomous Database on dedicated infrastructure running on on Exadata Database Machine system in your data center, together with the networking configuration that connects it to Oracle Cloud.

Sunday, April 23, 2023

Script to check the row count of partitions for a table


Script to check the row count of partitions for a table



count_tab.sql

set serverout on size 1000000
set verify off
declare
sql_stmt varchar2(1024);
row_count number;
cursor get_tab is
select table_name,partition_name
from dba_tab_partitions
where table_owner=upper('&&TABLE_OWNER') and table_name='&&TABLE_NAME';
begin
dbms_output.put_line('Checking Record Counts for table_name');
dbms_output.put_line('Log file to numrows_part_&&TABLE_OWNER.lst ....');
dbms_output.put_line('....');
for get_tab_rec in get_tab loop
BEGIN
sql_stmt := 'select count(*) from &&TABLE_OWNER..'||get_tab_rec.table_name
||' partition ( '||get_tab_rec.partition_name||' )';

EXECUTE IMMEDIATE sql_stmt INTO row_count;
dbms_output.put_line('Table '||rpad(get_tab_rec.table_name
||'('||get_tab_rec.partition_name||')',50)
||' '||TO_CHAR(row_count)||' rows.');
exception when others then
dbms_output.put_line
('Error counting rows for table '||get_tab_rec.table_name);
END;
end loop;
end;
/
set verify on


Monday, April 17, 2023

Querying external data with Redshift

 

Querying external data with Redshift



Two ways 

• Redshift Spectrum
 • Redshift Federated Query

Redshift Spectrum

Redshift Spectrum
  • Query exabytes of data from S3 without loading it into Redshift
  • Must have a Redshift cluster available to start the query
  • The query is then submitted to thousands of Redshift Spectrum nodes
  • External table structure/schemas can be created in external data catalog like Athena / Glue / Apache Hive metastore (EMR)
  • These external tables are read-only (insert / update / delete operations not possible)
  • Redshift cluster and S3 bucket must be in the same region

Redshift Federated Query
  • Query and analyze data across different DBs, DWs, and data lakes 
  • Currently works with Redshift, PostgreSQL on RDS, Aurora PostgreSQL and S3

Examples of using a federated query


https://docs.aws.amazon.com/redshift/latest/dg/federated_query_example.html

Example of using a federated query with PostgreSQL

The following example shows how to set up a federated query that references an Amazon Redshift database, an Aurora PostgreSQL database, and Amazon S3. This example illustrates how federated queries work. To run it on your own environment, change it to fit your environment. For prerequisites for doing this, see Getting started with using federated queries to PostgreSQL.

Create an external schema that references an Aurora PostgreSQL database.

CREATE EXTERNAL SCHEMA apg FROM POSTGRES DATABASE 'database-1' SCHEMA 'myschema' URI 'endpoint to aurora hostname' IAM_ROLE 'arn:aws:iam::123456789012:role/Redshift-SecretsManager-RO' SECRET_ARN 'arn:aws:secretsmanager:us-west-2:123456789012:secret:federation/test/dataplane-apg-creds-YbVKQw';

Create another external schema that references Amazon S3, which uses Amazon Redshift Spectrum. Also, grant permission to use the schema to public.

CREATE EXTERNAL SCHEMA s3 FROM DATA CATALOG DATABASE 'default' REGION 'us-west-2' IAM_ROLE 'arn:aws:iam::123456789012:role/Redshift-S3'; GRANT USAGE ON SCHEMA s3 TO public;

Show the count of rows in the Amazon Redshift table.

SELECT count(*) FROM public.lineitem; count ---------- 25075099

Show the count of rows in the Aurora PostgreSQL table.

SELECT count(*) FROM apg.lineitem; count ------- 11760

Show the count of rows in Amazon S3.

SELECT count(*) FROM s3.lineitem_1t_part; count ------------ 6144008876

Create a view of the tables from Amazon Redshift, Aurora PostgreSQL, and Amazon S3. This view is used to run your federated query.

CREATE VIEW lineitem_all AS SELECT l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus, l_shipdate::date,l_commitdate::date,l_receiptdate::date, l_shipinstruct ,l_shipmode,l_comment FROM s3.lineitem_1t_part UNION ALL SELECT * FROM public.lineitem UNION ALL SELECT * FROM apg.lineitem with no schema binding;

Show the count of rows in the view lineitem_all with a predicate to limit the results.

SELECT count(*) from lineitem_all WHERE l_quantity = 10; count ----------- 123373836

Find out how many sales of one item there were in January of each year.

SELECT extract(year from l_shipdate) as year, extract(month from l_shipdate) as month, count(*) as orders FROM lineitem_all WHERE extract(month from l_shipdate) = 1 AND l_quantity < 2 GROUP BY 1,2 ORDER BY 1,2; year | month | orders ------+-------+--------- 1992 | 1 | 196019 1993 | 1 | 1582034 1994 | 1 | 1583181 1995 | 1 | 1583919 1996 | 1 | 1583622 1997 | 1 | 1586541 1998 | 1 | 1583198 2016 | 1 | 15542 2017 | 1 | 15414 2018 | 1 | 15527 2019 | 1 | 151

Amazon Redshift- Loading Data into RedShift

 

 Loading Data into RedShift



Typically, data from OLTP systems is loaded into Redshift for analytics and BI purposes.

  • Data from OLTP systems can be loaded into S3 and data from S3 can then be loaded into Redshift.
  • Data from Kinesis Firehose can also be loaded in the same way.


COPY command

  • Loads data from files stored in S3 into Redshift
  • Data is stored locally in the Redshift cluster (persistent storage = cost)
  • DynamoDB table data and EMR data can also be loaded using COPY command


Loading data from S3 with COPY command



copy users from 's3://my_bucket/tickit/allusers_pipe.txt' credentials 'aws_iam_role=arn:aws:iam::0123456789:role/MyRedshiftRole' delimiter '|' region 'us-west-2';

  • Create an IAM Role.
  • Create your Redshift cluster
  • Attach the IAM role to the cluster
  • The cluster can then temporarily assume the IAM role on your behalf
  • Load data from S3 using COPY command


More ways to load data into Redshift


 


  • Use AWS Glue – fully managed ETL service 
  • ETL = Extract, Transform, and Load
  •  Use ETL tools from APN partners
  • Use Data Pipeline
  • For migration from on-premise, use.
  1. AWS Import/Export service (AWS Snowball).
  2. AWS Direct Connect (private connection between your datacenter and AWS)






Amazon Redshift Overview and Architecture

Amazon Redshift – Overview Amazon Redshift


  •  OLAP database (Data warehousing solution) based on PostgreSQL
  • OLAP = Online Analytical Processing
  • Can query petabytes of structured and semi-structured data across your data warehouse and your data lake using standard SQL
  • 10x performance than other data warehouses
  • Columnar storage of data (instead of row based)
  • Massively Parallel Query Execution (MPP), highly available
  •  Has a SQL interface for performing the queries
  •  BI tools such as AWS Quicksight or Tableau integrate with it.
  • Data is loaded from S3,Kinesis Firehose,DynamoDB,DMS ....
  • Can contain from 1 node to 128 compute nodes, up to 160 GB per node
  •  Can provision multiple nodes, but it’s not Multi-AZ
  •  Leader node: for query planning, results aggregation
  • Compute node: for performing the queries, send results to leader.
  • Backup & Restore, Security VPC / IAM / KMS, Monitoring.
  •  Redshift Enhanced VPC Routing: COPY / UNLOAD goes through VPC
  • Redshift is provisioned, so it’s worth it when you have a sustained usage (use Athena instead if the queries are sporadic)

Redshift Architecture

  • Massively parallel columnar database, runs within a VPC
  • Single leader node and multiple compute nodes
  • You can connect to Redshift using any application supporting JDBC or ODBC driver for PostgreSQL
  • Clients query the leader node using SQL endpoint
  • • A job is distributed across compute nodes
  • • Compute nodes partition the job into slices.
  • Leader node then aggregates the results and returns them to the client 

Redshift node types
 

• Dense compute nodes (DC2) 
• For compute-intensive DW workloads with local SSD storage

• Dense storage nodes (DS2) 
For large DWs, uses hard disk drives (HDDs)

RA3 nodes with managed storage
 • For large DWs, uses large local SSDs 
• Recommended over DS2 
• Automatically offloads data to S3 if node grows beyond its size 
• Compute and managed storage is billed independently