Jun Ma's Technical Domain: 2014

Redshift Best Practices & Commands

Redshift Best Practices Presentations

http://www.slideshare.net/AmazonWebServices/amazon-redshift-best-practices

Use S3 to store the processed data to be loaded. That way, the data can be retained and reloaded easily.
Use the proper Distribution Key

If you have to do a few joins or group by clauses, it is better to use the fields in joins or group by clauses as the dist key. That way, all data to be loaded are stored in the same nodes.

Use the proper Sort Key.

It is good to use Timestamp field as a sort key if you have many queries based on the time.

Use two tables. One is the Staging table to load data and verify the loaded data. The other one is the Target table, aka Production table. That way, the data to be loaded can be verified and corrected. The product table will not be impacted.
Consider the retention policy and use UNLOAD command to archive unused data

Unload: http://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html

Normalization rules:

For non-collated joins, use de-normalization
For slow changing dimensions, keep normalization, match the dist key with fact table

Copy Command:

Good for loading big data set
Some good options: EMPTYASNULL, BLANKASNULL, GZIP
DateTimeFormat: Must have a space between the date part and time part.

Space for Load and Vacuum.

Load and Vacuum needs 2.5 time of the table to operate.

Useful Commands

Space Monitoring

The command to analyze space for a cluster
select
trim(pgdb.datname) as Database,
trim(pgn.nspname) as Schema,
trim(a.name) as Table,
b.mbytes,
a.rows
from (
select db_id, id, name, sum(rows) as rows
from stv_tbl_perm a
group by db_id, id, name
) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (
select tbl, count(*) as mbytes
from stv_blocklist
group by tbl
) b on a.id = b.tbl
order by mbytes desc, a.db_id, a.name;

Analyze Table distribution between nodes:
select slice, col, num_values, minvalue, maxvalue
from svv_diskusage
where name = '__INSERT__TABLE__NAME__HERE__' and col = 0
order by slice, col;

The command to analyze compression

Jun Ma's Technical Domain

Tuesday, August 12, 2014

Redshift Best Practice & Commands

Redshift Best Practices & Commands

Redshift Best Practices Presentations

Useful Commands

Space Monitoring

Data Loading