Redshift Best Practices & Commands
Redshift Best Practices Presentations
http://www.slideshare.net/AmazonWebServices/amazon-redshift-best-practices- Use S3 to store the processed data to be loaded. That way, the data can be retained and reloaded easily.
- Use the proper Distribution Key
- If you have to do a few joins or group by clauses, it is better to use the fields in joins or group by clauses as the dist key. That way, all data to be loaded are stored in the same nodes.
- Use the proper Sort Key.
- It is good to use Timestamp field as a sort key if you have many queries based on the time.
- Use two tables. One is the Staging table to load data and verify the loaded data. The other one is the Target table, aka Production table. That way, the data to be loaded can be verified and corrected. The product table will not be impacted.
- Consider the retention policy and use UNLOAD command to archive unused data
- Unload: http://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html
- Normalization rules:
- For non-collated joins, use de-normalization
- For slow changing dimensions, keep normalization, match the dist key with fact table
- Copy Command:
- Good for loading big data set
- Some good options: EMPTYASNULL, BLANKASNULL, GZIP
- DateTimeFormat: Must have a space between the date part and time part.
- Space for Load and Vacuum.
- Load and Vacuum needs 2.5 time of the table to operate.
Useful Commands
Space Monitoring
The command to analyze space for a clusterselect
trim(pgdb.datname) as Database,
trim(pgn.nspname) as Schema,
trim(a.name) as Table,
b.mbytes,
a.rows
from (
select db_id, id, name, sum(rows) as rows
from stv_tbl_perm a
group by db_id, id, name
) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (
select tbl, count(*) as mbytes
from stv_blocklist
group by tbl
) b on a.id = b.tbl
order by mbytes desc, a.db_id, a.name;
Analyze Table distribution between nodes:
select slice, col, num_values, minvalue, maxvalue
from svv_diskusage
where name = '__INSERT__TABLE__NAME__HERE__' and col = 0
order by slice, col;
The command to analyze compression