Intro to snowflake
- company founded in Aug 2012 — while I was at B.
- Cloud data platform
- Snowflake region maps 1:1 to AWS region
- available, scalable, self managed
- decoupled tiers
The details of the tiers:
- storage tier – blob storage of hosting cloud storage
- multi petabite table side
- support for semi structred data
- some tables of peta byte size can have sub second access without explicit partition
- acid operations with fast update delete merge operations
- multi cluster compute layer
- dedicated compute clusters for workload
- compute size based on workload – charge per second
- unlimited scalability / elasticity without contention
- cloud services tier
- interface to external clients
- optimization
- management
- transactions
- security and governance
- metadata
- sharing and collaboration – without scale limitations – some installations can have ~ 100 mil query everyday
All public clouds supported:
- 2020 GCP
- 2018 Azure
- 2015 aws
- cloud agnostic layer
- avoid cloud lockin
The Snowfake cloud platform:
- snowflake regions can interact with each other
- global data mesh
- org admin – snowflake cloud ==> create account …
- database replication
global data sharing:
- local data sharing
- migrate account to different cloud regions
- business continuity – full failover of data and client connections
- create data … as replica of …
- alter database … refresh
- speed depends on regions — 20 tb/hour
snowflake data marketplace:
- data listings
- provider and consumer can be in different cloud/regions
- across cloud and boundaries
supports:
- data engineering
- data lake
- data warehouse
- data science
- data applications
- data exchange
snowsight:
- query database
- visualization
- share queries – granular permissions
performance:
- compilation time reduced
- Join performance
cloud query:
- transparent materialized view — snowflake automatically uses materialized views
- additional sizes of compute clusters
- massive concurrency
- 5XL and 6XL new compute clusters – each double of previous
- search optimization service – no tuning needed – can be enabled at table level — for queries that need full table scans (equality comparisons) – records metadata on where data is in the table in micro partitions
- point lookup query – speeds up by order of magnitude – need fewer compute resources
sql engine:expressive power —
- changes clause – access change history of table
- qualify clause
- explain, join json
- like all
- like any
- like any <multi row>
- top <n>
- booland_agg
- mode
- kurtosos
- skew
- conditional_true_event
- conditional_change_event
- soundex
- array_intersection
- ratio_to_report
- get_ignore_case
- recursive views
- SQL stored procedures
- geospatial data in varint / geography data type
data governance:
- know your data
- manage data
- know data
- who uses what data
- dependencies between data
- snowflake enforces policy
- dynamic data masking policy – unredacted / partial / full
- ingest data as tokenized — protegity (3r party)
dynamic data masking:
- user role policyadmin;
- create or replace masking policy <abc> as
- can use regex expression
- new cloned tables also has policy
- can be applied to external tables (e.g., ip table)
- tasks can be scheduled
- partitioning expressing can be added to copy function — for data saved to cloud env
external functions
- definiton in snowflake — implementation outside
- can be used in query / data transformation
- user defined functions can be created in java — python later
- data frame programming model — python and spark users — later on
- need api integration
- create a sql external function using url of end point
- can create a function calling a .jar
- can be built into data pipeline — with stream and task
salesforce partnership:
- works easier with salesforce
- einstein nalaytics — direct sf query
- esinstein analytics – outbound connector
data exchange:
- connect providers to consumers
- snowflake data marketplace – b2b
- can create org level data exchange
- listing can be standard or personalized
- invited sf accounts can see what is available in data exchange
- shared table is linked to a view that filters only rows available to user
— create external function to call Postegrity for de-tokenizing data
create or replace secure external function masking policies.tokenization_policies.protegrity_email_unprotect(email string)
returns string
api_integration-‘protegrity_api’
headers=(
‘X-Protection-DCop-Rules’='{\”payload_type\”:\”JSON\”,\”jsonpaths\” : \”$ data[*][1]\”,\”op_type\”:\”unprotect\”,\”data element\”:\”tok_alpha\”)]}
)
context_headers=(current_user,current_account)
as
https://3npspdj0k5.execute-api.us-west-1.amazonaws.com/SF_CUSTOMER’;
— create a masking policy that detokenizes email for PII_ROLE, everyone else gets tokenized value
create or replace masking policy masking_policies.tokenization_policies.email_detok as (val string)
return string ->
case
when invoker_share() is not null then val
when current_role() in (‘pii_role’) then masking_policies.tokenization_policies.email_unprotect(val)
else
val
end
— attached policy to email column
alter table sf_tuts.public.emp_tokenized
modify column email
set masking
policy masking_policies.tokenization_policies.email_detok;
====
show shares like ‘%emp%’;
use role pii_role;
select * from emp_tokenized limit 10;
use role analyst;
=======
this stuff was in: snowflake.com/say-hello-to-the-data-cloud/