Snowflake data cloud

Intro to snowflake

company founded in Aug 2012 — while I was at B.
Cloud data platform
Snowflake region maps 1:1 to AWS region
available, scalable, self managed
decoupled tiers

The details of the tiers:

storage tier – blob storage of hosting cloud storage
- multi petabite table side
- support for semi structred data
- some tables of peta byte size can have sub second access without explicit partition
- acid operations with fast update delete merge operations
multi cluster compute layer
- dedicated compute clusters for workload
- compute size based on workload – charge per second
- unlimited scalability / elasticity without contention
cloud services tier
- interface to external clients
- optimization
- management
- transactions
- security and governance
- metadata
- sharing and collaboration – without scale limitations – some installations can have ~ 100 mil query everyday

All public clouds supported:

2020 GCP
2018 Azure
2015 aws
cloud agnostic layer
avoid cloud lockin

The Snowfake cloud platform:

snowflake regions can interact with each other
global data mesh
org admin – snowflake cloud ==> create account …
database replication

global data sharing:

local data sharing
migrate account to different cloud regions
business continuity – full failover of data and client connections
create data … as replica of …
alter database … refresh
speed depends on regions — 20 tb/hour

snowflake data marketplace:

data listings

provider and consumer can be in different cloud/regions

across cloud and boundaries

supports:

data engineering
data lake
data warehouse
data science
data applications
data exchange

snowsight:

query database
visualization
share queries – granular permissions

performance:

compilation time reduced
Join performance

cloud query:

transparent materialized view — snowflake automatically uses materialized views
additional sizes of compute clusters
massive concurrency
5XL and 6XL new compute clusters – each double of previous
search optimization service – no tuning needed – can be enabled at table level — for queries that need full table scans (equality comparisons) – records metadata on where data is in the table in micro partitions
point lookup query – speeds up by order of magnitude – need fewer compute resources

sql engine:expressive power —

changes clause – access change history of table
qualify clause
explain, join json
like all
like any
like any <multi row>
top <n>
booland_agg
mode
kurtosos
skew
conditional_true_event
conditional_change_event
soundex
array_intersection
ratio_to_report
get_ignore_case
recursive views
SQL stored procedures
geospatial data in varint / geography data type

data governance:

know your data
manage data

who uses what data
dependencies between data
snowflake enforces policy
dynamic data masking policy – unredacted / partial / full
ingest data as tokenized — protegity (3r party)

dynamic data masking:

user role policyadmin;
create or replace masking policy <abc> as
can use regex expression
new cloned tables also has policy
can be applied to external tables (e.g., ip table)
tasks can be scheduled
partitioning expressing can be added to copy function — for data saved to cloud env

external functions

definiton in snowflake — implementation outside
can be used in query / data transformation

user defined functions can be created in java — python later
data frame programming model — python and spark users — later on
need api integration
create a sql external function using url of end point
can create a function calling a .jar
can be built into data pipeline — with stream and task

salesforce partnership:

works easier with salesforce
einstein nalaytics — direct sf query
esinstein analytics – outbound connector

data exchange:

connect providers to consumers
snowflake data marketplace – b2b
can create org level data exchange
listing can be standard or personalized
invited sf accounts can see what is available in data exchange
shared table is linked to a view that filters only rows available to user

— create external function to call Postegrity for de-tokenizing data

create or replace secure external function masking policies.tokenization_policies.protegrity_email_unprotect(email string)

returns string

api_integration-‘protegrity_api’

headers=(

‘X-Protection-DCop-Rules’='{\”payload_type\”:\”JSON\”,\”jsonpaths\” : \”$ data[*][1]\”,\”op_type\”:\”unprotect\”,\”data element\”:\”tok_alpha\”)]}

)

context_headers=(current_user,current_account)

https://3npspdj0k5.execute-api.us-west-1.amazonaws.com/SF_CUSTOMER’;

— create a masking policy that detokenizes email for PII_ROLE, everyone else gets tokenized value

create or replace masking policy masking_policies.tokenization_policies.email_detok as (val string)

return string ->

case

when invoker_share() is not null then val

when current_role() in (‘pii_role’) then masking_policies.tokenization_policies.email_unprotect(val)

else

val

end

— attached policy to email column

alter table sf_tuts.public.emp_tokenized

modify column email

set masking

policy masking_policies.tokenization_policies.email_detok;

====

show shares like ‘%emp%’;

use role pii_role;

select * from emp_tokenized limit 10;

use role analyst;

=======

this stuff was in: snowflake.com/say-hello-to-the-data-cloud/

Snowflake data cloud

Published by coolford

Leave a comment Cancel reply

Share this:

Related

Published by coolford

Leave a comment Cancel reply