Snowflake data cloud

Intro to snowflake

  • company founded in Aug 2012 — while I was at B.
  • Cloud data platform
  • Snowflake region maps 1:1 to AWS region
  • available, scalable, self managed
  • decoupled tiers

The details of the tiers:

  • storage tier – blob storage of hosting cloud storage
    • multi petabite table side
    • support for semi structred data
    • some tables of peta byte size can have sub second access without explicit partition
    • acid operations with fast update delete merge operations
  • multi cluster compute layer
    • dedicated compute clusters for workload
    • compute size based on workload – charge per second
    • unlimited scalability / elasticity without contention
  • cloud services tier
    • interface to external clients
    • optimization
    • management
    • transactions
    • security and governance
    • metadata
    • sharing and collaboration – without scale limitations – some installations can have ~ 100 mil query everyday

All public clouds supported:

  • 2020 GCP
  • 2018 Azure
  • 2015 aws
  • cloud agnostic layer
  • avoid cloud lockin

The Snowfake cloud platform:

  • snowflake regions can interact with each other
  • global data mesh
  • org admin – snowflake cloud ==> create account …
  • database replication

global data sharing:

  • local data sharing
  • migrate account to different cloud regions
  • business continuity – full failover of data and client connections
  • create data … as replica of …
  • alter database … refresh
  • speed depends on regions — 20 tb/hour

snowflake data marketplace:

  • data listings
  • provider and consumer can be in different cloud/regions
  • across cloud and boundaries

supports:

  • data engineering
  • data lake
  • data warehouse
  • data science
  • data applications
  • data exchange

snowsight:

  • query database
  • visualization
  • share queries – granular permissions

performance:

  • compilation time reduced
  • Join performance

cloud query:

  • transparent materialized view — snowflake automatically uses materialized views
  • additional sizes of compute clusters
  • massive concurrency
  • 5XL and 6XL new compute clusters – each double of previous
  • search optimization service – no tuning needed – can be enabled at table level — for queries that need full table scans (equality comparisons) – records metadata on where data is in the table in micro partitions
  • point lookup query – speeds up by order of magnitude – need fewer compute resources

sql engine:expressive power —

  • changes clause – access change history of table
  • qualify clause
  • explain, join json
  • like all
  • like any
  • like any <multi row>
  • top <n>
  • booland_agg
  • mode
  • kurtosos
  • skew
  • conditional_true_event
  • conditional_change_event
  • soundex
  • array_intersection
  • ratio_to_report
  • get_ignore_case
  • recursive views
  • SQL stored procedures
  • geospatial data in varint / geography data type

data governance:

  • know your data
  • manage data
      know data
  • who uses what data
  • dependencies between data
  • snowflake enforces policy
  • dynamic data masking policy – unredacted / partial / full
  • ingest data as tokenized — protegity (3r party)

dynamic data masking:

  • user role policyadmin;
  • create or replace masking policy <abc> as
  • can use regex expression
  • new cloned tables also has policy
  • can be applied to external tables (e.g., ip table)
  • tasks can be scheduled
  • partitioning expressing can be added to copy function — for data saved to cloud env

external functions

  • definiton in snowflake — implementation outside
  • can be used in query / data transformation
  • user defined functions can be created in java — python later
  • data frame programming model — python and spark users — later on
  • need api integration
  • create a sql external function using url of end point
  • can create a function calling a .jar
  • can be built into data pipeline — with stream and task

salesforce partnership:

  • works easier with salesforce
  • einstein nalaytics — direct sf query
  • esinstein analytics – outbound connector

data exchange:

  • connect providers to consumers
  • snowflake data marketplace – b2b
  • can create org level data exchange
  • listing can be standard or personalized
  • invited sf accounts can see what is available in data exchange
  • shared table is linked to a view that filters only rows available to user

— create external function to call Postegrity for de-tokenizing data

create or replace secure external function masking policies.tokenization_policies.protegrity_email_unprotect(email string)

returns string

api_integration-‘protegrity_api’

headers=(

‘X-Protection-DCop-Rules’='{\”payload_type\”:\”JSON\”,\”jsonpaths\” : \”$ data[*][1]\”,\”op_type\”:\”unprotect\”,\”data element\”:\”tok_alpha\”)]}

)

context_headers=(current_user,current_account)

as

https://3npspdj0k5.execute-api.us-west-1.amazonaws.com/SF_CUSTOMER&#8217;;

— create a masking policy that detokenizes email for PII_ROLE, everyone else gets tokenized value

create or replace masking policy masking_policies.tokenization_policies.email_detok as (val string)

return string ->

case

when invoker_share() is not null then val

when current_role() in (‘pii_role’) then masking_policies.tokenization_policies.email_unprotect(val)

else

val

end

— attached policy to email column

alter table sf_tuts.public.emp_tokenized

modify column email

set masking

policy masking_policies.tokenization_policies.email_detok;

====

show shares like ‘%emp%’;

use role pii_role;

select * from emp_tokenized limit 10;

use role analyst;

=======

this stuff was in: snowflake.com/say-hello-to-the-data-cloud/

Leave a comment