Zoom Data Engineering Part 4
Project and Scope of this Week
Goal of Proejct: Transforming the data loaded in our Data Warehouse (BigQuery) to Analytical Views (Google) bying using a transformation tool called dbt.
Prerequisites
By this stage of the course you should have already:
- A running Data Warehouse (BigQuery)
- A set of running pipelines ingesting the project dataset:
- Yellow taxi data - Years 2019 and 2020 [x]
- Green taxi data - Years 2019 and 2020 [x]
- fhv data - Year 2019. [x]
What is Analytics Engineering? And how do they fit in?
The analytics engineer is the role that tries to fill the gap: it introduces the good software engineering practices to the efforts of data analysts and data scientists. The analytics engineer may be exposed to the following tools:
-
Data Loading (Stitch…)
-
Data Storing (Data Warehouses)
-
Data Modeling (dbt, Dataform…)
-
Data Presentation (BI tools like Looker, Mode, Tableau…)
Data Modeling Concepts
In this lesson we will cover the T in ETL, which is transformation
Dimensional Modeling
- Fact and Dimensional tables (STAR Method)
Understanding the architecture of Dimensional Modeling:
Stage Area:
- Contains the raw data, which is not meant to be exposed to everyone
Processing area: From raw data to data models, focues on the efficiency and ensuring standards
Presentation area: Final presentation of the data, aka the exposure to the business stakeholder
Introduction to dbt
START OF WEEK 4!!!
Scope of Project (using dbt)
Our project will have trip data (which we’ve loaded into BigQuery). We’ll add a CSV file with the taxi lookup data. We’ll use DBT to transform our data in BigQuery, and then express our data within a dashboard.
dbt has 2 main components: dbt Core and dbt Cloud:
dbt Core: open-source project that allows the data transformation. Builds and runs a dbt project (.sql and .yaml files). Includes SQL compilation logic, macros and database adapters. Includes a CLI interface to run dbt commands locally. Open-source and free to use.
dbt Cloud: SaaS application to develop and manage dbt projects. Web-based IDE to develop, run and test a dbt project. Jobs orchestration. Logging and alerting. Intregrated documentation. Free for individuals (one developer seat).
For integration with BigQuery we will use the dbt Cloud IDE, so a local installation of dbt core isn’t required.
What is dbt?
dbt is used for transformation
- Transformation tool
Models in dbt
What are Models in dbt?
Models are snippets of SQL code that help shape data into a format that will be ready for reporting (Analytics/Business Intelligence)
- Models live in the Models directory of a dbt projecty
- Each Model has a One-to-One relationship with a table or view in the database
- You can run all the models, dbt will decide which/what models to build in-order by running the command
dbt run
At the top of our Model file we can include a Configuration Block that allows us to tell dbt to build either a Table or a View. An example of the code snippet block to config for a table is shown below:
If you want to run a specific model you can do so by running:
dbt run -m specifc_model
After running the Model you can check into your Data Warehouse (BigQuery, Snowflake, etc) and see that dbt built the model
Naming Conventions for Models
Modularity: Build Part-by-Part and *assemble** together. Think of building a car.
Sources: Raw data that has been loaded. (Not Models), they a way of referencing. ()
Staging: Clean and standardize the data (One-to-One with the source tables).
Itermediate Models: Exist somewhere between staging and final models. They should always reference staging models
Note: dbt docs at the to left of the project we can click on view docs
and dbt will generate documentation for us.
Sources
What are Sources in dbt?
Sources in dbt allow you to document those raw tables that have been brought in by EL tools (Extract, Load) you configure the source once in a .yml file. A lot like the ref function we used with models.
For Example if location or naming of raw tables change you need to update across multiple files
Source Function Example:
--
This is creating a direct reference to that .yml file.
When we compile our code:
--
compiles to:
raw.stripe.payments
So Sources are useful becuase when the Loading process changes you can update your dbt project in one place and everything can be up running again.
Tests in dbt
dbt tests are assertion you have about your data that are correct.
Testing at scale is difficult, run in dbt:
dbt test
Two types of tests in dbt which are singular and generic.
Singular Tests
- Very specific, maybe apply to 1 or 2 models and they assert something really specific about the logic in the model
Generic Tests
- Generic Tests are simple generic logic that can be used in your project
4 Types that come with dbt
- Unique: Every value in column is unique
- not_null: Every value in the column is not null
- accepted_values: Every value in the colmun is a value from a given list
- relationships: Every value in the colmun exisits in the column of another table
Note: Additional testing can be imported through packages OR write your custom generic tests.
Steps to use dbt in Our Project (START HERE)
Step 0: Prereq
Setting/making sure that our BigQuery Database has everything ‘correctly’ set up for this part!!!
Step 1: Set up dbt and dbt Cloud
The first step is to create a user account in dbt and create a project and connect it to a database.
Since we will be using BigQuery as our warehouse for this project we will use dbt cloud instead of dbt core, which is for developing on a local machine.
We will connect dbt to BigQuery using BigQuery OAuth.
Make sure that you have set up a GitHub repo for this part of the project.
SHOULD LINK GITHUB AND DBT TOGETHER!!!!
Click on init in dbt and it will create all the folders needed (e.g. Models, Seeds, tests, snapshots, etc.)
Change the name of the project in the dbt_project.yml
file to:
name: 'taxi_ride_ny'
and change:
models:
to taxi_ride_ny
dbt Core
- For running dbt locally on your machine.
Developing with dbt
Define a Source and Create First Model
We will create two folders inside the Models
folder in dbt. One file will be called staging
and the other will be called core
.
Inside the staging
folder we will create a (file) model named stg_green_tripdata.sql
Were going to create a schema.yml
file inside the staging
folder were we can define the source, here is the schema.yml
below:
Note:
name
we can name this whatever, just staging in this exampledatabase
This is from BigQuery-
schema
This is also from BigQuery tables
version: 2
sources:
- name: staging
database: new-try-zoom-data
schema: trips_data_all
tables:
- name: green_tripdata
- name: yellow_tripdata
In the model below called stg_green_tripdata.sql
were going to define our View:
--
-- source(name, table_name)
-- this is in our schema.yml file and the name staging can be anything we want
SELECT * FROM
LIMIT 100
——–Stopped HERE–
We can run ouir model in dbt by using the command:
dbt run
And it executed!
Lets edit the stg_green_tripdata.sql
and add on to the model:
--
select
-- identifiers
cast(vendorid as integer) as vendorid,
cast(ratecodeid as integer) as ratecodeid,
cast(pulocationid as integer) as pickup_locationid,
cast(dolocationid as integer) as dropoff_locationid,
-- timestamps
cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,
-- trip info
store_and_fwd_flag,
cast(passenger_count as integer) as passenger_count,
cast(trip_distance as numeric) as trip_distance,
cast(trip_type as integer) as trip_type,
-- payment info
cast(fare_amount as numeric) as fare_amount,
cast(extra as numeric) as extra,
cast(mta_tax as numeric) as mta_tax,
cast(tip_amount as numeric) as tip_amount,
cast(tolls_amount as numeric) as tolls_amount,
cast(ehail_fee as numeric) as ehail_fee,
cast(improvement_surcharge as numeric) as improvement_surcharge,
cast(total_amount as numeric) as total_amount,
cast(payment_type as integer) as payment_type,
cast(congestion_surcharge as numeric) as congestion_surcharge
from
limit 100
Can run with either dbt run
or run our model by itself dbt run --select stg_green_tripdata