Zoom Data Engineering Part 4

6 minute read

Project and Scope of this Week

Goal of Proejct: Transforming the data loaded in our Data Warehouse (BigQuery) to Analytical Views (Google) bying using a transformation tool called dbt.

Prerequisites

By this stage of the course you should have already:

  • A running Data Warehouse (BigQuery)
  • A set of running pipelines ingesting the project dataset:
    • Yellow taxi data - Years 2019 and 2020 [x]
    • Green taxi data - Years 2019 and 2020 [x]
    • fhv data - Year 2019. [x]

What is Analytics Engineering? And how do they fit in?

The analytics engineer is the role that tries to fill the gap: it introduces the good software engineering practices to the efforts of data analysts and data scientists. The analytics engineer may be exposed to the following tools:

  1. Data Loading (Stitch…)

  2. Data Storing (Data Warehouses)

  3. Data Modeling (dbt, Dataform…)

  4. Data Presentation (BI tools like Looker, Mode, Tableau…)

Data Modeling Concepts

In this lesson we will cover the T in ETL, which is transformation

Dimensional Modeling

  • Fact and Dimensional tables (STAR Method)

Understanding the architecture of Dimensional Modeling:

Stage Area:

  • Contains the raw data, which is not meant to be exposed to everyone

Processing area: From raw data to data models, focues on the efficiency and ensuring standards

Presentation area: Final presentation of the data, aka the exposure to the business stakeholder


Introduction to dbt

START OF WEEK 4!!!

Scope of Project (using dbt)

Our project will have trip data (which we’ve loaded into BigQuery). We’ll add a CSV file with the taxi lookup data. We’ll use DBT to transform our data in BigQuery, and then express our data within a dashboard.

dbt has 2 main components: dbt Core and dbt Cloud:

dbt Core: open-source project that allows the data transformation. Builds and runs a dbt project (.sql and .yaml files). Includes SQL compilation logic, macros and database adapters. Includes a CLI interface to run dbt commands locally. Open-source and free to use.

dbt Cloud: SaaS application to develop and manage dbt projects. Web-based IDE to develop, run and test a dbt project. Jobs orchestration. Logging and alerting. Intregrated documentation. Free for individuals (one developer seat).

For integration with BigQuery we will use the dbt Cloud IDE, so a local installation of dbt core isn’t required.

What is dbt?

dbt is used for transformation

  • Transformation tool

Models in dbt

What are Models in dbt?

Models are snippets of SQL code that help shape data into a format that will be ready for reporting (Analytics/Business Intelligence)

  • Models live in the Models directory of a dbt projecty
  • Each Model has a One-to-One relationship with a table or view in the database
  • You can run all the models, dbt will decide which/what models to build in-order by running the command dbt run

At the top of our Model file we can include a Configuration Block that allows us to tell dbt to build either a Table or a View. An example of the code snippet block to config for a table is shown below:


If you want to run a specific model you can do so by running:

dbt run -m specifc_model

After running the Model you can check into your Data Warehouse (BigQuery, Snowflake, etc) and see that dbt built the model


Naming Conventions for Models

Modularity: Build Part-by-Part and *assemble** together. Think of building a car.

Sources: Raw data that has been loaded. (Not Models), they a way of referencing. ()

Staging: Clean and standardize the data (One-to-One with the source tables).

Itermediate Models: Exist somewhere between staging and final models. They should always reference staging models

Note: dbt docs at the to left of the project we can click on view docs and dbt will generate documentation for us.

Sources

What are Sources in dbt?

Sources in dbt allow you to document those raw tables that have been brought in by EL tools (Extract, Load) you configure the source once in a .yml file. A lot like the ref function we used with models.

For Example if location or naming of raw tables change you need to update across multiple files

Source Function Example:

--  

This is creating a direct reference to that .yml file.

When we compile our code:

-- 

compiles to:

raw.stripe.payments

So Sources are useful becuase when the Loading process changes you can update your dbt project in one place and everything can be up running again.

Tests in dbt

dbt tests are assertion you have about your data that are correct.

Testing at scale is difficult, run in dbt:

dbt test

Two types of tests in dbt which are singular and generic.

Singular Tests

  • Very specific, maybe apply to 1 or 2 models and they assert something really specific about the logic in the model

Generic Tests

  • Generic Tests are simple generic logic that can be used in your project

4 Types that come with dbt

  • Unique: Every value in column is unique
  • not_null: Every value in the column is not null
  • accepted_values: Every value in the colmun is a value from a given list
  • relationships: Every value in the colmun exisits in the column of another table

Note: Additional testing can be imported through packages OR write your custom generic tests.

Steps to use dbt in Our Project (START HERE)

Step 0: Prereq

Setting/making sure that our BigQuery Database has everything ‘correctly’ set up for this part!!!

Step 1: Set up dbt and dbt Cloud

The first step is to create a user account in dbt and create a project and connect it to a database.

Since we will be using BigQuery as our warehouse for this project we will use dbt cloud instead of dbt core, which is for developing on a local machine.

We will connect dbt to BigQuery using BigQuery OAuth.

Make sure that you have set up a GitHub repo for this part of the project.

SHOULD LINK GITHUB AND DBT TOGETHER!!!!

Click on init in dbt and it will create all the folders needed (e.g. Models, Seeds, tests, snapshots, etc.)

Change the name of the project in the dbt_project.yml file to:

name: 'taxi_ride_ny'

and change:

models: to taxi_ride_ny

dbt Core

  • For running dbt locally on your machine.

Developing with dbt

Define a Source and Create First Model

We will create two folders inside the Models folder in dbt. One file will be called staging and the other will be called core.

Inside the staging folder we will create a (file) model named stg_green_tripdata.sql

Were going to create a schema.yml file inside the staging folder were we can define the source, here is the schema.yml below:

Note:

  • name we can name this whatever, just staging in this example
  • database This is from BigQuery
  • schema This is also from BigQuery

  • tables
version: 2

sources:
    - name: staging
      database: new-try-zoom-data
      schema: trips_data_all

      tables:
        - name: green_tripdata
        - name: yellow_tripdata

In the model below called stg_green_tripdata.sql were going to define our View:

-- 
-- source(name, table_name)
-- this is in our schema.yml file and the name staging can be anything we want
SELECT * FROM 

LIMIT 100

——–Stopped HERE–

We can run ouir model in dbt by using the command:

dbt run

And it executed!

Lets edit the stg_green_tripdata.sql and add on to the model:

-- 


select
    -- identifiers
    cast(vendorid as integer) as vendorid,
    cast(ratecodeid as integer) as ratecodeid,
    cast(pulocationid as integer) as  pickup_locationid,
    cast(dolocationid as integer) as dropoff_locationid,
    
    -- timestamps
    cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
    cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,
    
    -- trip info
    store_and_fwd_flag,
    cast(passenger_count as integer) as passenger_count,
    cast(trip_distance as numeric) as trip_distance,
    cast(trip_type as integer) as trip_type,
    
    -- payment info
    cast(fare_amount as numeric) as fare_amount,
    cast(extra as numeric) as extra,
    cast(mta_tax as numeric) as mta_tax,
    cast(tip_amount as numeric) as tip_amount,
    cast(tolls_amount as numeric) as tolls_amount,
    cast(ehail_fee as numeric) as ehail_fee,
    cast(improvement_surcharge as numeric) as improvement_surcharge,
    cast(total_amount as numeric) as total_amount,
    cast(payment_type as integer) as payment_type,
    cast(congestion_surcharge as numeric) as congestion_surcharge
from 
limit 100   

Can run with either dbt run or run our model by itself dbt run --select stg_green_tripdata