Intro
When handling large datasets in BigQuery from multiple sources like Google Analytics 4 (GA4) and ad platforms, things can quickly become chaotic with complex queries and workflows. Dataform steps in as a game-changing tool, simplifying data orchestration and turning raw data into actionable insights.
Challenges Working With Google Analytics 4 and Other Data Sources In BigQuery
One of the biggest advantages of GA4 compared to its predecessors is that even the free version allows you to export all raw data to BigQuery for advanced analysis. However, working with SQL on the raw Google Analytics 4 data, while promising, also brings new challenges.
What it might look like in a project with 3 raw source tables, 4 analysts writing SQL, and 4 result tables used for reporting.
SQL projects tend to grow quickly over time, bringing challenges with them. Analysts have different ways of writing their queries, leading to inconsistent logic. As more data sources and reports are added to the project, the logic becomes increasingly difficult to control. On top of that, the out-of-the box Google Analytics 4 dataset offers challenges that need to be addressed even for simple reports. Common issues when working with Google Analytics 4 data in BigQuery is:
- There is no concept of sessions. Many session-scoped dimensions are missing and need to be calculated.
- Traffic source data is incomplete and inaccurate. Even with the introduction of the new fields in the export schema that aim to address this, there are great challenges, especially if you want it to apply to historical data.
- Nested data structures for parameters are complex and challenging to work with.
- Data that has already been exported can still be retroactively updated for up to 72 hours after the initial export.
So, let’s explore a tool and approach to make working with GA4 data in BigQuery easier and more effective.
Introducing Dataform For Easier Data Orchestration
Dataforms interface with SQLX files and folders structure and code editor.
Dataform simplifies these challenges by offering a platform for managing data pipelines, reducing the complexity of SQL workflows, and providing better control over data quality and structure.
Key Benefits of Dataform
- Data Orchestration: Dataform helps orchestrate complex data workflows, reducing the manual effort required to manage dependencies between tables and ensuring smooth data transformations.
- Version Control: With built-in GitHub or GitLab integration, Dataform supports version control, allowing teams to manage multiple workspaces and development branches. This ensures that any changes are tracked, providing a reliable audit trail, and easy rollback of changes.
- Simplified SQL Workflows: Dataform introduces SQLX, which enhances standard SQL by combining it with JavaScript for dynamic queries. This makes complex SQL queries more flexible and easier to manage.
- Enhanced Data Quality: With assertions, Dataform allows users to set up automatic data quality checks, ensuring that any anomalies or inconsistencies are flagged early in the data pipeline.
- Customizable Execution: Dataform workflows can be triggered manually or automatically through Google Cloud Workflows or Cloud Composer (Apache Airflow), or REST API calls. This allows for robust scheduling and error handling, perfect for managing large-scale data pipelines.
- Incremental Tables: For large datasets, Dataform supports incremental table functionality, which minimizes unnecessary data processing by only updating the necessary portions of data. This optimizes performance and reduces costs.
An SQLX file, mixing JavaScript and SQL for easier scripting.
But what about BigQuery - should we still use it?
Dataform does not replace BigQuery but acts as a complement to help you organize and write your SQL code. Your data is still stored and processed in BigQuery. Dataform serves as a tool to design and execute data workflows.
What about dbt - how does it compare?
dbt (Data Build Tool) is a highly flexible and widely used tool in the data engineering community. It offers a similar solution to Dataform for managing data orchestration and helps you write SQL.
The main differences are that Dataform uses JavaScript, while dbt uses Jinja, which is based on Python. However, the most notable difference is that Dataform is integrated for free within Google Cloud and BigQuery, whereas dbt requires third-party hosting, which comes with additional costs, maintenance and access management.
Creating a Structured Data Library with Dataform
An example of a Structured Data Library for Google Analytics 4 and Google Ads. The cogwheels represent complex data transformations and logics.
At Ctrl Digital - for our clients, we use Dataform to create prepared tables that analysts can use as sources for their SQL queries. The raw data is transformed into a structured data library, ensuring that business logic is cohesive across queries. Complex processes like managing traffic source attribution models, unnesting parameters, retroactive updates of raw data, and more are handled within Dataform. This allows analysts to focus on the next steps without being bogged down by data preparation challenges.
How to Get Started with Dataform
Setting up Dataform is simple, it’s a free service on Google Cloud and works seamlessly with BigQuery and other GCP services. The steps are straightforward:
- Create a new Dataform repository in Google Cloud and connect it with GitHub/GitLab for version control.
- Give Dataforms service account access to the tables in BigQuery you want to query.
- Set up your workspace to begin writing SQLX code for your data transformations.
- Schedule workflows to execute your repository and its SQLX files, resulting in new tables. In an upcoming blog post, we will take a detailed look at how to set up a new Dataform repository and run our first SQLX queries with a workflow.
Conclusion
This is the first part in our blog series about Dataform, it is a powerful tool for teams working with complex data pipelines in BigQuery. It simplifies data orchestration, ensures data quality, and integrates seamlessly with BigQuery and Google Cloud. Whether you’re working with raw data from GA4 or other sources, Dataform makes it easier to manage and analyze data, leading to better insights and decision-making. If you’re ready to streamline your data processes in BigQuery, Dataform is the tool to explore next. If you need help setting up Dataform or want to discuss how to get started, reach out to us at [email protected]