Querying Safely in the (Google) Cloud
How to read this piece of storytooling
In a nutshell, we built a software tool and we’d like to share its story with the larger community. But given that our audience might be of varied backgrounds, we’d like to start you off with a flowchart, because flowcharts are fun!
The view from The Cloud™
Here at Thinking Machines, we do a lot of work with huge amounts of data. So huge, in fact, that it won’t fit on our local machines. How then do we get any analytics work done? Easy! Enter: Google Cloud Platform (GCP).
With GCP, a lot of our big data problems are easily solved, especially because it provides a managed data analytics warehousing solution known as BigQuery. It’s just like your traditional relational database that runs on SQL except it has been optimized to hell and back for complex queries over very large datasets. (It is not recommended for use as a transactional database, however. If you are expecting frequent modifications to existing data, please consider CloudSQL instead.)
We use BigQuery as both a data warehouse and an analytics platform. Once all the data has been loaded, we can swim through it using SQL queries that can be saved as views. These views can reference other views as well, allowing us to write a lot of reusable logic in SQL. Lastly, these views can be consumed by external visualization tools such as Google Data Studio, Tableau, and PowerBI.
With BigQuery comes big responsibility
Views can grow to be the accumulation of logic that was built over a long period of time by multiple people. However, views are saved in BigQuery alone unless we explicitly save the SQL source code elsewhere ourselves. This can be dangerous in case someone accidentally deletes or rewrites one or more views, especially if other views depend on them. With traditional code development workflows, this wouldn’t be a problem since we can expect code to first get committed to a version control system before going into production. This isn’t the case, however, when your team has a cloud-first workflow that involves a lot of experimentation for analyzing and exploring the shape and scale of the data you’re working with. On the other hand, asking everyone who writes views in BigQuery to save their own backups adds a significant amount of work and eats up a lot of valuable time, not to mention cognitive overhead when your workflow turns from the one on the left to the one on the right.
A pinch of Version Cron-trol
We asked ourselves: What would allow us to have a safety net against losing massive amounts of work while at the same time avoiding a costly addition to the workflow of everyone who writes views on BigQuery?
Then we answered ourselves: Let’s build a tool that can get the source code for all the views in a particular project, and then write a script to do this regularly while backing it up to a version control repository!
And so we built bqup (pronounced back-wup)! This tool can export a BigQuery project’s structure and source code into your file system while mimicking the hierarchy of datasets and tables. You can also optionally export schemata! Then we wrote a template shell script for using bqup with a git repository and instructions on how to use cron to make this happen on a regular basis. Granted you may still lose a bit of work in case of accidental deletions, but at least now you have a ceiling for the damage based on however often you run your backups. The best part is that we’ve found it so useful in a lot of our projects that we decided to open source it for everyone else to use! It’s also published on pypi so python users can just
pip install away.
We hope you find it useful in your journeys swimming through big data across the cloud!