Header image

Querying Safely in the (Google) Cloud

May 23, 2019 blog-post open-source bigquery gcp backup sql cloud python

How to read this piece of storytooling

In a nutshell, we built a software tool and we’d like to share its story with the larger community. But given that our audience might be of varied backgrounds, we’d like to start you off with a flowchart, because flowcharts are fun!

Flowcharts are fun! Or else.

The view from The Cloud™

Here at Thinking Machines, we do a lot of work with huge amounts of data. So huge, in fact, that it won’t fit on our local machines. How then do we get any analytics work done? Easy! Enter: Google Cloud Platform (GCP).

With GCP, a lot of our big data problems are easily solved, especially because it provides a managed data analytics warehousing solution known as BigQuery. It’s just like your traditional relational database that runs on SQL except it has been optimized to hell and back for complex queries over very large datasets. (It is not recommended for use as a transactional database, however. If you are expecting frequent modifications to existing data, please consider CloudSQL instead.)

We use BigQuery as both a data warehouse and an analytics platform. Once all the data has been loaded, we can swim through it using SQL queries that can be saved as views. These views can reference other views as well, allowing us to write a lot of reusable logic in SQL. Lastly, these views can be consumed by external visualization tools such as Google Data Studio, Tableau, and PowerBI.

With BigQuery comes big responsibility

Views can grow to be the accumulation of logic that was built over a long period of time by multiple people. However, views are saved in BigQuery alone unless we explicitly save the SQL source code elsewhere ourselves. This can be dangerous in case someone accidentally deletes or rewrites one or more views, especially if other views depend on them. With traditional code development workflows, this wouldn’t be a problem since we can expect code to first get committed to a version control system before going into production. This isn’t the case, however, when your team has a cloud-first workflow that involves a lot of experimentation for analyzing and exploring the shape and scale of the data you’re working with. On the other hand, asking everyone who writes views in BigQuery to save their own backups adds a significant amount of work and eats up a lot of valuable time, not to mention cognitive overhead when your workflow turns from the one on the left to the one on the right.

Left: A tight feedback loop that facilitates a good workflow. Right: An interruptive nightmare that constantly drains your focus

A pinch of Version Cron-trol

We asked ourselves: What would allow us to have a safety net against losing massive amounts of work while at the same time avoiding a costly addition to the workflow of everyone who writes views on BigQuery?

Then we answered ourselves: Let’s build a tool that can get the source code for all the views in a particular project, and then write a script to do this regularly while backing it up to a version control repository!

And so we built bqup (pronounced back-wup)! This tool can export a BigQuery project’s structure and source code into your file system while mimicking the hierarchy of datasets and tables. You can also optionally export schemata! Then we wrote a template shell script for using bqup with a git repository and instructions on how to use cron to make this happen on a regular basis. Granted you may still lose a bit of work in case of accidental deletions, but at least now you have a ceiling for the damage based on however often you run your backups. The best part is that we’ve found it so useful in a lot of our projects that we decided to open source it for everyone else to use! It’s also published on pypi so python users can just pip install away.

We hope you find it useful in your journeys swimming through big data across the cloud!


Why SQL is your ultimate tool for the modern data stack

The accessible nature of SQL can help you maximize the efficiency and flexibility that come with using a modern data stack.

What's so great about the modern data stack?

MDS is an ecosystem of technology that makes it much easier to govern, access, transform, and process data in an accessible and affordable way.

6 Use Cases of Geospatial Analytics That Change Your View of the World

Successful businesses use geospatial analytics wisely and we’ll tell you why you should start using it too.