Back to Blog

BigQuery ML: Machine Learning in BigQuery

Do you know that SQL can be used for machine learning? Well, it’s true! BigQuery ML lets anyone do this using SQL. So, don’t worry if you can’t code in Python, R, or other programming languages for ML. With just SQL, you can still build and train a machine learning model.

Interested in learning more about BigQuery ML? This article covers machine learning capability in BigQuery, including what BigQuery ML is, the type of models it supports, a simple example of using it, and more.

What is BigQuery ML?

Google’s BigQuery is more than just a data warehouse. It can provide decision-making guidance through predictive analytics by using its machine learning tool, BigQuery ML. More excitingly, you create and train a model without ever exporting data out of BigQuery!

Simply put, Google BigQuery ML (BQML) is a set of SQL extensions to support machine learning. It was launched in 2018 with the following purposes:

  • To empower data analysts and scientists to use machine learning through existing SQL skills and tools. It’s more common to find data analysts with SQL expertise than ones with ML programming backgrounds. So, with BigQuery ML, analysts who are familiar with SQL but can’t code in R, Python, or Java can still build powerful ML models.
  • To allow data analysts and scientists to do ML locally inside BigQuery. This means they don’t need to move the data out of the platform, saving time on machine learning development.

So, before using BigQuery ML, you’ll need to bring your data into BigQuery if it isn’t already there. There are several ways to do this. For example, you can upload CSV files, run SQL commands to insert data, use third-party services, etc.

If you’re importing data from external sources, such as Shopify, QuickBooks, Google Sheets, or REST APIs, consider using Coupler.io.

It’s a solution to easily import data from external sources into BigQuery, Excel, or Google Sheets. You can also automate the data load on a schedule. Check out the complete list of BigQuery integrations supported by Coupler.io.

BigQuery ML supported model types and algorithms

The terms “model” and “algorithm” are frequently used interchangeably in machine learning, but they’re not the same thing. A machine learning algorithm is the step-by-step instructions run on data to create a machine learning model. So, you can say that a machine learning model is the output of an algorithm.

BigQuery ML supports supervised learning algorithms such as linear and logistic regressions. It also supports unsupervised learning algorithms in that you can use k-means to cluster your data based on similarity. 

As of this writing, BigQuery ML supports the following model types:

BigQuery ML model typeML problem type and additional note
Linear regressionTo solve regression problems. Use LINEAR_REG when the label is a number, for example, to forecast product sales on a certain day.
Logistic regressionYou can use LOGISTIC_REG for either BigQuery ML binary logistic or multiclass classification. Use for binary classification when the label is TRUE/FALSE, 1/0, or only two categories, for example, to determine whether a flight will be late or not. 
Use for multiclass classification when the label is in a fixed set of strings, for example, to classify whether an email is a primary, social, promotions, updates, or forums.
Deep Neural Network (DNN)To create TensorFlow-based DNN for solving regression and classification problems.
Use DNN_REGRESSOR for regression problems.
Use DNN_CLASSIFIER for binary and multiclass classification problems.
Boosted TreeFor creating “boosted” decision trees, which have better performance than decision trees on extensive datasets.
You can use boosted tree models for classification and regression problems. 
Use BOOSTED_TREE_REGRESSOR for regression.
Use BOOSTED_TREE_CLASSIFIER for binary and multiclass classification problems.
Matrix factorizationFor creating a recommendation system, for example, to recommend the “next” product to buy to a customer based on their past purchases, historical behavior, and product ratings.
K-meansUse it when labels are unavailable, for example, to perform customer segmentation.
AutoencoderYou can use it to detect anomalies in your data. 
Time seriesBigQuery ML for time series is popular for estimating future demands, such as retail sales or manufacturing production forecasts. It also automatically detects and corrects for anomalies, seasonality, and holiday effects.
AutoML tablesYou can use AutoML for any regression, classification, and time series forecasting problems. It will automatically search through various models and find the best one for you.
TensorFlow model importingUse it if you have previously trained TensorFlow models and want to import them to BigQuery ML to perform predictions on them.

BigQuery ML built-in vs. external model types

Based on where the models are trained, the model types listed above can be classified into two different categories:

  • Built-in models, which are built and trained within BigQuery. These include linear regression, logistic regression, k-means, matrix factorization, and time series models. 
  • External models (custom models), which are trained outside BigQuery. Examples include AutoML Table, DNN, and boosted tree models trained on Vertex AI.

Built-in models can be trained very quickly. You can get high-quality models with external models, but they’ll take longer to train—it can be hours or even days. Choosing a built-in vs. external model will also be different in pricing, which we’ll cover in the BQML pricing section.

BigQuery ML & other tools in GCP for machine learning

You can do machine learning in BigQuery using just SQL. However, being able to use frameworks such as TensorFlow can give you access to a much wider variety of machine learning models and components. Including BigQuery, here are several ways to create machine learning models on GCP:

  • BigQuery ML, which is the focus of this article. You can build and train a model quickly using SQL without moving data outside the data warehouse.
  • TensorFlow, an open-source library that helps you build, train, and deploy your own custom models. In TensorFlow, you can code using several programming languages such as Python, C++, and Go, but Python is the best supported. BigQuery ML is also compatible with TensorFlow models, so you can export models between the two.
  • AutoML, which provides UI to allow you automatically build and deploy ML models. With AutoML, you can apply ML to real-life problems without coding, so there is no need to worry if you’re not highly skilled in this field. It consists of a family of products, such as AutoML Vision (to classify your images), AutoML Video Intelligence, and AutoML Translation. 
  • Pretrained ML APIs. To solve common ML problems, GCP has many pre-built models that are ready to use, such as Speech-to-Text, Translation, and Vision OCR.

Other than those mentioned above, there are a few other ways to do machine learning on GCP. For example, you can also run PyTorch, xgboost, or scikit-learn on Vertex AI, Spark ML on Dataproc, etc.

BigQuery ML example

In this example, we’ll use the BigQuery public dataset chicago_taxi_trips to build a model to predict the taxi fares of Chicago cabs. The label we’re going to predict is the trip_total, which is the total cost of the trip. We’ll use a linear regression model, the simplest regression model that BigQuery ML supports. 

The prediction we’re going to make is based on these three attributes also called features:

  • trip_seconds (time of the trip in seconds) 
  • trip_miles (distance of the trip in miles)
  • company (the taxi company) 

Note: Before choosing the input features of our ML models, it’s best to do more analysis to verify that those features do in fact influence the label. Deciding which features to include in a machine learning model is called feature engineering, and it is often considered the most important part of building an accurate machine learning model. In this example, however, let’s just pick those three features for simplicity.

This is the step-by-step guide for this example:

  • Create a dataset
  • Create a linear regression model using the CREATE MODEL statement
  • Evaluate the model using the ML.EVALUATE function 
  • Make a prediction using the ML.PREDICT function
  • Generate batch predictions

Create a dataset

Now let’s first create a BigQuery dataset to store our ML model by following the steps below:

Step 1: Go to the BigQuery page.

Step 2: In the toolbar, select your project (or create a new one).

Step 3: In the Explorer, expand the View actions icon () next to the project, then select Create dataset.

Step 4: On the Create dataset pane, enter a unique Dataset ID, for example, bqml_example. You can leave the other options as default, then click CREATE DATASET

Create model using BigQuery ML

Next, let’s create a linear regression model and save it into the bqml_example dataset we created previously. Here are the steps to create the model using the CREATE MODEL statement:

Step 1: Click the COMPOSE NEW QUERY button.

Step 2: Write the following SQL query in the editor.

CREATE OR REPLACE MODEL bqml_example.chicago_taxi_trip_model
OPTIONS (input_label_cols=['trip_total'], model_type='linear_reg') AS
SELECT  trip_seconds, 
        trip_miles, 
        company,
        trip_total        
FROM    `bigquery-public-data`.chicago_taxi_trips.taxi_trips
WHERE   trip_seconds BETWEEN 60 AND 7200
        AND trip_miles > 0
        AND fare >= 3.25
        AND company IS NOT NULL;

Query explanation:

The above SQL creates and trains a model named chicago_taxi_trip_model and saves it in the bqml_example dataset. The model uses data from the taxi_trips table in the chicago_taxi_trips dataset under the bigquery-public-data project.

Notice that the input label column and model type are specified in OPTIONS. Because the label is numeric, we use linear_reg model type as already discussed in the previous section about BigQuery ML supported model types

In the WHERE clause, we can put filters to exclude certain data in our training. For example, it’s very rare for a person to be in a taxi for under 1 minute or more than 2 hours. In this case, we also exclude trips with fares under $3.25, zero mile distance, and no information about taxi company names.

Step 3: Click RUN.

The query takes about 4 minutes to complete. After that, the new model chicago_taxi_trip_model appears in the navigation panel. 

Evaluate BigQuery ML model

We can get the evaluation results by running the following SQL query:

SELECT * FROM ML.EVALUATE(MODEL bqml_example.chicago_taxi_trip_model)

Result:

As you can see from the above screenshot, the mean absolute error value is about $3.98. This means that you should expect to predict the taxi fare with an average error of about $3.98. 

Note: To see the evaluation results, alternatively, you can click chicago_taxi_trip_model in the left panel. You will see several tabs containing info about this model. One of them is the Evaluation tab, where you can view the evaluation summary.

Predict with the BigQuery ML model 

We can try out the prediction by passing in a row for which to predict. For example, to get the predicted Chicago Taxicab total fare for a trip with 1.5 miles and 10 minutes, we can use the following code:

SELECT * FROM ML.PREDICT(MODEL bqml_example.chicago_taxi_trip_model, 
  (SELECT 600 as trip_seconds, 1.5 AS trip_miles, 'Chicago Taxicab' as company) 
)

Result:

As you can see, the predicted total fare is about $13.49. Because we have a mean absolute error of $3.98, the actual fare is expected to be in the range of $9.38 to $17.47.

Generate batch prediction via Google BigQuery ML 

We can also predict many rows at once. For example, here we’re going to predict the cab fare for trips with a duration every 30 seconds between 300 to 600 seconds using array generation:

WITH time_seconds AS (
  SELECT GENERATE_ARRAY(300, 600, 30) AS seconds
)
SELECT * FROM ML.PREDICT(MODEL bqml_example.chicago_taxi_trip_model, 
  (SELECT t_seconds AS trip_seconds, 1.5 AS trip_miles, 'Chicago Taxicab' as company
  FROM time_seconds, UNNEST(seconds) as t_seconds) 
) 
ORDER BY trip_seconds DESC;

The query returns 11 rows, sorted from longest to shortest trip duration:

BigQuery ML Geography data types

As a data warehouse, BigQuery can store data of many types: numeric, date, text, geospatial data, etc. Learn more about Google BigQuery data types

When you want to provide geolocation for your model input, using the state or country name might not give you enough detail, so it doesn’t work well in some cases. It’s recommended to use precise locations. However, using a WKT representation (e.g., POINT(-122.35 47.62)) or GeoJSON (e.g., { "type": "Point", "coordinates": [-122.35, 47.62] }) is not recommended either.

A better choice is using a GeoHash representation, a hash string representation of a geographic location. The longer the shared prefix between two hashes, the closer together they are spatially. 

To get a GeoHash representation of a location, use ST_GeoHash function in BigQuery.

Example: Adding locations as inputs to our model

Now, let’s add the pick-up and drop-off locations and use those as additional inputs to our model. We’ll create a new model and name it chicago_taxi_trip_geo_model.

CREATE OR REPLACE MODEL bqml_example.chicago_taxi_trip_geo_model
OPTIONS (input_label_cols=['trip_total'], model_type='linear_reg') AS
SELECT  trip_seconds, 
        trip_miles, 
        company,
        ST_GeoHash(ST_GeogPoint(pickup_latitude,pickup_longitude), 5) as pickup_location, 
        ST_GeoHash(ST_GeogPoint(dropoff_latitude,dropoff_longitude), 5) as dropoff_location, 
        trip_total        
FROM    `bigquery-public-data`.chicago_taxi_trips.taxi_trips
WHERE   trip_seconds BETWEEN 60 AND 7200
        AND trip_miles > 0
        AND fare >= 3.25
        AND pickup_latitude IS NOT NULL AND pickup_longitude IS NOT NULL
        AND dropoff_latitude IS NOT NULL AND dropoff_longitude IS NOT NULL
        AND company IS NOT NULL;

After that, run the following query to get the evaluation results:

SELECT * FROM ML.EVALUATE(MODEL bqml_example.chicago_taxi_trip_geo_model)

As you can see, this model results in a mean absolute error of about $2.83, a significant improvement over the $3.98 we got from the previous model. We can conclude that pick-up and drop-off locations also affect the taxi fare.

BQML pricing

When you use BigQuery ML, your training data and ML models are stored inside BigQuery. You will be charged mainly for storage and queries. The pricing for queries will also be differentiated into queries for analysis and CREATE MODEL statements—see the following table:

ResourceDetail
Storage The first 10 GB BigQuery storage per month is free.  
After that, you’ll be charged $0.020 per GB of data stored in BigQuery. This price automatically drops by approximately 50% if you don’t make any modifications to your data for 90 consecutive days.
Queries (analysis)The first 1 TB of data processed by queries that use BigQuery ML functions (ML.PREDICT, ML.EVALUATE, etc.) is free.
After that, you will be charged $5.00 per TB.
Queries (CREATE MODEL)For built-in model types:
The first 10 GB of data processed by queries that contain CREATE MODEL statements per month is free. After that, the price is $250.00 per TB
For external model types:
The price is $5.00 per TB, plus Vertex AI training cost.

Note: The above pricing applies if you use BigQuery on-demand pricing. But if you are a flat-rate customer, BigQuery ML costs are included in your BigQuery monthly payment ($2,000 per month, or lower if you choose a longer-term commitment). 

Pros and Cons of Google BigQuery ML

Pros:

  • You can perform BigQuery machine learning by using just SQL. There’s no need to use programming languages such as R or Python to create and train models. 
  • You can save ML development time by removing the need to export data from BigQuery. 
  • You can get a quick baseline model because BigQuery ML builds and trains models very quickly.

Cons: 

  • Though BigQuery ML offers fast iteration capability, you can’t get a very high-quality model. So, depending on your skill set and the characteristics of the problem you need to solve, you may want to use other tools such as AutoML or TensorFlow to get a very accurate model.
  • The cost of using BigQuery ML can be high, because you usually need to train a model several times, and you might easily overuse the free quota.

Wrapping up

BigQuery ML is a powerful service that can be used to build and train machine learning models in BigQuery using SQL. It supports many types of models for different machine learning problems. You don’t need to have any prior experience with R, Python, or any other machine learning programming languages to use it.

Importing data from different sources into BigQuery for machine learning or exporting results to Excel or Google Sheets? Coupler.io is an excellent way to do that! With this tool, you can easily manage your import process to your data warehouse and export from it without coding.

Try Coupler.io and BigQuery ML today, and good luck with your analysis! 😉

Back to Blog

Comments are closed.

Access your data
in a simple format for free!

Start Free