Back to Blog

Google BigQuery Regexp: How to Apply Regular Expressions in BigQuery

Regular Expressions (RegExp for short) are maybe one of the most important skills in the field of data analytics and data manipulation. They are used in search engines, word processors, text editors, and other popular tools.

In this article, we’ll present the functions that BigQuery provides for the use of regular expressions and some useful tips and tricks on how to perform popular actions on your data.

BigQuery Regexp Functions

Regular expressions are a pattern or a sequence of characters that allows you to match, search and replace or validate a string input. There are a lot of cases where during analysis we want to extract certain parts of a string, validate whether a string has a specific format or not, and even replace or remove certain characters.

As BigQuery is one of the most popular cloud data warehouse tools used by millions of people for storage and analysis, RegEx functions could not be missing from the list of the provided default functions:

Function Syntax Description
REGEXP_CONTAINSREGEXP_CONTAINS(value, regexp)Verifies whether the input matches the given regular expression pattern
REGEXP_EXTRACTREGEXP_CONTAINS(value, regexp)Extracts the first matched sequence of characters that matches the regular expression pattern.
REGEXP_EXTRACT_ALLREGEXP_CONTAINS(value, regexp)Extracts every matched sequence of characters that matches the regular expression pattern.
REGEXP_INSTRREGEXP_INSTR(source_value, regexp, [position], [occurrence], [occurrence_position])Extracts the position where the first matched pattern is starting.
REGEXP_REPLACEREGEXP_REPLACE(value, regexp, replacement)Replaces the matched sequence of characters that matches the regular expression pattern with the given string.
REGEXP_SUBSTRREGEXP_SUBSTR(value, regexp, [position], [occurrence])It’s a synonym to the REGEXP_EXTRACT function, and it works the same way.

BigQuery RegExp Tips & Tricks

Before showing you great tips and tricks on applying regular expressions and BigQuery RegExp functions on your data to extract and manipulate your data, we will first import our example dataset into BigQuery. For this, we will use Coupler.io, a solution for importing data from multiple sources into BigQuery, or Google Sheets, or Excel. The key benefit here is that you can automate data load on a schedule. 

You need to:

  • Set up source (the app or file to get data from)
  • Set up destination (configure your BigQuery project and select the dataset and table to import data to)
  • Set up a schedule (customize the frequency for automatic data refresh)

Read our BigQuery Tutorial guide to check out how it works. And here is the dataset we imported from Excel to BigQuery for the sake of examples:

Here is the preview:

Check out the many BigQuery integrations available.

Now that we’ve loaded our sample dataset, it’s time to see all the tips and tricks on how to apply regular expressions to it and make the most out of our data.

BigQuery RegExp: How to check if a column is a valid phone number

The first tip we’ll show you is how to verify whether a column contains a valid phone number or not. For this case, we will use the REGEXP_CONTAINS function and the below regular expression, which is created to match any US formatted number:

^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$

So we can use it in the below query:

SELECT
  phone_number,
  REGEXP_CONTAINS(phone_number, r"^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$") AS is_valid
FROM
  `projectname.datasetname.tablename`

In case we use the same regular expression with the “email” column, we would get “false” as a result because it’s not a valid phone number:

SELECT
  email,
  REGEXP_CONTAINS(email, r"^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$") AS is_valid
FROM
  `projectname.datasetname.tablename`
3-invalid-phonenumber

BigQuery RegExp: How to find a specific word

On a similar approach, we will try to find a specific word in one of our columns. In detail, we will use the “message” column to search for the word “dolor” (random word). So, if we use the RegExp .*dolor.*, we can see the following results:

SELECT
  message,
  REGEXP_CONTAINS(message, r".*dolor.*") AS contains_word
FROM
  `projectname.datasetname.tablename`

BigQuery RegExp: How to check if a column does not contain a word

Now let’s see a trick where you can check if a column does not contain a word. For just a single word you can also use the [NOT] LIKE operator, but RegExp are also handy for more than one word or to make search case insensitive. 

Similar to the previous operation, we can try to find if a column contains a word and reverse the outcome by using the NOT operator. For example, using the same regular expression but adding NOT to the start of the expression, we can achieve the desired result:

SELECT
  message,
  NOT REGEXP_CONTAINS(message, r".*dolor.*") AS contains_word
FROM
  `projectname.datasetname.tablename`
5-not-contains-word

BigQuery RegExp: How to check if a column contains a word from another column

There are many cases where we want to evaluate whether a column contains a word from another column. For this case, we will again use the REGEXP_CONTAINS function. For the regular expression, we will use the CONCAT function to concatenate the information from the other column with the regular expression pattern

CONCAT(r'(?i)', word, r'')

Now, let’s construct the query:

SELECT
  message,
  word,
  REGEXP_CONTAINS(message, CONCAT(r'(?i)', word, r''))
FROM
  `projectname.datasetname.tablename`

BigQuery RegExp: How to replace a word

Another popular operation where regular expressions are used is to replace a specific word from an entry with another one. To do that, we’ll use the REGEXP_REPLACE function and we’ll replace the word “tellus” with the word “REPLACED”:

SELECT
  message,
  REGEXP_REPLACE(message, r"tellus", "REPLACED") AS replaced_word
FROM
  `projectname.datasetname.tablename`

BigQuery RegExp: How to escape a special character

There are some cases where we want to use a special character in our regular expression that’s also reserved by regular expressions. Below are some examples of popular special characters reserved for regular expressions:

  • Dot (.): Matches any possible character or symbol
  • Asterisk (*): Expects zero or more occurrences of preceding character
  • Question Mark (?): Expects zero or one occurrence of preceding character
  • Plus sign (+): Expects one or more occurrences of preceding character
  • Vertical Bar (|): Add to denote an alternative match
  • Circumflex (^): Used to denote the beginning of a buffer
  • Dollar Sign ($): Used to denote the end of a buffer

To review all reserved special characters, you can visit Mozilla’s Regular expression syntax cheat sheet.

To escape a special character like the above and use it by it’s literal value, we can use the backslash (\). If you use a backslash right before a special regular expression character, then this will be used as a literal value rather than as an expression. For example, if we want to find all messages that contain a dot (.) we can use the function REGEXP_CONTAINS and the escape character in our regular expression:

SELECT
  message
FROM
  `projectname.datasetname.tablename`
WHERE REGEXP_CONTAINS(message, r"\.")

As you can see, in this query, we returned only the messages that contain a dot using regular expressions.

BigQuery RegExp: How to split a string

A great example of how regular expressions can be useful in your analysis is when you want to split a string on a given delimiter (e.g., a space) and take the first or the second part. Using the function REGEXP_EXTRACT and this regular expression ^(.+?), , we can extract everything that appears before the first comma in a string:

SELECT
  message,
  REGEXP_EXTRACT(message, r"^(.+?),") AS extracted
FROM
  `projectname.datasetname.tablename`

BigQuery RegExp: How to lookahead

Lookaheads might not be that popular but can be a really useful approach in some cases. A positive lookahead will look for a matched pattern only if it’s being followed by another pattern but without making the latter one a part of the result. A negative lookahead is when we want to look for a matched pattern only if it’s not followed by a specific matched pattern.

Unfortunately, Google BigQuery is using the RE2 expression library, which is not currently supporting either positive or negative lookaheads.

BigQuery RegExp: How to replace special characters

Special characters can be really useful in a text or something that is not wanted. So there are cases where we either want to remove or replace all special characters in a text. A special character can be anything other than a letter or a number, including dots, commas, spaces, and others.

This can be achieved by regular expressions like this [^a-zA-Z0-9]+ and REGEXP_REPLACE function as shown below, where we remove all special characters from an email address:

SELECT
  message,
  REGEXP_REPLACE(message, r"[^a-zA-Z0-9]+", "") AS removed_special_characters
FROM
  `projectname.datasetname.tablename`

BigQuery RegExp: How to use extract groups

The extraction of groups is a very popular operation of regular expressions. Groups can be a set of matched patterns that exist in the same text of interest. 

The most common usage is to extract every available word from a given text. A word is defined as any set of alphabetical characters that do not contain a special character between them. To extract all words from a given text using regular expressions, we can use the REGEXP_EXTRACT_ALL function and this regular expression (\w+) to identify all words. This function will return an array of all identified words:

SELECT
  message,
  REGEXP_EXTRACT_ALL(message, r"(\w+)") AS words
FROM
  `projectname.datasetname.tablename`

BigQuery RegExp: How to extract a number from a string

Finally, the last trick is how to extract all numbers from a string. This can be useful if you want to find out all numbers that exist in a string and/or even replace them. To do that, we can use the REGEXP_EXTRACT_ALL function again along with this regular expression ([0-9]+) to extract all numbers that appear in a given address:

SELECT
  address,
  REGEXP_EXTRACT_ALL(address, r"([0-9]+)") AS numbers
FROM
  `projectname.datasetname.tablename`

When should I use BigQuery Regexp?

Regular expressions are great when it comes to string manipulation and can be an amazing asset in your analysis arsenal. With the above tips and tricks, we hope we gave you a strong head start on the topic and smooth the learning curve. 

Using these tricks you will be able to boost your analysis capabilities and manipulate your data more powerfully. Regular expressions are a great skill to learn, and you will not regret getting your hands dirty.

Back to Blog

Comments are closed.

Access your data
in a simple format for free!

Start Free