Cloud Blog: Get started with Google Cloud’s built-in tokenization for sensitive data protection

Jan 15, 2025

—

Source URL: https://cloud.google.com/blog/products/identity-security/get-started-with-built-in-tokenization-for-sensitive-data-protection/
Source: Cloud Blog
Title: Get started with Google Cloud’s built-in tokenization for sensitive data protection

Feedly Summary: In many industries including finance and healthcare, sensitive data such as payment card numbers and government identification numbers need to be secured before they can be used and shared. A common approach is applying tokenization to enhance security and manage risk.
A token is a substitute value that replaces sensitive data during its use or processing. Instead of directly working with the original, sensitive information (usually referred to as the “raw data"), a token acts as a stand-in. Unlike raw data, the token is a scrambled or encrypted value.
Using tokens reduces the real-world risk posed by using the raw data, while maintaining the ability to join or aggregate values across multiple datasets. This technique is known as preserving referential integrity.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>

Tokenization engineered into Google Cloud
While tokenization is often seen as a specialized technology that can be challenging and potentially expensive to integrate into existing systems and workflows, Google Cloud offers powerful, scalable tokenization capabilities as part of our Sensitive Data Protection service. With it, you can make calls into serverless API endpoints to tokenize data on the fly in your own applications and data pipelines.
This allows you to enable tokenization without needing to manage any third-party deployments, hardware, or virtual machines. Additionally, the service is fully regionalized, which means tokenization processing happens in the geographical region of your choice helping you to adhere to regulatory or compliance regimes. The pricing is based on data-throughput with no upfront costs, so you can scale to meet the needs of your business with as little or as much as you need.
Sensitive Data Protection takes things even further offering in-line tokenization for unstructured, natural language content. This allows you to tokenize data in the middle of a sentence and if you pick two-way tokenization (and have the right access permissions), you can even detokenize data back when necessary.
This opens up a whole new set of use-cases including run time tokenization of logs, customer chats, or even as part of a generative AI-serving framework. We’ve also built this technology directly into Contact Center AI and Dialogflow services so that you can tokenize customer engagement on-the-fly.

The image above shows a raw input that contains an identifier (email address) along with a masked output that shows this email in tokenized form.

Tokenization with BigQuery
In addition to serverless access through Sensitive Data Protection, we also offer tokenization directly in BigQuery. This gives you tokenization methods at your fingertips in BigQuery SQL queries, User Defined Functions (UDFs), views, and pipelines.
Tokenization technology is built directly into the BigQuery engine to work at high speed and high scale for structured data, such as tokenizing an entire column of values. The resulting tokens are compatible and interoperable with those generated through our Sensitive Data Protection engine. That means you can tokenize or detokenize in either system without incurring unnecessary latency or costs, all while maintaining the same referential integrity.
Using tokens to solve real problems
While the token obfuscates the risk, utility and value are still preserved. Consider the following table which has four rows and three unique values: value1, value2, value3.
<value1> → <token1><value2> → <token2><value1> → <token1><value3> → <token3>
Here you can see that each value is replaced with a token. Notice how “value1” gets “token1” consistently. If you run an aggregation and count unique tokens, you’ll get a count of three, just like on the original value. If you were to join on the tokenized values, you’d get the same type of joins as if joining on the original value.
This simple approach unlocks a lot of use cases.
Obfuscating real-world risk
Consider the use-case of running fraud analysis across 10 million user accounts. In this case, let’s say that all of your transactions are linked to the end-users email address. An email address is an identifier that poses several risks:

It can be used to contact the end-user who owns that email address.

It may link to data in other systems that are not supposed to be joined.

It may identify someone’s real world identity and risk exploding that identity’s connection to internal data.

It may leak other forms of identity, such as the name of the owner of the email account.

Let’s say that the token for that email is “EMAIL(44):AYCLw6BhB0QvauFE5ZPC86Jbn59VogYtTrE7w+rdArLr” and this token has been scoped only to the tables and dataset need for fraud analysis. That token can now be used in place of that email address and you can tokenize the emails across all the transaction tables, and then run fraud analysis.
During this analysis any users or pipelines exposed to the data would only see the obfuscated emails, thus protecting your 10 million users while unblocking your business.
Next steps
Tokenization provides a powerful way to protect sensitive information while still allowing for essential data operations. By replacing sensitive data with non-sensitive substitutes, tokens can significantly reduce the risk of data breaches and simplify compliance efforts. Google Cloud simplifies tokenization by offering a readily available, scalable, and region-aware service, allowing you to focus on your core business rather than managing infrastructure.
To get started on using tokenization on Google Cloud, see the following:

Sensitive Data Protection Tokenization and pseudonymization

Learn more about BigQuery Encrypt and Decrypt compatibility with Sensitive Data Protection

Announcing BigQuery Encrypt and Decrypt function compatibility with Sensitive Data Protection
BigQuery now integrates with Sensitive Data Protection with native SQL functions that allow interoperable deterministic encryption and de…

Read Article

AI Summary and Description: Yes

Summary: The text discusses the implementation of tokenization as a security measure to protect sensitive data within industries such as finance and healthcare. It highlights Google Cloud’s offerings for tokenization, which simplify the integration process and ensure compliance with regulatory requirements while maintaining data utility.

Detailed Description:
The text presents an overview of tokenization as a robust technique for safeguarding sensitive data across various sectors, particularly finance and healthcare. Tokenization replaces sensitive data with a non-sensitive equivalent (token), thereby reducing risks associated with data breaches.

Key Points:

– **Token Definition and Functionality**:
– Tokens act as stand-ins for raw data and are often scrambled or encrypted.
– They allow users to perform data operations without exposing the sensitive data itself.

– **Google Cloud’s Tokenization Capabilities**:
– Google Cloud’s Sensitive Data Protection service offers scalable tokenization features that can be integrated into existing applications.
– Provides serverless API endpoints for on-the-fly tokenization.
– Fully regionalized processing, supporting compliance with various regulatory regimes.
– Pricing is based on data throughput rather than upfront costs.

– **Use-cases and Applications**:
– Tokenization can be applied in unstructured data, such as customer interactions, enabling sensitive information protection during real-time exchanges.
– Integration with Google Cloud services like Contact Center AI and Dialogflow for on-the-fly tokenization during customer engagement.

– **Tokenization with BigQuery**:
– Enables tokenization methods within BigQuery SQL queries, User Defined Functions (UDFs), views, and pipelines.
– Supports high-speed and high-scale tokenization, seamlessly interoperating with the Sensitive Data Protection engine.

– **Benefits of Tokenization**:
– Maintains referential integrity, allowing data to be aggregated and joined as if the original data were used.
– Effectively obfuscates risks while retaining the utility of data in complex analyses (e.g., fraud detection).

– **Practical Implications**:
– Tokenization facilitates the protection of user identities, such as email addresses, significantly lowering risks tied to personal data exposure.
– By preventing direct references to sensitive data during analyses, organizations can minimize the potential for data breaches.

– **Next Steps**:
– Encouraging users to explore Google Cloud’s offerings for tokenization services as part of their data protection strategies to streamline compliance and secure sensitive information.

This information is particularly valuable for security and compliance professionals, as it emphasizes practical applications of tokenization technology, regulatory considerations, and the associated benefits in enhancing data security.

1 2 3 4 5 a access Act AGI AI amble analysis API Application applications art as based BigQuery breach breaches business by C capabilities chat CIA Cloud cloud service cloud services compatibility compliance Compliance efforts compliance professionals Console content core cost Costs cross Customer customer engagement customer interactions D data data breach Data breaches data exposure data operations data pipelines Data Protection data security data utility dataset datasets de DeFi definition deployment detection Dialogflow e effective email email addresses encryption end endpoint endpoints engagement EU event exp features finance fine for framework fraud fraud detection full functionality g Gen generated generative Generative AI geo Go Google Google Cloud Google Cloud services government graph gs hardware health Healthcare high high-speed Highlight HR http HTTPS identity image implementation implications in information information protection infrastructure integration integrity inter interaction intern Just k l language latency led Link linked long low mac machine mini mission ML multi native natural language next NIST no non NPU NSA o of off on one open operation organization organizations ory over party permissions personal data phi pipelines point porting Power practical applications practical implications pre preserving pricing problem processing product products professionals R rag RCE real real-time Region regulatory regulatory considerations regulatory requirements Requirements right Risk risks s scalable Scale sec secure security security and compliance self sensitive data sensitive data protection sensitive information server serverless service services SHA side Sig Sim Simple SoC source sql SSE SSL structured structured data system systems T tech technology text the third third-party throughput Time to token tokenization tokens Tor TP trial trie two unstructured data up US use use cases user V val virtual machine virtual machines Wi workflows x