Five Special-Use Features of Google Cloud Spanner

🔍 Diving Deep into Unique Database Functionalities: Spanner's TABLESAMPLE, Row TTL, Commit Timestamps, and More

Feb 29, 2024

black and white card on white table — Photo by Fernando Lavin on Unsplash

Spanner is my database system of choice. It combines the features of SQL and highly distributed databases in an elegant way. I've written some in-depth articles on this topic, but for this piece, I want to quickly highlight five features which you might not need in your day-to-day database dealings but are intriguing for specific use cases. Let's dive in!

The Features

TABLESAMPLE Operator 🍱

With Spanners TABLESAMPLE operator, you get the ability to execute a query on a random sample of a table, rather than the entire table itself. This feature can be utilized by specifying either a fixed number of rows or a percentage of the table's rows.

How to Use It

Retrieve a fixed size sample from a table with the command:

SELECT *
FROM Table TABLESAMPLE RESERVOIR (100 ROWS);

If your requested amount of rows is less than or equal to the count of the table, the query returns the entire table. Otherwise, exactly 100 rows are returned, where any possible set of 100 rows has an equal probability of being selected.

To get approximately a percentage of the table, use:

SELECT *
FROM Table TABLESAMPLE BERNOULLI (0.1 PERCENT);

Each row has a 0.1% chance of appearing in the sample, with the probability applying individually to each row.

When to Use it

This operator is particularly useful for quick assessments or when working with vast datasets, where pulling the entire set of data might be impractical or time-consuming. It enables a more efficient and manageable way to interact with your data, especially for preliminary analyses or tests.

Limitations

Using TABLESAMPLE BERNOULLI does not guarantee the size of the sample. The query returns about x% of the table, which could vary slightly.

Row Deletion Policy 🧹

Spanner's advanced feature of automatically deleting rows after a certain duration, known as Row TTL (Time to Live), requires a TIMESTAMP field in your table.

How to Use It

Implement a Row Deletion Policy with:

CREATE TABLE UserClicks(
Key INT64,
User INT64,
ClickedLink STRING(MAX),
CreatedAt TIMESTAMP,
) PRIMARY KEY (Key),
ROW DELETION POLICY (OLDER_THAN(CreatedAt, INTERVAL 30 DAY));

To modify an existing table:

ALTER TABLE UserClicks
ADD ROW DELETION POLICY (OLDER_THAN(CreatedAt, INTERVAL 30 DAY));

ALTER TABLE UserClicks
DROP ROW DELETION POLICY;

When to Use It

This feature is especially useful for managing data lifecycle in a more automated and efficient way, ensuring that your database does not hold onto unnecessary data for longer than needed.

A common use case for this feature is in "Insert Only Log Tables" (comparable to Kafka). For example, events sent with additional information can be stored in such a table. This allows for detailed investigation of the event's origin and state at any given time. Since events are produced frequently, they can be set to be automatically deleted from the database after 30 days.

Other applications include complying with regulatory requirements.

Limitations

After a database is restored from a backup, all Row Deletion Policies are removed!
Only one policy per table is allowed.
Cascading Deletes may hit the transaction limit.
The deletion process runs in the background and is not deterministic, meaning rows are deleted up to 72 hours after the specified duration has passed.

Useful Links

Commit Timestamp ⌚️

Commit Timestamps offer precise timestamps that reflect the exact moment a transaction was committed, utilizing Google’s TrueTime technology.

How to Use It

Define a column for Commit Timestamps:

CommitTimestamp TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true)

A placeholder value, PENDING_COMMIT_TIMESTAMP(), must be used for the column upon creation. The GCP Spring API uses CommitTimestamp.of(Timestamp::class.java) for this purpose.

When to Use It

Commit Timestamps are useful for maintaining accurate and consistent data change histories, aiding in tracking and auditing within applications. They are invaluable for scenarios requiring precise data modification times and are ideal for Row Deletion Policies.

Useful Links

Query Statistics 📈📉

Spanner provides extensive statistics for individual queries and aggregated data, accessible through the SPANNER_SYS system views, offering insights into various database aspects.

How to Use Them

Access detailed query statistics through system views for different time frames, such as the last minute, 10 minutes, and hour, with specific views for each. These include key data points like query text, execution count, average latency, and failed execution count.

select * from SPANNER_SYS.QUERY_STATS_TOP_MINUTE;
select * from SPANNER_SYS.QUERY_STATS_TOP_10MINUTE;
select * from SPANNER_SYS.QUERY_STATS_TOP_HOUR;

Each table contains statistics for the stated time period, with no overlap between periods.

Key statistical data points include:

TEXT (truncated) The SQL query.
EXECUTION_COUNT How many times the query was executed.
AVG_LATENCY_SECONDS Average latency of the query in seconds.
ALL_FAILED_EXECUTION_COUNT Count of all failed executions.

For a broader view, Spanner also aggregates query statistics over the same intervals:

select * from SPANNER_SYS.QUERY_STATS_TOTAL_MINUTE;
select * from SPANNER_SYS.QUERY_STATS_TOTAL_10MINUTE;
select * from SPANNER_SYS.QUERY_STATS_TOTAL_HOUR;

When to Use Them

These statistics are invaluable tools for developers and database administrators looking to optimize query performance, identify slow queries, and ensure that their Spanner databases are running as efficiently as possible.

Limitations

Data for TOTAL_MINUTE is available for up to 6 hours.
Data for 10MINUTE goes back up to 4 days.
Data for HOUR is available for up to 30 days.

Useful Links

Cloud Spanner Query Statistics Documentation

Export to CSV 📄

Exporting data to CSV, though somewhat between Dataflow and Spanner features, is incredibly useful for quick and seamless data extraction from Spanner.

Dataflow is a useful tool, not only for Spanner.

How to Use It

Google has a Dataflow template ready to do the job for your. You can either use the Google Cloud Console or utilize gcloud:

gcloud dataflow jobs run export-spanner \
  --gcs-location gs://dataflow-templates-europe-west1/latest/Spanner_to_GCS_Text \
  --region europe-west1 \
  --staging-location gs://temp-file-bucket/temp \
  --additional-user-labels {} \
  --parameters \
    spannerTable=spanner-table,\
    spannerProjectId=project-id-where-the-spanner-instance-is,\
    spannerInstanceId=spanner-instance,\
    spannerDatabaseId=table,\
    spannerHost=https://batch-spanner.googleapis.com,\
    textWritePrefix=gs://output-bucket

When to Use It

CSV exports are versatile for data analysis or importing into another database, supporting various use cases with the widely accepted CSV format.

Limitations

Exports are limited to complete tables; exporting query results directly is not supported.

Available for iOS and Android

Roundup

If you found this post informative, you might also be interested in my insights on Joins in Spanner. Drop a comment 💬 and share this article 🔄 with peers who might find it enlightening!

Stay tuned for more! 👀👇