Create Log-Based Metrics in Google Cloud and Gain Valuable Insights

Use Terraform to create your metrics and deepen the understanding of your system

Sep 16, 2023

You have to know the current state of your system. Is there an unusual increase in errors? Is the load normal, or do you experience traffic spikes? What’s the latency? You might oversee problems on the horizon if you miss that knowledge. For example, a sudden increase in 401 or 403 errors might point you to someone who is trying to break into your system.

You get many things out of the box on the Google Cloud Platform. There are hundreds of predefined metrics. However, not even Google may foresee every metric you need. Because of this, you can create your own metrics.

Log-Based Metrics

Log-based metrics extract information from log messages. It can ingest all kinds of logs, even of services you do not own. This allows you to create a variety of custom metrics.

Two types of log-based metrics exist:

A counter simply counts the occurrences of log messages. They count how often something happens in your system, for example, how often your service publishes a message.
Distributions associates each log message with a value. For example, they search for a log message which states how long a process ran. You can use this to find out the average runtime, outliers, and similar.

Log-based metrics only use log messages which were logged after we created the metric.

We will use Terraform to create both types. Suppose we have an application that prints log messages like:

Bob logged in.
Eve logged in.
Alice logged in.Init took 5 seconds.
Init took 3 seconds.
CleanUp took 5 seconds.
CleanUp took 13 seconds.

We will use the former logs to create a counter and the latter to create a distribution.

Log-Based Counter

We want to count how often users log in. Let us look at the log messages.

Bob logged in.
Eve logged in.
Alice logged in.

As stated above, a Counter may only count the number of occurrences of log messages. Obviously, creating one counter for each user would not make any sense.

The usual way of solving this problem is to create one metric and add labels to it. The labels help to see which data point belongs to which user login. You can do the same with log-based metrics. With this in mind, the message we want to count looks more like this:

$username logged in.

The following Terraform resource adds this metric to your project:

resource "google_logging_metric" "login" {
  project = var.project
  name   = "login"
  filter = "resource.type:(run) AND textPayload:(\"logged in\")"
  metric_descriptor {
    metric_kind = "DELTA"
    value_type  = "INT64"
    labels {
      key         = "username"
      value_type  = "STRING"
    }

  }
  label_extractors = {
    "username" = "REGEXP_EXTRACT(textPayload, \"(.+) logged in\")"
  }
}

How does it work?

On line 4, we define which log messages to consider. We only want messages logged by CloudRun which contain the string “logged in.”
Line 6 and 7 tell Google Cloud we want to create a counter.
Line 9 and 10 define the username label I mentioned earlier. It is of type string. Other values are INT64 and BOOL. Interestingly, floating-point values are not possible.
Line 15 defines how the to extract the values for the label username. The metric evaluates the regular expression against each log message. Caution! The log message may be in another field, for example jsonPayload.message. This depends on how your application writes log messages. The value of the capture group becomes the value of the label. Therefore, you can only define one capture group.

That’s it for the counter! We use the Metrics Explorer to look at it. The metrics name is logging/user/login.

Counter metric displayed in Metrics explorer. — The custom login metric.

We selected our metric (1) and grouped it by the label username (2). Now we know Bob logged in 27 times in the last hour (3).

Log-Based Distribution

Counters come in handy in many situations. However, they are limited. If you want to gain information about things like job run time in seconds, they will not be of much help. You could add a label that captures the job run time, but you cannot calculate things like percentiles or the mean.

Distributions are the right fit for this kind of task.

Remember what the application logs were:

Init took 5 seconds.
Init took 3 seconds.
CleanUp took 5 seconds.
CleanUp took 13 seconds.

This time, we want to search for this pattern:

$operation took $value seconds.

Let us look at the Terraform resource for this metric:

resource "google_logging_metric" "ops_duration" {
  project = var.project
  name    = "operation-duration"
  filter  = "resource.type:(run) AND textPayload:(took seconds)"
  metric_descriptor {
    metric_kind = "DELTA"
    value_type  = "DISTRIBUTION"
    unit        = "s"
    labels {
      key        = "operation"
      value_type = "STRING"
    }
  }
  value_extractor  = "REGEXP_EXTRACT(textPayload, \"\\\\w+ took (\\\\d+) seconds\")"
  label_extractors = {
    "operation" = "REGEXP_EXTRACT(textPayload, \"(\\\\w+) took \\\\d+ seconds\")"
  }
  bucket_options {
    linear_buckets {
      num_finite_buckets = 10
      width              = 1
    }
  }
}

It is more complex than the counter, but we also see similarities.

Line 4 contains the filter again. This time, we want our messages to contain the words “took” and “seconds.”
Line 6 to 8 tell Google Cloud to create a distribution metric where the values are in seconds.
Line 9 to 12 define the label for the operation name, just as with the username. Line 15 to 17 contain the corresponding extractor.
There is an interesting part in Line 14. It looks like the label_extractor below and works similarly. The value_extractor gets the value (run time in seconds) from the log message.
Line 18 to 23 define how to group the log entries. We define ten buckets. The first bucket contains all log entries where the job took about 0 to 1 seconds; the fifth bucket contains all where the job took 4 to 5 seconds. If needed, you can use a more sophisticated approach, like exponential buckets, if needed.

Open the Metrics Explorer to see the heat map for the operation CleanUp.

We selected our metric (1). Add a filter to see the heat map for one operation (2). Now we can see at (3) that the operation took 10 to 11 seconds for about 19% of all invocations.

Final Word

It is really easy to set up log-based metrics. They are especially useful if you need to monitor something you do not have direct control over. However, use them (and other types of metrics) with care. You cannot separate a signal from the noise if you measure too much.

verbosemode

Discussion about this post