Teams who make metrics and do data analysis often wonder: how much data should I collect? How long should I retain it for? Are there important aspects of how I should structure the data?
In general, the concepts here are:
For example, imagine that you are collecting data about your code review tool.
Ideally, you record every interaction with the tool, every important touchpoint, etc. into one data set that knows the exact time of each event, the type of event, all the context around that event, and important “join keys” that you might need to connect this data with other data sources (for example, the ID of commits, the IDs of code review requests, the username of the person taking actions, the username of the author, etc.).
Then you figure out specific business requirements you have around the data. For example, you want to know how long it takes reviewers to respond to requests for review. So you create a higher-level data source, derived from this “master” data source, which contains only the events relevant for understanding review responses, with the fields structured in such a way that makes answering questions about code review response time really easy.
That’s a basic example, but there’s more to know about all of this.
It is impossible to go back in time and instrument systems to answer a question that you have in the present. It is also impossible to predict every question you will want to answer. You must already have the data.
As a result, you should strive to collect every piece of data you can collect about every system you are instrumenting. The only exceptions to this are:
As an extreme example, you could imagine a web-logging system that stored the entirety of every request and response. After all, that’s “everything!” But it would be impossible to search, impossible to store, and an extremely complex privacy nightmare.
The only other danger of “collecting everything” is storing the data in such a disorganized or complicated way that you can’t make any sense of it. You can solve that by keeping in mind that no matter what you’re doing, you always want to produce insights fron the data at some point. Keep in mind a few questions that you know people want to answer, and make sure that it’s at least theoretically possible to answer those questions with the data you’re collecting, with the fields you have, and with the format you’re storing the data in.
If your data layout is well-thought-out and provides sufficient coverage to answer almost any question that you could imagine about the system (even if it would take some future work to actually understand the answers to those questions) then you should be at least somewhat future-proof.
In general, you always want to have some idea of why you are collecting data. At the “lowest level” of your data system, the telemetry that “collects everything,” this is less important. But as you derive higher-level tables from that raw data, you want to ask yourself things like:
This is where you take the underlying raw data and massage it into a format that is designed to solve specific problems. In general, you don’t want to expose the underlying complex “collect everything” data store to the world. You don’t even want to expose it directly to your dashboards. You want to have simpler tables derived from the “everything” data store—tables that are designed for some specific purpose.
You can have a hierarchy of these tables. Taking our code review example:
You’ll find that people rarely ever want to directly query the table from Step 1 (because it’s hard to do so) sometimes want to query the table from Step 2, and the table from Step 3 becomes a useful tool in and of itself, even beyond just the dashboard. That is, the act of creating a table specifically for the dashboard makes a useful data source that people sometimes want to query directly.
Sometimes, trying to build one of these purpose-built tables will also show you gaps in your data-collection systems. It can be a good idea to have one of these purpose-built tables or use cases in mind even when you’re designing your systems for “collecting everything,” because they can make you realize that you missed some important data.
In general, the better you know the requirements of your consumers, the better job you can do at designing these purpose-built tables. It’s important to understand the current and potential requirements of your consumers when you design data-gathering systems. This should be accomplished by actual research into requirements, not just by guessing.
It should be possible to know when a large workflow starts, and when it ends. We should know that a developer intended something to happen, the steps involved in accomplishing that intention, when that whole workflow started, and when it ended. We should not have to develop a complex algorithm to determine these things from looking at the stored data. The stored data should contain sufficient information that is very easy to answer these questions.
We need to be able to connect every event within a workflow as being part of that workflow, and we need to know its boundaries—its start point and end point.
For example, imagine that we have a deployment system. Here’s a set of events that represent a bad data layout:
We have no idea that “Deploy It” means to deploy those two binaries. What if there are a hundred simultaneous workflows going on? What if “Deploy It” has been run more than once in the last five minutes? We have no idea that those health checks signal the end of the deployment. In fact, do they signal the end of the deployment? Are there other actions that “Deploy It” is supposed to do? I’m sure the author of “Deploy It” knows the answers to that, but we, a central data team, have no way of knowing that, because it’s not recorded in the data store.
A better data layout would look like:
You don’t have to figure out every workflow in advance that you might want to measure. When you know a workflow exists, record its start and end point. But even when you don’t know that a workflow exists, make sure that you can always see in the data store that two data points are related when they are related. For example, record that a merge is related to a particular PR. Record that a particular PR was part of a deployment. Record that an alert was fired against a binary that was part of a particular deployment. And so forth. Any two objects that are related should be able to be easily connected by querying your data store.