Slang Labs’ Data Collection Strategy

The Oops Moment

Since I started working at Slang Labs, I have worked in the Android team. It usually requires me to work on feature additions, optimizations and provide for specific customer requirements on the Android SDK.

“With great feature additions, comes the need for loads of Analytics.”

With each newly added feature came the need to add Analytics to gauge the performance and continue the feedback loop of constant improvement of our SDK. Until recently, adding data points to track new features was a tactless task. The steps we followed could roughly be summed up as follows:

  1. Create a new Analytics event corresponding to the new feature
  2. Add a bunch of variables to track metrics associated with the new feature
  3. Dump the newly added event with all the metrics to the Analytics endpoint
  4. Pat yourself on the back for successfully shipping a new feature
  5. Never ever think of the pain our Data Science team has to suffer for any analysis requests

And our Analytics codebase of the SDK could be summed up as a collection of listeners to feature usages and dumping events to our analytics endpoint whenever a callback was triggered

Does this sound like something you do?

No? Great! On behalf of your Data Science team, I sincerely thank you.

Yes? You really should continue reading.

All was fine until a few months later when I was asked to help the Data Science team with some work. Being always interested in the field of Data Science, I agreed since it was an excellent opportunity to gain first-hand experience on the inner workings of the product improvement lifecycle. I had no idea how hard my work was coming back to smack me in the face.

“It's called Karma. And it’s pronounced Ha-ha-ha.”

We had put zero thought into our data collection strategy up to this point—in pursuit of the Android team’s target of “moving fast”. However, a side effect of this was that we left the Data Science team terribly short of answering our PM’s question, “How well is this feature doing, and what can we do to improve?”

We realized that although we collected every data point, the collection was so haphazard and disorganized that making sense of that data was a Herculean task.

Imagine standing in a warehouse with a list of items you need to build a chair. You have been told everything is present in the warehouse, but you don’t know where exactly. There is no catalogue of stored items, just a long line of shelves with raw materials kept in random order.

Your customer keeps calling you about the date of delivery of the chair while the warehouse owner keeps insisting that they did an amazing job of storing everything you need.

This was precisely what our Data Science team was facing. And I was brought in to help because I was the one who dumped the data (or raw materials for our chair analogy) from the client SDK. So they needed me to make sense of “what went where”.

And each time I read the Android code and reported the data collection format, I was met with exasperated sighs and informed how it was a terrible approach and how costly it would be to query the data and perform the analysis.

The solution

There was no way around it; we had to redesign the Analytics module in our codebase. The data had to be organized, so the Data Science team could reliably provide insights without running after developers trying to make sense of the data. And having been down in the trenches with the Data Science team, I empathized with the scale of the task we dumped on them.

We broke down the redesign into simpler parts.

  1. Design a State Machine for data collection

Rather than blindly dumping events to the server, there has to be an order to the data collection flow. We designed a finite state machine (FSM) which would move between well-defined sentinel states in response to analytics events, and the data would only be dumped when the FSM transitioned between sentinel states. This was a departure from our earlier practice of relying on events to send events to Analytics.

Let me explain with an example.

Imagine a login screen on your app. Instead of dumping analytics events in response to each event like user.clicked.email_input, user.clicked.password_input, user.clicked.login_button, and user.clicked.close_button, you should instead create a sentinel state like user.processed.login. The state machine will only enter the sentinel state after the user clicks login or closes the login screen. All events and data received between state transitions will be saved, and once the state transition is completed, the data is dumped into Analytics.

The sentinel states should always be designed based on user journeys rather than control flows since user journeys remain unchanged, barring minor modifications. However, the control flow can change based on code architecture, bug fixes or new functionalities.

This ensures that with each analytics event, all relevant information about the user journey is present in one place, even as the codebase evolves with feature additions and optimizations.

  1. Create a schema for your data and adhere to it

All the data you collect and dump should follow a predefined schema. This ensures three things:

  1. The Data Science team always knows what to expect from their queries. There are no surprises, no sudden additions of new fields.
  2. Neither the data scientists nor the developers have to waste time in meetings trying to figure out where any given data metric is present in the data dump. The data scientists do not have to run behind developers, and the developers do not have to read through the code to explain where data is stored.

Without a schema, a data scientist will have to dig through the code (or have a developer do it for them), understand the raw data and how it is collected and what it means, clean it, and fix what they can reliably extract in terms of business metrics.

  1. If the data metrics are updated, the change can easily be tracked through schema version upgrades. Using a schema version helps make the schema atomic, which ensures every event will correspond to a schema version, so we can always get the data in a usable format.

Without a well-defined schema, data metrics could change arbitrarily without the knowledge of all the parties involved, leading to poor or inaccurate insights which ultimately guide business decisions.

  1. Optimize for collection over computation

If you need to compute something, there are two ways to do it: either do it at the time of data collection or at the time of data query. If both options are available, you should prioritize computation during data collection.

This is because storage is far cheaper than computation on the cloud.

This can apply to multiple situations:

  1. You want to track the conversion of a user post a search operation. Instead of simply dumping the number of searches and the number of purchases to the server, you could also do the calculation on the client itself and update the data dump. When tracking user conversions later, all you need to do is pull the data you calculated earlier, and you are ready to proceed.
  2. Suppose you have a metric which can be used across different analytics events (e.g. user age or the number of orders placed). Instead of dumping it in one event (e.g. user.login.success) and performing join operations to populate the metric for other events during analysis, it is far cheaper to cache the metric and dump it as a common metric for all events.

  1. Test everything

Even after defining a schema and a strict state machine for the Analytics module, the whole exercise would be futile if there is no way to enforce this.

I cannot stress this enough: you need unit tests to test your code responsible for collecting and dumping the data. Because Analytics is usually an afterthought to feature additions, unit tests would ensure you avoid accidentally breaking the entire collection pipeline.

Client-side tests are the bare minimum, but beyond that, it is a great habit to also have integration tests which enforce the data collection rules at your server endpoint.

Aal Izz Well…

Once we completed our redesign, the entire data collection and analysis process became streamlined, and our Data Science team could process analysis requests reliably and consistently. In addition, the Android team no longer received complaints of poorly formatted data and was able to make feature additions and add data collection metrics with the assurance everything would be stored in a properly consumable format.

So let us take a look at the key learnings from this experience:

  1. If the data team cannot make sense of the collected data, the data warehouse is simply a data swamp - and no one likes working in a swamp. You have to value your data, don’t throw it in a swamp.
  2. You need to think about your data collection strategy while building your product. If you put it off for later, you might not be able to extract valuable business insights from the collected data.
  3. While building your Analytics State Machine, remember to assign sentinel states based on user journeys in your application, while the control flow events can guide the transitions. Piggybacking on control flow events to track Analytics will get complex, and buggy as your code flows evolve over time.
  4. Your Analytics schema should be sacrosanct - no event should be sent to your endpoint which does not adhere to the rules of the schema.
  5. Given a choice between spending resources on computation or storage, always pick storage. Computation is more expensive. Collect your data in such a way that minimizes computation during analysis.
  6. Write tests for your analytics code to ensure you can reap the benefits of the entire exercise.