From Zero to A/B Testing

How we created our product experimentation framework and evolved our process around it. A possible roadmap for product organizations to kickstart scalable A/B testing.
AB Testing.jpg

 Six months after I joined Handshake we launched our first iteration of our very own A/B testing framework. We had a smooth launch and started rolling out experiments. Within 6 weeks of launching, we had 5 concurrent experiments with more in the pipeline. Analysis for each experiment took 1–2 days back then — we had to dig into our instrumentation and write scary-looking SQL queries to calculate conversion rates for cohorts. As experiments launched faster than we could analyze results it became obvious that the analysis-stage was the bottleneck. I often found myself explaining to stakeholders why it takes a week to get results back on A/B tests. Making room for new analytics projects became considerably more difficult. How did we get here?

Fast-forward 8 weeks and it’s a different story. Product teams can now use our internal self-service reporting tools to determine the outcome of experiments within hours of deploying. The multi-day analysis got squashed to minutes — no third party A/B testing tool, no new external dependencies in our framework code.

In this post I talk about our choices and takeaways from building a product experimentation framework, iterating to make it scalable, and why we built it in-house to begin with. As it turns out, two of our company values were key to our success.

Setting The Stage

If you’re not familiar with Handshake, we’re a team of 80 people (and growing!) on a mission to democratize opportunity for students transitioning into the job market.

null

The Handshake team during our weekly all-hands

Shortly after I came on board we also brought in our first Growth PM. With his help and with buy-in from the product organization, we decided to invest in A/B testing. Here is our Growth PM’s take on drive for experimentation:

“At Handshake, we are committed to be very measured in our approach and decision-making when it comes to building experiences, so we invested in creating our own growth experimentation platform in-house. This not only gave us the flexibility to adapt to a continuously evolving product, but also elevated the organization’s understanding and usage of data to inform product decisions.” 
— 
Andrew Yoon, Growth PM

That said, no company is born being great at product experimentation — integrating an A/B testing tool is just a piece of the puzzle. Knowing how and when to wield said tool is key, and so is having a clear process to deal with the results. I’ll expand on that soon.

The Tooling Decision

Let’s get right to it, there are plenty of tools out there to help you run and evaluate product experiments. Some are very thorough, others more bare-bones. However, after considering all our options, we decided to implement our own A/B testing framework internally. Ultimately, a couple of major factors influenced our choice:

  • Tool Reusability: we use LaunchDarkly at Handshake for feature management and quickly realized its traffic splitting feature could be leveraged as a means to split users for A/B testing purposes. Our familiarity with LaunchDarkly and it’s integration with our development processes made the implementation smoother. We also used Looker as our analytics tool to calculate and publish metrics on experiment results.
  • Customizability: by designing and managing the framework ourselves we could customize it for more intricate metrics of strategic importance for Handshake. If you read a few blogposts on A/B testing or the documentations for popular 3rd-party tools you’ll notice a pattern of e-commerce-like scenarios being addressed. These aren’t always in tune with what we consider successful outcomes on the Handshake platform. E-commerce cares about user lifetime value and checkout rates, Handshake is interested in students’ career interests and the ultimate (elusive) job search outcome — did you find a job.

Cost was another minor contributor to our decision: no new 3rd party tool meant one less bill to pay. Yes, we spent extra people-hours building and managing the framework but by our estimates the ramp up, implementation and management of a 3rd party tool would’ve taken equivalent effort.

Looking back, there was an additional benefit to our approach: it promoted data literacy internally. Data literacy is a concept we’ve been discussing within our team and is probably worth a post of it’s own. I understand it as a team’s ability to display a solid grasp of their dataset, and its intricacies and how to safeguard it. It took many of us to plan and implement the framework and that exposure made everyone more knowledgeable about experimentation and user behavior.

The Groundwork: Instrumentation & Team

A functioning framework requires having an efficient pipeline of impressions and conversions to be evaluated. We use Segment to help us track user activity on our platform and other tools for data aggregation/transformation to simplify analytics and measuring. However, logging clicks and page views is not the same as explicitly tagging experiment conversions or impressions. Let me explain.

An experiment might be set up to measure a particular type of click, or a combination of actions, that result in a qualifying event: a conversion. When we started, we were not explicitly tracking events this way. This resulted in additional analysis effort post-experiment to detect qualifying events, e.g. a job application in our site. In other words, we know how many people applied for the job, but it’s not obvious how many of those followed the special breadcrumb trail we laid out to get there. This makes for unreliable attribution for the conversion being tracked.

null

 

The “framework” itself was a way to mitigate this problem. It helps us generate experiment events tied to an impression (or conversion) we cared about. Here’s a practical example: we encourage students on our platform to complete their user profiles so we can better match them to jobs. A “completed profile” is a measure of how much essential information has been filled by a student. This is what the experiment analysis might look like to boost profile completion among seniors:

  1. If student doesn’t meet all profile information conditions, then profile is “incomplete”
  2. If student user is a senior and has an “incomplete profile” and lands on experiment page during time of experiment, then track user
  3. Track relevant conversions downstream (retention, job applications, etc.)

Beyond formalizing the framework, having a nimble team to drive the instrumentation and experimentation was very much necessary. We created a cross-functional Growth Squad with members from the product, engineering and data teams. The newly assembled squad dealt with the development of the framework and transition to creating and managing growth experiments thereafter.

Eventually We Hit A Problem

After iterating on our experimentation processes we focused much of our design and development efforts on the actual capability to track and store impressions and conversions for users that were subject to the experiment (a.k.a. the treatment group).

However, we could’ve spent more time thinking about what we needed to do to calculate changes in conversion rates. We could get conversion rates for treatment groups pretty easily — just calculate conversions/impressions from instrumentation — but ultimately we needed to compare it to a baseline (a.k.a. the control group) to establish the change in conversion rate or uplift. We had to count control users who would’ve qualified as potential impressions but never actually saw one.

 

null

 

In order to do this at the time, we had to establish our identical control cohort manually. Using the earlier example of users with incomplete profiles, we’d have to do the following: identify students in their senior year with incomplete profiles who landed on the dashboard page during the 5 day period of the experiment AND weren’t part of the treatment group.

We ran several experiments this way for awhile, each time crafting custom SQL queries manually calculating rates for control users. Each experiment would require me to create charts from those custom queries and a dashboard to present results. But, before I can start thinking about calculating statistical significance two or three days had gone by and more experiments needed a similar treatment. We hit our ceiling.

In talks with my manager and our PM of growth we agreed our process was due for an overhaul. We wanted to prioritize reliability, scalability, and velocity. There were talks about re-scoping some of the instrumentation earlier and revamping the process. There were also concerns about change and shifting work onto others’ plate. It was at this point where two of our company values came into play:

1. “Move quickly, but don’t rush”

At Handshake, we’re proud to work at a very fast operational speed. As a startup, velocity is very much our competitive advantage. That said, we shouldn’t move quick at the expense of quality or efficiency. We really felt empowered to take action and halt new experiments until we honed in on the problem and how we wanted to address it.
The growth team decided unanimously to stop experimentation indefinitely until we revamped our process. It encouraged participation from the whole team facilitating communication and focus on improvement.

2. “Learn. Grow. Repeat”

A stumble is an opportunity to learn, it’s a chance to get back up and do better. That’s how we tried to approach our situation. It was inspiring to see the growth team switch gears to “let’s get better” mode as soon as we had our first meeting about revamping the process. No finger-pointing, no staying down.

We mapped out our current process, everyone pointed out pain points and made sure everyone weighed in on alternatives. We all agreed to rethink experiment development to explicitly track the control group early on in every experiment.

How We Solved It

It was obvious from our internal discussions that we needed to be more thoughtful about impression tracking. We made changes to the way experiment toggles were set on LaunchDarkly and to how our experiment events tracker wrapper behaved when it was used on front-end code.

null

 

The new way has control and treatment groups equally instrumented at all times in terms of impressions. Even though control groups will see nothing different about their experience, we should fire a “no-impression” event to mark the point at which they diverge from the treatment group.
However, a more valuable outcome of our iteration was the improvement to process and structure. Since then we upgraded our documentation and formalized the process to setup experiments. We shared the documentation internally and the team did multiple presentations on it to spread the word.

I mentioned at the beginning that having a clear process post-results is just as important as launching the experiment. If you don’t learn anything from an experiment, misinterpret results, or don’t clean up after your experiment, you significantly hinder the potential value of the experiment effort. We address this by reviewing results of each new experiment during weekly update meetings and discuss our interpretations of the results and facilitating the creation, management and deprecation of toggles in LaunchDarkly post-experiment. Lastly, always keep records of past and current experiments (description, goal of the experiment, population targeted, and the engineer responsible). Those have come in handy time and time again.

On the reporting side, we focused on automation for self-service, something that wasn’t possible pre-revamp. We relied on Databricks, Looker and Snowflake to be able to provide metrics on any given experiment on-demand in under 1 minute.


We now have a more solid foundation that will cover our needs in the short to medium term. We’re expanding on our current setup: creating data pipelines to simplify and enrich experiment data, cohort exclusion to avoid cross-experiment contamination, etc. We also since implemented multi-variate testing support which is working pretty well.

The newfound stability of the framework allows us to focus on churning experiments that make our most important metrics go up. We’ve seen significant gains across the board and are eager to see where we go next.
More importantly, we have a better grasp of what our needs are and the resources necessary to address them. And, when we outgrow the current framework we’ll be smarter about our needs and what path to take next.

null

Handhake Engineering!

Interested in working with rich datasets and state-of-the-art tools to empower all sides of a business? Passionate about helping millions of students find great careers?


Check out our open roles at Handshake and help us democratize opportunity.

For CAREER SERVICES

For Employers

For students