This post describes how to design an effective metrics portfolio, focusing on the distinction between durable output business metrics and input metrics. It explores the difference between business, health, and diagnostic metrics, and discusses three flavors of goals: output business goals, controllable input goals, and launch goals. It discusses the standard form in which all goals should be expressed. Finally, it advocate for ruthless focus on a small set of goals.
Bad Goals are Worse than No Goals
Setting, measuring, and meeting goals is what we do as a business. – Kim Rachmeler
Amazon thrives on qualified autonomy: business leaders describe their key metrics and use goals to establish how much they will improve them, along with a plan to do so that makes the goals credible. The goals are real commitments: we stake our credibility and organizational success on achieving them. However, our plans are fluid. Leaders are free to use their judgement how to achieve those goals; we are free to make a new plan and do something entirely different, as long as we meet our goals.
Of course, goals are not a straitjacket. There are good reasons to miss goals, and bad reasons to meet them. If we learn something that leads us to conclude we have the wrong goals, we can seek to change them. However, doing so is and should be somewhat painful; when needed, it must be undertaken. Sadly, there are many examples of teams aggressively building the wrong solutions that move their stated goals even when they know doing so won’t help customers. This is a failure in many Leadership Principles, but it can be prevented by establishing the right goals in the first place.
Hallmarks of Effective Goals
Inspection requires leaders to audit the output or results of the mechanism and to course correct and improve the mechanism.
At Amazon we think deeply about mechanisms, which basically focus on the relationships between the inputs to a process and the resulting outputs. A complete mechanism carefully describes its inputs and outputs, and establishes appropriate inspection to assess how the process is affecting the outputs. Outputs are the results of our work; they are the purpose for what we do. However, we can change them only indirectly. For a company, the most important output is typically profit (arguably free cash flow); for a group, the output is their contribution to that profit. However, the company cannot directly (or at least ethically) create profit directly; they instead perform services or provide goods which generate revenue. Similarly, a group cannot directly change their outputs; they take actions which influence the output. The activities that a group can directly effect are the inputs: the activities or settings we can simply change. A process converts inputs to outputs; a mechanism inspects them.
Good metrics represent this relationship directly. They are clear about what outputs matter and describe how to measure them. Such output metrics should be durable – as durable as the mission of the team. While goals are re-established regularly, the output metrics themselves should persist across years. We then also measure and represent the inputs, to connect our work to the impact it has. Further, we must distinguish the impact we have had from environmental changes in the underlying situation. To do this, we disambiguate two flavors of inputs: inherent inputs represent the world as it is, and controllable inputs are the knobs we turn that allow us to affect the output. Except the laws of physics, few inputs are truly inherent; however, for any particular mechanism the ability to influence an input may be so difficult, or so far removed, that it is not worth describing as such. For example, a team focused on reducing packaging costs might celebrate a 10% reduction, but if the industry price for materials had declined by 30% over the same time period, perhaps such celebration would be misplaced.
Establishing powerful goals requires first designing a comprehensive portfolio of metrics that clearly distinguish output metrics from inputs, while describing and measuring inherent inputs. This portfolio should not change much year to year. Effective organizations have a comprehensive, defensible, durable metrics portfolio. Creating it may take significant investment, but scrimping on this investment is an invitation to under-delivery, or worse, mis-delivery.
Our most important goals are the targets we set for our durable outputs, representing the impact we will have for the business. We must carefully distinguish these from goals we take on controllable inputs, representing the things we can change to deliver that promised impact. It is essential to assure we in fact have a measurable output, no matter how hard. In practice outputs can be challenging to describe and measure; similarly inherent inputs are sometimes harder to describe and measure than controllable inputs. This leads to a common trap in designing a metrics portfolio: we have a bias towards the more tangible, easier to measure metrics, and our portfolio may do little to actually measure the metrics that actually matter. Too often our goals are limited to what we can most easily measure, sometimes just a laundry list of launches that – absent output goals to measure it – may offer little real benefit to our customers.
An inexact measurement of a good metric is superior to faultless measurement of a weak one. Some argue that there are metrics that are too hard to measure to be useful. In his seminal book, “How to Measure Anything,” Douglas Hubbard makes a compelling case that indeed anything can be measured. His definition of measurement is incisive: “measurement is a quantitatively expressed reduction of uncertainty.” If we have identified the right metric, we must find a good-enough way to measure it, giving quantitative insight into our mechanism while acknowledging error.
When we are finished, our goals will come in two flavors:
- Output goals: the most important goals. Output goals are a non-negotiable requirement for any metrics portfolio. The output goals connect the work to the clients and customers; without them, there is no way to know if the investments have (or will have) any real impact. Examples: improve product page conversion; reduce customer viewed price errors; improve seller trust; etc.
- Controllable input goals: these represent factors that drive your output goals. Examples: cost of goods sold, inventory turns, page latency, infrastructure investments, operational costs. It is not uncommon for one team’s controllable input to become another team’s output goal, in a long, interrelated chain all the way down Amazon’s organizational dependency graph.
Many teams include a long list of delivery/launch goals, but these are rarely necessary. Launches drive progress towards output or controllable input goals, in which case there is no need to take additional explicit goals on the launches (see discussion of “ramp” below). In some cases, however, they serve to create a new class of controllable input. Launch goals are specified by a single date, and should always be accompanied by the input metric they will produce and a goal for that metric, or minimally a second date by which a baseline for it will be established. Hence, since these serve to create controllable input metrics, they are best considered just that. They should be described not as a launch, but rather as creating a new metric, and goals should be expressed as the projected change for that metric.
It is not meaningful to take goals on inherent inputs; they describe the world as it is. As such, they are used to normalize the data, typically appearing in the denominator to assure the team does not attempt to take credit for the tide.
Ramp to Goal
For every goal it is important to describe the ramp or path to impact towards the declared goal. A goal that describes impact after a year should be supported by a description of the expected impact in each quarter, and commonly the impact expected over the next three months. This is an antidote to “watermelon goals” that are green on the outside but red on the inside: making clear, continuous, and quantitative progress towards a goal is the best way to reassure that we are on track. However, beware “straight line plans” where the ramp is smoothed over the goal period; this is a strong indicator that a plan is not concrete. Per above, changes to goals come from launches; those launches should be visible not as line items in a list, but as impact on metrics.
This suggests that our goals can only be considered complete if we are reviewing progress towards them regularly. Effective goals enjoy regular review; at Amazon, goals are scoped to the level at which the review will occur. The level of review correlates to level of commitment we have made to the goal; a goal that will be reviewed by the s-team is more critical than one local to a specific pizza-team.
A Rubric for Designing Effective Metrics
A critical control point is a single measurable element that covers a multitude of sins. – Temple Grandin
In her book “Animals in Translation,” Temple Grandin describes an approach to metrics that is perhaps counter-intuitive. Rather than an exhaustive set of metrics describing every aspect of a problem, she describes an approach focused on finding the smallest and simplest set of metrics that suffice to indicate a problem exists – not necessarily what the problem is, but that it exists. These are control point metrics: simple, concrete metrics that represent a host of ills.
[These ten details] are all you need to rate animal welfare. You don’t need to know if the floor is slippery, something regulators always want to measure. For some reason whenever you start talking about auditing everyone turns into an expert on flooring. I just need to know if any of the cattle fell down. If cattle are falling down, there’s a problem with the floor, and the plant fails the audit. It’s that simple. …
The audit is totally based on things an auditor can directly observe that have objective outcomes. A steer either moos during handling or he does not. … Another important feature of my audit: people can remember two sets of five items. That level of detail is what normal working memory is built to hold.
Most language-based thinkers find it difficult to believe that such a simple audit really works. They think ‘simple’ means wrong. They don’t see that each one of the five critical control points measures anywhere from three to ten others that all result in the same bad outcome.
It is not enough to have output goals; we should work hard to find minimal cover, the smallest number of goals that assure we will achieve the customer impact we have promised. Such critical outputs are sometimes described as “outcome” goals (related to but distinct from the “output metrics” described above). Decades of research suggest that people can retain around seven topics in operating memory at a time, which is what informs Dr. Grandin’s observation above the level of detail for her audit metrics. Hence a good metrics portfolio should be composed of no more than seven collections, each containing no more than seven metrics in total – ideally fewer than five of each. The fewer there are, the more powerful they become.
Kim Rachmeler suggests that metrics should be organized into three tiers: business metrics, health metrics, and diagnostic metrics. Business metrics are aligned with the concept of outcome metrics; they represent the key indicators that our business is on track. We must track our business metrics scrupulously. Health metrics correlate with input metrics, describing both inherent and controllable inputs. Health metrics assure us that we understand the business landscape, and are on track to make the impact we plan. We track health metrics with similar vigor to business metrics. Diagnostic metrics are essentially input metrics for the input metrics; they are used to make sense of unexpected changes in our health metrics. We typically do not monitor these, but consult them as the situation demands.
You Already Do This
These tiers of metrics have clear analogs in our software runbooks. A system has business, health, and diagnostic metrics, represented by how directly they affect customers. In our software systems, some issues are defined as “Business Critical” and result in pagers and immediate attention; others are defined as “impaired productivity” and result in tickets that can be prioritized against other work. This is self-evidently analogous to business and health metrics. When a system failure directly affects our customers, we alert and respond immediately: our monitoring creates a high-severity trouble ticket and a pager goes off. When a failure indicates a longer-term risk that does not immediately, but will ultimately impact customers, our monitoring creates a lower severity ticket and we deal with it based on projected impact. While diagnosing a problem, we may inject additional logging or instrumentation to give us insight, which we often remove when the problem has been resolved. Business, health, diagnostic.
It is worth noting that we monitor our technical operating metrics religiously and continuously: we use automation to assure that our metrics stay within well-defined operational thresholds. Automation is a particularly effective way to monitor metrics and stay on track.
Kim offered a wonderful story to illuminate the distinction between business and health metrics, and a lovely example of a critical control point “outcome” metric: she claimed that the Frito Lay company (purveyor of chips and other snack foods) runs their entire business on a single metric: stales. When a delivery is made to a retail outlet, the re-stocker notes the number of units remaining. The goal is that the number will be exactly one. If more than one unit remains, then the company has over-estimated demand by a quantifiable amount. If the shelf is empty, they have under-estimated by an uncertain amount. This is a wonderful metric: it scales gracefully from a store to a neighborhood to a city to a region to a country. It can be applied independently to specific products, to product categories, or to the entire product line. Best of all, it is simple, crisp, and trivially easy to measure: an elegant metric from a bygone era.
However, this metric can do little to tell us why the product is performing as it is. It cannot tell us if a competitor has sprung up (or vanished), or if perhaps the customer demographics have changed, or alarmingly if customer sentiment has changed – or something else. To understand that we need a different set of metrics, to establish health of the business. We may need to perform taste tests or other experimentation. These are examples of health metrics, and we must create and monitor these, so we can adapt before they impact our business goals.
Kim was a relentless advocate for simple, powerful metrics, and championed one of Amazon’s best: expressed dissatisfaction rate, or “how’s my driving.” Kim was an early leader of Amazon’s customer service organization, and wanted a metrics portfolio that would live up to the company’s aspiration to be the “World’s most customer-centric company.” After much thought and experimentation, her team produced a metric that simply asked customers one question (“Did I solve your problem?”) and produced a metric that captured the “no” votes over total responses. Every aspect of this metric got detailed attention and rigorous experimentation, from the ease of use (a single button within the email) to the specific question asked. The result was a metric that has served the company well in various forms for decades. Good metrics are transformative.
The Maxwell House Effect
When defining metrics, we must guard against incremental degradation. All metrics must include ground truth; purely relative or comparative metrics are subject to incremental failure. The history of the Maxwell House company provides a useful cautionary tale to exemplify this.
In the early 1900s in the United States, Maxwell House was the peerless giant in home-brew coffee, a position they maintained through two world wars and into the 1960s. In the 1970s, however, customer sentiment about their product suddenly soured, and they began losing market share precipitously. This created an opportunity that Folger’s seized with a new “instant” coffee, permanently reshaping the American coffee market.
Maxwell House had forewarning that customer sentiment was changing, before it became a crisis; they scrupulously tracked their demographics and sales, and performed regular taste tests. However, they could not explain the sudden change; consumers had apparently suddenly decided the coffee just plain didn’t taste good.
It was too late when they unearthed the underlying problem. The Maxwell House company had an annual process for reducing costs. Each year they created several new, less-expensive formulations for their coffee and taste-tested them against the current formula. If consumers could not tell the difference, they would adopt the new formula. In desperation, the company decided to compare the modern formula to the original formula from the turn of the century. They discovered that the modern formula was comparatively awful. The process had allowed an imperceptible, incremental degradation to quality of the product.
We must guard against this effect: again, all metrics must include ground truth, as purely relative or comparative metrics are subject to incremental failure.
Splintering and Outliers
When describing goals for our output metrics, it is useful to focus on typical behaviors to avoid investment with narrow gains. For example, we may set a TP90 performance goal (Time Percentile 90, representing the point at which 90% of the samples are included in the sample and 10% excluded). However, there is a trap in this approach, a concern that was most clearly indicated for me in the early days of Amazon’s Wishlist (in the early 2000’s). Wishlist was already a popular feature, and our TP90 metrics were all a lovely green. There were sporadic abnormalities in the data, suggesting that some wishlists were taking several minutes to load; but no data is completely clean, and the idea of a wishlist taking that long was somewhat absurd, so we ignored it. However, one day a motivated engineer decided to inspect those abnormalities, and discovered something astonishing: they were correlated to a handful of wishlists containing many thousands of items. But wishlists did not support pagination, so we knew those lists could not be loaded! We subsequently discovered that a handful of libraries were using the wishlist feature to organize their catalogs, and they knew of no better solution; these poor librarians would visit their wishlist, wait for the page to time out, and then reload again to get the cached result. Unsurprisingly, we introduced wishlist pagination shortly thereafter.
Hence, while we may focus on the core behaviors, it is important to rigorously inspect outliers: at the edges there be dragons. For critical metrics, it is worth measuring and often worth taking a goal on the worst-case behaviors. Fixing these can force us out of incremental thinking into more rigorous, durable solutions.
Consider that many program teams operate globally, and performance may vary significantly between regions or stores or other useful segments. A core program team may be critical for the success of a new or emerging product, so the team owning that product will often approach the core team to propose a “shared goal” that assures the small but growing area is not overlooked. However, such a core team may be expected to support hundreds of such efforts, and cannot afford to take independent goals on each. Instead, they should consider taking a generalized goal for the worst-case marketplace(s) to address these concerns. Indeed, addressing the worst-case sometimes results in a holistic redesign: as our metrics splinter into many small subdivisions, problems that occur in one misbehaving fragment can reveal architectural shortcomings that benefit many others.
A successful team at Amazon typically stands on three pillars: a mission that describes their purpose; tenets to describe the mental model for how to achieve the mission; and a rigorous and durable metrics portfolio. Of the three, it is my experience that establishing rigorous metrics are the most challenging and the most powerful. Too often these receive the least attention and investment. Such an oversight must be addressed. The most effective teams have a rigorous, well designed metrics portfolio. We focus on a small number of durable, “critical control point” outcome goals, buttressed by measured inherent inputs. Our controllable input goals are connected directly to those outputs. We review those goals regularly and vigorously.
Doing so ensures we are relentlessly focused delivering durable value to our customers.
Appendix A: Setting Effective Goals
Designing an effective metrics portfolio simplifies our approach to goals. This is a brief summary of some considerations powered by the ideas explored above.
Effective goals are:
When setting goals be rigorous and courageous: take only goals that matter. Taking too many goals obscures priority; ruthlessly excise goals that don’t directly deliver value to customers. As discussed above HACCP (critical control point) metrics deliver focus. Taking an exhaustive list of goals feels comprehensive, but is counterproductive. It increases burden on reviewers, forces attention on things that may not matter, and can allow important things to fester unnoticed longer than needed. That said, we must take a goal for every output business metric. If we can’t or won’t take such a goal… well, it isn’t really a business-level metric. For each business goal, we should in turn take goals on the fewest number of controllable inputs that directly influence our output.
As discussed above, good metrics use data that is simple to collect. The more abstract the data, the less likely it is to actually describe the world. Using a complicated formula or a trained Machine Learning model to make decisions can be empowering; using such signals as a success metric is dangerous.
Measurement must be specific and explicit. A goal that states an average must specify the time frame over which the average will be computed; a launch goal must be connected to the metric it will impact. Beware goals that cannot be described as “increasing” or “decreasing” something, or for which the units are unstated, invented, or mix unrelated concerns (dollars per square-kilowatt-gram).
Goals must have clear deadlines, and those deadlines must include both the start and end dates. Typically at Amazon we take annual goals, and the investment then covers a twelve month period. When setting the deadlines, it is important to be aware of seasonal factors; for example Amazon retail has traditionally had significantly different volumes at the end of the year than at the beginning. Be sure to use a meaningful duration over which to measure your goals.
To capture these concerns, at Amazon an effective goal is always stated in the following form:
Increase/Decrease [metric] from X [units] on D1 [date] to Y [units] on D2 [date], an improvement/reduction of Z [units] (Z’%).
This formulation provides several interrelated benefits. First, it is crisp about comparison; it removes ambiguity about how the metric is being measured, and the specific time range over which it is being computed. Including the math about the change Z, expressing it as both Z magnitude and Z’ relative units, immediately helps the reader understand the impact and reduces cognitive load by not forcing the reader to do simple math.
Appendix B: Anti-patterns
There are some anti-patterns to that serve as indicators of an incomplete or poorly designed metrics portfolio. Here is a sample:
- Output-free metrics. A metrics portfolio with no outputs is like a soccer game that doesn’t keep score. If you can’t quantify the impact you’re going to have, many leaders will assume you aren’t going to have any.
- Output-free goals. A year of delivery without output impact is a year spent piling up risk. The longer you wait to validate your ideas, the more likely it is that your ideas won’t work. (For pioneering efforts it may sometimes take more than a year to have impact, but for such efforts it is important to be unambiguous about the risks and what you are doing to “fail fast” – to either prove that the risks won’t take root, or prove that they cannot be avoided and the money is better spent elsewhere.)
- An “average” goal with a slow ramp. If we are moving an average, the arithmetic of computing an average requires that the final value will necessarily be well beyond the averaged target. Hence the longer it takes to get below that average, the farther below it we will need to reach to have the impact we intend – and the less likely it is we will actually reach our goal. “Yearly Average” metrics are uncommon, but where they are appropriate immediate progress should be expected and tracked.
- Dec 31 launch goals. Most American companies avoid meaningful launches between November and January. A launch goal at the end of a year typically suggests a lack of a real delivery plan. Worse still, such a goal is often unexamined until the end of the year approaches, so it becomes a fire drill for the engineering team during the part of the year during which it is hardest to make changes due to holiday traffic and vacations. Treat promises to launch in December with skepticism.
- Vacuum Launches. Launch goals are a means to an end; launches must be connected to a controllable input or output metric. Unless it is clear what metric the launch will impact, it cannot be clear whether the launch is worth completing.
- Lonely Metrics. A metric – or worse, metrics portfolio – that no one sees. Lonely metrics are no better than having no metrics at all, an investment disconnected from operational behavior.
- Unstable Metrics. Some metrics can be unduly influenced by inherent inputs. For example, a poorly designed metric for page views might be entirely dependent on the number of visitors to the website. It is important to use ratios and choose denominators to assure the metric is stable in the face of a change to the sample size or population mix.
- Myopic Metrics. Sometimes we have work that is germane to our charter and completely worth doing, but that would not impact any of our committed goals. Sadly, this often leads to teams de-prioritizing the work, but in fact it is a symptom of a metrics portfolio that isn’t focused on the right HACCP outputs. Fix the metrics, then do the right work.
- Metaphysics. A goal that promises to change an inherent input, sometimes that in fact requires a violation of the laws of physics to achieve. A widget on a page cannot load faster than the page itself. A “real-time” model-based decision cannot be made in less than a millisecond. We cannot detect bad actors before we have collected the information that reveals them.
- “We take financial outputs seriously, but we believe that focusing our energy on the controllable inputs to our business is the most effective way to maximize financial outputs over time.” – Jeff Bezos, 2009 Letter to Shareholders ↑
- There’s a related organizational design concern. If your team doesn’t have any controllable inputs, or your controllable inputs are insufficiently correlated to your desired outputs, your org structure is likely wrong and you have been set up to fail. Agonizing over goal-setting in that situation is counterproductive and you should instead be having the organizational conversation. ↑
- Another trap is to take goals that attempt to change inherent inputs, which are doomed from the start. See “Metaphysics” in Appendix B, “Anti-patterns.” ↑
- “My most important contribution [to animal welfare] has been to take the idea behind HACCP (Hazard Analysis Critical Control Point Theory, pronounced hassip) and apply it to the field.” ↑
- Miller’s law, also known as “The Magical Number Seven (plus or minus two).” ↑
- Which leads to inevitable curiosity why teams rarely automate monitoring of business goals. We need to inspect and interpret changes to health goals, but one might expect failures in business goals to result in alarms going off. ↑
- I don’t know if this is true, because I haven’t researched it. In this case I would prefer to believe without confirmation than to investigate and be disappointed. ↑
- This story may also be apocryphal. I have fruitlessly sought the data to establish (or refute) it. ↑
Kim Rachmeler was an Amazon VP from 2001-2007. Among other significant achievements, Kim pioneered the use of “How’s My Driving” and expressed dissatisfaction as the key metric for delivering effective Customer Service; she also authored the crisp goal pattern described in Appendix A. Many of the key insights described here, I learned from her.↑