HomeAmazonOn Metrics and Measurement

On Metrics and Measurement

June 13, 2022May 1, 2023dondo5 Comments 5509 views

This post describes how to design an effective metrics portfolio, focusing on the distinction between durable output business metrics and input metrics. It explores the difference between business, health, and diagnostic metrics, and discusses three flavors of goals: output business goals, controllable input goals, and launch goals. It discusses the standard form in which all goals should be expressed. Finally, it advocate for ruthless focus on a small set of goals.

Bad Goals are Worse than No Goals

Setting, measuring, and meeting goals is what we do as a business. – Kim Rachmeler

Amazon thrives on qualified autonomy: business leaders describe their key metrics and use goals to establish how much they will improve them, along with a plan to do so that makes the goals credible. The goals are real commitments: we stake our credibility and organizational success on achieving them. However, our plans are fluid. Leaders are free to use their judgement how to achieve those goals; we are free to make a new plan and do something entirely different, as long as we meet our goals.

Of course, goals are not a straitjacket. There are good reasons to miss goals, and bad reasons to meet them. If we learn something that leads us to conclude we have the wrong goals, we can seek to change them. However, doing so is and should be somewhat painful; when needed, it must be undertaken. Sadly, there are many examples of teams aggressively building the wrong solutions that move their stated goals even when they know doing so won’t help customers. This is a failure in many Leadership Principles, but it can be prevented by establishing the right goals in the first place.

Hallmarks of Effective Goals

Inspection requires leaders to audit the output or results of the mechanism and to course correct and improve the mechanism.

At Amazon we think deeply about mechanisms, which basically focus on the relationships between the inputs to a process and the resulting outputs. A complete mechanism carefully describes its inputs and outputs, and establishes appropriate inspection to assess how the process is affecting the outputs. Outputs are the results of our work; they are the purpose for what we do. However, we can change them only indirectly. For a company, the most important output is typically profit (arguably free cash flow); for a group, the output is their contribution to that profit. However, the company cannot directly (or at least ethically) create profit directly; they instead perform services or provide goods which generate revenue. Similarly, a group cannot directly change their outputs; they take actions which influence the output. The activities that a group can directly effect are the inputs: the activities or settings we can simply change. A process converts inputs to outputs; a mechanism inspects them.

Good metrics represent this relationship directly. They are clear about what outputs matter and describe how to measure them. Such output metrics should be durable – as durable as the mission of the team. While goals are re-established regularly, the output metrics themselves should persist across years. We then also measure and represent the inputs, to connect our work to the impact it has. Further, we must distinguish the impact we have had from environmental changes in the underlying situation. To do this, we disambiguate two flavors of inputs: inherent inputs represent the world as it is, and controllable inputs^[1] are the knobs we turn that allow us to affect the output. Except the laws of physics, few inputs are truly inherent; however, for any particular mechanism the ability to influence an input may be so difficult, or so far removed, that it is not worth describing as such. For example, a team focused on reducing packaging costs might celebrate a 10% reduction, but if the industry price for materials had declined by 30% over the same time period, perhaps such celebration would be misplaced.^[2]

Establishing powerful goals requires first designing a comprehensive portfolio of metrics that clearly distinguish output metrics from inputs, while describing and measuring inherent inputs. This portfolio should not change much year to year. Effective organizations have a comprehensive, defensible, durable metrics portfolio. Creating it may take significant investment, but scrimping on this investment is an invitation to under-delivery, or worse, mis-delivery.

Our most important goals are the targets we set for our durable outputs, representing the impact we will have for the business. We must carefully distinguish these from goals we take on controllable inputs, representing the things we can change to deliver that promised impact. It is essential to assure we in fact have a measurable output, no matter how hard. In practice outputs can be challenging to describe and measure; similarly inherent inputs are sometimes harder to describe and measure than controllable inputs. This leads to a common trap in designing a metrics portfolio: we have a bias towards the more tangible, easier to measure metrics, and our portfolio may do little to actually measure the metrics that actually matter. Too often our goals are limited to what we can most easily measure, sometimes just a laundry list of launches that – absent output goals to measure it – may offer little real benefit to our customers.^[3]

An inexact measurement of a good metric is superior to faultless measurement of a weak one. Some argue that there are metrics that are too hard to measure to be useful. In his seminal book, “How to Measure Anything,” Douglas Hubbard makes a compelling case that indeed anything can be measured. His definition of measurement is incisive: “measurement is a quantitatively expressed reduction of uncertainty.” If we have identified the right metric, we must find a good-enough way to measure it, giving quantitative insight into our mechanism while acknowledging error.

Effective Goals

When we are finished, our goals will come in two flavors:

Output goals: the most important goals. Output goals are a non-negotiable requirement for any metrics portfolio. The output goals connect the work to the clients and customers; without them, there is no way to know if the investments have (or will have) any real impact. Examples: improve product page conversion; reduce customer viewed price errors; improve seller trust; etc.
Controllable input goals: these represent factors that drive your output goals. Examples: cost of goods sold, inventory turns, page latency, infrastructure investments, operational costs. It is not uncommon for one team’s controllable input to become another team’s output goal, in a long, interrelated chain all the way down Amazon’s organizational dependency graph.

Many teams include a long list of delivery/launch goals, but these are rarely necessary. Launches drive progress towards output or controllable input goals, in which case there is no need to take additional explicit goals on the launches (see discussion of “ramp” below). In some cases, however, they serve to create a new class of controllable input. Launch goals are specified by a single date, and should always be accompanied by the input metric they will produce and a goal for that metric, or minimally a second date by which a baseline for it will be established. Hence, since these serve to create controllable input metrics, they are best considered just that. They should be described not as a launch, but rather as creating a new metric, and goals should be expressed as the projected change for that metric.

It is not meaningful to take goals on inherent inputs; they describe the world as it is. As such, they are used to normalize the data, typically appearing in the denominator to assure the team does not attempt to take credit for the tide.

Ramp to Goal

For every goal it is important to describe the ramp or path to impact towards the declared goal. A goal that describes impact after a year should be supported by a description of the expected impact in each quarter, and commonly the impact expected over the next three months. This is an antidote to “watermelon goals” that are green on the outside but red on the inside: making clear, continuous, and quantitative progress towards a goal is the best way to reassure that we are on track. However, beware “straight line plans” where the ramp is smoothed over the goal period; this is a strong indicator that a plan is not concrete. Per above, changes to goals come from launches; those launches should be visible not as line items in a list, but as impact on metrics.

This suggests that our goals can only be considered complete if we are reviewing progress towards them regularly. Effective goals enjoy regular review; at Amazon, goals are scoped to the level at which the review will occur. The level of review correlates to level of commitment we have made to the goal; a goal that will be reviewed by the s-team is more critical than one local to a specific pizza-team.

A Rubric for Designing Effective Metrics

A critical control point is a single measurable element that covers a multitude of sins. – Temple Grandin

In her book “Animals in Translation,”^[4] Temple Grandin describes an approach to metrics that is perhaps counter-intuitive. Rather than an exhaustive set of metrics describing every aspect of a problem, she describes an approach focused on finding the smallest and simplest set of metrics that suffice to indicate a problem exists – not necessarily what the problem is, but that it exists. These are control point metrics: simple, concrete metrics that represent a host of ills.

[These ten details] are all you need to rate animal welfare. You don’t need to know if the floor is slippery, something regulators always want to measure. For some reason whenever you start talking about auditing everyone turns into an expert on flooring. I just need to know if any of the cattle fell down. If cattle are falling down, there’s a problem with the floor, and the plant fails the audit. It’s that simple. …

The audit is totally based on things an auditor can directly observe that have objective outcomes. A steer either moos during handling or he does not. … Another important feature of my audit: people can remember two sets of five items. That level of detail is what normal working memory is built to hold.

Most language-based thinkers find it difficult to believe that such a simple audit really works. They think ‘simple’ means wrong. They don’t see that each one of the five critical control points measures anywhere from three to ten others that all result in the same bad outcome.

It is not enough to have output goals; we should work hard to find minimal cover, the smallest number of goals that assure we will achieve the customer impact we have promised. Such critical outputs are sometimes described as “outcome” goals (related to but distinct from the “output metrics” described above). Decades of research suggest that people can retain around seven topics in operating memory at a time,^[5] which is what informs Dr. Grandin’s observation above the level of detail for her audit metrics. Hence a good metrics portfolio should be composed of no more than seven collections, each containing no more than seven metrics in total – ideally fewer than five of each. The fewer there are, the more powerful they become.

Kim Rachmeler suggests that metrics should be organized into three tiers: business metrics, health metrics, and diagnostic metrics. Business metrics are aligned with the concept of outcome metrics; they represent the key indicators that our business is on track. We must track our business metrics scrupulously. Health metrics correlate with input metrics, describing both inherent and controllable inputs. Health metrics assure us that we understand the business landscape, and are on track to make the impact we plan. We track health metrics with similar vigor to business metrics. Diagnostic metrics are essentially input metrics for the input metrics; they are used to make sense of unexpected changes in our health metrics. We typically do not monitor these, but consult them as the situation demands.

You Already Do This

These tiers of metrics have clear analogs in our software runbooks. A system has business, health, and diagnostic metrics, represented by how directly they affect customers. In our software systems, some issues are defined as “Business Critical” and result in pagers and immediate attention; others are defined as “impaired productivity” and result in tickets that can be prioritized against other work. This is self-evidently analogous to business and health metrics. When a system failure directly affects our customers, we alert and respond immediately: our monitoring creates a high-severity trouble ticket and a pager goes off. When a failure indicates a longer-term risk that does not immediately, but will ultimately impact customers, our monitoring creates a lower severity ticket and we deal with it based on projected impact. While diagnosing a problem, we may inject additional logging or instrumentation to give us insight, which we often remove when the problem has been resolved. Business, health, diagnostic.

It is worth noting that we monitor our technical operating metrics religiously and continuously: we use automation to assure that our metrics stay within well-defined operational thresholds. Automation is a particularly effective way to monitor metrics and stay on track.^[6]

Stales

Kim offered a wonderful story to illuminate the distinction between business and health metrics, and a lovely example of a critical control point “outcome” metric: she claimed that the Frito Lay company (purveyor of chips and other snack foods) runs their entire business on a single metric: stales.^[7] When a delivery is made to a retail outlet, the re-stocker notes the number of units remaining. The goal is that the number will be exactly one. If more than one unit remains, then the company has over-estimated demand by a quantifiable amount. If the shelf is empty, they have under-estimated by an uncertain amount. This is a wonderful metric: it scales gracefully from a store to a neighborhood to a city to a region to a country. It can be applied independently to specific products, to product categories, or to the entire product line. Best of all, it is simple, crisp, and trivially easy to measure: an elegant metric from a bygone era.

However, this metric can do little to tell us why the product is performing as it is. It cannot tell us if a competitor has sprung up (or vanished), or if perhaps the customer demographics have changed, or alarmingly if customer sentiment has changed – or something else. To understand that we need a different set of metrics, to establish health of the business. We may need to perform taste tests or other experimentation. These are examples of health metrics, and we must create and monitor these, so we can adapt before they impact our business goals.

Kim was a relentless advocate for simple, powerful metrics, and championed one of Amazon’s best: expressed dissatisfaction rate, or “how’s my driving.” Kim was an early leader of Amazon’s customer service organization, and wanted a metrics portfolio that would live up to the company’s aspiration to be the “World’s most customer-centric company.” After much thought and experimentation, her team produced a metric that simply asked customers one question (“Did I solve your problem?”) and produced a metric that captured the “no” votes over total responses. Every aspect of this metric got detailed attention and rigorous experimentation, from the ease of use (a single button within the email) to the specific question asked. The result was a metric that has served the company well in various forms for decades. Good metrics are transformative.

The Maxwell House Effect^[8]

When defining metrics, we must guard against incremental degradation. All metrics must include ground truth; purely relative or comparative metrics are subject to incremental failure. The history of the Maxwell House company provides a useful cautionary tale to exemplify this.

In the early 1900s in the United States, Maxwell House was the peerless giant in home-brew coffee, a position they maintained through two world wars and into the 1960s. In the 1970s, however, customer sentiment about their product suddenly soured, and they began losing market share precipitously. This created an opportunity that Folger’s seized with a new “instant” coffee, permanently reshaping the American coffee market.

Maxwell House had forewarning that customer sentiment was changing, before it became a crisis; they scrupulously tracked their demographics and sales, and performed regular taste tests. However, they could not explain the sudden change; consumers had apparently suddenly decided the coffee just plain didn’t taste good.

It was too late when they unearthed the underlying problem. The Maxwell House company had an annual process for reducing costs. Each year they created several new, less-expensive formulations for their coffee and taste-tested them against the current formula. If consumers could not tell the difference, they would adopt the new formula. In desperation, the company decided to compare the modern formula to the original formula from the turn of the century. They discovered that the modern formula was comparatively awful. The process had allowed an imperceptible, incremental degradation to quality of the product.

We must guard against this effect: again, all metrics must include ground truth, as purely relative or comparative metrics are subject to incremental failure.

Splintering and Outliers

When describing goals for our output metrics, it is useful to focus on typical behaviors to avoid investment with narrow gains. For example, we may set a TP90 performance goal (Time Percentile 90, representing the point at which 90% of the samples are included in the sample and 10% excluded). However, there is a trap in this approach, a concern that was most clearly indicated for me in the early days of Amazon’s Wishlist (in the early 2000’s). Wishlist was already a popular feature, and our TP90 metrics were all a lovely green. There were sporadic abnormalities in the data, suggesting that some wishlists were taking several minutes to load; but no data is completely clean, and the idea of a wishlist taking that long was somewhat absurd, so we ignored it. However, one day a motivated engineer decided to inspect those abnormalities, and discovered something astonishing: they were correlated to a handful of wishlists containing many thousands of items. But wishlists did not support pagination, so we knew those lists could not be loaded! We subsequently discovered that a handful of libraries were using the wishlist feature to organize their catalogs, and they knew of no better solution; these poor librarians would visit their wishlist, wait for the page to time out, and then reload again to get the cached result. Unsurprisingly, we introduced wishlist pagination shortly thereafter.

Hence, while we may focus on the core behaviors, it is important to rigorously inspect outliers: at the edges there be dragons. For critical metrics, it is worth measuring and often worth taking a goal on the worst-case behaviors. Fixing these can force us out of incremental thinking into more rigorous, durable solutions.

Consider that many program teams operate globally, and performance may vary significantly between regions or stores or other useful segments. A core program team may be critical for the success of a new or emerging product, so the team owning that product will often approach the core team to propose a “shared goal” that assures the small but growing area is not overlooked. However, such a core team may be expected to support hundreds of such efforts, and cannot afford to take independent goals on each. Instead, they should consider taking a generalized goal for the worst-case marketplace(s) to address these concerns. Indeed, addressing the worst-case sometimes results in a holistic redesign: as our metrics splinter into many small subdivisions, problems that occur in one misbehaving fragment can reveal architectural shortcomings that benefit many others.

Conclusion

A successful team at Amazon typically stands on three pillars: a mission that describes their purpose; tenets to describe the mental model for how to achieve the mission; and a rigorous and durable metrics portfolio. Of the three, it is my experience that establishing rigorous metrics are the most challenging and the most powerful. Too often these receive the least attention and investment. Such an oversight must be addressed. The most effective teams have a rigorous, well designed metrics portfolio. We focus on a small number of durable, “critical control point” outcome goals, buttressed by measured inherent inputs. Our controllable input goals are connected directly to those outputs. We review those goals regularly and vigorously.

Doing so ensures we are relentlessly focused delivering durable value to our customers.

Appendix A: Setting Effective Goals

Designing an effective metrics portfolio simplifies our approach to goals. This is a brief summary of some considerations powered by the ideas explored above.

Effective goals are:

Decisive

When setting goals be rigorous and courageous: take only goals that matter. Taking too many goals obscures priority; ruthlessly excise goals that don’t directly deliver value to customers. As discussed above HACCP (critical control point) metrics deliver focus. Taking an exhaustive list of goals feels comprehensive, but is counterproductive. It increases burden on reviewers, forces attention on things that may not matter, and can allow important things to fester unnoticed longer than needed. That said, we must take a goal for every output business metric. If we can’t or won’t take such a goal… well, it isn’t really a business-level metric. For each business goal, we should in turn take goals on the fewest number of controllable inputs that directly influence our output.

Observable

As discussed above, good metrics use data that is simple to collect. The more abstract the data, the less likely it is to actually describe the world. Using a complicated formula or a trained Machine Learning model to make decisions can be empowering; using such signals as a success metric is dangerous.

Concrete

Measurement must be specific and explicit. A goal that states an average must specify the time frame over which the average will be computed; a launch goal must be connected to the metric it will impact. Beware goals that cannot be described as “increasing” or “decreasing” something, or for which the units are unstated, invented, or mix unrelated concerns (dollars per square-kilowatt-gram).

Time-Bound

Goals must have clear deadlines, and those deadlines must include both the start and end dates. Typically at Amazon we take annual goals, and the investment then covers a twelve month period. When setting the deadlines, it is important to be aware of seasonal factors; for example Amazon retail has traditionally had significantly different volumes at the end of the year than at the beginning. Be sure to use a meaningful duration over which to measure your goals.

Crisp

To capture these concerns, at Amazon an effective goal is always stated in the following form:

Increase/Decrease [metric] from X [units] on D1 [date] to Y [units] on D2 [date], an improvement/reduction of Z [units] (Z’%).

This formulation provides several interrelated benefits. First, it is crisp about comparison; it removes ambiguity about how the metric is being measured, and the specific time range over which it is being computed. Including the math about the change Z, expressing it as both Z magnitude and Z’ relative units, immediately helps the reader understand the impact and reduces cognitive load by not forcing the reader to do simple math.

Appendix B: Anti-patterns

There are some anti-patterns to that serve as indicators of an incomplete or poorly designed metrics portfolio. Here is a sample:

Output-free metrics. A metrics portfolio with no outputs is like a soccer game that doesn’t keep score. If you can’t quantify the impact you’re going to have, many leaders will assume you aren’t going to have any.
Output-free goals. A year of delivery without output impact is a year spent piling up risk. The longer you wait to validate your ideas, the more likely it is that your ideas won’t work. (For pioneering efforts it may sometimes take more than a year to have impact, but for such efforts it is important to be unambiguous about the risks and what you are doing to “fail fast” – to either prove that the risks won’t take root, or prove that they cannot be avoided and the money is better spent elsewhere.)
An “average” goal with a slow ramp. If we are moving an average, the arithmetic of computing an average requires that the final value will necessarily be well beyond the averaged target. Hence the longer it takes to get below that average, the farther below it we will need to reach to have the impact we intend – and the less likely it is we will actually reach our goal. “Yearly Average” metrics are uncommon, but where they are appropriate immediate progress should be expected and tracked.
Dec 31 launch goals. Most American companies avoid meaningful launches between November and January. A launch goal at the end of a year typically suggests a lack of a real delivery plan. Worse still, such a goal is often unexamined until the end of the year approaches, so it becomes a fire drill for the engineering team during the part of the year during which it is hardest to make changes due to holiday traffic and vacations. Treat promises to launch in December with skepticism.
Vacuum Launches. Launch goals are a means to an end; launches must be connected to a controllable input or output metric. Unless it is clear what metric the launch will impact, it cannot be clear whether the launch is worth completing.
Lonely Metrics. A metric – or worse, metrics portfolio – that no one sees. Lonely metrics are no better than having no metrics at all, an investment disconnected from operational behavior.
Unstable Metrics. Some metrics can be unduly influenced by inherent inputs. For example, a poorly designed metric for page views might be entirely dependent on the number of visitors to the website. It is important to use ratios and choose denominators to assure the metric is stable in the face of a change to the sample size or population mix.
Myopic Metrics. Sometimes we have work that is germane to our charter and completely worth doing, but that would not impact any of our committed goals. Sadly, this often leads to teams de-prioritizing the work, but in fact it is a symptom of a metrics portfolio that isn’t focused on the right HACCP outputs. Fix the metrics, then do the right work.
Metaphysics. A goal that promises to change an inherent input, sometimes that in fact requires a violation of the laws of physics to achieve. A widget on a page cannot load faster than the page itself. A “real-time” model-based decision cannot be made in less than a millisecond. We cannot detect bad actors before we have collected the information that reveals them.

Endnotes

“We take financial outputs seriously, but we believe that focusing our energy on the controllable inputs to our business is the most effective way to maximize financial outputs over time.” – Jeff Bezos, 2009 Letter to Shareholders ↑
There’s a related organizational design concern. If your team doesn’t have any controllable inputs, or your controllable inputs are insufficiently correlated to your desired outputs, your org structure is likely wrong and you have been set up to fail. Agonizing over goal-setting in that situation is counterproductive and you should instead be having the organizational conversation. ↑
Another trap is to take goals that attempt to change inherent inputs, which are doomed from the start. See “Metaphysics” in Appendix B, “Anti-patterns.” ↑
“My most important contribution [to animal welfare] has been to take the idea behind HACCP (Hazard Analysis Critical Control Point Theory, pronounced hassip) and apply it to the field.” ↑
Miller’s law, also known as “The Magical Number Seven (plus or minus two).” ↑
Which leads to inevitable curiosity why teams rarely automate monitoring of business goals. We need to inspect and interpret changes to health goals, but one might expect failures in business goals to result in alarms going off. ↑
I don’t know if this is true, because I haven’t researched it. In this case I would prefer to believe without confirmation than to investigate and be disappointed. ↑
This story may also be apocryphal. I have fruitlessly sought the data to establish (or refute) it. ↑

Credits

Kim Rachmeler was an Amazon VP from 2001-2007. Among other significant achievements, Kim pioneered the use of “How’s My Driving” and expressed dissatisfaction as the key metric for delivering effective Customer Service; she also authored the crisp goal pattern described in Appendix A. Many of the key insights described here, I learned from her.↑

About dondo

5 thoughts on “On Metrics and Measurement”

Greg Linden says:

June 14, 2022 at 9:41 am

Hi, Don. Great post! I was wondering if I might get you to expand a bit on your thoughts on metrics that are wrong?

You mostly say that metrics should be right to begin with (“building the wrong solutions that move their stated goals even when they know doing so won’t help customers… can be prevented by establishing the right goals in the first place”) and emphasize that getting it right in the first place is important because changing them “is and should be somewhat painful.” Why? Why should changing metrics be painful, rare, or hard?

For example, let’s say an executive decides watch time should be the goal (YouTube in the past) or that engagement including emotional reactions should be the goal (Facebook in the past). Then the company discovers that the metric is causing a lot of harm as it gets overoptimzed by teams (Goodhart’s Law). Why isn’t the solution here to accept that most metrics are flawed models and won’t be exactly hitting the target and, instead of expecting them to be perfect from the beginning, make the process to correct metrics easier as you learn more about what they actually should be over time?

I’d love to hear more of your thoughts on this if you don’t mind sharing. And if you want to get in touch directly, we could do it by e-mail instead.

Reply
dondo says:

June 14, 2022 at 2:05 pm

Hey, Greg! Long time, hope you’re doing well. 🙂

That’s a sophisticated and provocative question. My purpose in stressing the importance of investing up front in refining metrics and goals is that too often I find teams entirely under-invest in their metrics portfolio, caught up in the excitement of their ideas and charter. These teams often end up wasting time.

Some of my perspective here is informed by Amazon’s “hands off the wheel” approach, where senior leaders accept goals in lieu of continuous and direct management. I wanted to emphasize the power such goals provide to allow teams more freedom to adjust their plan without significant oversight and approval from their leadership – ideally, complete freedom to do so. In contrast, adjusting metrics imposes the need for a re-examination of the plan, which is comparatively quite painful.

Hence, the pressure here is not really intended to restrict investment in inspecting and adjusting metrics, but rather to emphasize the relative difference between how hard it is to adjust commitments to meeting goals versus the freedom that restraint affords in allowing teams to adjust their plans.

“[We should] accept that most metrics are flawed models and … [we should have a] process to correct them” – I think that’s entirely right, and very well articulated. We should! I have some ideas about what such a process might look like, and may one day write that as a similar post to this one – if you’re interested in helping me develop it, hit me up privately.

Thanks again!

Reply
Greg Linden says:

June 14, 2022 at 4:06 pm

That makes sense! And thanks for the reply.

I absolutely agree that teams and executives often under-invest in thinking about metrics — often just tossing out a hasty and ill-advised metric like engagement or immediate ad revenue — and that can, as you said, end up wasting a lot of time building wrong solutions that don’t actually help customers and the business.

I’d love to see more of your thoughts on changing and correcting metrics as it becomes obvious the current metrics no longer are incentivizing the right thing. If I can make that request, yes absolutely, please do write on that in the future! I’d love to see it!

Reply
Buggy says:

June 19, 2022 at 1:35 pm

Captcha fail, retrying: This is more helpful than the analytics course I’m taking (so far). However, I see some inherent problems. One is that I suspect your flow of outputs to inputs is roughly isomorphic to your management structure. When management only sets a goal/implements a measure without caring HOW the sausage gets made, that’s when Wells Fargo starts opening phantom accounts for existing employees, or securities fraud happens, or unwelcome data gets buried … while the business retains plausible deniability and can claim that the few bad actors have been punished and the problem is therefore solved. Another issue is the immorality of (over)relying on quantifiable outcomes alone, at least at certain company(s); e.g. if you constantly cull the lowest 10% of drivers that have staked their homes on Amazon delivery (while illegally breaking any attempts to unionize, and lobbying your own sellers, congress, your PR audience, etc. against reforms, and doing other things that seem to change the ‘unchangeable environmental inputs/laws of physics’), eventually you end up with a process where people die speeding in unmaintained vehicles with packages and pee bottles on the windscreen (citations available) … in other words, the over-reliance on measures without weighing its societal costs can be indistinguishable from going into the street and shooting people, it’s just easier to justify. Related: if you make your I/O chain long enough, you can have the same slow bleed of morality as there was a slow bleed in taste in the Maxwell House example: everyone can be doing their job as described, with no obvious increment in “evil” along the way, but the journey to an outcome that no participant would ask for is still inevitable. Another issues is that Amazon says how careful they are with these measurements, and yet I constantly see failing-in-action as a direct result of metrics: CSRs who are clearly meeting a “close or respond to X cases/hr” goal because that’s what can be measured (in the case of “how am i driving?” that translates into passing the buck to another (incorrect) department where the issue gets dead-ended: the seller won’t complain about the original rep, because they appear to have no responsibility … they also aren’t helped), or a clearly poor use of statistics (e.g. making decisions about account health — right down to taking sellers’ funds — based on an insufficient sample size). Also, the compartmentalization inherent in one org’s outputs becoming another’s inputs leads to excess silo’ing, and the inability to move problems up the food chain to their source. 1000 sellers can clearly transmit to support that there’s an engineering error, and the only response will be “my script says that’s not how the site works, so you’re wrong. did we help you today?” We can punish the individual reps without solving the problem, with the reps who exert the most effort (i.e. actually engage) taking the most flak. Finally, the comments about an org not launching in Q4 are clearly not seller-side. This is when we are *most* likely to hear that some customer-centric change is being made that is not to our benefit (extended returns window, refund at first scan, changes to FBA limits or fees, etc. etc. etc.).

Add it all up, and I’m left with this choice: if Amazon is as careful and expends as much energy with measurements as this post implies, do I conclude that they aren’t very good at it, that it’s not actually the way to go about doing business, or that these are actually the outcomes they’re gunning for? Is there a 4th answer? I don’t like Bush/Cheney problems, where the viewer is left to decide whether the action was evil (blame it on Cheney, or a Jeff) or stupid (George was in charge), because they always let all bad actors off the hook; so in the absence of evidence (“the outcome of this investigation is proprietary and will not be shared with the seller”), I’ll have to draw my own conclusion. You can claim it’s an underinformed conclusion, but I’m all ears.

Reply
dondo says:

June 19, 2022 at 5:22 pm

There’s a lot to unpack in there, Buggy. I will say that much of this comes across more as an attack on the company that I happen to work for, rather than an honest discussion of what I’ve written. “Given that [the company you mention is awful], doesn’t that mean that what you describe is just a way for evil to prevail?” Sure, I guess. But, you know, “Given that [potatoes are actually meat], isn’t any vegan who eats them a hypocrite?!”

¯\_(ツ)_/¯

There were a few concerns more directly related to the topic of metrics.

1. Using metrics in this way does not protect against the moral hazard of unethical behavior.
This is quite true. However, the implication that using metrics as described increases such moral hazard seems suspect. Companies that are comfortable with unethical behavior will find ways to be unethical. I’m unconvinced that using metrics as described here has appreciable impact on it. (Also, per above, I’m not engaging on the question of Amazon’s ethics.)

2. Metrics can obscure the underlying reality: “What gets measured gets optimized.”
This is quite true. This is why it’s so important to choose the right metrics. Your example of customer service representatives who might be measured based on time to resolve cases (were it true) would be a great example of a cripplingly poor output metric. The “How’s My Driving” metric I described in detail is specifically designed to counteract that pressure! However, time to resolve is an important *input* metric; if we aren’t paying attention to how long it takes to resolve cases, we have little incentive to eg. build better tooling. Is it possible that such metrics will be applied to individual behavior, resulting in pressure for representatives to solve cases faster or face penalty? Yes, it is. Is that wrong? At some level, perhaps not, in just the same way I’d rather hire a graphic designer for my book covers who works quickly rather than one who takes ages or never delivers. However, obviously that’s not the only or even most important thing I’d consider. Nor should time to resolve be the single or most important individual metric for evaluating representative performance.

3. Metrics can obscure the actual underlying reality: “What gets measured gets gamed.”
This is a real concern. There are a number of mechanisms that need to be applied to try to counteract them. For example, we pay attention to examples where anecdote conflicts with the metric – we don’t necessarily trust the anecdote, but we do deep dive and root cause analysis to try to assure it. Those aren’t a pure antidote, of course; metrics owners have strong incentive to look good, so those analyses aren’t necessarily conducted in good faith.

Which leads me to think that the real problem isn’t metrics; the problem is human nature. Yes, we can lie with statistics. We can also just plain lie. I’m not sure that using metrics as described here appreciably changes the underlying behavior. At least using data means that it’s at least somewhat connected to some reality. If we leave unethical actors completely to their own devices, they’ll just make stuff up.

I’m not entirely sure how to improve human nature. I’ll try to do better! :p

Reply