A Case Study on Quality: What Motorcycle Maintenance Can’t Teach You

For most of human history we measured quality using our senses. We smelled fish to look for spoilage, looked in a horses mouth to verify age, and felt our produce for ripeness. Because of this, we have an innate sense of what quality that is not easily defined:

Quality … you know what it is, yet you don’t know what it is. But that’s self-contradictory. But some things are better than others, that is, they have more quality.

— Phaedrus in “Zen and the Art of Motorcycle Maintenance”

People have always expected quality, but how do you provide it when most of what we do is abstracted and unable to be measured by our senses. Data can be a proxy for quality, but how do we bridge from what we can measure with what we can sense in the physical universe? As companies scale up they abstract the individual and the personal into the average and the aggregate. This transition is where lots of organizations get into trouble.

To do this, you need to overcome three key challenges. In this post, I am sharing a case study where a company fell into this trap. In reaching for quality they fell short of the customer’s need for an image recognition product that required high quality data inputs to be effective for the customer (I’ll refer to the company as IR).

Challenge 1 – Defining quality

The ability to recognize images and categorize them was important for the customers of IR. In order for the business to work, they needed 98% accuracy (i.e. only 2 mistakes per 100 images). In addition, each client had their own, specific quality targets under the broader 98% accuracy target. If these quality targets weren’t met, IR’s customers would have to not only re-tag the images themselves, but also reimburse the cost of the service it was supposed to automate. Knowing that quality was important, IR had a dedicated team that would work with each new customer to define the quality metrics to be used when the specific solution was operated at scale (Approx 1000 images/day). The determination was to perform an audit of the tagged data and send it to the human annotators to determine the quality of the algorithm. The human annotated data would also be used to refine and train the model.

Challenge 2 – Measurement and Accountability

IR built another team that was responsible for the data quality that came out of the auditing process once the customer was on-boarding and the IR program was operationalized (they called it matured). They would monitor the data quality as defined by the onboarding team and, if there were any issues, develop an action plan that they would coordinate with the operations and technology teams to mitigate. Measurement and accountability are important for any operation, and IR met this need by chartering this team. The team was responsible for looking at the reports and developing an action plan when targets were not met to ensure that the operations teams were accountable to what was required for the customer.

Challenge 3 – Not skipping what is important, but harder to measure

This created a “chicken and egg” problem, which caused the issue with the delivery to the customer. There was an on-boarding team that defined metrics and worked with the customer and an operational team that managed to the metrics when a workflow was mature but no team to ensure that the workflow ramped up and matured correctly. This problem is called a “cold start” in computing, but is called “Meno’s Paradox” in Philosophy based on a line in Plato’s “Republic” which states:

And how will you inquire into a thing when you are wholly ignorant of what it is? Even if you happen to bump right into it, how will you know it is the thing you didn’t know?

In order to have an accurate audit of the data you had to know that the humans were annotating correctly. Traditionally this is done by having a data set annotated twice and checking for “inter-annotator” agreement. If the annotators agree, you can assume it is correct. If they don’t, typically you throw the data out or send it on to a third annotator to break the tie.

This is very expensive to do, and IR cut corners. They said, we’ll have 10% of the images annotated and then audit the annotator with a “verifier” who will annotate 10% of the original 10%. This is much more cost effective, and still gets you a quality score by annotator.

This didn’t create the right outcome, however. The assumption IR made is that the “verifier” is more accurate than the original annotator and can be relied on to score the annotator. In a workflow done by an experienced annotator, this is a reasonable assumption, but in an immature workflow it was a big mistake. Because the workflow was new, no one was more experienced than anyone else. As a result, the “quality scores” that came out were good and assumed to reflect the underlying quality of the data. Because there was never a single set of accurate data (called a “golden” or “canonical” set), no one was effectively trained in the right way to do the workflow.

The laziest kid always has to work the hardest in the end

– C.S. Lewis

Ultimately, it was bad press for the customers that brought this to light to IR. The end users of the products that used the technology were turned off by the inaccurate image tagging. A reddit subform (or “subreddit”) with specific examples was picked up by mainstream media publications. Based on the bad press, and disappointed customers, cutting corners ended up being a negative economic decision and IR paid for it many times over. The solution that, from what I understand, still remains in place today is the best practice. Two annotators tag a data set and if they agree, it is “valid” and if they don’t, it gets sent to a third to “break the tie” or discarded.