Finish it in 12 hours.  You need to read carefully or I will dispute.
Discussion question:
1. Obviously, in both papers (Requirements development in scenario-based design,
Scenario-Based Usability Engineering, Chapter 3), there seems to be a strong
preference for scenario-based design, and rightfully so as it seems to be a better
approach in almost all situations. However, when would a requirement-based approach
beat out a scenario-based approach, or what big ideas from the readings make the
scenario-based approach better? (bonus points for naming something not mentioned
already!)
2. Read Understanding the Effect of Accuracy on Trust in Machine Learning Models
Discuss: Have you ever used any ML systems in your daily life/work? Do you think the ML
systems are trustworthy?
Copyright  1999 by Mary Beth Rosson and John M. Carroll
DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION
Scenario-Based Usability Engineering
Mary Beth Rosson and John M. Carroll
Department of Computer Science
Virginia Tech
Fall 1999
DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION
SBUE—Chapter 3 1
Copyright  1999 by Mary Beth Rosson and John M. Carroll
Chapter 3
Analyzing Requirements
Making work visible. The end goal of requirements analysis can be elusive when work is not
understood in the same way by all participants. Blomberg, Suchman, and Trigg describe this
problem in their exploration of image-processing services for a law firm. Initial studies of
attorneys produced a rich analysis of their document processing needs—for any legal proceeding,
documents often numbering in the thousands are identified as “responsive” (relevant to the case) by
junior attorneys, in order to be submitted for review by the opposing side. Each page of these
documents is given a unique number for subsequent retrieval. An online retrieval index is created
by litigation support workers; the index encodes document attributes such as date, sender,
recipient, and type. The attorneys assumed that their job (making the subjective relevance
decisions) would be facilitated by image processing that encodes a documents’s objective attributes
(e.g., date, sender). However, studies of actual document processing revealed activities that were
not objective at all, but rather relied on the informed judgment of the support staff. Something as
simple as a document date was often ambiguous, because it might display the date it was written,
signed, and/or delivered; the date encoded required understanding the document’s content and role
in a case. Even determining what constituted a document required judgment, as papers came with
attachments and no indication of beginning or end. Taking the perspective of the support staff
revealed knowledge-based activities that were invisible to the attorneys, but that had critical limiting
implications for the role of image-processing technologies (see Blomberg, 1995).
DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION
SBUE—Chapter 3 2
Copyright  1999 by Mary Beth Rosson and John M. Carroll
What is Requirements Analysis?
The purpose of requirements analysis is to expose the needs of the current situation with
respect to a proposed system or technology. The analysis begins with a mission statement or
orienting goals, and produces a rich description of current activities that will motivate and guide
subsequent development. In the legal office case described above, the orienting mission was
possible applications of image processing technology; the rich description included a view of case
processing from both the lawyers’ and the support staffs’ perspectives. Usability engineers
contribute to this process by analyzing what and how features of workers’ tasks and their work
situation are contributing to problems or successes1. This analysis of the difficulties or
opportunities forms a central piece of the requirements for the system under development: at the
minimum, a project team expects to enhance existing work practices. Other requirements may arise
from issues unrelated to use, for example hardware cost, development schedule, or marketing
strategies. However these pragmatic issues are beyond the scope of this textbook. Our focus is on
analyzing the requirements of an existing work setting and of the workers who populate it.
Understanding Work
What is work? If you were to query a banker about her work, you would probably get a
list of things she does on a typical day, perhaps a description of relevant information or tools, and
maybe a summary of other individuals she answers to or makes requests of. At the least,
describing work means describing the activities, artifacts (data, documents, tools), and social
context (organization, roles, dependencies) of a workplace. No single observation or interview
technique will be sufficient to develop a complete analysis; different methods will be useful for
different purposes.
Tradeoff 3.1: Analyzing tasks into hierarchies of sub-tasks and decision rules brings order
to a problem domain, BUT tasks are meaningful only in light of organizational goals and
activities.
A popular approach to analyzing the complex activities that comprise work is to enumerate
and organize tasks and subtasks within a hierarchy (Johnson, 1995). A banker might indicate that
the task of “reviewing my accounts” consists of the subtasks “looking over the account list”,
“noting accounts with recent activity”, and “opening and reviewing active accounts”. Each of these
sub-tasks in turn can decomposed more finely, perhaps to the level of individual actions such as
picking up or filing a particular document. Some of the tasks will include decision-making, such
1 In this discussion we use “work” to refer broadly to the goal-directed activities that take place in the
problem domain. In some cases, this may involve leisure or educational activities, but in general the same methods
can be applied to any situation with established practices.
DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION
SBUE—Chapter 3 3
Copyright  1999 by Mary Beth Rosson and John M. Carroll
as when the banker decides whether or not to open up a specific account based on its level of
activity.
A strength of task analysis is its step-by-step transformation of a complex space of
activities into an organized set of choices and actions. This allows a requirements analyst to
examine the task’s structure for completeness, complexity, inconsistencies, and so on. However
the goal of systematic decomposition can also be problematic, if analysts become consumed by
representing task elements, step sequences, and decision rules. Individual tasks must be
understood within the larger context of work; over-emphasizing the steps of a task can cause
analysts to miss the forest for the trees. To truly understand the task of reviewing accounts a
usability engineer must learn who is responsible for ensuring that accounts are up to date, how
account access is authorized and managed, and so on.
The context of work includes the physical, organizational, social, and cultural relationships
that make up the work environment. Actions in a workplace do not take place in a vacuum;
individual tasks are motivated by goals, which in turn are part of larger activities motivated by the
organizations and cultures in which the work takes place (see Activities of a Health Care Center,
below). A banker may report that she is reviewing accounts, but from the perspective of the
banking organization she is “providing customer service” or perhaps “increasing return on
investment”. Many individuals — secretaries, data-entry personnel, database programmers,
executives — work with the banker to achieve these high-level objectives. They collaborate
though interactions with shared tools and information; this collaboration is shaped not only by the
tools that they use, but also by the participants’ shared understanding of the bank’s business
practice — its goals, policies, and procedures.
Tradeoff 3.2: Task information and procedures are externalized in artifacts, BUT the impact
of these artifacts on work is apparent only in studying their use.
A valuable source of information about work practices is the artifacts used to support task
goals (Carroll & Campbell, 1989). An artifact is simply a designed object — in an office setting, it
might be a paper form, a pencil, an in-basket, or a piece of computer software. It is simple and fun
to collect artifacts and analyze their characteristics (Norman, 1990). Consider the shape of a
pencil: it conveys a great deal about the size and grasping features of the humans who use it;
pencil designers will succeed to a great extent by giving their new designs the physical
characteristics of pencils that have been used for years. But artifacts are just part of the picture.
Even an object as simple as a pencil must be analyzed as part of a real world activity, an activity
that may introduce concerns such as erasability (elementary school use), sharpness (architecture
firm drawings), name-brands (pre-teen status brokering), cost (office supplies accounting), and so
on.
Usability engineers have adapted ethnographic techniques to analyze the diverse factors
influencing work. Ethnography refers to methods developed within anthropology for gaining
insights into the life experiences of individuals whose everyday reality is vastly different from the
DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION
SBUE—Chapter 3 4
Copyright  1999 by Mary Beth Rosson and John M. Carroll
analyst’s (Blomberg, 1990). Ethnographers typically become intensely involved in their study of a
group’s culture and activities, often to the point of becoming members themselves. As used by
HCI and system design communities, ethnography involves observations and interviews of work
groups in their natural setting, as well as collection and analysis of work artifacts (see Team Work
in Air Traffic Control, below). These studies are often carried out in an iterative fashion, where
the interpretation of one set of data raises questions or possibilities that may be pursued more
directly in follow-up observations and interviews.
Figure 3.1: Activity Theory Analysis of a Health Care Center
(after Kuuiti and Arvonen, 1992)
Activities of a Health Care Center: Activity Theory (AT) offers a view of individual
work that grounds it in the goals and practices of the community within which the work takes
place. Engeström (1987) describes how an individual (the subject) works on a problem (the
object) to achieve a result (the outcome), but that the work on the problem is mediated by the tools
available (see Figure 3.2m). An individual’s work is also mediated by the rules of practice shared
within her community; the object of her work is mediated by that same communities division of
labor.
Kuutti and Arvonen (1992; see also Engeström 1990; 1991; 1993) applied this framework
to their studies of a health care organization in Espoo, Finland. This organization wished to evolve
Tools Supporting Activity:
Subject Involved in Activity:
Community sponsoring Activity:
Object of Activity:
Activity Outcome:
Division of LaborRules of Practice
patient record, medicines, etc.
one physician in a health care unit
all personnel of the health care unit
the complex, multi-dimensional
problem of a patient
patient problem resolved
DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION
SBUE—Chapter 3 5
Copyright  1999 by Mary Beth Rosson and John M. Carroll
from a rather bureaucratic organization with strong separations between its various units (e.g.,
social work, clinics, hospital) to a more service-oriented organization. A key assumption in doing
this was that the different units shared a common general object of work—the “life processes” of
the town’s citizens. This high-level goal was acknowledged to be a complex problem requiring the
integrated services of complementary health care units.
The diagram in Figure 3.1 summarizes an AT analysis developed for one physician in a
clinic. The analysis records the shared object (the health conditions of a patient). At the same time
it shows this physician’s membership in a subcommunity, specifically the personnel at her clinic.
This clinic is both geographically and functionally separated from other health care units, such as
the hospital or the social work office. The tools that the physician uses in her work, the rules that
govern her actions, and her understanding of her goals are mediated by her clinic. As a result, she
has no way of analyzing or finding out about other dimensions of this patient’s problems, for
example the home life problems being followed by a social worker, or emotional problems under
treatment by psychiatric personnel. In AT such obstacles are identified as contradictions which
must be resolved before the activity can be successful.
In this case, a new view of community was developed for the activity. For each patient,
email or telephone was used to instantiate a new community, comprised of individuals as relevant
from different health units. Of course the creation of a more differentiated community required
negotiation concerning the division of labor (e.g. who will contact whom and for what purpose),
and rules of action (e.g., what should be done and in what order). Finally, new tools (composite
records, a “master plan”) were constructed that better supported the redefined activity.
Figure 3.2 will appear here, a copy of the figure provided by Hughes et al. in their
ethnographic report. Need to get copyright permission.
Team Work in Air Traffic Control: An ethnographic study of British air traffic
control rooms by Hughes, Randall and Shapiro (CSCW’92) highlighted the central role played by
the paper strips used to chart the progress of individual flights. In this study the field workers
immersed themselves in the work of air traffic controllers for several months. During this time
they observed the activity in the control rooms and talked to the staff; they also discussed with the
staff the observations they were collecting and their interpretation of these data.
The general goal of the ethnography was to analyze the social organization of the work in
the air traffic control rooms. In this the researchers showed how the flight progress strips
supported “individuation”, such that each controller knew what their job was in any given
situation, but also how their tasks were interdependent with the tasks of others. The resulting
division of labor was accomplished in a smooth fashion because the controllers had shared
knowledge of what the strips indicated, and were able to take on and hand off tasks as needed, and
to recognize and address problems that arose.
DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION
SBUE—Chapter 3 6
Copyright  1999 by Mary Beth Rosson and John M. Carroll
Each strip displays an airplane’s ID and aircraft type; its current level, heading, and
airspeed; its planned flight path, navigation points on route, estimated arrival at these points; and
departure and destination airports (see Figure 3.2). However a strip is more than an information
display. The strips are work sites, used to initiate and perform control tasks. Strips are printed
from the online database, but then annotated as flight events transpire. This creates a public
history; any controller can use a strip to reconstruct a “trajectory” of what the team has done with a
flight. The strips are used in conjunction with the overview offered by radar to spot exceptions or
problems to standard ordering and arrangement of traffic. An individual strip gets “messy” to the
extent it has deviated from the norm, so a set of strips serves as a sort of proxy for the orderliness
of the skies.
The team interacts through the strips. Once a strip is printed and its initial data verified, it is
placed in a holder color-coded for its direction. It may then be marked up by different controllers,
each using a different ink color; problems or deviations are signaled by moving a strip out of
alignment, so that visual scanning detects problem flights. This has important social consequences
for the active controller responsible for a flight. She knows that other team members are aware of
the flight’s situation and can be consulted; who if anyone has noted specific issues with the flight;
if a particularly difficult problem arises it can be passed on to the team leader without a lot of
explanation; and so on.
The ethnographic analysis documented the complex tasks that revolved around the flight
control strips. At the same time it made clear the constraints of these manually-created and
maintained records. However a particularly compelling element of the situation was the
controllers’ trust in the information on the strips. This was due not to the strips’ physical
characteristics, but rather to the social process they enable—the strips are public, and staying on
top of each others’ problem flights, discussing them informally while working or during breaks, is
taken for granted. Any computerized replacement of the strips must support not just management
of flight information, but also the social fabric of the work that engenders confidence in the
information displayed.
User Involvement
Who are a system’s target users? Clearly this is a critical question for a user-centered
development process. It first comes up during requirements analysis, when the team is seeking to
identify a target population(s), so as to focus in on the activities that will suggest problems and
concerns. Managers or corporation executives are a good source of high-level needs statements
(e.g., reduce data-processing errors, integrate billing and accounting). Such individuals also have
a well-organized view of their subordinates’ responsibilities , and of the conditions under which
various tasks are completed. Because of the hierarchical nature of most organizations, such
individuals are usually easily to identify and comprise a relatively small set. Unfortunately if a
requirements team accepts these requirements too readily, they may miss the more detailed and
situation-specific needs of the individuals who will use a new system in their daily work.
DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION
SBUE—Chapter 3 7
Copyright  1999 by Mary Beth Rosson and John M. Carroll
Tradeoff 3.3: Management understands the high-level requirements for a system, BUT is
often unaware of workers’ detailed needs and preferences.
Every system development situation includes multiple stakeholders (Checklund, 1981).
Individuals in management positions may have authorized a system’s purchase or development;
workers with a range of job responsibilities will actually use the system; others may benefit only
indirectly from the tasks a system supports. Each set of stakeholders has its own set of
motivations and problems that the new system might address (e.g., productivity, satisfaction, ease
of learning). What’s more, none of them can adequately communicate the perspectives of the
others — despite the best of intentions, many details of a subordinate’s work activities and
concerns are invisible to those in supervisory roles. Clearly what is needed in requirements
analysis is a broad-based approach that incorporates diverse stakeholder groups into the
observation and interviewing activities.
Tradeoff 3.4: Workers can describe their tasks, BUT work is full of exceptions, and the
knowledge for managing exceptions is often tacit and difficult to externalize.
But do users really understand their own work? We made the point above that a narrow
focus on the steps of a task might cause analysts to miss important workplace context factors. An
analogous point holds with respect to interviews or discussions with users. Humans are
remarkably good (and reliable) at “rationalizing” their behaivor (Ericsson & Simon, 1992).
Reports of work practices are no exception — when asked workers will usually first describe a
most-likely version of a task. If an established “procedures manual” or other policy document
exists, the activities described by experienced workers will mirror the official procedures and
policies. However this officially-blessed knowledge is only part of the picture. An experienced
worker will also have considerable “unofficial” knowledge acquired through years of encountering
and dealing with the specific needs of different situations, with exceptions, with particular
individuals who are part of the process, and so on. This expertise is often tacit, in that the
knowledgeable individuals often don’t even realize what they “know” until confronted with their
own behavior or interviewed with situation-specific probes (see Tacit Knowledge in Telephone
Trouble-Shooting, below). From the perspective of requirements analysis, however, tacit
knowledge about work can be critical, as it often contains the “fixes” or “enhancements” that have
developed informally to address the problems or opportunities of day-to-day work.
One effective technique for probing workers’ conscious and unconscious knowledge is
contextual inquiry (Beyers & Holtzblatt, 1994). This analysis method is similar to ethnography, in
that it involves the observation of individuals in the context of their normal work environment.
However it includes the perogative to interrupt an observed activity at points that seem informative
(e.g., when a problematic situation arises) and to interview the affected individual(s) on the spot
concerning the events that have been observed, to better understand causal factors and options for
continuing the activity. For example, a usability engineer who saw a secretary stop working on a
DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION
SBUE—Chapter 3 8
Copyright  1999 by Mary Beth Rosson and John M. Carroll
memo to make a phone call to another secretary, might ask her afterwards to explain what had just
happened between her and her co-worker.
Tacit Knowledge in Telephone Trouble-Shooting: It is common for workers to
see their conversations and interactions with each other as a social aspect of work that is enjoyable
but unrelated to work goals. Sachs (199x) observed this in her case study of telephony workers in
a phone company. The study analyzed the work processes related to detecting, submitting, and
resolving problems on telephone lines; the focus of the study was the Trouble Ticketing System
(TTS), a large database used to record telephone line problems, assign problems (tickets) to
engineers for correction, and keep records of problems detected and resolved.
Sachs argues that TTS takes an organizational view of work, treating work tasks as
modular and well-defined: one worker finds a problem, submits it to the database, TTS assigns it
to the engineer at the relevant site, that engineer picks up the ticket, fixes the problem, and moves
on. The original worker is freed from the problem analysis task once the original ticket, and the
second worker can move on once the problem has been addressed. TTS replaced a manual system
in which workers contacted each other directly over the phone, often working together to resolve a
problem. TTS was designed to make work more efficient by eliminating unnecessary phone
conversations.
In her interviews with telephony veterans, Sachs discovered that the phone conversations
were far from unnecessary. The initiation, conduct, and consequences of these conversations
reflected a wealth of tacit knowledge on the part of the worker–selecting the right person to call
(one known to have relevant expertise for this apparent problem), the “filling in” on what the first
worker had or had not determined or tried to this point, sharing of hypotheses and testing methods,
iterating together through tests and results, and carrying the results of this informal analysis into
other possibly related problem areas. In fact, TTS had made work less efficient in many cases,
because in order to do a competent job, engineers developed “workarounds” wherein they used
phone conversations as they had in the past, then used TTS to document the process afterwards.
Of interest was that the telephony workers were not at first aware of how much knowledge
of trouble-shooting they were applying to their jobs. They described the tasks as they understood
them from company policy and procedures. Only after considerable data collection and discussion
did they recognize that their jobs included the skills to navigate and draw upon a rich organizational
network of colleagues. In further work Sachs helped the phone company to develop a fix for the
observed workarounds in the form of a new organizational role: a “turf coordinator”, a senior
engineer responsible for identifying and coordinating the temporary network of workers needed to
collaborate on trouble-shooting a problem. As a result of Sach’s analysis, work that had been tacit
and informal was elevated to an explicit business responsibility.
Requirements Analysis with Scenarios
As introduced in Chapter 2, requirements refers to the first phase of SBUE. As we also
have emphasized, requirements cannot be analyzed all at once in waterfall fashion. However some
DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION
SBUE—Chapter 3 9
Copyright  1999 by Mary Beth Rosson and John M. Carroll
analysis must happen early on to get the ball rolling. User interaction scenarios play an important
role in these early analysis activities. When analysts are observing workers in the world, they are
collecting observed scenarios, episodes of actual interaction among workers that may or may not
involve technology. The analysis goal is to produce a summary that captures the critical aspects of
the observed activities. A central piece of this summary analysis is a set of requirements scenarios.
The development of requirements scenarios begins with determining who are the
stakeholders in a work situation — what their roles and motivations are, what characteristics they
possess that might influence reactions to new technology. A description of these stakeholders’
work practice is then created, through a combination of workplace observation and generation of
hypothetical situations. These sources of data are summarized and combined to generate the
requirements scenarios. A final step is to call out the most critical features of the scenarios, along
with hypotheses about the positive or negative consequences that these features seem to be having
on the work setting.
Introducing the Virtual Science Fair Example Case
The methods of SBUE will be introduced with reference to a single open-ended example
problem, the design of a virtual science fair (VSF). The high-level concept is to use computer-
mediated communication technology (e.g., email, online chat, discussion forums,
videoconferencing) and online archives (e.g., databases, digital libraries) to supplement the
traditional physical science fairs. Such fairs typically involve student creation of science projects
over a period of months. The projects are then exhibited and judged at the science fair event. We
begin with a very loose concept of what a virtual version of such a fair might be — not a
replacement of current fairs, but rather a supplement that expands the boundaries of what might
constitute participation, project construction, project exhibits, judging, and so on.
Stakeholder Analysis
Checklund (1981) offers a mnemonic for guiding development of an early shared vision of
a system’s goals — CATWOE analysis. CATWOE elements include Clients (those people who
will benefit or suffer from the system), Actors (those who interact with the system), a
Transformation (the basic purpose of the system), a Weltanschauung (the world view promoted by
the system), Owners (the individuals commissioning or authorizing the system), and the
Environment (physical constraints on the system). SBUE adapts Checklund’s technique as an aid
in identifying and organizing the concerns of various stakeholders during requirements
analysis.The SBUE adaptation of Checklund’s technique includes the development of thumbnail
scenarios for each element identified. The table includes just one example for each VSF element
called out in the analysis; for a complex situation multiple thumbnails might be needed. Each
scenario sketch is a usage-oriented elaboration of the element itself; the sketch is points to a future
situation in which a possible benefit, interaction, environmental constraint, etc., is realized. Thus
the client thumbnails emphasize hoped-for benefits of the VSF; the actor thumbnails suggest a few
interaction variations anticipated for different stakeholders. The thumbnail scenarios generated in
DRAFT: PLEASE DO NOT CITE OR CIRCULATE WITHOUT PERMISSION
SBUE—Chapter 3 10
Copyright  1999 by Mary Beth Rosson and John M. Carroll
this analysis are not yet design scenarios, they simply allow the analyst to begin to explore the
space of user groups, motivations, and pragmatic constraints.
The CATWOE thumbnail scenarios begin the iterative process of identifying and analyzing
the background, motivations, and preferences that different user groups will bring to the use of the
target system. This initial picture will be elaborated throughout the development process, through
analysis of both existing and envisioned usage situations.
CATWOE
Element
V S F
Element
Thumbnail
Scenarios
Clients Students
Community members
A high school student learns about road-bed coatings from a
retired civil engineer.
A busy housewife helps a middle school student organize her
bibliographic information.
Actors Students
Teachers
Community members
A …
Understanding the Effect of Accuracy on Trust in
Machine Learning Models
Ming Yin
Purdue University
[email protected]
Jennifer Wortman Vaughan
Microsoft Research
[email protected]
Hanna Wallach
Microsoft Research
[email protected]
ABSTRACT
We address a relatively under-explored aspect of human–
computer interaction: people’s abilities to understand the
relationship between a machine learning model’s stated per-
formance on held-out data and its expected performance post
deployment. We conduct large-scale, randomized human-
subject experiments to examine whether laypeople’s trust in
a model, measured in terms of both the frequency with which
they revise their predictions to match those of the model and
their self-reported levels of trust in the model, varies depend-
ing on the model’s stated accuracy on held-out data and on its
observed accuracy in practice. We find that people’s trust in a
model is affected by both its stated accuracy and its observed
accuracy, and that the effect of stated accuracy can change
depending on the observed accuracy. Our work relates to re-
cent research on interpretable machine learning, but moves
beyond the typical focus on model internals, exploring a
different component of the machine learning pipeline.
CCS CONCEPTS
• Human-centered computing → Empirical studies in
HCI; • Computing methodologies → Machine learn-
ing.
KEYWORDS
Machine learning, trust, human-subject experiments
ACM Reference Format:
Ming Yin, Jennifer Wortman Vaughan, and Hanna Wallach. 2019.
Understanding the Effect of Accuracy on Trust in Machine Learn-
ing Models. In CHI Conference on Human Factors in Computing
Systems Proceedings (CHI 2019), May 4–9, 2019, Glasgow, Scotland
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights
for components of this work owned by others than the author(s) must
be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Request permissions from [email protected]
CHI 2019, May 4–9, 2019, Glasgow, Scotland UK
© 2019 Copyright held by the owner/author(s). Publication rights licensed
to ACM.
ACM ISBN 978-1-4503-5970-2/19/05.. .$15.00
https://doi.org/10.1145/3290605.3300509
UK. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/
3290605.3300509
1 INTRODUCTION
Machine learning (ML) is becoming increasingly ubiquitous
as a tool to aid human decision-making in diverse domains
ranging from medicine to public policy and law. For exam-
ple, researchers have trained deep neural networks to help
dermatologists identify skin cancer [8], while political strate-
gists regularly use ML-based forecasts to determine their
next move [21]. Police departments have used ML systems
to predict the location of human trafficking hotspots [28],
while child welfare workers have used predictive modeling
to strategically target services to the children most at risk [3].
This widespread applicability of ML has led to a move-
ment to “democratize machine learning” [12] by developing
off-the-shelf models and toolkits that make it possible for
anyone to incorporate ML into their own system or decision-
making pipeline, without the need for any formal training.
While this movement opens up endless possibilities for ML
to have real-world impact, it also creates new challenges.
Decision-makers may not be used to reasoning about the
explicit forms of uncertainty that are baked into ML pre-
dictions [27], or, because they do not need to understand
the inner workings of an ML model in order to use it, they
may misunderstand or mistrust its predictions [6, 16, 25].
Prompted by these challenges, as well as growing concerns
that ML systems may inadvertently reinforce or amplify so-
cietal biases [1, 2], researchers have turned their attention to
the ways that humans interact with ML, typically focusing
on people’s abilities and willingness to use, understand, and
trust ML systems. This body of work often falls under the
broad umbrella of interpretable machine learning [6, 16, 25].
To date, most work on interpretability has focused explic-
itly on ML models, asking questions about people’s abilities
to understand model internals or the ways that particular
models map inputs to outputs [20, 24], as well as questions
about the relationship between these abilities and people’s
willingness to trust a model. However, the model is just one
component of the ML pipeline, which spans data collection,
model selection, training algorithms and procedures, model
evaluation, and ultimately, deployment. It is therefore im-
portant to study people’s interactions with each of these
components—not just those that relate to model internals.
CHI 2019 Paper CHI 2019, May 4–9, 2019, Glasgow, Scotland, UK
Paper 279 Page 1
https://doi.org/10.1145/3290605.3300509
https://doi.org/10.1145/3290605.3300509
https://doi.org/10.1145/3290605.3300509
One particularly under-explored aspect of the evaluation
and deployment components of the pipeline is the inter-
pretability of performance metrics, such as accuracy, preci-
sion, or recall. The democratization of ML means that it is
increasingly common for a decision-maker to be presented
with a “black-box” model along with some measure of its
performance—most often accuracy—on held-out data. How-
ever, a model’s stated performance may not accurately reflect
its performance post deployment because the data on which
the model was trained and evaluated may look very differ-
ent from real-world use cases [15]. In deciding how much to
trust the model, the decision-maker has little to go on besides
this stated performance, her own limited observations of the
model’s predictions in practice, and her domain knowledge.
This scenario raises a number of questions. To what extent
do laypeople—who are increasingly often the end users of
systems built using ML models—understand the relationship
between a model’s stated performance on held-out data and
its expected performance post deployment? How does their
understanding influence their willingness to trust the model?
For example, do people trust a model more if they are told
that its accuracy on held-out data is 90% as compared with
70%? If so, will the model’s stated accuracy continue to influ-
ence their trust in the model even after they are given the op-
portunity to observe and interact with the model in practice?
In this paper, we describe the results of a sequence of
large-scale, randomized, pre-registered human-subject exper-
iments1 designed to investigate whether an ML model’s accu-
racy affects laypeople’s willingness to trust the model. Specif-
ically, we focus on the following three main questions:
• Does a model’s stated accuracy on held-out data affect
people’s trust in the model?
• If so, does it continue to do so after people have observed
the model’s accuracy in practice?
• How does a model’s observed accuracy in practice affect
people’s trust in the model?
In each of our experiments, subjects recruited on Amazon
Mechanical Turk were asked to make predictions about the
outcomes of speed dating events with the help of an ML
model. Subjects were first shown information about a speed
dating participant and his or her date, and then asked to
predict whether or not the participant would want to see his
or her date again. Finally, they were shown the model’s pre-
diction and given the option of revising their own prediction.
In our first experiment, we focus on the first two questions
above, investigating whether a model’s stated accuracy on
held-out data affects laypeople’s trust in the model and, if so,
whether it continues to do so after they have observed the
model’s accuracy in practice. Subjects were randomized into
one of ten treatments, which differed along two dimensions:
1All experiments were approved by the Microsoft Research IRB.
stated accuracy on held-out data and amount at stake. Some
subjects were given no information about the model’s accu-
racy on held-out data, while others were told that its accuracy
was 60%, 70%, 90%, or 95%. Halfway through the experiment,
each subject was given feedback on both their own accuracy
and the model’s accuracy on the first half of the prediction
tasks, which was 80% regardless of the treatment. Subjects in
all treatments saw exactly the same speed dating events and
exactly the same model predictions. This experimental de-
sign allows us isolate the effect of stated accuracy on people’s
trust, both before and after they observe the model’s accuracy
in practice. As a robustness check, some subjects received a
monetary bonus for each correct prediction, while others did
not, allowing us to test whether the effect of stated accuracy
on trust varies when people have more “skin in the game.”
We find that stated accuracy does have a significant effect
on people’s trust in a model, measured in terms of both the
frequency with which subjects adjust their predictions to
match those of the model and their self-reported levels of
trust in the model. We also find that the effect size is smaller
after people observe the model’s accuracy in practice. We do
not find that the amount at stake has a significant effect.
In our second experiment, we test whether these results
are robust to different levels of observed accuracy by running
two additional variations of our first experiment: one in
which the observed accuracy of the model was low and one
in which the observed accuracy of the model was high. We
find that a model’s stated accuracy still has a significant effect
on people’s trust even after observing a high accuracy (100%)
in practice. However, if a model’s observed accuracy is low
(55%), then after observing this accuracy, the stated accuracy
has at most a very small effect on people’s trust in the model.
In our third experiment, we investigate the final question
above—i.e., how does a model’s observed accuracy in prac-
tice affect people’s trust in the model? The experimental
design used in our first two experiments does not enable us
to directly compare people’s trust between treatments with
different levels of observed accuracy because the prediction
tasks (i.e., speed dating events) and the model predictions
differed between these treatments. Our third experiment was
therefore carefully designed to enable us to make such com-
parisons. We find that after observing a model’s accuracy in
practice, people’s trust in the model is significantly affected
by its observed accuracy regardless of its stated accuracy.
Finally, via an exploratory analysis, we dig more deeply
into the question of how people update their trust after re-
ceiving feedback on their own accuracy and the model’s
accuracy in practice. We analyze differences in individual
subjects’ trust in the model before and after receiving such
feedback. Our experimental data support the conjecture that
people compare their own accuracy to the model’s observed
accuracy, increasing their trust in the model if the model’s
CHI 2019 Paper CHI 2019, May 4–9, 2019, Glasgow, Scotland, UK
Paper 279 Page 2
observed accuracy is higher than their own accuracy—except
in the case where the model’s observed accuracy is substan-
tially lower than its stated accuracy on held-out data.
Taken together, our results show that laypeople’s trust
in an ML model is affected by both the model’s stated accu-
racy on held-out data and its observed accuracy in practice.
These results highlight the need for designers of ML systems
to clearly and responsibly communicate their expectations
about model performance, as this information shapes the
extent to which people trust a model, both before and after
they are able to observe and interact with it in practice. Our
results also reveal the importance of properly communicat-
ing the uncertainty that is baked into every ML prediction.
Of course, proper caution should be used when generalizing
our results to other settings. For example, although we do
not find that the amount at stake has a significant effect, it
is possible that there would be an effect when stakes are suf-
ficiently high (e.g., doctors making life-or-death decisions).
Related Work
Our research contributes to a growing body of experimental
work on trust in algorithmic systems. As a few examples,
Dzindolet et al. [7] and Dietvorst et al. [4] found that people
stop trusting an algorithm after witnessing it make a mistake,
even when the algorithm outperforms human predictions—
a phenomenon known as algorithm aversion. Dietvorst et
al. [5] found that people are more willing to rely on an algo-
rithm’s predictions when they are given the ability to make
minor adjustments to the predictions rather than accepting
them as is. Yeomans et al. [30] found that people distrust
automated recommender systems compared with human rec-
ommendations in the context of predicting which jokes peo-
ple will find funny—a highly subjective domain—even when
the recommender system outperforms human predictions. In
contrast, Logg et al. [17] found that people trust predictions
more when they believe that the predictions come from an
algorithm as opposed to a human expert when predicting mu-
sic popularity, romantic matches, and other outcomes. This
effect is diluted when people are given the choice between us-
ing an algorithm’s prediction and using their own prediction
(as opposed to a prediction from another human expert).
The relationship between interpretability and trust has
been discussed in several recent papers [16, 22, 25]. Most
related to our work, and an inspiration for our experimental
design, Poursabzi-Sangdeh et al. [24] ran a sequence of ran-
domized human-subject experiments and found no evidence
that either the number of features used in an ML model
nor the model’s level of transparency (clear or black box)
have a significant impact on people’s willingness to trust the
model’s predictions, although these factors do affect people’s
abilities to detect when the model has made a mistake.
Kennedy et al. [14] touched on the relationship between
stated accuracy and trust in the context of criminal recidi-
vism prediction. They ran a conjoint experiment in which
they presented subjects with randomly generated pairs of
models and asked each subject which model they preferred.
The models varied in terms of their stated accuracy, the size
of the (fictitious) training data set, the number of features,
and several other properties. The authors estimated the ef-
fect of each property by fitting a hierarchical linear model
and found that people generally focus most on the size of the
training data set, the source of the algorithm, and the stated
accuracy, while less often taking into account the model’s
level of transparency or the relevance of the training data.
Finally, a few studies from the human–computer interac-
tion community have examined the relationship between sys-
tem performance and users’ trust in automated systems [31,
32], ubiquitous computing systems [13], recommender sys-
tems [23], and robots [26]. For example, in a simulated ex-
perimental environment in which users interacted with an
automated quality monitoring system to identify faulty items
in a fictional factory production line, Yu et al. [31, 32] ex-
plored how users’ trust in the system varies with its accuracy.
Unlike in our work, system accuracy was not explicitly com-
municated to users. Instead, users “perceived” the accuracy
by receiving feedback after interacting with the system. Yu et
al. found that users are able to correctly perceive the accuracy
and stabilize their trust to a level correlated with the accu-
racy [31], though system failures have a stronger impact on
trust than system successes [32]. In addition, Kay et al. [13]
developed a survey tool through which they revealed that, for
classifiers used in four hypothetical applications (e.g., elec-
tricity monitoring and location tracking), users tend to put
more weight on the classifiers’ recall rather than their pre-
cision when deciding whether the classifiers’ performance
is acceptable, with the weight varying across applications.
2 EXPERIMENT 1: DOES A MODEL’S STATED
ACCURACY AFFECT LAYPEOPLE’S TRUST?
Our first experiment was designed to answer our first two
main questions—i.e., does a model’s stated accuracy on held-
out data affect laypeople’s trust in the model, and if so, does
it continue to do so after they have observed the model’s ac-
curacy in practice? In our experiment, each subject observed
the model’s accuracy in practice via a feedback screen that
was presented halfway through the experiment with infor-
mation about the subject’s own accuracy and the model’s
accuracy thus far, as described below. Before running the ex-
periment, we posited and pre-registered two hypotheses de-
rived from our questions, which we state informally here:2
2The pre-registration document is at https://aspredicted.org/uq3hi.pdf.
CHI 2019 Paper CHI 2019, May 4–9, 2019, Glasgow, Scotland, UK
Paper 279 Page 3
https://aspredicted.org/uq3hi.pdf
• [H1] The stated accuracy of a model has a significant
effect on people’s trust in the model before seeing the
feedback screen.
• [H2] The stated accuracy of a model has a significant
effect on people’s trust in the model after seeing the
feedback screen.
As a robustness check to guard against the potential criti-
cism that any null results might be due to a lack of perfor-
mance incentives, we randomly selected some subjects to
receive a monetary bonus for each correct prediction. We also
posited and pre-registered two additional hypotheses:
• [H3] The amount at stake has a significant effect on peo-
ple’s trust in a model before seeing the feedback screen.
• [H4] The amount at stake has a significant effect on
people’s trust in a model after seeing the feedback screen.
Prediction Tasks
We asked subjects to make predictions about the outcomes of
forty speed dating events. The data came from real speed dat-
ing participants and their dates via the experimental study
of Fisman et al. [9]. Each speed dating participant indicated
whether or not he or she wanted to see his or her date again,
thereby giving us ground truth from which to compute accu-
racy. We chose this application for two reasons: First, predict-
ing romantic interest does not require specialized domain
expertise. Second, this setting is plausibly one in which ML
might be used given that many dating websites already rely
on ML models to predict potential romantic partners [18, 29].
For each prediction task (i.e., speed dating event), each
subject was first shown a screen of information about the
speed dating participant and his or her date, including:
• The participant’s basic information: the gender, age, field
of study, race, etc. of the participant.
• The date’s basic information: the gender, age, and race of
the participant’s date.
• The participant’s preferences: the participant’s reported
distribution of 100 points among six attributes (attrac-
tiveness, sincerity, intelligence, fun, ambition, and shared
interests), indicating how much he or she values each
attribute in a romantic partner.
• The participant’s impression of the date: the participant’s
rating of his or her date on the same six attributes us-
ing a scale of one to ten, as well as scores (also using a
scale of one to ten) indicating how happy the participant
expected to be with his or her date and how much the
participant liked his or her date.
The subject was then asked to follow a three step-procedure:
First, they were asked to carefully review the information
about the participant and his or her date and predict whether
or not the participant would want to see his or her date
Figure 1: Screenshot of the prediction task interface.
again. Next, they were shown the model’s (binary) prediction.
Finally, they were given the option of revising their own
prediction. A screenshot of the interface is shown in Figure 1.
Experimental Treatments
We randomized subjects into one of ten treatments arranged
in a 5×2 design. The treatments differed along two dimen-
sions: stated accuracy on held-out data and amount at stake.
Subjects were randomly assigned to one of five accuracy
levels: none (the baseline), 60%, 70%, 90%, or 95%. Subjects
assigned to an accuracy level of none were initially given
no information about the model’s accuracy on held-out data.
Subjects assigned to one of the other accuracy levels saw
the following sentence in the instructions: “We previously
evaluated this model on a large data set of speed dating
participants and its accuracy was x%, i.e., the model’s predic-
tions were correct on x% of the speed dating participants in
this data set.” Throughout the experiment, we also reminded
these subjects of the model’s stated accuracy on held-out data
each time they were shown one of the model’s predictions.
We note that our sentence about accuracy was not a decep-
tion. We developed four ML models (a rule-based classifier,
CHI 2019 Paper CHI 2019, May 4–9, 2019, Glasgow, Scotland, UK
Paper 279 Page 4
a support vector machine, a three-hidden-layer neural net-
work, and a random forest) and evaluated them on a held-out
data set of 500 speed dating participants, obtaining accuracies
of 60%, 70%, 90%, and 95%. To keep the treatments as similar
as possible, the models made exactly the same predictions for
the forty speed dating events that were shown to subjects.
Subjects were randomly assigned to either low or high
stakes. Subjects assigned to low stakes were paid a flat rate
of $1.50 for completing the experiment. Subjects assigned to
high stakes also received a monetary bonus of $0.10 for each
correct (final) prediction3 in addition to the flat rate of $1.50.
Experimental Design
We posted our experiment as a human intelligence task (HIT)
on Amazon Mechanical Turk. The experiment was only open
to workers in the U.S., and each worker could participate
only once. In total, 1,994 subjects completed the experiment.
Upon accepting the HIT, each subject was randomized
into one of the ten treatments described above. Each HIT
consisted of exactly the same forty prediction tasks, grouped
into two sets A and B of twenty tasks each. As described
above, subjects in all ten treatments saw exactly the same
model prediction for each task. The experiment was divided
into two phases. To minimize differences between the phases,
subjects were randomly assigned to see either the tasks in
set A during Phase 1 and the tasks in set B during Phase 2, or
vice versa; the order of the tasks was randomized within each
phase. We chose the tasks in sets A and B so that the observed
accuracy on the first twenty tasks would be 80% regardless of
the ordering of sets A and B. This experimental design mini-
mizes differences between treatments and allows us to draw
causal conclusions about the effect of stated accuracy on
people’s trust without worrying about confounding factors.
Each subject was asked to make initial and final predic-
tions for each task, following the three-step procedure de-
scribed above. The subjects were given no feedback on their
own prediction or the model’s prediction for any individual
task; however, after Phase 1, each subject was shown a feed-
back screen with information about their own accuracy and
the model’s accuracy (80% by design) on the tasks in Phase
1. A screenshot of the feedback screen is shown in Figure 2.
At the end of the HIT, each subject completed an exit sur-
vey in which they were asked to report their level of trust in
the model during each phase using a scale of one (“I didn’t
trust it at all”) to ten (“I fully trust it”). Specifically, we asked
subjects the following question: “How much did you trust
our machine learning algorithm’s predictions on the first
[last] twenty speed dating participants (that is, before [after]
3The highest possible bonus was 40×$0.10 = $4—i.e., substantially more
than the flat rate of $1.50, thereby making the bonus salient [11].
Figure 2: Screenshot of the feedback screen shown between
Phase 1 and Phase 2 (i.e., after the first twenty tasks).
you saw any feedback on your performance and the algo-
rithm’s performance)?” We also collected basic demographic
information (such as age and gender) about each subject.
To quantify a subject’s trust in a model, we defined two
metrics, calculated separately for each phase, that capture
how often the subject “followed” the model’s predictions:
• Agreement fraction: the number of tasks for which the
subject’s final prediction agreed with the model’s predic-
tion, divided by the total number of tasks.
• Switch fraction: the number of tasks for which the sub-
ject’s initial prediction disagreed with the model’s pre-
diction and the subject’s final prediction agreed with the
model’s prediction, divided by the total number of tasks
for which the subject’s initial prediction disagreed with
the model’s prediction.
We used these two metrics when formally stating all of our
pre-registered hypotheses, while additionally pre-registering
our intent to analyze subjects’ self-reported trust levels.
Analysis of Trust in Phase 1 (H1 and H3)
We start by analyzing data from Phase 1 to see if subjects’
trust in a model is affected by the model’s stated accuracy
and the amount at stake before they see the feedback screen.
Figures 3a and 3b show subjects’ average agreement fraction
and average switch fraction, respectively, in Phase 1, by treat-
ment. Visually, stated accuracy appears to have a substantial
effect on how often subjects follow the model’s predictions.
Subjects’ final predictions agree with the model’s predictions
more often when the model has a high stated accuracy. How-
ever, the effect of the amount at stake is less apparent. To
formally compare the treatments, we conduct a two-way
ANOVA on subjects’ agreement fractions and, respectively,
switch fractions in Phase 1. The results suggest a statistically
significant main effect of stated accuracy on how often sub-
jects follow the model’s predictions (effect size η2 = 0.036,
p = 4.72 × 10−15 for agreement fraction, and η2 = 0.061,
p = 5.62×10−26 for switch fraction) while the main effect of
the amount at stake is insignificant (p = 0.30 and p = 0.11
for agreement fraction and switch fraction, respectively).
We do not detect a significant interaction between the two
CHI 2019 Paper CHI 2019, May 4–9, 2019, Glasgow, Scotland, UK
Paper 279 Page 5
0.73
0.77
0.81
0.85
0.89
None 60% 70% 90% 95%
A
gr
ee
m
en
t F
ra
ct
io
n
Stated Accuracy
Low stakes
High stakes
(a) Phase 1: Agreement frac.
0.2
0.3
0.4
0.5
0.6
None 60% 70% 90% 95%
S
w
itc
h
Fr
ac
tio
n
Stated Accuracy
Low stakes
High stakes
(b) Phase 1: Switch fraction
0.73
0.77
0.81
0.85
0.89
None 60% 70% 90% 95%
A
gr
ee
m
en
t F
ra
ct
io
n
Stated Accuracy
Low stakes
High stakes
(c) Phase 2: Agreement frac.
0.2
0.3
0.4
0.5
0.6
None 60% 70% 90% 95%
S
w
itc
h
Fr
ac
tio
n
Stated Accuracy
Low stakes
High stakes
(d) Phase 2: Switch fraction
Figure 3: Comparing how often subjects in different experimental treatments follow an ML model’s predictions (average agree-
ment fraction and average switch fraction) during each phase of our first experiment. Error bars represent standard errors.
factors (p = 0.77 and p = 0.62 for agreement fraction and
switch fraction, respectively). In other words, hypothesis H1
is supported by our experimental data, while H3 is not.
An analysis of subjects’ self-reported levels of trust reveals
a similar pattern. We detect a statistically significant main
effect of stated accuracy on subjects’ self-reported levels of
trust during Phase 1 (η2 = 0.049, p = 1.61×10−20), while the
main effect of the amount at stake is insignificant (p = 0.92).
We also conduct a post-hoc Tukey’s HSD test to identify
pairs of treatments in which subjects exhibit distinct dif-
ferences in how often they follow the model’s predictions.
We find that treatments can be clustered into two groups—
treatments with an accuracy level of none, 60%, or 70%, and
treatments with an accuracy level of 90% or 95%—such that
almost all statistically significant results are found for across-
group treatment pairs.4 These results confirm our visual
intuition from Figures 3a and 3b: when subjects have not yet
observed the model’s accuracy in practice, they tend to follow
the predictions of models …




Why Choose Us

  • 100% non-plagiarized Papers
  • 24/7 /365 Service Available
  • Affordable Prices
  • Any Paper, Urgency, and Subject
  • Will complete your papers in 6 hours
  • On-time Delivery
  • Money-back and Privacy guarantees
  • Unlimited Amendments upon request
  • Satisfaction guarantee

How it Works

  • Click on the “Place Order” tab at the top menu or “Order Now” icon at the bottom and a new page will appear with an order form to be filled.
  • Fill in your paper’s requirements in the "PAPER DETAILS" section.
  • Fill in your paper’s academic level, deadline, and the required number of pages from the drop-down menus.
  • Click “CREATE ACCOUNT & SIGN IN” to enter your registration details and get an account with us for record-keeping and then, click on “PROCEED TO CHECKOUT” at the bottom of the page.
  • From there, the payment sections will show, follow the guided payment process and your order will be available for our writing team to work on it.