User Interface Development SEI Curriculum Module SEI-CM-17-1.1* November 1989+ Gary Perlman Ohio State University Carnegie Mellon University Software Engineering Institute This work was sponsored by the U.S. Department of Defense. Approved for public release. Distribution Unlimited * A support materials package, SEI-SM-17, is available for this module. + Module Revision History Version 1.1 (November 1989) Expanded coverage of general design, task analysis, measurements; additional teaching considerations and references Version 1.0 (April 1988) Draft for public review {Contents} Contents Capsule Description Scope Philosophy Objectives Prerequisites Using this Module Acknowledgements User Interface Life Cycle 1. User Interface Design 1.1. Task/Requirements Analysis for User Interface Design 1.2. Psychological Foundations for User Interfaces 1.3. Principles, Guidelines, Standards, and Rules for User Interface Design 1.4. Input/Output Devices for User Interfaces 1.5. Dialogue Types for User Interfaces 1.6. Enhanced/Adaptive Interaction 1.7. Prototyping User Interfaces 1.8. Specification of User Interface Design 2. User Interface Implementation 2.1. Fundamental Concepts of User Interface Implementation 2.2. Interaction Dialogue Types 2.3. Interaction Libraries 2.4. Dialogue-Control Structure Models 2.5. User Interface Management Systems 2.6. User Guidance Integrated into User Interfaces 3. User Interface Evaluation 3.1. Empirical Evaluation of User Interfaces 3.2. Theoretical Evaluation / Predictive Modeling of User Interfaces Teaching Considerations Sources of Information Priorities Schedules Exercises and Projects Bibliography {Capsule Description} This module covers the issues, information sources, and methods used in the design, implementation, and evaluation of user interfaces, the parts of software systems designed to interact with people. User interface design draws on the experiences of designers, current trends in input/output technology, cognitive psychology, human factors (ergonomics) research, guidelines and standards, and on the feedback from evaluating working systems. User interface implementation applies modern software development techniques to building user interfaces. User interface evaluation can be based on empirical valuation of working (parts of) systems or on the predictive evaluation of system design specifications. {Scope} This module attempts to cover all aspects of the process of designing, implementing, and evaluating user interfaces. Inputs to the process include the results of requirements analysis for the system, although this module has some coverage of task analysis, requirements analysis of the human components of systems. Outputs of the process include systems of software and user documentation, however more detailed coverage of documentation should be sought elsewhere, and the module contains little coverage of field-testing of usability over the long- term maintenance of systems. {Philosophy} Despite the great effort involved in developing user interfaces and the large potential costs of bad ones, the design of a user interface is often left until late in the development of software systems. Many user interfaces, including those designed late, are not evaluated for their usability or acceptability to users, thus risking failure. User interfaces are important for the marketability of software products. Bad user interfaces can contribute to human error, possibly resulting in personal and financial damages. In short, user interfaces are important for the success of software products, and software developers who are writing software to be used by people must know how to ensure that people can use that software. This module is designed for teachers of software engineering. As such, it differs in form, content, and depth of presentation from a module designed for teaching other people involved in user interface development. For example, this module is not appropriate for human factors students because it goes into too much detail on software implementation and is weak, by human factors standards, in the area of evaluation. Nor is the module designed for teaching people interested in doing research in human-computer interaction; a practical application approach is taken here. Many software engineers have limited exposure to issues of software usability and little interest in a detailed study of the research on human-computer interaction; but they would rather produce usable software than software that is difficult to learn and use. Given those assumptions, and given that software engineers make most of the user interface decisions for systems under development (partly because of a lack of interaction with human factors experts), software engineers must gain an appreciation of the problems of user interface development and their solutions. {Objectives} A student who has worked through this module should have gained perspective, learned about methods and tools, and gained an appreciation of their limits. Perspective * the importance of the user interface * the impact of good and bad user interfaces * the diversity of users and applications Methods and Tools * the tradeoffs of design decisions involving different dialogue types and input/output devices * the information resources available for design * the benefits and costs of developing tools for user interface implementation * the need to integrate training materials with the user interface * the need to evaluate system usability * information about some design and evaluation tools Limits of Knowledge * when and how to work with human factors engineers as consultants for design and evaluation * when and how to work with technical writers for implementation of a system of user guidance * when and how to work with a statistical consultant; the difficulty of measurement and the complexity of making decisions based on data {Prerequisites} This module assumes students have: * familiarity with data structures such as stacks, trees, and graphs. * a considerable degree of mathematical fluency so they can quickly absorb concepts of experimental design and data analysis. A previous course in applied statistics would be useful, but its necessity as a prerequisite depends on the depth at which the evaluation section is taught. A background in the following areas would be useful but is not required. SEI curriculum modules are available on each of these areas: * Requirements Analysis, including areas such as task analysis and the importance of pleasing users. * Specification, including the specification of the structure of user interfaces, which is used extensively in some user interface management systems. * Testing, from the point of view of usability. This area receives some coverage in the section on user interface evaluation because it is typically omitted from validation instruction. * Technical Writing, an integral part of a user interface because of its importance in user guidance, both online and in printed form. {Using this Module} The content of this module is divided into three large sections: design, implementation, and evaluation. Each of these sections contains topics with titles and annotations. The annotations contain brief descriptions of the subject matter for that topic, along with bibliographic citations. Watch for special notes in the annotations: * There are comments throughout the module on teaching considerations that are tied to specific topics. TEACH: These give advice drawn from experience teaching the material. * There are notes describing class demonstrations. DEMO: These suggest when demonstrations are appropriate or needed, and note demonstrations that have worked well in the past. {Acknowledgements} This module grew out of my experiences in teaching three offerings of an elective course on User Interface Design, Implementation, and Evaluation in the Masters of Software Engineering program at the Wang Institute of Graduate Studies. During those offerings, my students helped improve the course in more ways that I can recall; and my fellow faculty members were often thoughtful to direct my attention to new developments in the field. I would like to thank Norm Gibbs, director of education at the SEI, who made it possible for me to encapsulate what I have learned about teaching user interface development to software engineers. Arvind Jain and and later Jim Rankin and Mark Schmick provided tireless effort with the bibliography. Gary Ford provided many formatting aids to help me present the material the way I wanted. Albert Johnson, Lionel Deimel, and Allison Brunvand made sure my work environment was right. I would like to thank the reviewers of and contributors to several versions of this document: Len Bass (SEI), Deborah Boehm-Davis (George Mason University), Stu Card (Xerox PARC), Lionel Deimel (SEI), Jim Foley (George Washington University (now at Georgia Tech.), Bob Glass (SEI), Paul Green (University of Michigan), Bruce Horn (CMU), Bonnie John (CMU), Bill Hefley (SEI), Clayton Lewis (University of Colorado, Boulder), Marilyn Mantei (University of Toronto), John Nestor (SEI), Linda Pesante (SEI), Steve Poltrock (Boeing), Judith Reitman Olson (University of Michigan), Robert Seacord (SEI), Ben Shneiderman (University of Maryland), and probably some others I have forgotten. {User Interface Life Cycle} User interface development involves design, implementation, and evaluation. Some software development projects have used a life cycle model in which a system is designed (over an extended period), implemented (with little interaction between designers and implementors), and then evaluated as it is about to go out the door. Such projects are destined to make the very largest of mistakes because of the lack of feedback until the end of the process. User interface development is complex and not well understood, hence the need for prototype implementation and evaluation. Although most of the time spent on design may come early in a project and the majority of evaluation time late, the design-implementation-evaluation cycle occurs hundreds of times as parts of user interfaces are built. See [Gould85, WilligesR86a, Hartson86a, Foley86b, Grudin87, Bennett87, Mantei88, Grudin89, Perlman89c] and [Salvendy87] (Ch. 1.2) for discussions of user interface development life cycle models. Eval Standards +--+ | Guidelines | | | | Experience | | | | | Requirements | | | | | | | v v v v v | +--------+ | | Design |----------------+ | +--------+ | | | | | | Formal Specification | | Models | | | | | | v | | +--------------+ | | |Implementation| | | Guidelines +--------------+ | | | Reqs | | Stds | | | +---------------+ | | | | | | Prototype/Build | | | | | | System | v v v v v | +----------+ | |Evaluation| | +----------+ | | +-------+ The knowledge in the field of user interface development is intertwined, with no clear beginning or end. Designers of user interfaces must draw from as many sources of information as can be obtained at reasonable cost. Some design information comes from evaluation data, which may, in turn, come from instrumentation software placed in implemented prototypes. Some design information may come from predictive models, which can only be evaluated for their applicability with an understanding of empirical evaluation methods. {Requirements/Task Analysis for User Interface Design} Before a user interface is designed, the constraints on design must be known, and the output of requirements analysis must be used as input to the process. One aspect of requirements analysis is task analysis, which studies the system requirements concerned with human interaction. {Designing User Interfaces} Given a set of requirements, there are many sources of information that can guide design. There is information about the two sides of the user interface: the human information processor, and the input/output devices designed to interact with the human. There is the existing body of knowledge about user interface design, derived from human factors studies or from the influence of popular systems. There are standards, either governmental or de facto industry standards, that can have overriding influences on design. All these sources of information must be available to designers. In the absence of enough information, prototyping with rapid evaluation must be an option available to designers. {Design Specification for User Interfaces} In some cases, a formal specification of a user interface is needed. Specification may be needed as executable input to a user interface management system in order to satisfy contractual agreements and to communicate the design between groups such as designers and implementors. {Implementation of User Interfaces} Implementation of user interfaces concludes with the existence of an artifact that is usable by people. The use of tools to develop systems that are cost- effective and maintainable is a key concept of user interface development. In addition to including software running on a hardware configuration, the artifact includes training materials, online help, and printed documentation, whose development must be coordinated with that of the user interface software. {Evaluation of User Interfaces} The complexity of both humans and computing systems makes their interaction less predictable than we would like. Even the best intentions can result in unusable systems or, more often, in systems with problems. These problems must be discovered as early as possible so that they can be addressed cost-effectively. Some evaluation methods are models that predict aspects of usability from design specifications. The best-known evaluation methods involve data collection and analysis; and an understanding of the cost/benefit tradeoffs of the different empirical methods is critical for software engineers to make decisions about which analyses will be used. {User Interface Life Cycle Meshed with Software Development Life Cycle} Software engineers must learn how to incorporate user interface development concerns into all aspects of software development. User interface development does not happen at any particular stage of software development. It interacts with requirements analysis when customers are best satisfied by seeing mockups of screens. It interacts with system design when different system designs imply user interfaces of different complexity, especially when the implications are difficult to anticipate. It interacts with system testing when end users perceive functional errors in a system implemented according to its specification. {1. User Interface Design} User interface designers try to satisfy the human requirements of a system by applying knowledge from many areas: cognitive psychology, input and output devices, guidelines and standards, dialogue types, and (because design knowledge is inadequate) prototyping methods. Both [Salvendy87] and [Helander88] contain many chapters on design; see the topics table at the end of the bibliography. This section necessarily deals with mundane aspects of design and does not give much guidance about how to be the first to come up with big design ideas like VisiCalc or the Macintosh. Copying big and little design ideas can have copyright and other legal ramifications [Samuelson89a, Samuelson89b]. Material on the semantic/syntactic models of design is discussed in [Shneiderman87] and in many chapters in [Helander88] (notably Chapters 2, 3, 35 and 38). At the risk of oversimplifying a complex process: Design begins at the conceptual level of mental models, proceeds to the semantic level of functionality, continues at the syntactic level of system structure, and ends at the lexical level of input and output considerations. The product of user interface design is the design specification, either in written form or in the form of a prototype. {1.1. Task/Requirements Analysis for User Interface Design} Before a user interface is designed, the constraints on design must be known, and the output of requirements analysis must be used as input to the process. One aspect of requirements analysis is task analysis, which studies the system requirements concerned with human interaction [Fleishman84, Drury87, Rubinstein84, Thomas89]. Many of the methods discussed under user interface evaluation apply well to the analysis of user interface requirements. One particularly influential source of design information is an analysis of existing systems, performed either by conducting evaluation surveys or by drawing from experiences with systems. There are many chapters on task analysis in [Salvendy87] (including 3.1, 3.2, 3.3, and 3.4) and in [Helander88] (including 1 and 9). [USC89] is a software tool for aiding task analysis by coding sequences of events. The sociological context of systems can be considered part of the requirements for a system. The formation of the conference on Computer Supported Cooperative Work (CSCW), formed by members of ACM SIGCHI, attests to this. See Chapters 48, 49, and 50 of [Helander88] and sections of [Baecker87]. {1.2. Psychological Foundations for User Interfaces} User interfaces must adapt to the capabilities and limitations of people, the human half of the user interface. The information processing capabilities of humans is essential input to the design of user interfaces [Lindsay77, Card83, Monk84, Salvendy87], although these should not be addressed to the exclusion of social and motivational factors [Rushinek86, Salvendy87, Helander88]. User interface developers need to know about how human skill is measured and, in so doing, learn that psychological variables (like brightness and loudness) are relevant, while physical variables (like intensity) can be misleading. TEACH: After instruction on psychology, software engineers should have a better appreciation of the complexity and unintuitive nature of the human information processor. TEACH: Examples of how psychology can be applied to design should be a class demonstration and perhaps an exercise. The Landauer chapter in (Carroll87b) is a good source of examples of applications. The human engineering data compendium is an attempt to make behavioral data useful to practitioners [Lincoln88]. {1.2.1. Model of the Human Information Processor} The model of the human information processor is covered in [Lindsay77], Chapter 2 of [Card83], and Chapter 2.2 of [Salvendy87]. It has input channels, memory and processing systems, and output channels; it helps organize the material on psychology. +---------------------------+ | Computer | +---------------------------+ | ^ v | +-------+ +-------+ |Output |---------->|Input | |Devices|<----------|Devices| +-------+ +-------+ | ^ v | +-------+ +---------+ |Sensors| |Effectors| +-------+ +---------+ | ^ v | +-------------+ | |Short Term | performance |Sensory Store| | +-------------+ | | | attention | | +---------+ | | |rehearsal| | v v | | +---------------------------+ |Short Term (Working) Memory| +---------------------------+ | ^ elaboration retrieval v | +---------------------------+ | Long Term Memory | +---------------------------+ {1.2.2. Sensation and Perception} The input capabilities of the human information processor mold the types of computer output devices used. Humans receive information in different modalities, but most relevant to user interfaces are vision and audition. Different modalities can be characterized by their sensitivity, dynamic range, speed of response, and acuity [Lindsay77], [Salvendy87] Chapter 2.1. Gestalt grouping principles use higher level perceptual processes and include principles of proximity, similarity, continuation, common fate, closure, and simplicity. DEMO: For vision, the following might be covered: the light frequencies (colors) to which the eye responds (assuming no color-blindness), the speed at which visual information is processed (including critical-flicker or fusion frequency), and areas of maximum resolution (foveal region). Some physical characteristics of eye movement are relevant here. Measurement methods and some psychophysical results can be shown. Some implications of sensory and perceptual capabilities to design should be discussed. TEACH: Particularly compelling to students are demonstrations showing the unintuitive nature of sensation and perception. Most will not have seen their blind spot, many will have seen few visual illusions (color after-images are particularly relevant to screen design), and many will not believe that they cannot see color in their periphery. {1.2.3. Attention and Performance} Attention is the process of filtering incoming signals and using it as information. Attention can be selective, affected by expectations. People can process more information in multi-modal presentations than in uni-modal [Lindsay77]. DEMO: A reaction time demonstration of sound versus light stimuli versus both combined is good, and adding a choice reaction time task is good to demonstrate the effects of expectations. Performance in human-computer interaction deals primarily with motor control (for keyboard entry, mouse movement) and, less often, with speech. Tasks like visual search or typing can be performed at different rates of speed and accuracy; the efficiency depends on skill level and speed/accuracy tradeoffs such as those in Fitts' law (Fitts67, Card83). As skills become more practiced, performance becomes automatic and the need for attention drops; consequently, issues of response compatibility become less important than during initial skill acquisition (Salvendy87). Repetitive tasks can result in fatigue and lowered performance. Action slips [Norman81] provide a notation for some types of errors. Human error is also discussed in Chapter 2.8 of [Salvendy87]. DEMO: A demonstration of the Stroop task is good for showing a salient example of bad response compatibility. DEMO: A demonstration and classification of errors in rapid typing will give students some intuitions about motor program errors. With untrained but well-practiced typists, it is easy to show transposition errors and doubling errors. {1.2.4. Learning and Memory} The major memory systems are short-term sensory store, short-term or working memory, and long-term memory. These systems can be described in terms of their capacity, duration, modality of encoding, and methods of storage [Lindsay77, Card83] and [Salvendy87], Chapter 2.4. DEMO: For each memory system, there can be a class demonstration of some well-known experiment: STSS - Sperling Task, STM - Peterson & Peterson, LTM - (MillerG56). Each demonstration will require careful explanation, with interactive feedback from students, to make sure the logic of the experimental method has been conveyed. [Poltrock88] is an easy-to-use experimental control package that can be used for class demonstrations. [Schneider88] is more sophisticated, but it requires more time to learn. No one knows why people forget: whether it is because facts and skills have dropped out of long-term memory or because they simply cannot be retrieved. In the area of user interfaces, the interference effects are especially relevant to the effects of incompatible designs; when a new system is incompatible with another, it is harder to learn the new one because of what has been learned (proactive interference) and learning the new system makes it harder to return to the old (retroactive interference). Strategies for learning such as accretion, restructuring and tuning (Rumelhart78), and analogy (Rumelhart81) can be discussed in terms of knowledge representation formalisms like semantic networks. User conceptual models of systems are organizing themes for design (Norman86, Carroll87b). The GOMS model [Card83] of higher level tasks like planning postulates that the user's cognitive structure consists of four components: Goals, Operators, Methods, and Selection rules. TEACH: These topics also have applications to the design of online help and training materials. Although of high interest in cognitive psychology and artificial intelligence, topics like mental models and conceptual models may be received with blank stares because of the lack of immediate applicability to building software. {1.3. Principles, Guidelines, Standards, and Rules for User Interface Design} Principles, guidelines, standards, and rules represent the accumulated knowledge about user interfaces. {1.3.1. Principles of User Interface Design} Principles are general goals which can be useful to organize a design. However, since principles do not specify methods of achieving goals, they have limited practical use. TEACH: Who can argue with principles like "Be Consistent" or "Simplicity"? But they are hard to define and harder to enforce. {1.3.2. Guidelines for User Interface Design} Guidelines are general rules to follow in design. They need to be general if they are to be applied in many contexts. Guidelines can be derived from basic psychology or human factors studies, standards, or hard-earned "common sense." They benefit software engineers who are often looking for quick answers, but they can be misinterpreted or misapplied -- especially when guidelines contradict each other because of tradeoffs -- therefore, examples accompany most guidelines. Guidelines even help human factors experts (who have human cognitive limitations like limited memory) by serving as a design-checklist. There are so many possible guidelines that there are problems managing them, and special tools and techniques may be needed [Perlman88a, Perlman88b, Perlman89a, Fox89]. Guidelines may be used extensively in the design of tools for building user interfaces, where extra design effort is repaid several times over. [WilligesB84, Smith86a, Brown88] are primary sources of guidelines. [Rubinstein84, Shneiderman87] are also good sources. {1.3.3. Standards for User Interface Design} Standards are principles, guidelines, or rules that must be followed because of mandated or industry pressures (de facto standards such as the Macintosh toolbox, Microsoft Windows, or the IBM SAA/CUA standard interface) [Smith86b]. Standards are designed to protect the well-being or efficiency of users, or the product lines of developers. Standards are sometimes forced out prematurely, and they can exhibit commercial and political motivations. [Smith86a] is sometimes cited as a standard because some government contracts require it, but the best known U.S. standards in the area of user interfaces are MIL-STD- 1472C (particularly section 5.15) [DOD83], the newer MIL-STD-1472D [DOD89], DOD-HDBK-761 [DOD85], and NASA- STD-3000 [NASA87]. Other standards agencies include ANSI (US) and DIN (Germany). {1.3.4. Rules Defined for Designing Specific Systems} Rules are guidelines or standards with free variables specified so they are customized for a particular system [Mosier86]. Rules can produce product family resemblances. Rules are defined by gathering information relevant to specific parts of a user interface (e.g., a window manager) and choosing specific parameters for their implementation (e.g., format of titles, color, etc.) [Perlman88b]. Documentation of rules can be done with a styleguide [McCormick85]. Enforcement of rules can be a difficult task and should be automated with software tools (e.g., runtime user interface libraries using object- oriented definitions) when possible. There should be a class demonstration of how information from guidelines and standards is used in defining rules for a system. Rules can be defined for part of a system, such as the format of error messages or the contents of window titles. It is important to have consistency of rules across parts of a system (some rules about error messages may constrain others about window title format -- they must be distinguishable). {1.4. Input/Output Devices for User Interfaces} Input devices (e.g., keyboard, mouse, microphone) and output devices (e.g., CRT, printer, speaker) connect the physical effectors of humans (hands, vocal chords) to the input channels of computers. Task demands, such as the need for hands-free or silent operation affect the choice of input and output devices. Besides the more common input and output devices, devices like the head-mounted display and the data-glove allow the creation of artificial realities [Fisher88]. DEMO: This is a good time to show some videos or demos of common and novel forms of interaction. {1.4.1. Input Devices} Input devices specify the objects and actions of interaction. The logical equivalence of devices (different devices such as a mouse or cursor keys can be used for the same input tasks) has been pointed out in [Foley82], which gives definitions of the following logical devices: locator, pick, valuator, keyboard, and button. (Card78, Buxton86) point out that logically equivalent devices can cause different fluency and accuracy of interaction. DEMO: Many software engineers have little experience with a variety of input devices, and at least some exposure is necessary. This can be done by demonstrations with a variety of input devices, or by video tapes that show input devices in use. Input devices can be categorized as follows: {1. Voice/Speech} [Helander88] Chapter 14, [Salvendy87] Chapter 2.10 and 11.5 {2. Digital keys} [Helander88] (Chapter 21), including: * Buttons * Keyboards Compare the Sholes (QWERTY), Dvorak, and Alphabetic boards. * Function Keys * Cursor Keys {3. Pointing Devices} There are many pointing devices available, and the choice of one over another can be a function of taste, user sophistication, or the task to be performed. Some pointing devices include: * light pen, * mouse, * touch screen, * track ball, * touch pad, * puck in rink, * pen and tablet, * joy stick, * swivel pad, * thumb wheels. {4. Multifunction Devices} Some devices, like multidimensional joysticks and the data-glove allow a variety of interactions. {1.4.2. Output Devices} [Foley82] contains a good overview of the properties of different graphical output devices. Sources in (Salvendy87) (Chapters 2.10 and 11.5) cover speech devices. Different output devices for user interfaces include: {1. Indicator Lights} which indicate state {2. Character CRT} for which dot matrix characters and legibility [Gould87] are issues {3. Color Screen} {4. Hard Copy} {5. Graphics Display} for which computer graphics techniques dealing with human factors problems must be considered (e.g., jaggies and antialiasing), and linked back to psychological foundations. Some graphic displays, like the stereoscopic head-mounted display, allow new levels of visualization reality. {6. Video Tape and Disk} {7. Tone Generator} {8. Voice Synthesizer and Digital Voice Recordings} (See [Helander88] Chapter 15, [Salvendy87] Chapters 2.10, 5.2, and 11.5) {1.5. Dialogue Types for User Interfaces} Question and answer allows users to provide inputs to a series of questions and is best suited to simple interactions with untrained users. Form filling allows users to provide several inputs at once and are well suited to tasks where many values must be seen at one time or when the order of input may vary. Menu selection allows users to select from a list of alternatives and is well suited to interactions with untrained or occasional users [Kiger84, Perlman84, Perlman85b] and [Helander88] (Chapter 10). Function keys allow selection by pressing a labeled button instead of choosing from a displayed list. Labels may be on the keys if they are dedicated, or on a template or a portion of a display device [Sherwin86]. Command language allows skilled users to compose commands. In command languages, there are issues of command names [Barnard88] and syntax [Perlman84, Radin84]. Some common command languages are the DOS command language and the UNIX shell, both of which have associated programming languages, and some versions of which have command line editing with a history of previous commands. See [Helander88] (Chapter 11). Query language is a special case of command language that allows users to request information. Particular to query languages are issues of making Boolean expressions usable by people [Williams82]. See also [Helander88] (Chapter 12). Natural language allows untrained users to speak or type inputs (Rich84). Simplified forms of natural language, such as aids for constructing sentences, may be used because of the difficulty of implementation [Tennant83]. See [Helander88] (Chapters 13, 14, 44). Graphic interaction> allows users to provide inputs by direct manipulation of displayed objects such as icons [Shneiderman83, Hutchins86] and [Helander88] (Chapters 6 and 17). DEMO: Examples of the different forms of interaction should be shown. Widespread interfaces like that of the Macintosh, Lotus 1-2-3 menus, forms interfaces like Query-by-Example, command languages like those on UNIX, and others should be demonstrated, showing their strong and weak points. {1.6. Enhanced/Adaptive Interaction} Methods of adaptive interaction do not fit well into any area of design. Issues of customization (for compatibility across systems, or abbreviation within systems) and user programming should be addressed when considering the transition from novice to expert in parts of systems. See the discussion about different types and levels of users in the section on implementation of user guidance. Methods of customization include options set interactively or from startup files, programming-by- example, or by saving sequences of keystrokes (macros). Recognition of abbreviations, spelling correction like DWIM (Do What I Mean), command and file name completion (type part of a name and the system supplies the rest), and command line editing (with a built-in history of previous commands) are all ways to improve throughput. Expert user interfaces may be able to adapt to individual users by monitoring command usage and making frequent ones more easily executed. TEACH: Adapting to individual differences and providing aids for the handicapped are other types of adaptation to users. See [Helander88] (Chapters 24, 25 and 26). {1.7. Prototyping User Interfaces} The complexity of user interfaces and our lack of knowledge about how to build them force us to explore the design space by prototyping. The acknowledgement of this deficiency in our abilities is apparent in catchy phrases like design for redesign and iterative design. There are limited-functionality prototypes and limited-performance prototypes, and these allow us to design the front end of a system and evaluate it without building the whole system. Parts of user interfaces commonly prototyped are screen layouts, and command syntax and names (which show functionality). These are mocked up and evaluated using the methods in the evaluation section of this module. Care must be taken to avoid biasing the evaluation with the parts not being accurately prototyped. For example, a database front end might appear faster than it will be when connected to a real database system, or, in the other direction, slow screen-presentation software might affect ratings of system acceptability. For more complete treatments of rapid prototyping, see [Helander88] (Chapter 39) and [Melkus88]. There are many tools for prototyping screen layout, a well known one is [Bricklin86] (for DOS), a tool that can only be used for prototyping and not for implementation. Also useful is the Macintosh-based Prototyper [SmethersBarnes88] which gives non- programmers access to the Mac toolbox. HyperCard has many easy-to-use programming capabilities that make it a good prototyping tool [Goodman87, MillerD88]. More comprehensive development systems are mentioned in the section on user interface management systems. [Tullis87] reviews many screen building systems for the PC. See Chapter 18 of [Helander88] and Chapter 5.1 of [Salvendy87]. On the NeXT machine [Webster89], the program-by-example Interface Builder has been shown to be an effective tool for building highly graphical user interfaces [Thakkar90]. One of the most intuitive techniques for prototyping is called the Wizard of Oz method, which places a trained person in charge of user interface operations [GreenP85]. For example, to simulate a natural language voice- input dictation system, a user speaks into a microphone that is monitored by a skilled typist typing sentences at a terminal in another room. The technique allows the evaluation of systems that have never been built, or ones that are beyond current technology. {1.8. Specification of User Interface Design} Most often, a user interface design is documented with a combination of pictures, prose, prototypes, and anything else that will reduce the need/opportunity for implementors to make design decisions. Informal design specifications may include a discussion of the philosophy of the design of the system and its parts (e.g., objects and actions) and may include more detailed descriptions of the semantics of functions. In some cases, a formal specification of a user interface is needed to disambiguate the many low-level design decisions encoded in design rules. Specification may be needed as executable input to a user interface management system, to satisfy contractual agreements, or to communicate the design between groups such as designers and implementors. See [Jacob83a, Jacob83b, Jacob86, Bass85, Chi85, GreenM85a, GreenM86, Ehrich86, Wasserman85] for examples of specification languages. TEACH: This topic is closely related to user interface management systems, which are discussed under user interface implementation. A thorough treatment of specification and UIMS in a broad module like user interface development is not feasible, but it should be considered for an advanced course or seminar. {2. User Interface Implementation} There are many reasons for bad user interfaces [Perlman86]. Tools for building user interfaces address many of these problems but are lacking because they require more than usual interest, time, effort, and expertise in user interface development. The promise of user interface implementation tools is that ideas used in hand-crafting isolated user interfaces can be abstracted and reused. The extra effort put into tools will be rewarded by increased programmer productivity because the same tools can be used by many programmers. The proliferation of use of a tool has positive impact on end users because reuse in several applications leads to increased consistency, and because extra human factors effort can be put into design of tools. The rewards of adding functionality to or improving a widely used tool are also multiplied. [Perlman88c, Foley88a, Fischer89, Myers89a] User interface implementation deals primarily with software, but also with user guidance materials such as online and physical documentation. One fundamental concept of user interface software implementation is the independent treatment of the application software, input and output devices, and different types of users. Also critical is ease of use, not only for end users but also for developers. To adapt to different types of users, developers can implement different dialogue types using object-oriented or data- abstraction techniques, and reuse them in interaction libraries. The structure of dialogues can be implemented in several notations, and interleaving of dialogues can be managed by window system managers. User interface management systems (UIMS) combine all these concepts. {2.1. Fundamental Concepts of User Interface Implementation} {2.1.1. Independence and Reusability} The concept of independence recurs in user interface implementation. [Pfaff85, SIGGRAPH87, Hartson89] One goal is to be able to use the same tools for the development of user interfaces to many applications (programs such as editors, database systems, bulletin boards), so the tools should promote application independence. Another goal is to be able to attach a variety of input and output devices to an application without modifying that application, so tools should promote device independence. A third goal is to be able to put different types of users in front of the same application by using different dialogue types and without having to build special interfaces for new and experienced users, so tools should promote user independence. These three types of independence all promote reuse of effort put into tools; they are keys to economical maintainability of software. If a new application, device, or user comes along, relatively minor efforts can accommodate tools to the change. Independence is especially desirable in a distributed environment, where users interact with applications on other machines. 2.1.1.1. Application Independence is the separation of the application program (the software that does a real task) from the user interface software (the software that lets people do the task). In its strongest form, application independence requires that user interface tools have no specific knowledge about specific applications; the application communicates with the application through an application interface. The application interface may have the application as an active agent, controlling the user interface software, or the roles may be reversed. The application interface may consist of a set of function calls between the application and the user interface tool, or there may be a special- purpose language or some other method of control. Data can be passed between the application and user interface tool using procedure call parameters, symbol tables, shared databases, or some other form of IPC (interprocess communication). Both tool and application must be designed anticipating the form of the interface. The communication of control information and data can be accomplished through a command language, which is easy for a user interface manager to generate and easy to add as a front end to most applications. DEMO: For example, many DBMS's can be controlled with a command language such as SQL, and many UIMSs (e.g., [Vo85]) can generate text in the form of a command language. The user interacts with the user interface software that takes inputs, validates them, and provides help as is needed, and constructs commands to pass on to the application. When application-independent tools are used, the application can be modified without changing the user interface tools, and the user interface tools can be modified without changing the application. In particular, the same user interface tools can be used to build interfaces to several applications, promoting ease of learning of additional systems [Wiecha89]. There are potentially great costs of application independence, both in terms of usability and in development costs for tools. It is possible that when tools are designed to be adaptable to any application, they are not well suited to any in particular. The cost of developing a reusable tool is higher than for custom code. {2.1.1.2. Device Independence} is the isolation of details of particular devices from the application and some user interface tools. Input and output are done through an abstract protocol that is not tied to any device-specific codes. The device- specific interface, called a device driver, encodes the protocols unique to a device, so that other tools can communicate with the device using a device- independent protocol. To add a new device to those supported, only a new device driver must be added. Often, this can be done by specifying a table of control codes, or at worst, writing a set of interface routines. Applications can then communicate with a device in abstract user interface terms, such as those in [Foley82], and read input events such as "select" or "point" and write outputs such as "alarm" or "display", without even knowing the modality of the interaction (it may be on a screen with a mouse, or on a phone with touch-tones). See [Myers89b] for advances in this area. Device independence permits the separation of the logic (structure) of a user interface from the presentation, and is sometimes called presentation independence. There may be performance penalties for device independence, and uncommon features of devices may not be used because the device interface may define the smallest common subset of features. {2.1.1.3. User Independence} is the ability to adapt to different types of users. The most commonly used distinction is that between novice and expert users, but that can be misleading. A novice may lack knowledge about the domain of the application, about the particular program being used, or about the use of computers in general. The deficiency may be broad, or in a specific new area, such as a new subject, an untried feature in a program, or a new user interface style. Thus, user interface tools that support user independence must do more than provide a menu, command language, graphical, and natural language interface styles. They must support dynamic transitions between styles, to provide the level of assistance appropriate to the task and the user. {2.1.2. Ease of Use} Tools for user interface development must promote ease of use, both by end users and by programmers. User interface tools not only need to connect applications, devices, and users, but they must also be designed to be easy for programmers to use. A critical aspect is responsiveness of tools, both for programmers and end users [Shneiderman87, Rushinek86]. {2.1.2.1. Tool-End User Interface} End users of an application do not want to know about how the user interface to the application was constructed, but they would like to be able to assume that the user interface developer designed for their needs and included features such as input validation and editing, online help, menus on demand, and so on. {2.1.2.2. Tool-Programmer Interface} It should be easy for programmers to use tools to make good user interfaces. If it is difficult, then tools will be under-utilized, and programmers, users, or both will suffer. It is ironic that some tools for user interface implementation are not themselves easy to use. Examples of implementations would help most programmers. Most issues of user interface design, implementation, and evaluation apply to the development of usable tools. Like end-user interfaces, tools should provide efficient access to resources. Access involves knowing what resources are available and how to use them, so documentation and online help are important. Efficiency involves speed and accuracy, so issues of batch compilation versus interactive interpretation, response time, and error correction are important. Also critical for the success of tools is the level at which development is done. Tools can be built from: {a function library} in a general-purpose programming language. There are low-level libraries such as for graphics or screen handling [Joy80, Arnold81], higher level libraries with menus and windows, [Apple85, Scheifler86] and very high-level libraries, that provide guidance in user interface construction [Schmucker86a]. {a special-purpose programming language} (which should have many general programming language capabilities such as procedural and data abstraction, arithmetic, variables) [Vo85, HayesP85, Shulert85, Sibert86, Wasserman86]. {an interactive editor} (to allow easy prototyping, especially by non- programmers) [Wasserman86, Henderson86b, Hill86, Hix86, Myers86, Goodman87, Litvin87]. {an application-specific user interface programming language or editor} that implements design rules such as in-house standards [Perlman86]. {2.2. Interaction Dialogue Types} An interaction dialogue type (also called a style) is a method for prompting for and getting control input from users. To adapt to different types of users or tasks, many user interfaces should implement more than one dialogue type. A common pair is the command language (for experts) with the menu hierarchy (for novices). Another, becoming more popular, is the graphical interface (for most users) and function keys (for quick command execution). Some of the more commonly used dialogue types are listed below. The choice depends on the type of user and the task (see the section on Design). For each dialogue type, there is a need for "hooks" to connect user interface software with the application. Much of this information can be abstracted, kept in a database from which data items are extracted, and filled into dialogue type data structures that are managed by interaction libraries. These data structures should allow reuse and thereby promote consistency. They can be defined by abstract data types or by object- oriented techniques [Cox86, Schmucker86a, Schmucker86b] and share definitions by inheritance. See [Vo85, HayesP85, Shulert85] for a discussion of user interface dialogue data types. Form filling - A form consists of: an id, a title, help text, and a series of fields. A field consists of: an id (often a variable name), a prompt, help text, and information about the value to be entered into the field, such as type, range, default value, and current value. Typically, form filling is done on a screen (it could be done over the phone); and for screen-based systems, information about video attributes and location are included. In addition to these basic parts of a form, there can be entry and exit conditions and actions for the form and its fields. Some examples: A form or subform may not apply in some situations, and a failed entry condition will inhibit its activation. When a form is entered, the action of displaying all or some fields might be taken, and the current values for fields might be displayed. When a field is entered, the action of modifying its appearance might be taken to tell users it can be modified, an input editing buffer from an interaction library would be activated; and when the field is left, the appearance would revert. Validation for field values might be coded as exit conditions for individual fields, while inter-value validation might be coded as an exit condition for the whole form. Menus - A menu is similar in structure to a form and consists of: an id, a title, help text, and a series of options. Each option might have a title, a selector string or method, and an action (e.g., function call, setting of a variable) to be taken if the option is selected. Often, the display of the menu and options are mixed with the logical definition, but that is not necessary [Vo85]. Function keys - A function key is similar to a menu option, except that selection is done by typing a key. Most general interaction libraries provide simple functions for defining and displaying function key settings. Command language - Although the design of a command language is a complex task, its implementation is usually simple when there is access to language development tools such as lexical analyzers and parser generators. Natural language - This is an area that has not fared well as a practical solution to user interface problems, partly because of the difficulty of implementing efficient systems. Implementation is often based on AI techniques such as augmented transition networks. See the methods used in [Tennant83] to provide substitutions for natural language interfaces. Graphic interaction - Tools for the development of graphical or direct manipulation interfaces are part of many window toolkits discussed later. Their implementation is most often based on the object- oriented paradigm where graphical objects represented by icons are acted on by events caused by the user, the application, or other objects. {2.3. Interaction Libraries} Interaction libraries contain a variety of reusable tools for building user interfaces. These tools establish and enforce standards for programmers to follow throughout a system. Often tools are built using other tools, so derivative tools also follow conventions. A good example is an input editing buffer that may contain functions to modify the contents of the buffer and functions to display the contents. These input routines may be used to gather all input from users, whether it be as input in a form, selection in a menu, or typing in a command line. Reusing the same tool allows concentrated effort in its development, with two benefits: consistency and increased functionality. Continuing the example, a help button could be tied into the input routines so that help could be requested uniformly from a variety of dialogue types. Furthermore, having a single input routine simplifies storing usage protocols and replaying previously stored stored sessions evaluation, user guidance demonstrations, for macro definition. Interaction libraries are tools for which increased design and evaluation efforts are warranted and rewarded, so they are one area of user interface development where exhaustive application of user interface design guidelines is cost-effective. The following list contains categories of libraries that can contribute to the development of user interface software. Of particular note are window packages that contain many sublibraries. TEACH: There is little in the way of standard structure across libraries. For example, some libraries will include special error routines, while others will provide programming methods for implementing them. Input validation and conversion routines test type and range validity for common and specialized types such as integers, real numbers, dates, times, file names. They enforce any conventions about acceptable formats. Conversion functions may provide the user with added flexibility (e.g., there are dozens of ways of specifying dates, many of which take into account the current date), or a strict-but-fair rule of one format may be enforced. Pattern matching, spelling correction, and automatic command completion capabilities may be added as part of a general library. Multi-level help and error management software can provide a single interface for the application programmer to direct help information and a consistent method for the user to access it. The software may adapt to the types of output devices, perhaps using highlighting or different modalities to present critical information. Command, query, and natural language parsers may be used to simplify and standardize the interface to line-oriented input. Editing and history may be built in. Knowledge representation, symbol tables, or database systems may be used to store and retrieve dynamic information to allow the parts of a UIMS to communicate. Window libraries are comprehensive toolkits that manage the presentation layer of user interfaces. Such libraries can be viewed as several sublibraries for mouse control, icons, cursors, dialogue control (radio buttons, check boxes, menus), windows (clipping, scroll-bars), graphics (lines, shapes, painting, colors, including implementations of graphics standards such as GKS and CORE), audio generation, text, input edit buffer, fonts, terminal emulation, file management, and process control. Notable examples are the libraries on the Macintosh [Apple85, Rose89], Microsoft Windows on the IBM PC, the X windows system [Scheifler86], and the PostScript-based SUN NeWS system. More recent additions are discussed in [Greenberg89, HayesF89, Linton89, Petzold89, Seymour89]. Most routines in a window library apply to a single window or process, and provide a programmer with a virtual device. Issues of interleaved activities in a multiple window or multiple process system are transparent to the programmer at this level. Such issues are the domain of window system managers, discussed in models of dialogue-control structure. It is sometimes difficult to separate the window library from the window manager, especially as programming becomes more graphical [Webster89]. {2.4. Dialogue-Control Structure Models} A dialogue structure model describes the possible sequences of control and overall structure of a user interface. It can be used as a method for formally specifying a user interface, possibly for input to a UIMS that uses it to generate or control a user interface (See the Foley paper in [SIGGRAPH87]). The specification of the logical structure of a user interface (the software architecture) can be separated from the specification of the presentation; this is a fundamental concept used in user interface management systems [Vo85]. [GreenM86] surveys the following three dialogue models and concludes that the event model has the greatest descriptive power, although which supports the easiest specification or implementation is not known. Event-based models organize the the user interface around events caused by the user (e.g., input), by the application (e.g., arrival of mail), or by information inside part of the user interface. When an event is detected, the actions taken depend on the state of the information held in the user interface. This model is used in many highly interactive systems using an object-oriented approach (e.g., [Apple85, GreenM85b, Hill86], and many others). State transition models organize the user interface around the set of all possible user interface states (such as input mode in a text editor) and transitions between states. [Jacob83a, Jacob83b, Jacob86, Wasserman85, Wasserman86] Language grammars, such as a context-free grammar, can describe the language (i.e., all possible syntactically correct inputs) used by the user to communicate to the computer [Reisner81]. Window managers [Apple85, Scheifler86] share limited resources, such as screen space and the CPU, among several asynchronous processes. To the user of a window system, there are multiple threads of interleaved activities. One important issue concerns switching the window that is connected to a shared resource such as the keyboard, or in some cases, multiple input devices [Hill86]. From the point of view of the developer, the multi-window aspects can be hidden and often ignored; and development can be done with one virtual device, with the window manager handling the client-server application-window relationship. Different dialogue types can be used for controlling the sequence of actions in an interactive session or for data entry. At any point during interaction, there can be interruptions in the normal flow of control; these should be integrated into user interface software (i.e., hooks provided for the insertion of application information). * HELP - Help can be requested about the user interface (e.g., how do I use the editing commands?) or about the task (e.g., what does this command do?). * QUIT - There should be a clear way for the user to leave the system, and the procedure should guard against the loss of work. * SUSPEND/RESUME/RESTART - There can be a way for a user to suspend activity and later resume it in the same state. There can be a way for the user to restart a session (e.g., re-initialize). * CANCEL/UNDO - There should be a way for users to cancel a command. For potentially destructive commands, there should be a method to undo changes. Systems differ greatly in how much can be undone; but some undo is much better than none, even though unlimited undo is the goal. * REDISPLAY/REVIEW/REDO - There should be a way for users to redisplay what may have been missed. There should be a way for users to review what has been done and redo previous actions, possibly after some editing. {2.5. User Interface Management Systems} A user interface management system (UIMS) is a collection of tools and techniques for the rapid development of user interfaces. Central to the UIMS is independent specification of the user interface from the application. User interface management is somewhat of a misnomer, because no current systems come close to managing any aspects of the construction or runtime interaction of user interfaces; user interface development tool is a more appropriate term [Hill86]. Fundamental to the UIMS approach is the easy-to-use reusable tool for developing easy-to-use user interfaces, independent of applications, devices, and users. A UIMS supports the development of assorted interactive dialogue types for data entry and sequence control, and it supports the specification of dialogue structure. Underlying any UIMS is an interaction library, a toolkit of routines for building user interfaces. Extensive and thoughtful consideration of the issues of the structure and capabilities of UIMSs are covered in [Molesworth86, Morris86], several articles from the 1986 Seattle on software tools for user interface management [SIGGRAPH87], and in the proceedings from the 1983 Seeheim workshop [Pfaff85]. Some papers on specific systems and experiences include [Bass89, DeSoi89, Foley88b, Foley89, Hartson89, Wiecha89] +---------------------------+ | Computer | +---------------------------+ | ^ v | +-------+ +-------+ |Output |---------->|Input | |Devices|<----------|Devices| +-------+ +-------+ | ^ v | +-------+ +---------+ |Sensors| |Effectors| +-------+ +---------+ | ^ v | +-------------+ | |Short Term | performance |Sensory Store| | +-------------+ | | | attention | | +---------+ | | |rehearsal| | v v | | +---------------------------+ |Short Term (Working) Memory| +---------------------------+ | ^ elaboration retrieval v | +---------------------------+ | Long Term Memory | +---------------------------+ {2.6. User Guidance Integrated into User Interfaces} User guidance is as much part of user interface implementation as putting menus on screens. Any change to the user interface will affect the form and content of the instructions to be provided to users. Since these instructions are part of the user interface, this material must be covered in any documentation. Guidance includes documentation, online prompts and error messages, training materials, and any instructional aids. Research on different types of guidance can be found in [Wright83, WilligesR86b, Felker81, Houghton84] in chapters 8.2-8.6 and 11.7 of [Salvendy87], and in chapters 16, 27 and 28 of [Helander88]. The documentation effort should be organized to include the expertise of technical writers. Part of this organization involves physically separating the text in prompts, error messages, and longer help information such as procedures, from the application and, as much as possible, from the user interface software. This allows technical writers to work on system documentation while other people work on other system parts, and it makes consistency checking easier. Separation also facilitates activities such as translation to foreign languages. Technical writers should be involved early in development because they will detect complexity and inconsistency from a user's point of view more than from a software engineer's, partly because they are providing or documenting much of the user interface closest to users. DEMO: Good and Bad Examples of Guidance: or ? the above are cryptic error messages No such file or directory where is this message coming from? what file? Are you sure (y/n)? automatic confirmation can be dangerous. Sure of what? {2.6.1. Levels/Types of Users} It is common to distinguish between two types of users: novices and experts. While it is true that some users know more than others, the distinction is flawed in two ways: first, there is a gradual increase in competence from novice to expert; and second, there may be parts of a system where a user is an expert and parts where the user has had no experience. Different users have different amounts of experience with computers in general, with the particular system being used, with the specific program or the domain of application of a program (e.g., they might be highly skilled accountants). Finally, although a user might at one time have been expert with a particular program, if he or she is an occasional user, some memory refreshing might be needed. Those complications aside, it is still convenient to refer to a user who needs guidance in a specific situation as a novice and to one who needs little or none as an expert. Types of users may include: * Expert/novice computer users * Specific system novices/experts * Specific program novices/experts * Domain/task experts/novices * Occasional users The type of guidance offered should depend on the needs and capabilities of users, and guidance design should be based on how people learn (e.g., imitation, asking questions, seeing examples, by analogy) High proficiency should not be required just as low proficiency should not be assumed or forced (such as in menu-only systems). Multiple dialogue types should be provided and customization should be allowed. Early training of users might include browsing hierarchies of capabilities [Perlman84], or finding help by keywords with synonyms. {2.6.2. Levels/Types/Formats of Help} There are several types of help, such as prompts, error messages, tutorials, and reference materials. Different levels of detail that can be provided for each type of help: none, little, extensive, etc. And for each combination of type of help and level of detail, several media formats can be used: online, paper, video, voice, or some combination. There is a distinction between help with completing a task and help using the system (such as with user interface tool conventions). Tools must be designed to provide help with their own use and to provide application-specific hooks for application-specific help; these must be accessible using the same conventions. For sophisticated help systems, help on help must be provided. {2.6.2.1. Levels of help} For a given user in a given situation, there is some optimal level of detail of guidance to provide. An expert in a task may require very little, while a novice probably requires more. User profiles, possibly utilizing records of command usage, can be used to predict the level of detail of help; however, users should be able to override these assumptions and, in most cases, request progressive levels of detailed explanation. It is important to avoid the problem of information overload and not try to provide too much information too soon. {2.6.2.2. Types of help} Help can be provided to tell the user what can be done next, to indicate the current system status, and to tell what has been done (wrong) before [Nievergelt80]. {Prompts} consistent wording and format are important: * What information is being requested? * What form should the input take? * What will be the effect of typing ENTER or clicking on OK? * How can the user get help? {Status information and feedback} * General Information - time and date, arriving mail, current directory/program/system, system/network load, print queue status * Task/process information - current activity, prompt for input, warning of pauses, percent/amount done indication, priority of process, consumed resources {Warnings and error messages} * Content - Identify the source of the problem (e.g., command or program). Indicate the severity of the problem (e.g., comment, warning, fatal). Provide diagnostic information (allowing for more detail on request). Provide prescriptive information (including method to edit or undo). * Presentation - Be timely, brief, noticeable, neutral (do not assign blame). Maintain context (don't clear the screen or cover the problem). Consider requiring acknowledgement (e.g., lock keyboard). Allow review of messages. Allow incorporation into task (e.g., syntax errors and spelling errors should be available for correcting errors). [Felker81] {2.6.2.3. Formats of help} Many different forms of media can be used to provide user guidance. Help can be provided online, in printed form, or through non-traditional media such as video or audio. As the quality of computer displays improves, the distinctions among these forms blur, and newer systems combine help from many media. {Printed materials} include information on getting started, function key templates, reference/cheat sheets, maps, examples, tutorials/demos, reference manuals. {Hypertext/hypermedia systems} show the uses of nonlinear structured texts in widespread commercial systems such as Guide and HyperCard [Goodman87], and comprehensive online help systems such as Concordia [Walker88]. See [Conklin87] for a good overview of hypertext. {Video} is necessary for highly interactive and visually oriented systems. Get a camera with a pause button and auto-focus, design a demo, and record the demo one action at a time. The computer program is the best of actors, always willing to do the same scene the same way. Even an amateur video will convey more useful "look and feel" information than the best of printed documents. {Online tutorials and self-demos} can guide users through exercises that give them task- specific experience. Programs instrumented for record and playback options are particularly well suited to the economical creation of demonstrations. {Expert Advice Givers} which build a representation of the user and the task, show promise in being able to offer the right amount of the right advice at the right time [Carroll87a]. Phone lines and user support groups are among the most appreciated forms of help for commercial products. Part of the process is to reproduce, as closely as is possible, the user's environment so that the most relevant information can be provided. User support is expensive; it should be used to collect evaluation information to improve system design and to reduce the amount of support needed. {2.6.3. Coordinating Help with User Interfaces} All forms of help must be consistent in format and content: * Prompts * Error Messages * Online Help * Printed Documentation * Application Functionality They must be coordinated by using a single source of information to insert into the places in the software and documentation where the help will be used [Perlman85c, Perlman86, Perlman89b]. When text and design decisions are stored in a central location, such as a database, technical writers can access the text, while human factors specialists can modify the design, and software engineers can work on application software, all working independently of one another. {3. User Interface Evaluation} User interfaces connect two complex systems: humans and computers. Mistakes will be made in designing and implementing user interfaces because the people developing user interfaces have limited knowledge of how humans operate. User interface developers, who themselves have limited capacity to address all the known constraints on design, are bound to make errors in design and implementation. Evaluation of user interfaces is therefore a critical part of user interface development, just as testing is critical to software development in general. Evaluation should not be viewed as a final validation of a grand design; it should be integrated into design prototyping and at various stages of implementation. The sooner a problem in a part of a user interface is discovered, the more inexpensively it can be fixed. In the future, if user interfaces are not evaluated, questions of product liability may arise instead of the current practice of blaming human error. User interfaces can be built and then evaluated empirically by collecting data on their use. More attractive is the possibility of evaluating a design specification using predictive models -- before a system is built and without the need to collect data. Both approaches are covered in this section of the module. {3.1. Empirical Evaluation of User Interfaces} Empirical evaluation of a user interface involves collecting data on the user interface (e.g., usage patterns, subjective ratings) and interpreting that data in terms of prioritized evaluation criteria. Criteria such as usability or attractiveness cannot be measured directly, so operational definitions of measures must be defined and evaluated for validity and reliability. When data are collected, it is usually not possible to repeatedly measure all possible cases, so samples must be taken -- with care -- to obtain unbiased and representative samples. After data are collected, they must be analyzed so that the results can be summarized (descriptive statistics) and so that results can not be attributed to sampling (inferential statistics). Finally, the conclusions based on empirical evaluation must be fed back into the design-implementation-evaluation loop. TEACH: The material in this section draws heavily from applied statistics, particularly those methods used in experimental psychology. The methods are standard and covered in many textbooks, and I have found [Guilford78] to be an adequate text. [Shneiderman87] covers many practical aspects of collecting user interface evaluation data, but more complete treatment can be found in chapter 10.1 of [Salvendy87] and chapters 36 and 41 of [Helander88]. TEACH: This area is new to most software engineers, so it is important to avoid creating impractical expectations about what should be learned. This may be especially true for students who have taken mathematical and not applied statistics. If they come out of the course realizing that their intuition is poor and that data can be enlightening, then there has been some success. For the rare software engineer who will do research on user interfaces, a much greater degree of understanding is required. TEACH: Students will make many mistakes when trying to use empirical techniques. They will create poor measures, collect biased data, use improper analyses, and make wild conclusions. One of the most important lessons they can learn here is that their knowledge is limited in this area, and that they should be conservative in the methods used and conclusions drawn. They should use the instructor as a statistical consultant and be warned that making mistakes while using a consultant is excusable, but becoming a quack is irresponsible. {3.1.1. Measurement of User Interface Success} Without measurement, success is undefined; so any empirical evaluation begins with the definition of useful measures [AERA85]. The criteria of defining success of a user interface depends on what it is used for, who is using it, how often it is used, and so on. For example, in the operation of a nuclear power plant, certain types of errors must be minimized, even at the cost of throughput (number of operations per time unit), while in a video game, the opposite may be the case. When a user interface is being designed, the criteria for its evaluation must be decided. TEACH: The input of evaluation to design is a prime reason why evaluation should be taught before design. TEACH: Software engineers will have had little experience defining measures. Exercises should be done in which they define a measure, indicate its scale of measurement, describe how they would establish its validity and reliability, and discuss how they would make conclusions based on it. It is likely that the first attempts will be failures, perhaps because they are trying to measure variables much fuzzier than CPU cycles, so some prodding may be needed to avoid student frustration. I have found it useful to have students write a critique of a peer's work; then I grade the original and the critique. This lets me see how well students are able to define reasonable measures and how well they can evaluate research that they may see in the future. {3.1.1.1. Operational Definitions of Measures} Psychological variables relevant to user interface design, like difficulty of learning and use, cannot be measured directly as can physical measures such as time or distance. Instead, we must make an operational definition (sometimes called a proxy) for a measure in terms of variables we can measure directly, such as time to complete a task. Operational definitions have been called a translation of reality into numbers. Measures can be categorized as performance measures, subjective measures, and/or composite measures. Performance measures include: error rates on selected tasks or over specified periods of use, learning time to be able to answer selected questions or perform certain tasks without referring to a manual, task completion time for selected tasks, amount of work done in a fixed time, the location and duration of eye fixations, number of references to a manual, and so on. TEACH: Students may argue about the meaning of performance measures, but they should understand that these measures are not as potentially biased as subjective measures. Subjective impression measures usually are ratings and include: rated aesthetics, rated ease of learning, rated throughput, stated decision to purchase, and so on. Composite measures are usually weighted averages of any of the above measures, some of which themselves are weighted averages. Care must be taken to avoid grouping too many measures of questionable validity or reliability because of the tendency for people to attach greater importance to complex composite measures. DEMO: A simple example of a composite measure is to define efficiency as throughput divided by the number of errors. Throughput is the number of tasks divided by the total time. The ambiguity of what is a task and what is an error should be discussed until students have defined them. {3.1.1.2. Scales of Measurement} Measures can be qualitative or quantitative, although many qualitative measures become quantitative when aggregated through grouping and counting. Measures can be: * Categorical/Nominal - in which data fall into one of several discrete categories. * Rank/Ordinal - in which data are ordered but differences between data are not comparable. * Interval - in which data are ordered and differences are comparable but ratios of numbers are not meaningful. * Ratio - in which there is a meaningful zero point. The higher the scale of measurement, the stronger the inferences that can be made. The goal is to capture as much information as is practical by using the highest scale of measurement. DEMO: The following are different versions of questions that get at the same information. Each subsequent form gathers more information with roughly equal respondent effort. Is the amount of help good? yes no (gives no idea of the extent of goodness or badness) Is the amount of help good? VG G ? B VB (gives no idea of why the amount of help is good or bad) Rate the amount of help provided: Too little 1 2 3 4 5 6 7 Too Much {3.1.1.3. Validity of Measures} Validity cannot be established directly. A measure is valid if it makes sense -- or more formally, if the inferences based on it are sensible. There are several types of validity, each with its own method of establishment. A measure is said to have construct validity if it includes some latent construct (e.g., learnability or searchability). Measures with construct validity should correlate highly with other measures of the same construct (convergent validity) and should not correlate with (be independent of) measures of other constructs (discriminant validity). A measure has content validity if it covers, as decided by experts in an area, specific content areas thought to be important. For example, an evaluation might have content validity if it uses data collected from all parts of a system. The most commonly used method of establishing validity is by seeing if other people believe it, which is unfortunate but economical. It also stresses the need to favor simple over complex measures because people can more easily understand simple measures. TEACH: Examples abound in the area of measures. Avoid sports measures, like batting averages, because of they are nor of interest to everyone, especially foreign students. Give examples with measures with which all students have an interest, such as SAT or GRE scores, or with which most students will be familiar, such as unemployment rate or inflation. Many students will -- and should -- become upset about the lack of rigor used in most statistics reported to them. {3.1.1.4. Reliability of Measures} A measure is reliable if repeated measures under the same conditions yield the same, or similar, results. Reliability differs from variability of scores throughout a population; it is normal for things to differ, but a measure is unreliable if measures of the same thing are variable. Reliability also differs from the accuracy or precision of measurement. Measures automated by computer usually are highly reliable (as well as accurate), while subjective ratings may not be reliable. Reliability is established by correlating repetitions of the same test or questions, perhaps in alternate forms. In tasks where measures are derived from expert raters observing performance, inter-rater reliability can be established. {3.1.1.5. Sensitivity of Measures} A measure is sensitive if it is able to detect meaningful changes in what is being indirectly measured. Measures that do not have a wide enough range (e.g., a rating scale with only two categories) may fail to represent meaningful differences. Measures that are not end anchored (e.g., with values outside the range of expected ratings) may exhibit floor or ceiling effects (i.e., all scores are at one end of the scale) and may not be treated as though on an interval scale. {3.1.2. Collecting Data on User Interface Usage} Too few user interfaces are evaluated by any methods other than by having some software engineers agree that it looks good. Software engineers must learn that they should define objectives for a system and collect data rather than make dangerous assumptions. The most common problem in empirical investigations is the lack of forethought about what will be done with the data after they are collected. Data collection begins with a method (how the data will be collected), a design (what conditions will be studied -- experimental design), and a plan (who will provide the data, how much data will be collected -- sampling issues). The method, design, and plan affect the eventual analysis of the data and the conclusions that may be drawn from it. TEACH: It is difficult to learn about experimental design issues without having a good understanding of the methods for data collection. However, emphasizing the methods of data collection over the design can encourage people to collect the easiest data and not the most relevant. TEACH: Certain types of data require instrumentation of the user interface software. This is a prime reason why evaluation should be taught before implementation. There are several methods for collecting data on user interfaces: people may be interviewed, protocols (such as video) may be collected, survey questionnaires may be used, and programs may collect data on their own use. The methods affect the design and plan for data collection; some situations are not amenable to collecting controlled data or the large amounts of data required for certain designs. Different methods provide different types of information and have different costs; these tradeoffs are discussed below. The cost of collecting data can be controlled with prudent planning. Pilot tests can find ambiguities in instructions, determine the need for practice by subjects, and help decide if the tasks are too easy or too hard. {3.1.2.1. Expert Commentary, Customer Interviews} Talking to people interested in the success of a system is one of the most obvious ways of gathering evaluation data. Expert commentary is a too often a way of saying "We software engineers asked our peers." Even though peer comments are valuable, they should not be the only source of expert commentary (they are a biased sample), and should certainly not be the only source of evaluation information. Human factors specialists, graphic designers, and technical writers can provide qualitatively different insights, and can be drawn into the evaluation process for major evaluations and minor suggestions. If experts are readily available, they should be consulted often, perhaps through electronic mail or bulletin boards. If experts must be brought into an organization, then time and funds must be allocated when planning a software development project. One highly relevant source of information is the customer. Customers can make suggestions that will help adapt systems to their needs, and they can be very good at indicating what they do not like. Addressing negative reactions to the user interface is something that is best done during development and worst after delivery. When customers are asked for their opinions, they are more involved in the system design, and they enjoy seeing their suggestions incorporated in the delivered version. This is not to suggest that seeking customer input is gratuitous, but that the user interface developer must acknowledge and deal with personal and political issues. Instead of -- or in addition to -- interviews, personal evaluation information can be obtained in a less personal way, using questionnaires. These can help reduce the interpersonal problems of interviewers leading interviewees and problems of interviewees trying to please interviewers. Questionnaires also can collect data from more people than interviews, and in most cases, they can collect more interpretable data per person. However, questionnaires require more planning and are less adaptive; a questionnaire can miss large problems, even when it ends with a question like "Are there any other comments?" Less structured, but perhaps even richer than interviews are records of people using the software (e.g., video protocols). {3.1.2.2. Video/Audio Protocols of Users} User interface protocols are records of natural use of software. These protocols can be reviewed to find problems. Most software engineers have never seen their users use their software, and much of the motivation for user interface evaluation can be taught by having students sit and silently watch people use their and others' software. Many software developers cannot restrain themselves when they see people making mistakes that nobody would ever make, and often claim that the users are deliberately trying to make them look bad. Watching users and trying to interpret what they are thinking is time consuming and often ineffective. Having users think aloud is a procedure to get users to explain what they are trying to do, and why, and what problems they are having doing it. [Ericsson84] is an excellent source on analysis of thinking-aloud protocols. It takes practice to learn how to get users to think aloud. If users stop verbalizing, the observer should prompt them by asking non-directive questions like "What are you thinking now?" and "What are you trying to do?" in a non- evaluative tone. Observers should avoid asking biased questions. Observers should not help users unless the users are hopelessly lost or becoming frustrated; the observers will not be with end users when the product is released [Knox89]. It is important for users to feel at ease, that they are not being evaluated, the system is. Sometimes, observers will leave the room and observe behind two-way mirrors, which provide some feeling of privacy, even though users know they are being watched. Often, video or audio recordings are made for later analysis or for demonstrations (videos often contain compelling anecdotes, critical incidences, that make good ammunition for convincing others of problems). See [Mackay89] for an extensive treatment of the use of video as data. Analysis of protocols can involve transcription of the protocol into a word processor, segmentation of the protocol into units such as sentences, and encoding of the protocol into a reduced vocabulary [USC89]. Such detail is not always necessary, because in many cases, the protocol is collected to document critical instances of behaviors like errors or complaints to be used to provide feedback. The analysis of protocols is done by trained raters, so it is important to try to avoid using potentially biased observers and to check for inter-rater reliability. Protocols may be quantified by identifying specific situations, such as positive or negative comments, various types of questions, references to user manuals, and so on. Often the "results" will be obvious -- users may have misconceptions about part of a system. DEMO: A simple setup will make for a memorable class demonstration. A video camera is pointed at a screen, possibly from the side to show the user's face, and a microphone is put on the user. A self- confident, serious student should be used as the user. A system for which there are several problems should be selected, possibly a prototype developed by the class instructor, or there may be little to comment on. Students will get most of the point during the live action, so video recording equipment can be omitted if unavailable, or audio equipment may be used in its place. When replaying, coding should be demonstrated and saved for later analysis. {3.1.2.3. Survey Questionnaires} Survey questionnaires can be used to gather large amounts of data economically. Survey questionnaires require a lot of planning, or they require a lot of extra work later. In the worst case, they provide a lot of bad data. [Perlman85a] discusses the stages of conducting a survey, in the context of online design, gathering, and analysis. In designing a questionnaire, the question formats used should gather as much quantitative information as is practical, given that the respondents will not want to complete a long or complex questionnaire. There are several commonly used question types: Thurstone, Likert scale, semantic differential, multiple choice, True-False numerical value, or free-form. After question selection, the format of the questionnaire should be decided. Issues of the perceived length of the questionnaire, the ease of providing answers, and the ease of transcribing answers should be considered. Preliminary encoding might include collation of free-form answers and tabulation of others, with a no-answer category for missing data. A pilot survey should be run to detect problems. Questionnaire data can be collected and analyzed electronically [Chin88]. {3.1.2.4. Program Instrumentation} It is relatively easy to instrument programs -- to put code in programs to monitor how they are used. This is especially true if such instrumentation is anticipated and all input goes through one routine which can be monitored. The input information includes information like keystrokes and mouse movements, and it can include timestamps. Often, the information is called an audit-trail and is stored in an audit file or "dribble" file (because the information dribbles in). Input can be monitored continuously or for special cases, such as requests for help or error conditions. The anonymity of users should be protected, and permission to monitor should be obtained (see Chapter 7 of [Solso84] or [APA82] for ethical issues). It is easy to collect too much information, so that data that is never examined takes up space or adversely affects program response time. On the other hand, care must be taken to avoid filtering information and losing data that might be useful later. A special case of program instrumentation is to allow user comments, which can be saved in a file or mailed electronically to developers. TEACH: It is remarkable that most developers have no data on what parts of a system are used. When trying to improve program speed, we monitor usage and look at a profile to see where to place effort. Similarly, in user interfaces, improvements should be made to the most important parts, but unlike speeding up a program, lack of use of a part of a system should be looked into. DEMO: A simple method of instrumentation is to attach a front end to a set of programs. This can be done by using command scripts (e.g., UNIX shell scripts or DOS batch files) that append the date and time to a file every time a command is run. Audit information can be used in simple ways, such as to tell how often various commands are used, if at all. In looking for complex patterns of usage, there is no limit to the possible sophistication of analyses. Audit information can be used to play back sessions, possibly using timing information to give a realistic feel (even more so if accompanied by an audio protocol), or replayed at an observer-prompted pace. Such playbacks can also be used for tutorial purposes. Unless information is supplemented with other data, it is not easy to tell why an error occurred. Information from audit trails can be correlated with other measures, such as ratings, to get converging evidence to establish validity of measures. {3.1.2.5. Experiments and Experimental Design} Experiments, in the formal sense, allow the inference of causality because we can observe the differential effects of controlled conditions on variables we wish to measure. Laboratory experiments, in which few variables change between conditions, can be costly and may have limited applicability because of artificiality. However, laboratory experiments have clearer results and have more of a chance of being applicable to a broad class of designs. Still, laboratory experiments require much more sophistication in experimental methods than is justifiable for software engineers and will not be covered in detail in this module. TEACH: Some experimental control packages, which greatly reduce the time to conduct laboratory experiments, are available for inexpensive machines like the IBM PC. APT is an easy-to-use experimental control package that can be used for class demonstrations [Poltrock88]. MEL, the Micro-Experimental Laboratory, is one that is more sophisticated but requires more time to learn [Schneider88]. More naturalistic experiments, in which competing versions of systems or different interaction styles are compared, often have so many variables changing between conditions that it is difficult to infer which are relevant. Given the costs and benefits of experiments, it is unlikely that many software engineers will conduct laboratory-style experiments, but it is still important for them to understand experimental logic and practical methods of inferring causes, to know what inferences are valid under which conditions, both in their own work and in evaluating the research of others. Part of learning about the logic of experiments is the terminology, most of which will be new to software engineers. There is a one-to-one relationship between the design-space used in designing user interfaces and the factors and conditions that have been or might be investigated in experiments. Sometimes, experiments have been done in areas closely related to the design decisions that must be made, but often the user interface designer tries to predict the effects of complex interactions of factors. In experiments, independent variables are variables that are controlled by the experimenter; in system design, making a design decision selects a level or condition of some variable or factor of interest. Dependent variables are ones that may depend on the independent variables; in user interfaces, they are the measures of success of the system. In experimental studies, subjects (test users) are assigned to conditions and data are collected. The logic of experiments is the inference that certain independent variables affect certain dependent variables (e.g., color coding affects search time and subjective ratings). The logic of experiments is to hold variables-not-of-interest constant among conditions, systematically manipulate (control) those of interest, and observe the effects of the manipulation on the dependent variable(s). If two conditions are identical except for one variable, then differences in measures of behavior in those conditions can be attributed to that variable. A common use of experimental logic is to devise benchmark tasks -- tasks exercising many or important parts of a system -- that are completed by comparable groups on difference systems, or by different groups on comparable systems. TEACH: Running experimental studies involving the systematic control of variables depends on experience, so the teacher must be a consulting tutor who gives continuing advice during any exercises. The following discussion about confounding (lack of control) and methods for avoiding confounding, demonstrates some of the need for experience. It is possible that an experimenter thinks that two conditions are comparable, except for a difference of independent variables, but that there are confounding variables contributing to observed differences between conditions. There are many possible sources of confounding in user interface experiments -- factors not of direct interest to the experimenter -- and several precautions are in common use. * Order effects occur when it makes a difference whether some condition is done early or late in measurement. Typically, tasks done early in testing are slower and more prone to error. Tasks done late in testing may be affected by user fatigue. * Carry-over effects occur when it makes a difference if one condition follows another. A prime example is learning text editor commands: the first system is learned without the interference of knowledge of a previous system. * Experience factors affect results when the people in one condition have more/less relevant experience than in others. This does not apply when experience is a factor being studied. Observed superior performance may mistakenly be attributed to experimental factors when group differences provide adequate explanation. A special type of experience difference, a carry-over effect, occurs when a person is tested in one condition, learns information or skills relevant to other conditions during that testing, and is then tested in one of those other conditions. * Experimenter/subject bias occurs when the experimenter systematically treats some subjects differently from others, or when subjects have different motivation levels in ways not relevant to the study. * Other uncontrolled variables, such as time of day or system load, may have systematic (non-random) effects on the data. Some commonly used precautions and remedies for confounding include the following: * Randomization and counterbalancing are used to control for order effects; and to some extent, counterbalancing can be used to study carry-over effects. Random assignment to conditions is used to ensure that any effects due to unknown differences among users or conditions is random. Even when randomization is used, there are dangers of carry- over effects. A simple example of counterbalancing in a two-condition experiment is to test half the users in condition I first, and the other half in condition II first. For experiments with one independent variable, but possibly several conditions, different permutations of condition order can be used; and to systematically control the transition orders between pairs of conditions, Latin squares can be used. * Repeated measures are taken of the same conditions and make order effects a systematically manipulated variable. In user interface studies, repeated measures allow the study of learning. * Matched groups are used to remove unwanted differences between groups of subjects in different conditions. Matching can be done on a group-average basis, or on a single-person basis, for example, by taking the N most experienced users and assigning them (randomly perhaps) to N different groups, then taking the next N, and so on. A special case of matched groups is when subjects are matched to themselves. These so-called within-subjects designs can only be used instead of between- subjects designs when carry-over effects are not anticipated. * Control conditions/groups are used as a basis for comparing experimental groups. For example, when a new version of a system is designed, two prototypes might be evaluated, but their success should be measured against the baseline performance of the original system. The two "improved" systems may be worse than the original, and there is no way of determining that without a control condition. * Experimenter/subject blind means that the experimenter or subject (user being tested) does not know which condition is being tested. Experimenter blind helps remove the possibility that experimenters will treat subjects differently because of preconceived hypotheses. Subject blind, which can include lack of knowledge of the purpose of the study, can help remove the possibility that subjects will act differently because they know they are in specific conditions. * Statistical control is used to control for differences after data are collected. Methods like analysis of covariance are more remedies than precautions. Quasi-experimental designs use matching of groups in observational data; but because assignment to conditions may not be random, there is still a danger of confounding due to other factors. Experimental design involves choosing factors and levels of factors of interest, and devising a plan for collecting data to make conclusions about the factors. It is unfortunate that the area is called experimental design, because the methods apply equally well to observational methods of data collection. During data collection, it is possible to collect more than one measure (dependent variable) per condition; such studies are called multivariate. DEMO: [Tullis86] gathered two measures per condition: search time and subjective rating. It is also possible to systematically manipulate more than one independent variable or factor in the same study; such studies are called factorial designs. DEMO: [Perlman85b] studied menu search and manipulated the number of items (5, 10, 15, 20), the type of items (numbers or words), and the order of items (random or sorted), making a total of 4x2x2 (16) different conditions. When factors are crossed (all levels of one factor is combined with all levels of another), as opposed to nested, the number of distinct conditions is the product of the number of levels of all factors. The multiplication of levels of factors is the basis of the name factorial design. As more factors are added, there is more potential for a factorial explosion of conditions that would make it impossible to collect enough data, although the number of variables used can be reduced by pre-experimental analysis [Beaudet89]. Factorial designs provide us with interaction information about how the level of one factor affects the level of another. DEMO: The menu item selection experiment in [Perlman85a] shows a cross-over interaction. Letter menu selectors can be the best or worst selectors, depending on how they are paired with what they select. In multivariate and factorial experiments, design and analysis get complex; so statistical consultants should be used. {3.1.2.6. Sampling Techniques and Issues} We cannot observe every possible user under all possible conditions to look for problems in user interfaces. Instead, we sample, or more formally, draw a sample from a larger population to which we hope to generalize. A population is something we don't know about but want to, such as users, system features, or test cases. Population, here, is used in the traditional statistical sense and can mean a group of users, a group of features in a system design, a set of input data to a system, and so on. We use statistical methods to help us draw valid conclusions about the data we have collected, and these methods make use of information about the sampling techniques. We must be careful to have a large enough sample, and one that is not biased. * Unplanned samples are used when data are collected without any plan. They should be avoided because of the danger of sampling bias. * Regular samples are drawn at regular intervals from a population, such as every 10th customer. There is usually little chance of bias here. * Random samples are drawn by randomly selecting from a population. Although statistical procedures usually assume random sampling, they are robust against most violations, which is good because random samples are rare. * Stratified/representative samples are drawn by first dividing the population into groups, or strata, and then sampling from those strata acknowledging their relative sizes in the population. This method, used extensively in political polling, can be used to sample different classes of users or emphasize the most used parts of systems. * Biased samples occur when a sample systematically under- or over-represents part of a population so that the sample is not a true indicator that will allow valid conclusions to be drawn about the population. Bias can occur in user interface evaluation if a developer collects data on peers instead of people more like eventual end users. Sampling bias is not experimenter bias, which occurs when the experimenter biases the conditions under which data are collected. * Small samples occur when there is not enough data collected to make any inferences about a population. Either there is not enough data to reach statistical significance or there is not enough to convince others. {3.1.3. Analyzing User Interface Data} There is nothing special about data collected on user interfaces, so ordinary statistical methods apply. Primarily, we want to summarize the data collected to make sense of them. Then, because of sampling issues, we should apply inferential statistics to check the generalizability of our results. Both descriptive and inferential statistics are automated by statistical computing. TEACH: For people experienced in applied statistics, it may be hard to recall how foreign this subject is to the novice. Continuous "hand-holding" is required to allow students to use and interpret simple descriptive statistics. Students should understand the types of errors than can occur in hypothesis testing and the relationship between sampling and hypothesis testing. Unless students have had previous training in statistics, it is unreasonable to assume that they will understand the basis for significance testing, let alone its application. Students should be warned that this is a critical area in which to obtain consulting help. {3.1.3.1. Descriptive Statistics and Graphs} Data collected from samples from a population are used to estimate the true values in the population. Even in small studies, the amount of data can be overwhelming; and descriptive statistics and graphs are used to summarize the trends in the data. Most statistics books like [Guilford78] cover descriptive statistics well. [Tufte83, Cleveland85] are two good sources on graphical presentations of data. The types of statistics and graphs depend on the scale of measurement. Categorical data are grouped and counted, while quantitative data may be averaged. * Frequency distributions show, for any particular value, or range of values, the number of times a value was collected. Frequency counts may be shown graphically as histograms. Data from different conditions may be compared by showing different histograms, or by presenting frequency information in tables. For factorial designs, data can be compared in multiway cross-tabulations. The information in frequency distributions may be presented as proportions, to allow the comparison of trends in samples of different sizes. DEMO: Examples of different shapes of distributions should be shown, including normal, uniform, bimodal, and skewed. time measures are usually positively skewed. * Measures of central tendency -- or more simply, averages -- show the location of a distribution of values. There are measures like the mean, median, midpoint, mode, and more exotic measures; and there are cases where each is the best indicator of the center. The centers of distributions from different conditions are usually compared instead of comparing all the scores in each. Centers of different conditions can be shown graphically in bar plots and line plots and, for factorial designs, in factorial plots. DEMO: The effects of outliers on different averages should be shown. The location of the mean, median, and mode should be shown in a skewed and bimodal distribution. * Measures of dispersion show the range of values in a distribution around its center. There are measures like the standard deviation, quartile deviation, range, and more exotic measures, but the standard deviation is used most often because of its advantageous mathematical properties. Measures of dispersion give us an idea of the size of a difference between measures of central tendency, in standard units. Measures of dispersion are often used to indicate the precision of an estimate of central tendency. Dispersion is often shown in graphs as standard deviation or standard error marks around the mean. DEMO: Examples of high and low dispersion, given the same mean, should be shown along with equal measures of dispersion, given different means. The effects of outliers on measures of different measures of dispersion should be shown. * Measures of association show the degree of relationship between two or more variables. The most common measure is the Pearson product-moment correlation (usually just called the correlation), but there are measures of association for ranks (the Spearman rho) and for joint-categorical distributions (e.g., Kendall tau). Association is often best shown graphically in a scattergram. DEMO: Examples of strong / weak positive / negative correlations should be shown. There should be examples that demonstrate the problems with linear correlation, such as a zero correlation for a quadratic relationship and a high linear correlation for a power function. The use of transformations on nonlinear data, followed by linear correlation, give an adequate introduction to the concepts of nonlinear correlation and functional modeling. * Confidence intervals combine information about an estimate and its sample variability, usually a mean and standard error. They provide a method for conveying the precision of any estimate and for comparing estimates. TEACH: Students will have trouble understanding the logic behind a confidence interval formula if they do not have a background in probability distributions. For a first course in user interface development, of which evaluation is only a part, confidence intervals are best left out; students can be taught to report the mean, standard deviation, and sample size. {3.1.3.2. Inferential Statistics} When comparing statistics from different conditions, we usually want to conclude that there are differences between conditions. We expect that sample averages will differ, even when drawn from the same population, because of random sampling. For practical reasons, we draw samples of subsets from populations; and there are chances of unlucky samples even when care has been taken to avoid bias. In hypothesis testing, we assume that there will be no differences among conditions -- this is called the null hypothesis, sometimes denoted Ho, and that any observed differences in sample statistics are due to chance. Mathematical statistics provides us with the tools to compute the probability, p, of an observed difference among statistics, assuming the null hypothesis (and "other" assumptions). The correct test to use depends on how the data were collected, the scale of measure of the data, and on the degree to which the data violate "other" assumptions. If this probability, p, is low, it casts doubt on the null hypothesis. If p is lower than some level called alpha, then we reject the null hypothesis at the alpha significance level and conclude that there are real differences. If there really are no differences among conditions, and we reject the null hypothesis and conclude there are differences, then we have committed a Type I error. The probability of such a false alarm is alpha, and is chosen by the hypothesis tester who must weigh the cost of a Type I error. On the other hand, if we find that p exceeds alpha, then we do not reject the null hypothesis. That is not to say that we accept the null hypothesis, because we may not have enough power to reject because of lack of data, variability of data, weakness of effect, or low alpha level. If there is an effect of interest that has not been detected, then we have committed a Type II error, whose probability is called beta. In general, we can not know the value of beta unless we make assumptions about the size of the effect we wish to detect. DECISION Effect (Reject Ho) No Effect (Fail to reject Ho) TRUTH Effect (Ho false) power = 1 - beta Type II Error p = beta No Effect (Ho true) Type I error p = alpha Showing the decision matrix used in any statistics book for hypothesis testing will help students understand and remember. It is important for students to understand that the decisions they make about conditions are based on sample data and may be wrong. In the face of inconclusive results, they should have some intuitions about when to collect more data or use other methods to increase power. {3.1.3.3. Statistical Computing} Almost all statistics are computed using statistical packages, many of which run on affordable computers. Statistical packages allow fast analysis by people who do not know the details behind the statistics being computed, so care must be taken when using them. Generally, the packages have components for data management (including database access and data transformations), data analysis (including descriptive and inferential statistics for particular types of data), and data presentation (i.e., graphics and tables). TEACH: The established packages like BMD/P, SPSS, SAS, and MINITAB all have versions running on PCs, although they can be expensive. A package called SYSTAT runs on both the Macintosh and the IBM PC, and offers excellent accuracy; there is a free promotional subset of SYSTAT, called MYSTAT. |STAT [Perlman87b] is a package on UNIX and PCs that offers most of the analyses that would be used, with the advantage that it can be copied free for nonprofit use, so students can take it with them. {3.1.4. Interpretation and Evaluation of Data Analysis} TEACH: Students must learn how to interpret their own data and the empirical studies by others. Students should learn that data are facts that cannot be argued with, but there can be disagreements about the interpretation of the meaning of data. There are many ways that empirical research can lead to faulty conclusions, and these are covered in detail in [Campbell74]. When empirical research is evaluated, all stages of empirical research must be scrutinized. There are different stages at which empirical studies can falter: data collection, data analysis, interpretation. The later the stage of the mistake, the more that can be salvaged. Some questions that must be answered satisfactorily are: * Are the measures valid? If the measures used are not good indicators of what the study purports to measure, then the whole study is suspect. * Are the measures reliable? * If the measures used are not reliable, then are the sources of unreliability a possible source of confounding? * Have proper experimental controls been applied? If there are clear or possible confounding variables, can parts of the study be salvaged, or must a new study be done? * Have representative samples been taken? If the samples chosen are biased, then how does this affect the generality of the conclusions that can be made? * Have appropriate counterbalancing measures been taken? If counterbalancing is not used, some confounding order and carry-over effects may go unnoticed. It is also a sign of lack of experimental design skill. * Have any data been lost or transformed? If data has been lost due to subject errors, then what percentage of the total data were they, and how are analyses different with and without them? If transformations on the data have been performed, are good reasons given? * Are the statistics appropriate to the scale of measurement? If improper statistics were selected, how robust are they against violations of their assumptions? Can the data be reanalyzed using appropriate statistics? * Have significance tests or confidence intervals been reported? If no significance tests are reported, then are the results so strong that tests are unwarranted? * Are reasonable conclusions drawn from the data and analysis? Are there other, equally valid conclusions that can be drawn about the study? Do the conclusions drawn contradict those from similar studies? {3.2. Theoretical Evaluation / Predictive Modeling of User Interfaces} Models of user interface performance can predict the usability of systems based on a design specification. The main idea here is that it should be possible to evaluate a user interface without collecting usage data. There are serious limitations to current models, both in scope and in accuracy; but they hold the promise of aiding design by predicting evaluation and even by suggesting improvements using diagnostic information. Models implemented in software promise to be cheaper than human critics, more thorough for what is covered, discreet (for the designer's benefit only and not for a supervisor's), and easy (if built into a design system). Models may have questionable validity, and they can be "fooled" if their knowledge is incomplete. For general treatment of models, see chapters 7, 41 and 42 of [Helander88]. {3.2.1. How Predictive Models are Built} Models formalize design knowledge and can be described as functional/structural (these try to explain processes or relations) or as purely predictive (these only try to make predictions, and make no attempt to provide diagnostic information). Purely predictive models are almost always based on regression analysis, which use weighted averages to make predictions. Regression gives us powerful, almost too easy, methods for generating predictions. Regression is based on the concept of least squares, which tries to minimize the squared deviations from predictions. The regression modeling process contains the following steps: choose variables to predict; choose predictor variables; collect data and use regression program; evaluate statistical significance of prediction; plot prediction and residuals (to look for outliers and nonlinear trends); add, drop, or modify variables in the model; and iterate. An underused source of data for building models of any type is the existing corpus of data [John89]. TEACH: Regression can be introduced as an extension of correlation, beginning with one-variable linear regression, proceeding with multiple regression, and hinting at multiple nonlinear regression. The concepts of residuals, including residual plots, may be introduced to help students look for nonlinear relations. The concepts of partial correlation and stepwise regression may be introduced if students are going to build models. {3.2.2. Detailed Discussion of Some Models} The following models show the range of tasks for which predictions can be generated. Each model has its strengths and weaknesses. Some models are explanatory of process or structure and have a particularly strong theoretical basis. Other models are purely predictive models and are more useful as engineering approximations. TEACH: Students should have an introduction to human information processing psychology before receiving the material. * Predicting Task Completion Time - [Card80, Card83] describe the keystroke-level model of task performance in which they predict the amount of time it will take experts to perform routine tasks. A large task is composed of a series of unit tasks, and the time to complete a unit task depends on the time to figure out what to do plus the time to do it. The keystroke-level model uses several operators, each with an average completion time: keystrokes (.2 sec), pointing (1.1 sec), homing to a device (.4 sec), line drawing (depends on number and length of lines), mental preparation time (1.35 sec), and system response time (depends on the complexity of the action). Several heuristics are used to estimate operator times and when to apply them. The keystroke-level model was empirically validated using a broad database. The authors' procedure makes good use of gathering performance baselines, practice trials, multimedia collection of data, and sensitivity analysis of model parameter values. DEMO: Compare arrow keys to mousing in tasks requiring a range of movements. The mouse will come out ahead. See what happens to the arrow keys if a SHIFT key moves the cursor to the next tab stop or the next word. The cursor keys can be improved, but the mouse will still come out ahead. The exercise shows that the keystroke-level model can be used to improve designs without collecting data. * Predicting Learning Time - [Reisner81, Kieras85, Polson87] predict cognitive complexity by modeling the knowledge that is necessary to learn new concepts about a system. The major assumption is that complexity in the model reflects cognitive complexity. Reisner uses formal grammars while Kieras and Polson use production systems. Reisner's model predicts complexity based on the number of terminal symbols, length of terminal strings, and number of rules in a grammar. In [Reisner81], two designs of a graphics system were compared by comparing their command grammars. A drawback of the analysis is that there are many ways to define grammars for a system (for example, rules may be replaced by many terminal symbols), so building the grammar -- and consequently the evaluation -- )is non-deterministic. Kieras and Polson assume user knowledge includes the job situation (what can be done), how-to-do-it information (the GOMS model of [Card83]), and how-it-works information. (The GOMS model includes the Goals, elementary Operations, Methods, and Selection rules for when a method applies.) How-to-do-it knowledge is encoded in a production system using condition-action rules that operate on working memory. Kieras and Polson studied people learning and using a text editor. They devised tasks with common subtasks so that the first time a task was learned, new procedures would have to be learned, but after that there would be transfer of learning. They built a production system of methods (although they do not provide deterministic algorithms for doing so), timed users on tasks, and found that time savings supported their model. TEACH: Students should already know about formal grammars but may have no experience with production systems, so some introduction may be needed. For demonstrations, most commercial production systems (often part of expert systems) have a step mode in which the contents of working memory and firing of productions are shown one step at a time. * Predicting Display Layout Quality - [Tullis85] introduced the idea that an alphanumeric screen layout could be analyzed by a program to predict quality. Here, quality is achieved with low average search times for items in the display, favorable subjective ratings of the layout, or some combination of the two. [Tullis86] documents a program that implements his model in which he measured several screen properties (overall density, local density, number of groups, average group sizes, item uncertainty or alignment, and number of items) and predicted search time and subjective preferences using a regression model. Tullis obtained squared multiple-correlation coefficients of .51 for search time and .81 for subjective preferences. Tullis' program works from a line-by-column screen-dump matrix; it provides suggestions about extreme values of his measured properties and some qualitative views of the screen, some of which are similar to those of [Streveler85]. Tullis' model was empirically validated, but it points out some of the problems of regression models (lack of diagnostic information, lack of ability to generalize without requiring a full validation, hiding of information in weighted averages). [Perlman87a] points out some of the limitations of Tullis' model (no use of highlighting, boxing, or hierarchical structure) and promotes the idea that any analysis of screen layout must make use of the designer's concept so that it can provide better diagnostic information. With a representation of a designer's concept, such as in a semantic network, many of the measures in Tullis's model are values that can be determined from the representation. Perlman distinguishes between perceptual and cognitive complexity, which can be analyzed independently, and before design. Because the fundamental goal of layout is to display the information so that it reinforces the underlying structural relationships, a representation of information can be used to generate display designs automatically, much in the same way as was done by [Mackinlay86] for displaying quantitative information. {Teaching Considerations} {Sources of Information} Information useful for teaching user interface development comes from many sources. There are the traditional sources, like books and journals, but the most timely information comes from conferences and new products. Keeping up to date on all of them is difficult, but the task of technology monitoring can be assigned to students. {Books on User Interfaces} There are no books that cover design, implementation, and evaluation of user interfaces, although [Monk84] is a good approximation, and [Baecker87] contains a comprehensive set of readings. Instead, it is necessary to gather books about design and evaluation. There are no books on user interface implementation. Good source books on user interface design are [Shneiderman87, Card83, Bailey82, Rubinstein84, Nickerson86, Woodson87, Smith86a, Brown88, Salvendy87, Helander88]. The best single book for a course is [Shneiderman87]. [Card83] is the most scientific but too theoretical for software engineers. [Bailey82, Kantowitz83] are human factors texts, but they do not contain information solely about user interfaces. [Rubinstein84] is a good book to cover task analysis and is a general guide to the whole development process. [Smith86a, Brown88] are both excellent sources of guidelines. [Salvendy87, Helander88] contain many excellent chapters from which readings might be drawn; a table of the coverage of their chapters is at the end of the bibliography. [Lindsay77] is a good book for covering psychological foundations, as is chapter 2 of [Card83]. [Foley82] is good for covering input and output devices. Timely sources on user interface implementation can best be found in conference proceedings and selected journals (see [SIGGRAPH87]). [Pfaff85] is the proceedings of the Seeheim conference on user interface management systems. Some books, including [Shneiderman87], contain sections on evaluation; but to cover the ideas properly, it is best to turn to books on statistics for the behavioral sciences (the field whose methods are closest to those needed to evaluate user interfaces). [Guilford78] is a reasonable textbook on experimental design and analysis. Books on the logic behind empirical evaluation include: [Huff54, Campbell74, Solso84]. [Cleveland85, Tufte83] are good sources of ideas on graphics, both for data display and for designing displays. There are several collections of chapters on human- computer interaction: [Ehrich86, Norman86, Baecker87, Carroll87b, Salvendy87, Helander88]. [Ehrich86] focuses on dialogue management and is most closely related to topics in this module on implementation. [Norman86, Carroll87b] contain original contributed chapters by leading researchers in the human-computer interaction. [Baecker87] is a collection of previously published papers, related by the editors' commentary; it is a good source of readings and would make a good single text because of its added commentary. [Salvendy87, Helander88] are handbooks with chapters by leading researchers and practitioners. [Salvendy87] is a broader text on human factors, while [Helander88] goes into more depth on human-computer interaction; both are excellent reference sources and both should be consulted for source material on most of the topics in this module; see the topic coverage tables at the end of the bibliography. {Summary of Recommendations} * General books: [Baecker87, Shneiderman87] * Design: [Rubinstein84, Smith86a] * Evaluation: [Campbell74, Solso84] * Reference: [Salvendy87, Helander88] {Journals on User Interfaces} There are several journals that are dedicated to or that feature many papers on user interfaces. These are highlighted in bold. Then, there are journals and magazines that are worth scanning for interesting papers, including many ACM and IEEE publications, and magazines like MacUser, Mac World, BYTE, PC Magazine, and PC World. * Communications of the ACM * ACM SIGCHI Bulletin * ACM Transactions on Office Information Systems * ACM Transactions on Graphics * ACM Computer Graphics (SIGGRAPH Newsletter) * HCI Abstracts * Human Factors * Human Factors - Computer Systems Group Newsletter * Human Factors Bulletin - Tools of the Trade Section * IEEE Computer * IEEE Software (especially the Human Factors section) * IEEE Transactions of Systems, Man and Cybernetics * Human Computer Interaction * International Journal of Man-Machine Studies * Ergonomics * Behaviour and Information Technology * Journal of Applied Psychology * Cognitive Science {Conferences on User Interfaces} The following conferences have a user interface orientation (or have a user interface track) and take place every year or two. They all publish proceedings that should be available in libraries. * ACM SIGCHI - Human Factors in Computing Systems * Human Factors Society Annual Meeting * Computer Supported Cooperative Work * IFIP INTERACT * International Conference on Human-Computer Interaction * ACM SIGGRAPH - Conference on Computer Graphics {Videotapes of User Interfaces} Videotape is perhaps the only widely available medium that is rich enough to convey effectively the ideas about interactive systems. Beginning in 1983, the ACM SIGCHI conference has featured videotapes of interesting systems. Excerpts from the best of these have been compiled into two-hour reviews and distributed by the ACM in New York as SIGGRAPH Videos. These are highly recommended for showing in unstructured periods. SIGGRAPH Review videos takes one hour per volume, and are distributed in VHS and Umatic (3/4 inch) formats. The Umatic format provides better quality than VHS; but the machines are less common and more expensive, and the tapes can only hold one hour, so the Umatic format costs twice as much: about $100 per conference rather than $50. The regular SIGGRAPH videos are pretty, but they do not contain much material on user interfaces. Address: ACM Order Department, Box 64145, Baltimore, MD 21264. Phone: 800-342-6626 (credit cards) or 301-528-4261 (information). {User Interface Hardware and Software} It is important for software engineers to have some experience with a variety of hardware and software so that they can apply a breadth of knowledge to their work. As part of user interface course demonstrations, it is useful to have access to many input and output devices and many types of software. For someone teaching a course over several years, it is feasible to collect a zoo of inexpensive input devices and videos showing a variety of output devices. On shorter notice, with fewer resources, a field trip to a commercial computer store might be arranged. There are two categories of software that serve as sources of information. First, there is software designed specifically for the design, implementation, and evaluation of user interfaces, of which there is little. Second, there is software with interesting user interface ideas, of which there is a lot. Many of the best ideas about user interface design are never written, but incorporated in software. Suggestions have been restricted to inexpensive systems running on the IBM PC and compatibles, and on the Apple Macintosh. One area that not discussed is that of compilers for these machines, which now come with a host of interaction libraries and will eventually contain libraries of dialogue types. Besides the economic advantages of these systems, they are also the most widely used; and people working on user interfaces should be familiar with them. Not surprisingly, given the size of markets for software running on these machines, some of the best crafted user interfaces are found on the least expensive of machines. PC Software: * Dan Bricklin's Demo Program [Bricklin86] is a tool for screen layout and for creating slide shows with conditional branching. It is easy to learn and is good for exploring alternative layouts. * Display Analysis Program [Tullis86] is a tool for analyzing the quality of screens layouts by estimating the average time it will take people to find data items and by estimating subjective preference. * NaviText SAM [Perlman88b] is a tool for finding and managing sets of guidelines for system design and evaluation. * |STAT [Perlman87b] is an inexpensive statistical package for simple data analysis such as might be collected in an evaluation. One of its advantages is that students can take it with them for no fee. Macintosh Software: * For many software engineers, just using the graphical interface of the Macintosh will be an enlightening experience. * HyperCard is a programmable hypermedia (text and graphics, primarily) system. [Goodman87] is the standard reference source on HyperCard. * Prototyper is a design and implementation tool that provides interactive access to the Macintosh toolbox [SmethersBarnes88]. {Human Technical Resources for User Interface Development} People of many backgrounds can contribute to user interface development. In general, it is best to involve people early in the design stage to plan for the implementation and evaluation of systems. Consider having representatives from the following categories as guest speakers to tell students what they do and how they work with others on user interface development. * Software Engineers - The primary contacts of software engineers are other software engineers. It is important for them to know how to get user interface input from their peers. * Human Factors Specialists - Human factors specialists are trained to understand the problems of people using complex systems. They are aware of user interface issues and will direct the attention of a development team to those issues, adding suggestions based on a broad human factors background. * Graphics Designers - Graphic designers are trained to communicate ideas in a motivating medium. Almost any user interface design can benefit from input from a graphic designer. * Technical Writers - Technical writers are trained to organize and communicate complex bodies of knowledge, such as the procedures described in user manuals. Technical writers should be consulted during the design stage of development to suggest areas where there will be problems documenting complicated parts of a system. * Statistical Consultants - Statistical consultants are trained to design studies in which data are collected and analyzed. Often, human factors specialists will have adequate statistical knowledge for conducting empirical evaluations of systems. Statistical experts should be consulted during the design phase to help develop criteria by which systems will be evaluated. * Customers - Customers are the final judges of usability, so they must be pleased with the output of user interface development. A bad user interface can result in unused programs that benefit no one. Customers are often experts in the tasks they perform and can provide valuable input after task analysis. It is prudent to maintain ties with customers to gather more information and reactions to design decisions. {Priorities} Although this is a large module, which perhaps should be broken into two or three modules, there are good reasons for one curriculum module: * All the material is related. * User interface development is a specialized area of software engineering. * Having one module makes the interrelationships of the material clear. However, there are cases for teaching sections of the module separately: * A unit on implementation would be of immediate use to students mainly concerned with development because that material covers data structures and algorithms for user interface implementation. Also, the implementation material is the only material that requires that the student be a competent programmer; its inclusion would exclude many psychology and human factors students. * A unit on evaluation would be of use to organizations who wish to evaluate existing systems for usability. Much of the material presented here has been taught as an elective in the Wang Institute master of software engineering degree program. It should probably not be required in a computer science degree program, but it should be recommended for (managers of) developers of end-user products. There is too much material in this module to cover in a quarter or semester course; it might all be covered in a year-long course, which would allow more time for projects. The instructor must extract topics according to priorities. Some advice is provided about priorities, in particular for what material to cover in three weeks allocated to the topic in a graduate software engineering curriculum. The priorities should delineate what every software engineer should know about user interfaces. {Core Knowledge} In a three week unit, which might be part of a systems engineering course, there are several topics that should be covered in enough depth that students are aware of the issues in user interface development. These topics include: * Process Life Cycle - The design-implement-evaluate cycle is one that should be applied for each user interface design decision. Prototyping methods using prototyping tools [Bricklin86, Goodman87, SmethersBarnes88] should be stressed. * User Interface Guidelines - Software engineers should be aware of the existence of design guidelines and standards documents [Smith86a], and they should know how to define and document design rules for specific systems [Perlman88b]. * Input and Output Devices - Software engineers should have at least observed a variety of hardware systems and their input and output devices, including the Macintosh and systems using other media, for which the SIGGRAPH video tapes are useful. * Dialogue Types and Interactive Libraries - There are many dialogue types that are used to adapt to different task and user demands, and interaction libraries are used in their implementation. Software engineers should understand the tradeoffs of the different dialogue types and know about the structure of some libraries such as the Macintosh toolbox [Apple85] and a window system [Scheifler86]. * User Guidance - User guidance includes both online information (prompts, error messages, help), printed documentation (function key templates, reference manuals and sheets, tutorials), and perhaps more advanced forms of media (video). All must be coordinated with software if a system is to be learnable and usable [Houghton84]. * Collecting Usage Data - The design-implementation- evaluation life cycle implies a need for collecting data on the low level of design decisions (possibly using prototypes) and on the high level of system evaluation. Software engineers should understand the need for evaluation and know about the costs and benefits of different data collection methods like instrumenting programs, video protocols, interviews, and questionnaires. Simple summary statistics and graphs should be covered. {Advanced Topics} Listed below is the material that I would give less coverage in a course constrained by time. For some of the topics, the material is not amenable to a cursory exposition. For others, the material is not part of established practice and would therefore be of less immediate practical use. For all the topics, the core knowledge is prerequisite. * Human information processing * Specification of user interfaces * Dialogue structure models * Detailed coverage of the theory behind empirical evaluation * (measurement, experimental design, sampling issues) * Inferential statistics and statistical computing * Predictive models of user interfaces {Beyond this Module} Advanced material for user interface development specialists may be taught in subsequent courses. * UIMS and Design Specification - in which students would use or design a UIMS and a related specification language [Bass89]. * Evaluation Methods - in which students would conduct experiments or evaluate systems and possibly contribute to the scientific base [Sherwin86, Thakkar90]. * Task Analysis and Design - in which students would learn about methods of data collections and use data as input to cognitive science conceptual models. * Predictive Models - in which students would learn about how to build models to explain or predict data. {Schedules} The material in this module has been taught at the Wang Institute in this order: design implementation evaluation but also in this order: evaluation design implementation The latter format may be preferred if students are going to implement working versions of the systems they design. It makes sense for evaluation to come first when the methods of evaluation must be taken into account in the design (task analysis, using predictive models) and/or implementation (instrumenting programs to collect usage data). {Exercises and Projects} Practical experience is necessary to learn the lessons about design, implementation, and evaluation. Students cannot be expected to appreciate the complexity of design without going through evaluated exercises. To understand the motivations behind the development of software tools for user interface implementation, students must gain experience using both primitive and advanced tools. Students cannot be expected to appreciate the need for user interface evaluation without watching users use their systems. Exercises must be emphasized in user interface development. It is important that the exercises are interesting and that they motivate the students, so personalized projects with sub-projects during the course are useful. If students are given artificially small exercises, then they will do them to get them out of the way. A term project with the goal of developing a usable system is useful to motivate students. The exercises can be applied to different aspects of the project. While it may seem attractive to allow students to choose their own projects, most students would prefer to be given one or two interesting project areas, along with the option of choosing their own. This allows most students to follow the path of least resistance and be sure that they will choose a project that will be both interesting and possible. I have found this to be especially true for students without industrial experience; they seem to have the most problems making choices and then making them work [Perlman88d]. The following is a suggested sequence of project activities, paired with a term-long teaching schedule: Project Schedule CLASS INSTRUCTION PROJECT ACTIVITY Introduction Project Area (1) Empirical Evaluation Foundations Evaluation Design Criteria (2) Design Design (3) Implementation Implementation (4) Predictive Evaluation Evaluation (5) Interpreting Data Re-Design (6) Re-Implementation (7) Re-Evaluation (8) (1) Students should be allowed to choose their own project area, but many will prefer to be given one. If several groups work on systems to satisfy the same user interface functional requirements, they may find that they can learn from other projects, and there may be a sense of competition, which has good and bad points. (2) After learning measurement concepts, sampling issues, and issues of validity and reliability, students are ready to define evaluation criteria, a task that will be novel and difficult for most. (3) During design, students should draw from many sources of information and should be able to document how they made design decisions. (4) During implementation, the students get an opportunity to develop and/or use tools for building user interfaces. They may learn to appreciate the benefits of tools for achieving design goals, and they may learn that some tools have bad user interfaces (here the programmer is the user of the tool). (5) During evaluation, students learn that even thoughtful design and implementation result in problems in both. They see the point of doing evaluations. (6) During redesign, the important task is to incorporate into design the empirical information gathered during evaluation. (7) During re-implementation, when a previous implementation is modified, it is possible to gauge how well the first implementation isolated design decisions using concepts of device and application independence. (8) After re-evaluation, it is possible to compare previous success (or failure) with the result of reworking. Students should be able to try many tools for design, implementation, and evaluation. These should include state-of-the-practice commercially available (and affordable) tools, and can also include state-of-the- art or experimental systems. {A User Interface Development Project} In this project, students were provided with a skeleton of the working parts of a system, and they provided the user interface design, built it, and evaluated it. The goal was to provide a project to students that would relieve them of the burden of finding an interesting project topic for the instructor to approve. In a class of 13 students, all but one chose the suggested project and tended to work in groups of two. The topic of the project came out of interest in hypertext systems (see [Conklin87] for a review of hypertext concepts), interest in user interfaces, and the availability of an online version of [Smith86a], a corpus of 944 user interface design guidelines. The corpus is hierarchically structured and has many internal cross-references and references to outside sources, making it ideal for hypertext access. The goal of the project was to design, implement, and evaluate some sort of hypertext interface to the corpus. To set up the project, the corpus was reformatted (because it contained typesetting codes) into a generic hypertext file format, and access routines were provided. On top of the access routines, a set of utilities were provided for hypertext processing: converting strings to pointers to hypertext nodes, search utilities, structure traversal utilities, etc. On top of these, a windowing and input library were provided. The separation between the database, database access functions, hypertext functions, and higher level user interface tools was strict, so that the students could buy in to whatever level they wanted. This was done to allow students to do development on whatever machine they wanted. * One team used all the tools provided and built a multiwindow environment on a Sun workstation. * Two teams used the tools up to the hypertext functions and built multiwindow environments on VAXstations. * Two teams used the tools up to the hypertext functions and built single window systems on UNIX terminals with curses, a text-screen package [Arnold81]. * One student used the database access functions to convert the corpus to a format used by emacs, a programmable editor, and did his prototyping using emacs. * Another team used the database access functions to build onto a file viewing program running on UNIX. * The whole set of tools provided on UNIX were ported to a PC multiwindow environment [Perlman88b]. The project was done during a 13-week term and progressed through design, implementation, and evaluation. There was not enough time for re-design, re-implementation and re-evaluation. At the end of each phase, students made class presentations and gathered feedback from their peers. At the end of the project, students made a final presentation in which they demonstrated their systems and presented the results of their evaluation. All teams were required to instrument their programs so that they could answer some basic usage questions. The data collection methods used included questionnaires and video protocols. Toward the end of the project, some students tried to create benchmark tasks to use to compare the different systems, but those tasks never materialized. The process we used to come up with a task and a measure was instructive, but some students never understood. The task was based on a signal detection model of finding guidelines relevant to a specific topic, compensating for the retrieval of irrelevant guidelines. The operational definition of relevance being validated by inter-rater-reliability, along with the complexity of signal detection theory, caused enough confusion to convince the instructor not to conjure such a complex measure again. Although the benchmark task was decided on, the tasks were never constructed because the students did not have time. Instead, they looked for problems in their systems. The results of the evaluations were compelling to the students. Many were surprised that their intuitions were not perfect, and some found that they had designed bad systems. Most teams found that their systems could be modified to improve many of the problems found, and that motivated students to evaluate in the future, although no tracking is planned. All the students needed constant hand-holding with even the simplest of data analyses, and they were instructed how to get help from experts in the future. The results of the project were good in that students were exposed to and exercised many design methods, used and developed tools in their implementations, and got some experience with evaluation. During the term, there were assignments to use various tools or methods; and having a full-term project allowed students to work on some assignments and apply the work to their projects, which gave them extra motivation.