ARTICLE IN PRESS
Int. J. Human-Computer Studies 64 (2006) 489–501 www.elsevier.com/locate/ijhcs
Methods for inclusion: Employing think aloud protocols in software usability studies with individuals who are deaf Vera Louise Robertsa,, Deborah I. Felsb a
Adaptive Technology Resource Centre, University of Toronto, 130 St George Street, Toronto, Ont. 416-978-4360, Canada M5S 1A5 b Ryerson University, 350 Victoria St. Toronto, Canada M5B 2K3 Received 7 July 2004; received in revised form 20 September 2005; accepted 7 November 2005 Available online 20 December 2005 Communicated by J. Jacko
Abstract Usability is an important step in the software and product design cycle. There are a number of methodologies such as talk aloud protocol, and cognitive walkthrough that can be employed in usability evaluations. However, many of these methods are not designed to include users with disabilities. Legislation and good design practice should provide incentives for researchers in this field to consider more inclusive methodologies. We carried out two studies to explore the viability of collecting gestural protocols from sign language users who are deaf using the think aloud protocol (TAP) method. Results of our studies support the viability of gestural TAP as a usability evaluation method and provide additional evidence that the cognitive systems used to produce successful verbal protocols in people who are hearing seem to work similarly in people who speak with gestures. The challenges for adapting the TAP method for gestural language relate to how the data was collected and not to the data or its analysis. r 2005 Elsevier Ltd. All rights reserved. Keywords: Usability; Usability evaluation methods; Deaf; Think aloud protocol; Gestural think aloud protocol
1. Introduction While additions to software such as preferences setting, frequency of use menus and operating system accessibility options indicate that there is some movement in software design towards adherence to universal design principles, usability evaluation methods (UEMs) have not undergone a comparable shift. Consequently, a question emerges: how well do the current UEMs evaluate the usability of software developed for multiple user profiles? One common methodology used to study software and/ or product usability is talk aloud or think aloud protocol (TAP). This protocol requires the product user to speak aloud his/her thoughts while focusing on and manipulating the product in prescribed ways (Ericsson and Simon, 1984; Ericsson and Simon, 1998). The interaction between the user and the software is often video taped so that the user’s Corresponding author. Tel.: +01 416 978 4360; fax: +01 426 971 2629.
E-mail addresses:
[email protected] (V. Louise Roberts),
[email protected] (D.I. Fels). 1071-5819/$ - see front matter r 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.ijhcs.2005.11.001
actions along with his/her verbalizations are recorded for later analysis. Questions about inclusion arise from an examination of this popular and useful UEM: What if the intended user of a product does not speak through oral means? What equivalent UEM is available for individuals who, for example, communicate with a gestural language such as American Sign Language (ASL)? The think aloud method has not been widely used with individuals who are users of sign language yet mandates for inclusive educational technologies require that research methods appropriate to this group be developed and tested. A gestural think aloud protocol (GTAP) that utilizes ‘‘self-sign’’ in the same way TAP utilizes ‘‘self-talk’’ may prove to be an important methodology in assessing the usability and quality of online educational materials for ASL users. Furthermore, there may be more to learn about expanding existing research protocols such as TAP to include gestural/visual languages. This research explores the use of TAP methodologies with ASL users and seeks to provide support for use of this inclusive UEM.
ARTICLE IN PRESS 490
V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501
The theoretical underpinning applicable to the TAP method and to this research is that information held or recently held in short term (STM) and working memory is accurately revealed through verbalizations. What is important for GTAP to be successful, is that there is a trace of recent thought in working memory that may be articulated via sign language. Research with ASL participants (Bellugi et al., 1974; Campbell and Wright, 1990; MacSweeney, 1998; Wilson and Emmorey, 1998, 2003) provide support for STM or working memory processes in individuals who are deaf that are similar in function to individuals that have hearing. Also, these studies indicate that participants are able to access and articulate items in working memory. Thus, the use of gestural TAP with individuals who are deaf and sign language speakers is further supported. There have been two other studies with participants who were deaf that have been characterized as think aloud studies: the reading comprehension studies of Andrews and Mason (1991) and Schirmer (2003). The former asked participants to read a close passage and upon reaching the omitted word, to think their thoughts aloud. At the same time, the researchers engaged the participant in conversation and asked probing questions. This study was not a typical think aloud method as it was more likely to draw introspective as well as retrospective responses and did not follow the standard procedure where prompts to participants are limited to ‘‘keep talking’’ or ‘‘thoughts?’’ Schirmer (2003) asked participants to think aloud at natural breaks in the text such as page breaks rather than a concurrent thought stream. While it is clear that TAP is a very useful tool in understanding reading strategies, there is no body of research that explicitly verifies that gestural TAP and verbal TAP are equivalent cognitively or produce the same results. The think aloud method or protocol (TAP) was first used in research on problem-solving behaviour (Someren et al., 1994). Pioneered by Newell and Simon (1972) and refined by Ericsson and Simon (1984), TAP required that participants speak their thoughts aloud as they complete an assigned task. The TAP method is not associated with any one model of cognition or memory (Ericsson and Simon, 1984; Green, 1998) and may be best viewed as a method for testing hypotheses (Ericsson and Simon, 1984; Green, 1998). For example, the usability researcher tests hypotheses held by developers about how certain features will be used by the audience or how well the needs and expectations of the audience are met by the product. Furthermore, ‘‘one of the uses of think aloud protocol is to assist in forming hypotheses about areas where not much is known’’ (Wiedenbeck et al., 1989, p. 25). Hence, TAP is particularly useful for the iterative development cycle of software. As an inclusive UEM, TAP has an advantage over structured elicitation techniques such as interview and surveys: the think aloud method makes it easy for the subjects because they are allowed to use their own language (Van Someren et al., 1994, p. 26) and own words. In this
way, the protocol method may be ideal for speakers of ASL who will be able to articulate their thoughts in their first language using their known words, grammar and syntax. For most native ASL speakers, oral language and its written form is a second language in which they are less fluent (Schirmer, 2000). For this reason, written questionnaires and responses or paper-based surveys are not always an appropriate means for equitable user testing. Gestural TAP enables spontaneous user feedback about how a task is being accomplished without the additional and potentially interfering cognitive processes required for individuals to work in a language that is not their first language. Also, research has shown that the concurrent verbal protocols obtained using TAP methods are superior to retrospective protocols obtained after the task is completed (Nisbett and Wilson, 1977; Ericsson and Simon, 1984; Kuusela and Paul, 2000). Furthermore, in a study that compared four usability methods (logged data, questionnaires, interviews and verbal protocol analysis), TAP was shown to be ‘‘the most effective single method at highlighting usability problems’’ (Henderson et al., 1995, p. 426). Indeed, in some cases, when UEM methods were paired, the number of unique problems revealed by the paired UEMs still failed to identify as many problems as were identified through TAP alone (Henderson et al., 1995). Signed languages are complete, natural languages that are distinct from spoken languages (Stokoe, 2001) and have a distinct phonology, morphology, syntax and vocabulary (Messing, 1999). Thus ASL has the same potential to provide rich protocols as do spoken languages. Although oral and gestural languages rely on different modes for communication, the underlying neural structures are actually very similar (MacSweeney, 1998). In hearing individuals, the left hemisphere of the brain is responsible for speech production and language comprehension (McNeill, 1992). Corina (1998) reviewed 16 studies of individuals who communicate with sign and have suffered brain lesions and found that the studies provided sufficient evidence to support the notion that the left hemisphere of the brain is also responsible for sign language comprehension and sign production. Indeed, this notion that language and gesture rely on similar brain regions and processes is supported by research with individuals with aphasia (McNeill, 1992). Furthermore, Bates and Dick (2002) in their review of language and gesture research found that ‘‘compelling links have been observed, involving specific aspects of gesture that precede or accompany each of the major language milestones from 6 to 30 months’’ (p. 293), and ‘‘that there is research to support links between gesture and language development in both typical and atypical populations’’ (p. 295). The researchers conclude that ‘‘the division of labor in the brain does not seem to break down neatly into language versus nonlanguage’’ (p. 305). Thus there is support for the notion that the collection of gestural protocols makes similar cognitive demands on the participant as the collection of verbal protocols.
ARTICLE IN PRESS V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501
491
1.1. Scope This paper sets out a methodology for inclusive collection of verbal protocols and is not intended as further validation of the think aloud method for collecting verbal protocols. The goal is to develop a UEM that is inclusive. We will present the results of two studies designed to explore the use of gestural protocol as a usability method. The first study addresses the mismatch between the movement towards barrier-free designs and the UEMs available to test them. In our second study, we implemented GTAP in a usability evaluation of new methods for presenting ASL translations of online videos. Thus, we were looking for similar results across the two studies carried out and were not attempting to validate the method in each study. This research will advance our knowledge of how gestural and verbal protocols compare as well as further our understanding of how to carry out a gestural TAP procedure. TAPs are used often and are widely accepted as a standard UEM; validation of the proposed GTAP method is still needed. We carried out two studies to explore the use of gestural TAP as a viable method of user testing for deaf individuals and intend to show that existing TAP UEMs may be effectively and reasonably adapted for use with individuals who communicate with gestural languages such as ASL. 2. Method This research contained two components: (1) a simple game study and (2) an actual usability study carried out using the GTAP method. Each study was designed to examine different aspects of using the GTAP method. The first study enabled comparison between verbal and GTAPs and the later enabled evaluation and testing of the GTAP method. This research addresses the following questions: (1) What is the relationship of gestural language to outcomes of TAP. The main hypothesis to consider for this question is that there is a difference between TAP outcomes between ASL speakers and oral language speakers. (2) What is the viability of gestural TAP as a usability evaluation method? Is it viable and if not, how may it be made viable? and (3) What is the relationship of gestural language to the protocol data that is collected, and how the data is collected?
2.1. Solitaire game study 2.1.1. Materials For the game component, Microsoft Solitaire on a laptop computer was used. One video camera was used to record the screen and spoken English interpretation simultaneously (see Fig. 1 for schematic).
r prete only) Inter s t n a articip eafp
(d
Participant
C
Investigator
Fig. 1. Lab configuration for Solitaire Game Study.
2.1.2. Subjects Participants for each study were recruited at different times. Participants were permitted to participate in both studies. For the game study, there were nine individuals (3 male, 6 female; ages 26–45) who were deaf and fluent in ASL and nine hearing individuals (5 male, 4 female; ages 26–55), and ranged in education level from some college to graduate degree for deaf participants and high school to graduate level for hearing participants. All subjects were familiar with the Solitaire card game. In both groups, one individual was younger than 30 years, six individuals were 31–40 years and two individuals were older than 40 years. 2.1.3. Procedure 2.1.3.1. Task. Following training, individuals were asked to play Solitaire until their participation in the study reached ten minutes. The session closed with a short questionnaire and discussion. 2.1.3.2. Think aloud protocol. Participants in the game study were asked to speak/sign their thoughts at any time or at least before making a move/using the mouse. After short periods of silence (about 10 s), several moves without utterances, or indications of thought such as facial expressions or head nodding, the investigator would remind the participant to ‘‘keep talking.’’
ARTICLE IN PRESS V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501
2.1.4. Outcome measures and data analysis The game study is a between-subjects design. Two groups of subjects, ASL speakers (ASL group) and English speakers (Oral group) played Microsoft Solitaire. The grouping factor is language. Protocols collected for the game study are categorized as belonging to one of the following categories: (1) responses to the cards (e.g., expression of need); (2) play (e.g., card value); (3) strategy (e.g., plan for best play); (4) general comments about Solitaire in general; (5) error (e.g., having made a mistake); (6) TAP (e.g., effect of TAP on play); (7) technical (comment about hardware); and (8) usability (e.g., comments about game software environment). The verbal protocols were transcribed from videotape, chunked into segments and then coded for analysis. The Intraclass Correlation statistic was used to determine the reliability of the coding strategy between two separate coders. The ASL was translated and recorded simultaneously during the study. Participants were asked to complete a pre-study questionnaire. The Solitaire pre-study questionnaire asked some background information questions as well as questions about the participant’s experience with Solitaire and skill level. The participant had the option to have any written material translated into ASL and to respond in ASL. 2.2. Law video usability study 2.2.1. Materials A software video player interface was developed and tested for the usability component. The interface allowed users to view an educational video about traffic court along with an ASL interpretation of the video. The original education video was produced as a standard National Television Systems Committee (NTSC) video with traditional verbatim closed captions. The ASL interpretation was provided in two formats, an Acted ASL version where actors in costume played the parts of the original actors and a Standard ASL version using a single translator for all of the parts. The Acted ASL version showed ASL actors in two separate video windows along with the original video. The translated version showed the translator in one separate window along with the original video. Video controls in the interface allowed the user to play, pause, move forward or rewind the video. Viewing preferences could also be set by the user and included adjustment of video and interpretation window positions, proximity of the windows, relative size of the windows and use of borders around the windows. A short training video was provided to assist participants in learning the interface as well as in practicing GTAP. The educational video was provided by Canadian Learning Television and is part of a full social studies curriculum. For the usability study, subjects were provided with a PC that included keyboard, mouse, speakers and 17-in colour monitor. Each PC was further equipped with key-stroke
C
492
r prete
Inter
Participant C
Investigator
Fig. 2. Lab configuration for Law Video Usability Study.
logging software. Two video cameras were used to record the signs made by the participant and by the investigator so that they could be translated at the close of the study. Please see Fig. 2 for schematic. 2.2.2. Subjects For the usability study, there were 17 (9 female and 8 male) participants who were deaf and fluent in ASL and ranged in education level from high school to college/ university. The Canadian Learning Television video on traffic court was aimed at the age and education level of the participants. 2.2.3. Procedure 2.2.3.1. Task. Individuals in the usability study were asked to watch a 10-min video segment of a traffic court educational video. The first viewing of the video was followed by the comprehension test. Next, participants were allowed to adjust viewing preferences before viewing the same educational video segment but with a different ASL interpretation format from the first viewing. The session closed with a short questionnaire and discussion. 2.2.3.2. Think aloud protocol. Usability participants were asked to sign their thoughts about the task as they were performing it. After short periods of silence (about 10 s), or
ARTICLE IN PRESS V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501
indications of thought such as facial expressions or head nodding, the investigator would remind the participant to ‘‘keep talking.’’ 2.2.4. Outcome measures and data analysis The law video usability study is a within-subjects design. The experimental factors of interest in this study are viewing order and interpretation type (Acted ASL and Standard ASL). The law video data is aggregated into four overarching categories that relate to the specific research questions of the usability study: (1) interpretation, (2) format, (3) technical issues and, (4) content. All of the analyses are based on these macro-categories. To understand the quality of the comments in these macrocategories further sub-categories of positive and negative are used. A counterbalanced order for the levels of presentation technique was established and participants received the order that was assigned to their subject number. Participants were assigned subject numbers consecutively and the order of subjects was simply determined by the order that arose from the subjects’ availability for the study. As in the game study, video data was transcribed and coded. The Intraclass correlation statistic was used to verify the consistency and codability of the data. In this study, the ASL was translated after the study Participants were asked to complete a pre-study questionnaire. The usability pre-study questionnaire asked questions about the participant’s experience with online video, computers and ASL interpretations of video material. The participant had the option to have any written material translated into ASL and to respond in ASL. 2.3. Statistical issues Non-parametric tests were used to compare the means of the two groups in both studies as the number of subjects is considered small. A higher alpha level for small groups will increase the power of the test and improve the chance of making a correct decision about the null hypothesis so the level for tests of both sets of data was set to a ¼ :10 (Stevens, 1996, pp. 6 and 175). However, because the data is multivariate, the alpha level for univariate analysis will be calculated as .01 (roughly a/p, where p ¼ the number of categories) in order to reduce the risk of a type I false rejection error (Stevens, 1996, p. 160). Power is particularly important in the Solitaire game study since the expectation is that there will be no difference between the two groups. It should be noted that no data is removed in order to increase power. 3. Results 3.1. Solitaire game study A sample set of the transcripts for the protocols were divided into utterance ‘‘‘chunks’’ that represented a
493
complete move such as playing a card and coded by two independent raters. Inter-rater reliability (ICC) of the chunking was high at 76% for 416 sample utterances, and ICC for the coding components was high ranging from 68 to 100% for all categories. Chunking and coding tasks were then continued by only one of the raters. The average number of comments for the play category was nearly double for the Oral group than for the ASL group (53 and 28, respectively). However, a 1 min snapshot of the data at the third minute of each session shows no difference in the number of play comments. This 3rd min was randomly selected to provide a normalized unit of time to eliminate differences in counts due to time elapsed during play which varied from participant to participant. The mean number of comments for each category are shown in Tables 1 and 2. The standard deviation for all category counts is high for both full session and minute-three protocols. Multivariate analysis of the dependent variable, language, showed no effect of group; however, the observed
Table 1 Descriptive group data (n ¼ 9) for category counts for minute three of session protocol Category
Play Strategy Cards Solitaire Error TAP Distracted Technical Usability Conversation No. of utterances
ASL
Oral
Mean
Count
SD
Mean
Count
SD
4.33 .22 1.44 .22 .11 .11 .56 .11 .00 .89 7.44
39 2 13 2 1 1 9 1 0 8 76
4.33 .44 1.33 .44 .33 .33 1.33 .33 .00 1.05 3.00
4.33 .33 2.22 .22 .33 .11 .00 .00 .00 1.11 7.11
39 3 20 2 3 1 0 0 0 10 78
2.74 .50 1.39 .44 .50 .33 .00 .00 .00 1.27 2.37
Table 2 Descriptive group data (n ¼ 9) for category counts for complete session protocol Category
Play Strategy Cards Solitaire Error TAP Distracted Technical Usability Conversation No. of utterances
ASL
Oral
Mean
SD
Mean
SD
27.67 1.67 10.56 1.11 1.11 1.67 2.56 .33 .56 11.44 53.56
18.76 1.94 6.06 1.36 1.45 1.66 5.66 .50 .88 6.19 16.45
52.89 2.78 14.67 2.22 2.00 .56 .00 .33 .22 5.78 72.56
26.90 2.22 8.76 1.56 1.73 1.01 .00 .50 .44 5.19 27.12
ARTICLE IN PRESS V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501
494
power was low at .33. Wilcoxon t-tests were carried out for each of the categories for utterances made for the entire session and for utterances made during the 3rd min slice only. The individual results are summarised in Tables 3 and 4. No significant difference (po:01 level) on number of comments produced in a category was found in either the whole session or minute-three data. Participants in both groups averaged 7 utterances during minute three of the verbal protocol. Category sums related to playing the game (play, strategy, cards, error) and categories not related to the game (solitaire, distracted, TAP, technical, usability, conversation) were aggregated and are shown in Table 5. The overall ratio of game comments to non-game comments was 4:1.
Table 3 Wilcoxon t-test results for whole session data Category
Test statistic
Asymptotic significance
Play Strategy Cards Solitaire Error TAP Distracted Technical Usability Conversation No. of utterances
64 73.5 75.5 69.5 73.5 68.5 72 85.5 79 63.5 68.5
.06 .28 .38 .14 .26 .11 .07 1.00 .47 .05 .13
Table 4 Wilcoxon t-test results for minute three data. Protocol analysis Category
Test statistic
Asymptotic significance
Play Strategy Cards Solitaire Error TAP Distracted Technical Usability Conversation No. of utterances
80.5 82.5 81 85.5 76.5 85.5 76.5 81 85.5 73 83.5
.66 .78 .61 1.00 .27 1.00 .15 .32 1.00 .25 .86
Table 5 Aggregate category sums for minute three of verbal protocol
ASL Oral Total
Group total
Average
Ratio
Game
Non-game
Game
Non-game
56 64 120
16 14 30
6.22 7.11 13.33
1.78 1.56 3.33
3.5:1 5.6:1 4:1
A Wilcoxon t-test for the two aggregate categories showed no significant difference between the groups. 3.2. Law video usability study This task is a case study of the adapted TAP method in a usability study. The grouping factors in this study were interpretation treatment (acted and standard interpretation) and viewing order of the video interpretation treatment. Due to technical difficulties, only 13 participants of the initial 17 have full data sets (26 ten-min video sessions) and only these full sets are included in the analysis. Only the verbal protocols for the law video viewing sessions are analysed because these sessions constitute the GTAP usability portion of the study. An initial coding scheme of nine categories (see Table 6) was developed. Seven of the categories were aggregated into four categories: the ASL interpretation (1 and 2), format (3, 4 and 7), technical issues (5) and content (6) of the law video. Categories 8 and 9 stood alone. Comments were coded as positive or negative for each aggregate category and simply counted in categories 8 and 9. The ICC results for the four main categories which include all of the sub-categories are high and are shown in Table 7. Table 8 shows the average divided category counts for the data for each interpretation type and for each viewing order. T-tests were used to determine whether viewing order and interpretation type affected the quantity of positive and negative comments. There were significantly more positive comments during the second session (tð22Þ ¼ 2:08, po:05). No effect of interpretation type on number of comments was found. Multivariate analysis of the ten grouping variables was conducted to determine whether order or interpretation type had an effect on the number of comments made. A significant effect of order was found with alpha at the .10 level [F ð10; 13Þ ¼ 2:31, po:1]. The observed power of the effect is moderate at .81. The categories were also grouped by order only and interpretation type only and Wilcoxon tests were run. Table 9 shows the findings for the test that are significant or approached significance. Many issues were identified by the participants. For example, two participants identified seven different issues for the acted interpretation and nine different issues for the standard interpretation; their comments are shown in Table 10. Each of the 26 ten-min video sessions yielded at least one coded comment with the average number of comments per session being 8.4. Table 11 provides a summary of the number of comments in each category. Examples of comments made by participants are shown in Fig. 3. No significant correlation between interpretation type or viewing order and total number of comments was found; however, correlations between these variables and three of the four coding categories were found (see Table 12).
ARTICLE IN PRESS V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501
495
Table 6 Protocol analysis for usability study and sample comments Category
Sample comment
1. Quality of interpreter/actors presence and expression
+ve ve
2. Interpretation quality (speed, novel signs)
+ve ve +ve ve +ve ve +ve ve +ve ve
3. Ease of using video interface (video control buttons) 4. Viewing preferences (position, border, size, proximity) 5. Technical issues with video (synchronicity, visibility) 6. Content of video 7. Acted vs interpreted ASL content (comparison, busyness) Acted Interpreted 8. Closed captioning requested 9. Recommendation
Good facial expression. The character looks mad but the interpreter is not showing it. The interpreter keeps up well. That is the wrong sign. The reverse button is useful. How do I stop this thing? I like being able to move the ASL window. There are too many choices. Good lighting. The interpretation is not in synch with the video. It makes sense. Boring.
+ve Easier to keep track with costumes. ve Harder to follow two windows than one +ve It is less work with fewer windows to watch ve Confusing to track who is speaking. You should have captions too. They signed situation. Details leading to the ‘‘situation’’ should be explained
Table 7 Mixed two way, single measure intraclass correlation coefficient statistics for ASL law video protocol coding categories not split for positive and negative utterances Category
ICC (3,1)
Confidence interval
ASL Technical Format Content
.9077 .9261 .9189 .8622
.8545–.9420 .8829–.9538 .8718–.9492 .7858–.9127
po:001.
4. Discussion The analysis of the results of these two studies provide some insight into the use of GTAP as a usability testing method that includes people who are gestural or sign language users. Specifically, the Solitaire game study is designed to examine differences between TAP data collected from spoken language users and gestural language users. In answering the first research question posed in this paper regarding the relationship of gestural language to TAP outcomes, we begin to gain some understanding of the use of gestural TAP as a viable and inclusive usability method. Analysing the usability data gathered in the law video usability study demonstrates the feasibility of gestural TAP in an actual usability study. Some of the interpretations of this data are presented here to illustrate the treatment of such data and answer our second research question regarding the viability of using the GTAP methods in usability testing. Finally, new insights gained from the use of GTAP inform our research question on gestural language and method of collection and analysis of protocol
data, and allow us to put forward recommendations for including gestural language users in usability studies. 4.1. The relationship of gestural language to tap outcomes As shown by the results of the Solitaire game study, the verbal protocol outcomes between ASL speakers and oral speakers seem very similar in a controlled and known task environment such as Solitaire. While there were some difference trends between ASL and oral speakers in the verbal protocol outcomes in the Solitaire game study, there were no significant differences in any Wilcoxian t-test comparison of the means in any category. Therefore we cannot support rejection of the null hypothesis that there was no difference of verbal protocol outcomes between ASL and oral language speakers for this particular task. Even though the mode of communication for the two groups was quite different, no corresponding significant difference in the number and kind of comments was found. The descriptive analysis shows that except for the play and distracted categories, the number of comments made by the two groups in each category is remarkably close. For the distracted category, one person in the ASL group was concerned about a work issue with a subordinate and contributed 17 out of the 23 comments. This one individual’s contribution skewed the results to indicate a trend where there may not actually be one. Further subjects are required to determine the extent of any trend of ASL speakers making comments that are not related to the study task. The difference trend found in the play category is more interesting and its cause is less obvious. Although the statistical analysis does not show any differences between ASL and oral speakers in this category,
ARTICLE IN PRESS V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501
496
Table 8 Mean and standard deviation for split category counts divided by interpretation type and viewing order Viewing order
First
Video
Interpreted n ¼ 8 Acted n ¼ 5 Total n ¼ 13
Second
Interpreted n ¼ 5 Acted n ¼ 8 Total n ¼ 13
Total
Interpreted n ¼ 13 Acted n ¼ 13 Total n ¼ 26
ASL
Format +
Technical
Content
Caption
Recommend
+
+
.75 1.75 .80 1.10 .77 1.48
.38 .74 .00 .00 .23 .60
2.25 5.20 .60 .89 1.62 4.09
+
Mean SD Mean SD Mean SD
1.38 1.51 1.40 2.19 1.38 1.71
3.00 3.38 1.80 1.48 2.54 2.79
.00 .00 .00 .00 .00 .00
.38 1.06 2.40 1.67 1.15 1.63
.00 .00 .00 .00 .00 .00
1.50 1.41 3.20 2.86 2.15 2.15
.13 .35 .20 .45 .15 .38
Mean SD Mean SD Mean SD
1.40 1.95 2.62 4.27 2.15 3.51
.40 .55 .63 .92 .54 .78
2.40 2.19 1.50 1.77 1.85 1.91
1.40 1.34 2.38 2.07 2.00 1.83
.00 .00 1.00 1.07 .62 .96
1.80 1.30 1.13 1.89 1.38 1.66
.00 .00 .13 .35 7.69E-02 .28
.00 .00 .38 .52 .23 .44
.00 .00 .00 .00 .00 .00
.40 .89 1.88 2.17 1.31 1.89
Mean SD Mean SD Mean SD
1.38 1.61 2.15 3.56 1.77 2.73
2.00 2.92 1.08 1.26 1.54 2.25
.92 1.75 .92 1.55 .92 1.62
.77 1.24 2.38 1.85 1.58 1.75
.00 .00 .62 .96 .31 .74
1.62 1.33 1.92 2.43 1.77 1.92
7.69E-02 .28 .15 .38 .12 .33
.46 1.39 .54 .78 .50 1.10
.23 .60 .00 .00 .12 .43
1.54 4.12 1.38 1.85 1.46 3.13
Table 9 Significant Wilcoxon test results for category means grouped by either viewing order or interpretation type Grouping
Category
Test statistic
p
Viewing order
Negative ASL Positive format Positive technical
W ¼ 130:5 W ¼ 117a W ¼ 143
.016 .000 .015
Video
Negative format Positive technical
W ¼ 131:5 W ¼ 143
.017 .015
a Meets more stringent a level (po:01) applied to repeated measures tests.
there is a noteworthy difference trend. Oral speakers seem to make more comments than ASL speakers in this category (mean of 52.9 comments for oral speakers compared with 27.7 comments for ASL speakers). However, the majority of these comments occurred at the beginning of the session as the means for the minute-three sampling point after the game play had normalized are equal (4.33 comments for each group). From a usability standpoint, the play category represents a kind of background ‘‘noise’’ category: a demonstration that indeed the participant has some knowledge of how to use the type of interface or of the details of what is contained in short term memory. Similar kinds of utterances are found in the Tower of Hanoi protocols described by Ericsson and Simon (1984) and the reading comprehension coding scheme protocol examples developed by Green (1998) where some utterances are description of moves or reading of statements and not goals or processes. Reduced utterances in this category may indicate
that the participant is giving effort to expressing thoughts of greater complexity. In this particular study, ASL speakers seemed to have been engaged in more conversation comments particularly with the translator. This is a common practice with translators during the beginning of a translation session in order to establish context, cultural preferences and rapport (Jones and Pullen, 1992). Researchers have shown that the same regions of the brain are used for communication in the oral mode as in the gestural mode (McNeill, 1992; Corina, 1998; Green, 1998; MacSweeney, 1998). In addition, the literature also suggests that ASL, like English, is a fully developed language with grammar and syntax. The failure to find differences in number and kind of comments between the two groups therefore is not that surprising especially given the ability of the signing participants to manipulate the mouse, use the game interface and sign their thoughts at the same time. Since physical formation of the utterance is the primary difference between oral and gestural language, it may have been expected that this physical difference between the two modes of communication would also cause a difference in the number of comments made by the two groups. Indeed, the trend in the play category data may reflect this difference, however, when controlled for session length differences, the total number of all category utterances for each group are within 1 point of each other. Essentially, the two groups made the same number of utterances. It is perhaps not surprising then that no other significant differences were found between these two groups. One factor that could have caused these results was that too few subjects were used. However, several steps were taken to increase the power of the analysis. First, only
ARTICLE IN PRESS V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501
497
Table 10 Issues identified in two randomly selected protocols for each interpretation mode Acted ASL issues identified
Standard ASL issues identified
Participant A
1. 2. 3. 4.
The interpreter is not clear here That sign is unfamiliar They are too fast, and both are speaking at the same time The camera is too far away
1. 2. 3. 4. 5. 6.
Participant B
1. 2. 3. 4.
The interpretation and movie are not synchronized Both are speaking at the same time The popping in is difficult to follow This is boring
1. Closed captioning would make this easier 2. This is boring 3. The popping in and out is awkward
Too dark Interpreter is too slow Camera is too far from the interpreter This section is better with one interpreter This section is better with two interpreters This part is confusing
Table 11 Count of coded comments for each category Acted Quality of interpreter/actors presence and expression Interpretation quality (speed, novel signs) Ease of using video interface (video control buttons) Viewing preferences (position, border, size, proximity) Technical issues with video (synchronicity, visibility) Content of video Acted vs interpreted ASL content (comparison, busyness) Acted Interpreted Closed captioning requested Recommendations Total
They should have deaf actors do the whole thing Boring/long They should have put window on bottom like close-captioning I liked the straight caption better than the acted one Dark background makes it hard to read the interpreter I lose concentration trying to look up at the interpreter I would like to change the border colour. It would be good to be able to set the span of the frames to own setting [not predetermined setting]. Why does the actor disappear? They should just freeze the frame. The person is still talking, why aren’t they signing?
Fig. 3. Sample participant comments from law video usability sessions.
individuals who knew how to play Solitaire and had played the computer version before were selected to reduce the possibility that any comments were related to learning how to play the game as learning was not one of the factors
Interpreted
+ve ve +ve ve +ve ve +ve ve +ve ve +ve ve
19 4 9 10 0 0 0 2 8 25 2 7
9 8 9 18 0 0 0 3 0 21 1 6
+ve ve +ve ve
1 9 11 20 0 18 145
11 2 1 5 3 20 117
measured in this study. Also, all deaf participants were culturally deaf and not hard of hearing or hearing impaired. Furthermore, all deaf participants used ASL for communication. Next, the alpha level was adjusted so that the probability of having a type II error decreased. These considerations ensure a powerful test given the small sample size. Indeed, the analysis indicates no difference between the groups despite statistically optimizing the chances of finding a difference. In the Solitaire study there may be a linguistic preference for certain general types of comments (e.g., it is easier for ASL speakers to converse with the translator than produce game-type comments). To examine this, two overarching categories are considered in the protocol analysis: game comments and non-game comments. The game category is composed of those coding categories that specifically refer to an aspect of playing the Solitaire game such as play,
ARTICLE IN PRESS 498
V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501
Table 12 Correlation matrix for split categories by order and interpretation ASL
Order Interpretation
r-value Sig. (2-tailed) r-value Sig. (2-tailed)
Format
Technical
+
+
+
.14 .48 .14 .48
.45* .02 .21 .32
.58* .00 .00 1.0
.25 .22 .47* .02
.43* .03 .43* .03
.20 .32 .08 .69
n ¼ 26. *Correlation is significant at the .05 level (2-tailed).
strategy, cards and error. The non-game category is composed of the coding categories that relate to the situation such as solitaire, distracted, TAP, technical, usability and conversation. When controlled for session length differences, on average, oral language participants made 1.11 more game comments and only a fraction fewer non-game comments per session than did gestural language participants. However, they are not significantly different. Thus, it is likely that both groups were able to concentrate on the game equally well while carrying out the think aloud method. Gesture does not seem to impede ability to play the game or ability to produce protocols. 4.2. The viability of GTAP for usability evaluation We found that using GTAP in a usability study context, the law video usability study, produced usability results that are viable and useful. This study was a particularly interesting testing ground for GTAP as an inclusive UEM because it involved the complex management and presentation of multiple video windows (one video without ASL and a second video with the ASL interpretation) as well as a user interface for customization of the presentation. The study involved presentation of a novel ASL interpretation format (acted interpretation) as well as introduction to new viewing preferences choices. The video interpretation and customization formats were specifically designed for individuals who communicate with ASL. The following analysis provides an interpretation of the GTAP usability data collected for this study. The experimental factors of interest for the law video usability study were viewing order and interpretation type. The coding categories for the law video usability study arose from the questions being asked by developers and researchers. The greatest concern for the researchers was the participant’s response to Acted ASL interpretation as well as to the available viewing preferences (size of video boxes, position of boxes, proximity of boxes and outline of boxes). Developers were further concerned with the viewer interface controls/buttons. Potential confounds for participant responses were identified as production/technical issues such as lighting and video content that may not interest all participants. The inter-rater reliability for all
categories is good indicating that the categorization process and the resulting data are reliable. In the law video usability study, there were a total of 262 comments generated by 17 participants (or approximately 15 comments per participant) in all categories and subcategories. The majority of comments appeared in the negative sub-category of technical issues (25 for the Acted and 21 for the Standard). The fewest comments were made in the positive sub-category of the Content of Video category (2 for Acted and 1 for Standard). There were three sub-categories where no comments were generated for either the Acted or Standard groups (positive Viewing Preferences and negative and positive Ease of Using Video Interface). As demonstrated in Table 10, participants were able to identify a large number of issues with the interpretation modes. For example, from the randomly selected protocols of just two participants, seven different issues in the acted interpretation and nine different issues in the standard interpretation were identified. These two participants identified the same issue only once. The protocols of these two participants exemplify the variability between participant comments and the way that this variability benefits that usability process. Differences in interpretation type and viewing order factors were found in four categories (these differences approached significance at the .01 level with a Wilcoxian ttest). Correlation coefficients seen in Table 12 also show these relationships. For viewing order, negative ASL and positive Technical approached significance (p ¼ :016 and .015, respectively) and for video, negative format and positive technical approached significance (p ¼ :017 and .015, respectively). The results shown in Table 8 indicate that for the positive technical category, all comments were made during the second viewing but only when the Acted ASL video was played (there were no positive technical comments for the Standard video version in either order nor were there positive technical comments for the Acted ASL video when it was the first video played). The technical category definition included anything that might be controlled in production such as synchronizing acted interpretation to main video, lighting, interpreter position, costumes/clothing. The Acted ASL video violated more user expectations with regards to costumes, interpreter
ARTICLE IN PRESS V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501
position and synchronicity. Indeed producers of the Acted ASL video deliberately broke standard presentation style of ASL interpretation video in order to explore new methods of presenting interpretation of existing videos and their possible benefit to viewers. When viewers were in a position to compare the acted interpretation to a standard interpretation they made both positive and negative comments about the technical aspects of the acted video. During the first viewing, when direct comparison between the acted interpretation and a standard interpretation was not available, no positive technical comments were made about the acted interpretation. Another possible cause of this result is that there is a difference in the production quality of the two different versions. The Acted ASL version may have been perceived as higher quality than the standard interpretation and therefore it drew more favourable comments than without the comparison. Interestingly, at no time were positive technical comments made for the standard interpretation. However, this lack of favourable technical comments is not necessarily indicative of a low-quality production video. Rather, it may be that the standard interpretation met the normal expectations of the participants as it is the traditional style of presenting ASL interpreted video material. Thus, the standard interpretation video did little to surprise the viewer and did little to elicit positive comments regarding production. It is unlikely that a practice effect as a result of repetition of the video dialogue caused an increase in positive technical comments since this effect was not shown for both video categories. This discrepancy indicates that comparison between the two videos is a likely contributor to the finding that positive technical comments were made only during Acted ASL videos shown after a Standard ASL video. Participants produced more negative ASL comments during the first video session than the second. This difference may be attributed to practice effects. Both videos have the same script so a viewer would be more likely to have a better understanding of the interpretation in the second session when they are seeing the script signed for the second time. This ‘‘practice’’ of the script means that the participant would be less likely to notice or have focus drawn to aspects of the interpretation that would be otherwise difficult to understand. The objective for using TAP in a software usability study was to gather rich data that sheds light on how an individual is responding to and using the user interface. The concurrent nature of the data helped the researcher to make connections between specific actions or aspects of the interface and thoughts voiced by the participant. In the usability study reported here, participants reported their thoughts with an average frequency that approached 1/min (.84/min). In addition each verbal protocol yielded at least one coded comment and the average number of comments per session was 8.4. Participants were only required to watch the video segments and were not instructed to make
499
any specific use of the user interface during the playing of the video. Standards for optimum number of comments per session do not exist for the TAP method and establishing a standard is beyond the scope of this study. However, it seemed reasonable that the number of comments produced would vary from task to task and the objective of our study was not to compare the frequency of comments between tasks but to show that comments would be generated. For the law video sessions, the production of 8.4 coded comments per session was substantial given the more passive role of the participant. 4.3. The relationship between gestural language type and method of data collection The results of our studies also seem to indicate that the TAP method requires very little adaptation to allow for the collection of gestural protocols. The greatest modification in the conventional TAP protocol is the addition of an interpreter (i.e., there are now two observers/experimenters involved). However, these modifications are relatively minor and do not cause much disruption to the flow of the protocol as long the interpreter is adequately briefed and prepared. In the Solitaire game study the interpreter provided realtime translation during the study. This translation was recorded along with the participant’s actions as they occurred during the study. In the usability study, a more conventional usability approach was taken to data collection and recording; the participant actions and comments were recorded together and translation/transcription occurred after the study was completed. We found that having simultaneous translation during the actual study simplified the analysis process because it did not involve using a two-step process. However, simultaneous translation has the potential to introduce slight time delays because the translator must wait until enough has been said by the signer to correctly translate the syntax and grammatical structures. This may cause difficulties in studies where accurate time date for beginning and ending of thoughts is required. 4.3.1. Preliminary guidelines for using GTAP There are some simple steps that we used and would recommend to optimize collection of a gestural protocol. First, it is useful to explain the rationale of the TAP method to the interpreter. This step enables the interpreter to better assist the researcher in obtaining the desired results since interpretation to ASL from English (and vice versa) is not a direct one to one mapping. Second, it is important to discuss the handling of speaking prompts to the interpreter and participant before the data collection begins. The interpreter should understand the importance of using brief, neutral cues such as ‘‘thoughts?’’ and ‘‘keep talking’’ when prompting a participant who has fallen silent. Another way to handle prompting is to tell the participant that they will be
ARTICLE IN PRESS 500
V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501
physically tapped when they fall silent for too long. In this way, the researcher may be positioned behind the participant an ideal location from which to operate the camera, to observe the user and screen without being a visual distraction for the participant and to prompt the participant with a shoulder tap when necessary. Third, the researcher should consider the importance of text interactions with the participant and remove them whenever possible. For example, it is customary to have hearing participants read a passage aloud as a warm-up for thinking aloud. For deaf participants, it may make more sense to practice signing thoughts while performing a simple task on or off the computer. Participants may also be asked to sign rote phrases such as counting or the alphabet. Fourth, some deaf individuals have residual hearing, thus, when an interface has an audio component, speakers and sound should be on. Finally, when taping the actual gestures of the participants, have a lab coat available for participants to wear. On video and even in person, the ease of viewing signs is hampered by patterned tops. A covering such as a lab coat will create a neutral background making the signs more visible to the interpreter and anyone viewing the video of the gestural protocol. Understanding and attending to these variables and how they affect the gestural protocol are keys to using the method optimally. Gestural TAP is laid on the foundation of research that supports TAP as a valid and reliable UEM. The collection of gestural protocols requires only minimal change to the TAP method: an interpreter and some consideration of hand requirements during the task. Equipment, coding and analysis requirements are virtually unchanged when concurrent interpretation of the utterances are collected. This minor change means that the method may be readily applied by field practitioners, particularly as replications of this research further support the similarity between spoken and gestural protocols. Gestural TAP enables inclusive use of TAP, the most effective UEM available (Corina and McBurney, 2001). These methods are important to enable developers to meet inclusive technology mandates and to help foster an environment of universal design. There is ample evidence that retrofitting environments whether they are physical or digital to accommodate individuals with special needs is more costly than building inclusive environments in the first place. Developers however, cannot effectively meet the needs of disabled users unless they include these users in their usability evaluations. This GTAP research and research on other inclusive UEMs are urgently needed so that developers and usability engineers are equipped with methods that have been tested and refined. This study began with the research question: are the outcomes of ASL speakers comparable to English speakers’ outcomes? Certainly the failure to find significant differences between the groups on most of the variables lends support to the idea that oral and gestural language verbal protocols are similar especially given the steps taken to increase statistical power and precision.
Acknowledgements Funding for this project, Creating Barrier-Free, Broadband Learning Environments Project #34, is provided by the E-Learning program of CANARIE. The authors wish to acknowledge gratefully Earl Woodruff and Richard Volpe of the University of Toronto for their assistance with this project, the Canadian Hearing Society for assistance in recruiting participants, and all of the participants who gave up their valuable time to participate in our studies.
References Andrews, J.F., Mason, J.M., 1991. Strategy usage among deaf and hearing readers. Exceptional Children 57, 536–545. Bates, E., Dick, F., 2002. Language, gesture, and the developing brain (Special issue: Converging method approach to the study of developmental science). Developmental Psychobiology 40 (3), 293–310. Bellugi, U., Klima, E.S., Siple, P., 1974. Remembering in signs. Cognition 3 (2), 93–125. Campbell, R., Wright, H., 1990. Deafness and immediate memory for pictures: dissociations between ‘‘inner speech’’ and the ‘‘inner ear’’? Journal of Experimental Child Psychology 50 (2), 259–286. Corina, D.P., 1998. Sign language aphasia. In: Coppens, P. (Ed.), Aphasia in Atypical Populations. Erlbaum, Hillsdale, NJ, pp. 261–309. Corina, D.P., McBurney, S.L., 2001. The neural representation of language in users of American Sign Language. Journal of Communication Disorders 34 (6), 455–471. Ericsson, K.A., Simon, H.A., 1984. Protocol Analysis: Verbal Reports as Data. MIT Press, Cambridge, MA. Ericsson, K.A., Simon, H.A., 1998. How to study thinking in everyday life: contrasting think-aloud protocols with descriptions and explanations of thinking. Mind, Culture, & Activity 5 (3), 178–186. Green, A., 1998. Verbal Protocol Analysis in Language Testing Research: a Handbook. Cambridge University Press, Cambridge, UK. Henderson, R., Podd, J., Smith, M., Varela-Alvarez, H., 1995. An examination of four user-based software evaluation methods. Interacting with Computers 7 (4), 412–432. Jones, L., Pullen, G., 1992. Cultural differences: deaf and hearing researchers working together. Disability & Society 7 (2), 189–196. Kuusela, H., Paul, P., 2000. A comparison of concurrent and retrospective verbal protocol analysis. American Journal of Psychology 113 (3), 387–404. MacSweeney, M., 1998. Cognition and deafness. In: Gregory, S. (Ed.), Issues in Deaf Education. D. Fulton Publishers, London, p. xi 292. McNeill, D., 1992. Hand and Mind: What Gestures Reveal About Thought. University of Chicago Press, Chicago. Messing, L.S., 1999. An introduction to signed languages. In: Messing, L.S., Campbell, R. (Eds.), Gesture, Speech and Sign. Oxford University Press, Oxford, UK, New York, p. xxv 227. Newell, A., Simon, H.A., 1972. Human Problem Solving. Prentice-Hall, Englewood Cliffs, NJ. Nisbett, R.E., Wilson, T.D., 1977. Telling more than we can know: verbal reports on mental processes. Psychological Review 84 (3), 231–259. Schirmer, B.R., 2000. Language and Literacy Development in Children Who are Deaf, second ed. Allyn and Bacon, Boston. Schirmer, B.R., 2003. Using verbal protocols to identify the reading strategies of students who are deaf. Journal of Deaf Studies & Deaf Education 8 (2), 157–170. Someren, M.W.v., Barnard, Y.F., Sandberg, J., 1994. The Think Aloud Method: a Practical Guide to Modelling Cognitive Processes. Academic Press, London, San Diego. Stevens, J., 1996. Applied Multivariate Statistics for the Social Sciences, third ed. Lawrence Erlbaum Associates, Mahwah, NJ.
ARTICLE IN PRESS V. Louise Roberts, D.I. Fels / Int. J. Human-Computer Studies 64 (2006) 489–501 Stokoe, W.C., 2001. The study and use of sign language. Sign Language Studies 1 (4), 369–406. Van Someren, M.W., Barnard, Y.F., Sandberg, J., 1994. The Think Aloud Method: a Practical Guide to Modelling Cognitive Processes. Academic Press, London, San Diego. Wiedenbeck, S., Lampert, R., Scholtz, J., 1989. Using protocol analysis to study the user interface. Bulletin of the American Society for Information Science June/July, 25–26.
501
Wilson, M., Emmorey, K., 1998. A ‘‘word length effect’’ for sign language: further evidence for the role of language in structuring working memory. Memory & Cognition 26 (3), 584–590. Wilson, M., Emmorey, K., 2003. The effect of irrelevant visual input on working memory for sign language. Journal of Deaf Studies & Deaf Education 8 (2), 97–103.