Embodied Programming Erik Pukinskis <
[email protected]> UC San Diego Cognitive Science
Abstract
Past research into the nature of programming has focused on programmers as information processors, but there is a recent move in Cognitive Science towards an embodied view of cognition. This work presents results of a naturalistic observation of student pairprogrammers which show that programming is, in fact, an embodied activity. Analysis shows that the body is used to enact code, and to carry out complex pointing simultaneously in multiple semiotic fields, and that programmers use different modes afforded by the mouse cursor and selection to perform a wide variety of points.
Introduction
What is programming? We know that it is carried out primarily by highly educated or skilled programmers and that it results in the software systems that run everything from the pacemakers that keep our hearts beating to the 17 mile wide Large Hadron Collider. But how exactly does programming happen? What are the cognitive systems at work in the construction of software? Studies of programming have been conducted, but they have typically been thought of as an information processing activity. Ko and Meyers (2005) developed a model of software errors which breaks down software error production into four layers: specifications, programmer, programming system and programmer. They call errors in the programmer layer “cognitive breakdowns” which can be attributed to problems with knowledge, attention, and strategies. The types of breakdowns they describe range from issues with “ambiguous signs” to programmers
choosing the wrong rule to apply or using a “faulty model”. This method of describing cognition falls firmly in the “information processing” model, a rich model which certainly bears some fruit, and Ko and Meyers arrive at many valuable insights about programming. However, in recent years there has been in increasing interest within Cognitive Science in the body, and Ko and Meyers ignore the body almost entirely. The basic embodiment hypothesis is that the body plays a critical role in cognition, but there are as many formulations of the “embodiment” hypothesis as there are researchers interested in the phenomenon. These range from the notion that our language and cognition is grounded in embodied metaphors (Lakoff and Johnson, 1980) to more radical beliefs that much of actual online cognition occurs outside the head (Hutchins, 1995; 2007). Many other cognitive scientists (Gibbs, 2006; Noe, 2004; Hurley, 1998; Lakoff & Nunez, 2000) have pushed the notion that the body is an important part of the cognitive apparatus. In addition, Andy Clark (2007) has proposed that the embodiment hypothesis is particularly applicable in HumanComputer Interaction, because of what he calls "radical embodiment." Clark cites several new pieces of research which support this notion. Maravita and Iriki (2004) performed a set of experiments recording from bimodal neurons that respond both to activity in a somatosensory receptive field (sRF) and to a visual receptive field (vRF) adjacent to the sRF, which remain anchored to the body as it moves. Trials showed two distinct types of bimodal cells, both which show an expansion of the vRF following tool use. The vRF of the "distal" type initially included only the hand, but after learning to use a rake in a reaching task expanded to include the length of the tool. The "proximal" type cells responded initially to the area within reach of the body, and expanded to include the area within reach of the tool. Further experiments involved training in a virtual reality situation, where the monkeys were prevented from viewing their hand or the rake directly, instead viewing them via a video monitor. In this situation the vRF shifted to include the onscreen representations of the hand and the tool. The explanation Maravita and Iriki offer is that the body schema is being reused for the tool and tool activity.
In addition, there is quite a bit of research which supports the idea that the body is playing a critical role in cognitive activities. Hutchins (2007) describesby way of Spivey (2007)a finding from Glucksberg (1964). Glucksberg's study required participants find a clever solution for mounting a candle on a wall with a box of tacks. What is noteworthy about this study is that it was found that participants often were observed to be handling the box immediately before they realized that the solution was to use the box itself to hang the candle. This suggests that the physical handling of the box facilitated the insight. And Hutchins himself (2007) describes a situation in which a navy shipman's hands are doing cognitive work that allows him to discover a missing term in a calculation. Indeed, the notion that cognition is offloaded onto external representations is not new (Scaife and Rogers, 1996; Hutchins, 1995), and there is no reason why the body could not be used for such a purpose. Indeed, it seems likely that the body is recruited in programming, the big question is how?
Method
The study presented here utilizes naturalistic observation techniques borrowed from the practice of cognitive ethnography. A programming class at UC San Diego was identified as a potential source of data. The class is specifically focused on debugging. Students are assigned readings from a software development text each week. In class they take a quiz on the readings and are given a short lecture. After the lecture, they form pairs and are given a small, selfcontained debugging project. They are given 15 source files and a brief description of the desired behavior of the application. The source files have several bugs which must be fixed within the remaining approximately 90 minutes in the class period. At the end of the period, students turn in the fixed code, along with a log of their debugging activities. They are encouraged to use a scientific method, forming and testing hypotheses. Several instructors were on hand to answer questions, and would often interrupt students to give advice.
Generally, the students observed were novice programmers, this typically being only their second programming course. In addition, each class period the students find new partners, so each session is a chance to work with someone they've never worked with before. Data was captured from 8 different pairs in 8 different sessions, each lasting 4080 minutes. Three video streams were captured. (Figure 1) Two were from digital camcorders, one with a wide angle lens pointed at the programmers' bodies and faces from behind their monitor, and another pointed at the monitor to capture interaction between participants hands and the screen. The third video stream was a full resolution digital capture of the desktop activity, including mouse activity. A directional microphone recorded highquality audio, a necessity in a noisy classroom. Results were analyzed using microanalysis. Two of the eight video sessions received full analysis. Because of problems with the desktop capture, full analysis was not possible on the other data. The analyiss process began with the author watching the videos and adding subtitles as necessary when audio quality was too poor for easy listening. The videos were then watched through completely, while creating an index of participants' goals, strategies, and the gestures and speech that seemed strongly correlated to the identification and solution of debugging problems. Once these indicies were created, key moments showing examples of embodiment were analyzed in detail, frame by frame, to attempt to identify the function of embodiment. In order to facilitate this kind of detailed analysis, a new video analysis tool, called “3stream” (Figure 1) was built to facilitate the watching of all three video streams simultaneously. 3Stream allows analysis to loop over key segments, scrub back and forth over all three video streams at once, and step forward and backward through the streams. In addition, facilities are provided to adjust the synchronization of the streams. Such a tool is necessary because the story of a given embodied activity is not fully present in any one stream. Watching the desktop capture, won't allow you to understand what is being gestured at with the mouse, because the analyst cannot tell what the programmers are saying.
Similarly, watching the video of the participants will not reveal what they are looking at, so it is often impossible to know what they are talking about. Speech, gesture, mouse activity are tightly coupled, as will be shown below, which is why frame by frame simultaneous analysis facilitated by 3stream is so critical.
Results
While much of the observed activity involved very limited movement, often just eye movements and scrolling, there were several instances of dramatic embodied activity. This paper will focus on two kinds of activity: code enactment and pointing. Code Enactment Figure 2 shows a series of movements made by a programmer who was debugging a tetris game. The programmer and his partner were attempting to debug a faulty loop in the program, and there was some confusion about the matrix which was being operated on. The participant on the right at several points asks whether the problem is that the array indexes appear in different orders through the program, sometimes in the form [x][y], and sometimes in the form [y][x]. Up until this point, their discussion never got past the suggestion that this might be a problem. We can see at the beginning of this sequence, in Figure 2b, as the programmer says “going through the x position,” he draws out a line on the table, creating a mapping between the x position and that space on the table. He then returns his hand to the mouse and begins reading code off of the screen: “less than width... x plus plus.” He then gestures back and forth near the left side of the axis he drew with his hand while saying “cause it's within the border,” suggesting he is still using the same mapping. (Figure 2c) Then, he retraces with his hand formed into a claw shape the same line he traced in Figure 2b, although drawing a somewhat shorter line. (Figure 2d) He makes a little hop, and then stretches his hand
out into a broad stance along the line, with his pinky at his far right side, and his pointer at his far left. (Figure 2f) Next he brings the middle finger of his opposite hand up next to his pointer (Figure 2g), and while saying “plus plus... plus plus...” moves his pointer finger towards his pinky in two small movements, timed with each “plus plus” (Figure 2hi). At this point, the programmer is able to state more confidently: “So it starts with... probably starts off with zero, goes to width.” (Figure 2j) Discussion
The correlation between movement and speech seems to show clearly that the movements are meaningfully connected to the participant's cognitive activity, but there is an open question as to whether the programmer's body is doing cognitive work, or whether perhaps it is just a side effect of his internal cognitive processes. Certainly, I am not the first to suggest that such bodily activities are doing meaninful work. The Glucksberg (1964) result reported on earlier, Hutchins' (2007) analysis of the shipman's use of his hands, are preexisting examples. But a slightly deeper analysis is in order. Hutchins uses "the enactments of external representations habitually performed by practitioners who live and work in complex culturally constituted settings" to explain his navigator's "aha moment" and I think we see something similar in the example presented here. It is plainly obvious in any recording of programmers that they habitually read code. And it is shown here that they sometimes act it out. These are cultural practices, and the events described above constitute an example of these cultural practices coming together and resulting in the programmer being able to make a claim about the boundary conditions of the loop. It is possible that the enactment was not strictly necessary for the programmer to reach this conclusion, but the fact is that he didn't reach the conclusion until after he had performed the enactment.
And certainly there appear to be constraints that the programmer leans on. The importance of constraints in cognitive offloading is described well in Scaife and Rogers, 1996). Constraints limit the amount of work that must be done inside the head by creating impossibilities outside the head. Gibbs (2006) claims that a form of this happens in the body, suggesting that the body can be used to create stable multimodal enactions. In our case, the enactment is stable in many ways... the movement of his pointer is confined by the limits of his tendons. He cannot move is further to the left than its initial position, and he cannot move it further to the right than his pinky. This constraint neatly mirrors the constraint of the edges of a fixed size array. He even goes to the length of bringing his middle finger on the other hand next to his pointer to maintain the stability of this constraint.
Pointing A second example of embodied activity which was very frequently observed was pointing. In one particular example, participants were debugging a loop. The loop said, in pseudocode: "if something AND NOT something else THEN return false". In the correct solution, however, the loop reads: "if something OR something else THEN return true." Figure 3 shows a few seconds of mousing activity coupled with speech from the pair. The details of the dialog are not nearly as important here as the structure of the mousing and the way it couples with speech. First, there are minimally four different kinds of points shown here: mouse hovering, clicking, selecting, and scrolling. Second, these different kinds of points are shown to be carefully cotimed with speech, suggesting real cognitive significance. Discussion Again, this is not the first work to suggest that such gestures are doing real cognitive work. Casasanto (in press) provides experimental evidence that "motor programs, themselves, are the
‘active ingredient’ in the cognitive function of gesture." Carlson et al (2007) provide evidence that the hands are externalizing cognitive work while being used to solve math problems. So it seems entirely possible, but what is really happening here? There are two things that are noteworthy about the kind of pointing observed in the present study. First, Iit appears to be closely related to what Chuck Goodwin describes in his 2003 paper on pointing. He describes the way we use various tools and body parts simultaneously on multiple semiotic fields to accomplish pointing. In his example, someone uses a trowel pointed at a map, but combines that gesture with a head nod towards a space nearby together to establish the target of the point. We seem to be seeing something similar thing here, where the programmer is indicating both at the code, and the output, using multiple semiotic fields to triangulate a target for pointing. Second, it is common knowledge that people point at things with the mouse cursor, but what this data reveals is that the picture is far more complex than just one kind of indicating. This user, in the span of maybe two sentences uses at least four different kinds of points (clicking, hovering, selecting (with double click and with drag), and scrolling. The present data isn't rich enough to say what the different functions of these different modes might be, but it is not hard to speculate. Clicking creates a sound which integrated tightly with speech. Selection can indicate a range, where clicking indicated a single point. A hover gesture can indicate motion and can create icons where a selection can only indicate a range. These affordances are extremely varied.
Conclusion
We guessed that programming was embodied, and that has been born out. It seems likely that these embodied activities are doing real cognitive work, and there are some unique properties to humancomputer embodiment in particular. As humans we opportunistically use whatever
structures we can do carry out our enactments, and the computer provides a very specific set of structures. It's been shown here that these structures and there uses are more varied than we might have guessed. These insights can and should influence how we design programming systems and indeed computer interfaces in general. I've shown that people opportunistically use digital structures like the cursor and the selection in creative ways, but these structures weren't designed a multi modal tools for pointing using multiple semiotic fields. They were designed to activate buttons and to create selections. This is one of the things that may be making programming (and perhaps computer use in general) so difficult. We can imagine digital structures that could be designed to scaffold embodied highlighting and enaction activities. For example, a selection that can more flexibly select structures at different levels of detail might support pointing better. And data structures that can be grabbed and animated like puppets would allow for richer enactments. By acknowledging in our design proces the fact that programming is an embodied activity, we may be able to create a new kind of programming tool which provides a much more natural, easier kind of programming.
References Carlson, R.A., Avraamides, M.N., Cary, M., Strasberg, S. (2007). What do the hands externalize in simple arithmetic?. Journal of Experimental Psychology: Learning, Memory, and Cognition, 33(4), 747756. Casasanto, D. (in press). The Cognitive Function of Metaphorical Gestures. Clark, A. (1999). An embodied cognitive science? Trends in Cognitive Science 3:9. 345351. [link] Clark, A. (2007) Reinventing ourselves: The plasticity of embodiment, sensing, and mind. Journal of Medicine and Philosophy 32:3 , pp. 263282. [link]
Gibbs, Raymond W. 2006. Embodiment and cognitive science. New York: Cambridge University Press. Glucksberg, S. (1964). Functional fixedness: Problem solution as a function of observingresponses. Psychonomic Science, 1: 117118. Goodwin, C. (2003) "Pointing as Situated Practice", in S. Kita (ed.) Pointing: Where Language, Culture, and Cognition Meet, pp. 217–241. Hillsdale, NJ: Lawrence Erlbaum Associates. [link] Hurley, Susan (1998) Consciousness in Action. Harvard University Press. Hutchins, E. (1995) Cognition in the wild. MIT Press. Hutchins, E. (2007) Enaction, imagination, and insight. In press) [link] Ko, A. J. and Myers, B. A. (2005). A Framework and Methodology for Studying the Causes of Software Errors in Programming Systems. Journal of Visual Languages and Computing, 16, 12, 4184. [link] Lakoff, George & Johnson, Mark (1980) Metaphors We Live By. Chicago: University of Chicago Press. Lakoff, G. & R. Núñez. (2000). Where Mathematics Comes From: How the Embodied Mind Brings Mathematics into Being. New York: Basic Books. Noe, Alva (2004) Action in perception. MIT Press. Scaife, M. & Rogers, Y. (1996), External cognition: How do graphical representations work?, International Journal of HumanComputer Studies, vol. 45. pp. 185213. [link] Spivey, M. (2007). The Continuity of Mind. Oxford: Oxford University Press.
Figure 1: The 3stream video analysis tool. Multiple video streams can be watched simultaneously, with scrubbing and single frame movement features, the ability to control audio playback independently on each stream and a built in synchronization tool.
Figure 2: Hand movements of participant B while talking about for loop in isLineFull method. a) “so…”
f)
23:14:22 22:58:20
“returns false”
b) “going through the x position”
g) “so”
22:59:26 – 23:00:22
“uh… so… less than width… x plus plus… Y should be ok. Y is ok…”
23:15:14
h) “plus plus”
c) “cause it’s within the border.” 23:17:00
i) “plus plus” 23:10:02 - 12:10:11
“uh… x…” d)
23:17:06
“equals zero.” j) “So it starts off…” 23:12:13 – 23:13:10
e)
23:20:20 23:13:20
“probably starts of with zero, goes to width.”
Figure 3: Indicating with clicking, selecting, hovering, and scrolling. A: so both of these are either true or false so changing this 1. won’t really make a difference. 2.
B: no, this should be true A: oh, this right here? 3. 4.
and then this should be an OR? 5. B: let me think. A: because this one doesn’t really matter cuz this is false, right? 6.
B: well it can’t be both inside and empty 5:12
A: but if this is true and 7. 8. this is false then it 9.
will stop all the way up here. 10. 11. B: why A: because it’s false false 12. 13.
Figure 4: Images taken at the same time of a programmer simultaneously pointing with his hand by resting it on the display bezel next to a region of interest while moving the mouse pointer over the same area. The typing carat is also close by.
Figure 5: Two instances (separated by many minutes) of programmers averting their eyes away from the workspace while attempting to solve a puzzle. In both cases their partners appear not to be attending to their gaze, suggesting the aversion of eyes aides thinking rather than communication.