Author Variation In Mobile Phone Texting

  • April 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Author Variation In Mobile Phone Texting as PDF for free.

More details

  • Words: 4,494
  • Pages: 12
Preliminary observations on author variation in mobile phone texting by John Olsson, Director of the Forensic Linguistics Institute, UK www.thetext.co.uk [email protected] Abstract This non-statistical study considers the degree of individual author variation in mobile phone text authorship. Fifty-three texters donated 950 texts to a corpus. Each author was tested for degree of variation displayed in regard to corpus-wide polymorphs such as ‘u’, ‘you’, ‘yew’. It was found that 34 of the 53 texters exhibited some degree of polymorphism, ranging from 2 per cent to 52 per cent (average 9 per cent). Although the number of tokens which polymorph is relatively low in terms of the total lexicon, about 5 per cent, polymorphs account for approximately 20 per cent of all tokens used. Although the degree of variation per texter is not high per se, it can, in some instances, be sufficient to cause problems in forensics. It is therefore crucial that forensic workers carry out prior variation testing on candidates in mobile phone text inquiries. The study found that of all the socio-groups observed young females exhibit more variation than any other group, and that variation decreases with age. It is also observed that author variation is often ignored by forensic linguists. Recent work which scarcely considers the topic is cited. The findings from this preliminary study will be tested against further, larger studies based on corpora currently being built. Introduction The purpose of this preliminary study is to consider the question of author variation in the context of mobile (or cell) phone texting. Specifically, the study examines the extent to which the individual mobile phone texter varies with regard to style. For a number of reasons – including the informality of the phone text medium, the generally private nature of the content, the fact that even the best laid out phone keypads are not always easy to use, and, among some social groups, the motivation to use an in-group

1

style – many texters use a variety of abbreviations and other non-conventional forms in place of standard language tokens. An important question in the forensic context is the extent to which individual texters are consistent in their output. If it is discovered that texters are highly consistent with regard to the forms they use, this would give forensic authorship of mobile phone texts great credence in courts. If, however, it turns out that texters are inconsistent with regard to output then forensic authorship analysis could prove problematic in some circumstances. It should be noted that the corpus is relatively small in size with some contributors having donated a disproportionately large number of texts. For this reason some individual texters will in all likelihood be exerting an undue influence on the polymorph population and to some extent on variation. One solution to this might have been to restrict each contributor to a fixed number of texts. This, however, would have meant the sacrifice of much valuable data. The long term answer will probably simply be the expansion of the corpus to the extent that the capacity for any one individual to influence the corpus will be restricted. That is why this is a ‘provisional’ analysis, a work in progress. It should perhaps be emphasised at this point that the study is about author variation, not authorship testing per se. Questions relating specifically to authorship identification will be addressed in later studies. Finally, it is worth remarking that this is not a statistical study, although quantitative data are employed. However, interested readers may wish to take the data found in the table and perform their own calculations. Mobile phone texting In order to send text messages the mobile phone texter (hereafter ‘texter’) selects the message function on the handset, and then enters a message using the alphanumeric keypad. On most phones there are between three and four characters for each numeric key. For example, the number 1 key typically allows for entry of the letters ‘A’, ‘B’ and ‘C’. For the first letter in the sequence only one press is required to return the desired letter, for the second two presses are required, while for the third three presses are required. There is a space bar to enter spaces between words and there is usually a

2

specific key which, if pressed in sequence, will yield a given punctuation mark, such as a comma, question mark and so on. Additionally, the texter can select the upper case option for proper names while, typically, sentence initial letters automatically default to the upper case. Most phones have the facility for dictionary building using the predictive text mode. In this mode a pre-entered dictionary item, such as ‘you are’ would be yielded if the texter entered a pre-selected sequence, such as ‘u r’. Given the relative difficulty of adding items to the inbuilt dictionary and/or the labour involved in learning in-built configurations, not to mention the wide range of items required to make such a dictionary effective, few users opt for the predictive texting mode. In fact, of the 53 participants in the present study none had selected predictive mode and, as far as the author is aware, no authorship cases involving predictive mode texting have been presented to the courts in the UK, though this may not be true of other jurisdictions. For the reasons previously stated many users opt for abbreviated forms, such as ‘u’ for ‘you’, ‘l8’ for ‘late’, ‘b’ for ‘be’ and so on. For many language tokens several options are available. For example, in the present study four forms of the word ‘about’ were presented in addition to its standard form: ‘abt’, ‘bout’, ‘bowt’ and ‘bt’, which was also one of the forms for ‘but’. The word ‘and’ attracted four forms in addition to the regular form, and these were ‘&’, ‘n’ (which also occurred in place of ‘in’), ‘nd’ and ‘un’. ‘Because’ occurred in six different ways – never once in its conventional form – only appearing as ‘cause’, ‘caz’, ‘cos’, ‘coz’, ‘cuz’ and ‘cz’. ‘You’ (including with its various contracted auxiliaries) occurs in exactly 20 different guises, including ‘u’, ‘ya’, ‘yew’, ‘yo’ (also a greeting), and ‘yhoo’. Some of these, given the fact that they are not actually abbreviations, may be occurring for in-group reasons or for style motivational reasons. The study Mobile phone texts were obtained over a period of two months from 53 texters in the vicinity of the author’s workplace and elsewhere. Participants were aware that a study of mobile phone texts was being undertaken for forensic purposes. A phone number was made available for people to forward their texts to, while some visited the author’s premises and allowed an upload of material onto the author’s computer, using various types of software. Several sets of texts were presented in hard copy

3

format. Participants were requested to send or bring only texts which had been sent prior to the request for texts, and that texts should be less than twelve months old. Above, the reader will have observed that I have titled this analysis ‘Preliminary observations on author variation in mobile phone texting’. My reason for this is simple: my colleagues and I are in the process of collecting a larger corpus of mobile phone texts. We have yet to collate and anonymise these. On completion of this work we will be able to determine to what extent any conclusions in this report can be more generally applied. At that stage a further study will take place with, it is intended, considerably more detail than that provided in the present instance. The participants The participants ranged in age from 11 (with parental permission) to 70 years of age. 16 participants were aged 25 or less, 17 were between 26 and 40, a further 12 were between 41 and 50 years of age, and the remaining 8 were between 51 and 70 (all ages inclusive). 33 of the participants were female and 20 were male. There was a variety of educational and social levels among the participants. All except two participants were English native speaker British Caucasians, one was an Irish Caucasian (English native speaker) and the other a Chinese (Han) national (putong hua speaker). Participants were informed that they could anonymise their texts or, alternatively, this would be done on their behalf. The texts Texters varied greatly in the number of texts they donated, with several donating just one text while one texter donated 104 texts. The average number of texts donated was 17.92. In all 950 texts were received. Once received, those texts which had not been anonymised were carefully redacted to protect the identities and locations of the participants and their text interactants. All phone numbers were replaced by eleven zeros (eleven being the number of digits in a UK telephone number, whether terrestrial or mobile). The motivation for the present study The author has been researching variation in written language for some years but the motivation for the present study resulted after a discussion in which the following

4

words were used by a colleague with reference to the language of mobile phone texting: “It is not unusual to find variation in the same person’s language choices… but…for most items most people use only one form most of the time”. This comment struck me as being worthy of further analysis, especially as the colleague and I both have an interest in forensic matters. ‘For most items most people use only one form most of the time’ Firstly, of course it is the case that ‘most’ items will be realised in ‘only one form most of the time’. It would be far too labour and memory intensive for ‘most people’ to produce several forms of most of the language items they used. In all likelihood the extent of both producer and receiver processing in such a case would defeat the goal of ‘instant’ communication. Secondly, it is a surprise that this phrase1 occurred at all in the context of forensic matters. At a purely practical level, it is meaningless: ‘most’ can mean as little as ‘fifty one percent’. Taking the statement to its logical conclusion, we could quite reasonably interpret its meaning as ‘fifty one percent of people use only one form fifty one percent of the time’. As the reader will appreciate this equates to something like twenty five percent by the multiplication rule. Even if, however, my colleague actually meant something like ‘ninety percent of people use only one form ninety percent of the time’, this is still not a forensically satisfactory statement, since – again, by the multiplication rule – this would equate, informally, to something approximating 80 percent consistency. Few courts would be prepared to convict if such a low percentage were claimed for DNA or other physical evidence and, as far as I know, nobody has yet built software which is able to (statistically) attribute mobile phone text authorship to even this level if, indeed, such software exists. So, what my colleague appeared to be saying was that somewhere between 25% and 80% of the time we will find authorial consistency. I submit that this is saying very 1

In the context of the same case, another colleague was interviewed on the BBC. He also claimed that

consistency was important in forensic authorship, citing a recent case in which the defendant allegedly forged texts from a victim's phone. He quoted the defendant's texts as using 'me' (as possessive pronoun) instead of 'my' "consistently". However, close examination of the defendant's texts reveals that the defendant uses 'my' just as frequently in this capacity as he uses 'me'. It seems that adherents of 'consistency' are sometimes unable even to observe variation.

5

little of forensic value. Nevertheless, I decided to test the extent to which texters are consistent in their output. The experiment It was first decided to calculate how many words in the corpus occurred in more than one form. Aside from personal and place names and the combining of two words (e.g. ‘see you’ as ‘cu’, ‘cya’, etc), as well as the formation of clusters by punctuation, e.g. ‘cya.or’ in ‘cya.or cal me’, there were 295 types which occurred in more than one form, producing a total of 894 tokens. Some occurred only twice, others several times – examples of some of these were given earlier. Words which occur in more than one form in the corpus have been labelled ‘polymorphs’. As seen from the earlier examples, most polymorphs are of extremely common words, about 80% being function words, mostly medium to high frequency. The next question was how many of the 53 texters used at least two forms of one polymorph in their texts. It will be appreciated that those with many texts will have had the opportunity to use more polymorphs than those with fewer texts. Of the 53 participants only 19 did not text more than one form of a polymorph. In other words 64 percent used more than one form of one word, with half of those producing between 10 and 50 percent of all the possible polymorphs (in their sub-corpus) in multiple forms. To clarify this point: a texter can use a word which is found (in the corpus or elsewhere) in a variety of forms (e.g. ‘you’). Some texters use just one form (not necessarily the standard form) of a particular polymorph such as ‘you’, others use several different forms. What the experiment showed was that most texters use more than one form of at least one word, while a number use more than one form of a significant percentage of their possible polymorphs. The youngest texter, who contributed 72 texts, displayed the highest variation: of 61 possible polymorphs occurring in her texts, 32 occurred in more than one form. This texter produced no less than five forms of ‘you’, four forms of ‘yes’, three forms of ‘tomorrow’, and a number of other words in multiple forms. A 16 year old male participant, who contributed a much lower 21 texts had 87 possible polymorphs, of which 18 occurred in more than one form, including four forms of ‘chicken’ (in the phrase ‘chicken pox’). Nor was this degree of variation found exclusively in young texters. An older texter (F, 44) produced three forms of ‘and’ and four forms of ‘have’.

6

Some texters produced two forms of a polymorph in just one text, e.g. one participant (F, 62) contributed just one text, which contains both ‘u’ and ‘you’, while another texter (F, 26) also contributed just one text in which she produces not only two forms of ‘you’ in it (‘u’ and ‘ya’) but also two forms of ‘good’ (‘good’ and ‘gud’). The results are presented in numerical form in the table below. The percentage of variation for each author is calculated as the number of polymorphs used compared to the number of words which are polymorphous. An alternative method of measurement is given below, after the table.

Table 1: Showing the degree of variation found in a participant group of 53 texters Author Code

Age

Gender

Total No of Txts

Total No of Tkns

1 2 3 4 5 7 8 9 10 11 13

44 70 32 45 50 17 45 16 25 32 25

F F F F F M M M M M F

104 2 11 4 5 30 93 21 4 9 2

1905 47 83 35 53 259 1125 428 52 236 32

Total No of Types 627 43 63 28 34 177 393 231 44 116 30

1 Form Polymrphs 134 17 33 15 12 73 124 87 18 47 16

Many Form Polymrphs 44 0 0 0 2 11 9 18 0 3 0

One/Many 0.33 0.00 0.00 0.00 0.17 0.15 0.07 0.21 0.00 0.06 0.00

7

15 22 24 26 27 28 32 33 35 36 37 38 39 40 41 42 44 45 46 48 49 50 51 52 60 61 63 65 66 67 69 70 71 73 74 75 77 78 79 80 81 82

35 42 46 25 22 37 40 35 18 45 26 35 38 50 20 25 23 32 32 45 56 11 15 26 17 40 39 36 58 25 70 45 16 45 56 70 45 55 26 24 62 38

F F F F F F F F F F F F F F F F M M F M M F F F M M M M F M M M F F M F F M M M F F

1 26 85 39 2 13 42 6 9 3 1 1 6 1 22 5 7 6 41 1 27 72 3 8 1 32 2 3 1 48 2 2 94 1 6 3 4 11 19 1 1 7

58 202 934 766 66 245 460 109 191 31 32 25 130 18 309 92 78 99 905 24 230 627 35 119 34 249 15 38 43 749 17 51 2323 1 104 15 95 254 276 6 15 168

54 147 339 367 55 149 270 73 118 26 30 21 97 14 172 65 64 69 330 23 133 285 30 72 30 152 10 32 38 307 16 42 508 1 72 13 71 146 125 6 15 124

26 56 109 116 22 54 82 26 54 6 9 7 37 4 67 25 35 37 134 10 42 61 15 32 17 58 5 14 25 105 6 21 144 1 34 4 21 44 59 2 7 50

1 6 25 22 1 4 13 0 4 1 3 1 0 0 7 2 4 1 3 0 4 32 2 4 0 2 0 0 0 11 0 0 42 0 0 0 1 1 4 0 1 1

0.04 0.11 0.23 0.19 0.05 0.07 0.16 0.00 0.07 0.17 0.33 0.14 0.00 0.00 0.10 0.08 0.11 0.03 0.02 0.00 0.10 0.52 0.13 0.13 0.00 0.03 0.00 0.00 0.00 0.10 0.00 0.00 0.29 0.00 0.00 0.00 0.05 0.02 0.07 0.00 0.14 0.02

Results The average percentage of multiple form polymorphs was 0.09 (nine percent). However, there was extensive variation (from 0 percent to 52 percent). Males appear to exhibit significantly less variation than females (5 percent compared to 11 percent). Variation appears to decrease with age: texters aged 18 or less exhibit 20% variation on average, those between 18 and 30 display 13%, between 30 and 40 years of age a mere 4%, while those aged 40-50 exhibit 9%, the figure declining to 7% for those between 50 and 70 years of age. Hence young females show the highest variation in their texts of all the sub-groups. There is almost no correlation between age of texter

8

and average length of text: in fact the correlation is slightly negative (-0.08). Similarly, there is little or no connection between the texter’s gender and average length of text. Of the 53 participants, 26 exhibit 5% or greater variation. Even those who contributed 10 or fewer texts exhibit, by this calculation, an average of 5% variation. It will be appreciated that the size of the corpus in the present study means these results should be considered as provisional, pending further research. It was also found that, as expected, variation tends to increase with the number of text’s in the author’s sub corpus. There was thus a high correlation (p<0.005, n = 53) between the number of texts the author had sent and the degree of variation they displayed. This can be confirmed from the table above. This would mean that analysis of a corpus which contained a relatively small number of texts per author would very likely reveal variation to be relatively low. If the data in the above table are reliable indicators of variation, then corpora containing low numbers of texts per author would in all probability not represent the true state of author variation in the phone texting medium. A more conservative measurement of individual variation than the one referred to above is the number of many form polymorphs used divided by the total number of types (as in types/tokens) used. Although this yields a much lower percentage of variation, it was found that 15 of the 53 texters exhibited variation of 5% or greater. To give a rough idea of what all this might mean in the forensic context, the results presented in this analysis suggest that if an expert were being very conservative, he/she would have to inform the court that the chances of a correct identification could, in some circumstances be as low as 72% (i.e. 38/53), although it is possible – again, depending on the circumstances – for it to be as high as 90%. Clearly, the texts in each case will present specific levels of variation and some will present greater difficulties with regard to possible identification than others. Conclusion In this analysis I have attempted to illustrate that individual author variation in mobile phone texts may be greater than would be readily apparent. This is particularly important in the forensic context, where people’s lives, liberty or reputations are at issue.

9

I am not suggesting that forensic authorship of mobile phone texts cannot be carried out accurately and successfully, but I believe that very great caution is required when making identifications for court purposes. Nobody would disagree that this should be so in any forensic analysis. More generally, I believe it is clear from this brief study that the question of author variation is becoming increasingly important in the forensic linguistic context. Manifestly, we require to know more about this phenomenon, so that we can be more sure when stating our observations and conclusions in forensic work, that we are basing our judgements on the realities of language use. Up until now individual author variation has largely been ignored by many forensic linguists and I believe this is dangerous: dangerous for the justice system and dangerous for victims and defendants alike. In case the reader might think I am exaggerating when referring to the lack of attention given to the topic of individual author variation, I would refer to a statement in a recent book on forensic linguistics. The linguist approaches the problem of questioned authorship from the theoretical position that every native speaker has their own distinct and individual version of the language they speak and write, their own idiolect, and the assumption that this idiolect will manifest itself through distinctive and idiosyncratic choices in speech and writing… This implies that it should be possible to devise a method of linguistic fingerprinting, in other words that the linguistic ‘impressions’ created by a given speaker should be usable, just like a signature, to identify them. So far, however, practice is a long way behind theory and no one has even begun to speculate about how much and what kind of data would be needed to uniquely characterize an idiolect. Coulthard and Johnson 2007: 161.

In this passage Coulthard (the author of this section of the book) appears to be claiming that given sufficient data writers can be ‘linguistically fingerprinted’. All that is required to expose this idiolect is sufficient data and, apparently, a method. According to this claim each of us has an idiolect and – providing the above conditions can be met – this idiolect could be demonstrated. However, the notion of idiolect assumes an underlying consistency of data and, as William Labov noted over thirty years ago, the notion of “homogenous data” in idiolects is highly questionable (Labov, 1972: 192) and the problem of “the quantification of the dimension of style”

10

(1972: 245) remains. Although Labov’s comments are in relation to speech, my view is that they are probably applicable to all language data. As far as I am aware, Coulthard does not refer anywhere in his book to author variation specifically and shows no attempt at having studied the subject in any depth. Yet he pronounces that we each have an idiolect – which is no more than a theoretical construct, based on the notion of linguistic variation at the group level which may be extended to linguistic variation at the individual level. It is of some concern that trainee forensic linguists may now also adopt the position that ‘idiolect’ is a given, rather than simply a notion, or a convenient way of describing what is no more than a theoretical possibility. It is surely the case that in any inquiry we need to look systematically at the variation displayed by each author candidate and compare this (perhaps in some measurable way) with the degree to which candidates are individually distinctive in their choices. We would then also test each claim we make about distinctiveness, either by the use of general and specialist corpora and, where feasible, Internet searches. Hence, while it is literally true that ‘for most items most people use only one form most of the time’ in itself this statement does not mean much in the forensic context. In the experiment reported here, while only about five per cent of items are actually polymorphs they represent around 20% of all words used. At this level the variation displayed is more than capable of confounding an identification analysis. Moreover, given that it is almost predictable which tokens will turn out to be polymorphs, and given the endless inventiveness of human beings when in contact with language, it is apparent from the results given above that not only are individuals subject to some kind of ‘internal’ instability of form when using polymorphs, but that polymorphs in the phone texting environment are themselves subject to various forms of social change, including of course diachronic change. It is suggested that Bakhtin’s notion of heteroglossia is particularly useful when considering variation in mobile phone texting: whereas ordered, standardised forms of language are evidence of centrifugal social forces, centripetal forces influence non-standard forms (see Morson and Emerson, 1990). This effectively means that variation itself may be unstable and hence unpredictable.

11

This makes it doubly critical that in any forensic analysis involving authorship of phone texts linguists need to look closely at the degree of variation displayed by individual authors before carrying out comparison tests. They will then be in a better position to assign a more accurate level of probability to their results. As a first step, it may be helpful to consider that forensic phoneticians now look at the issue of ‘consistency’ and they balance this against ‘distinctiveness’. Hence, in a given case if a suspect’s voice yields (relatively) consistent data, and is sufficiently distinctive as a voice, this greatly enhances the ability to make a valid comparison. How is it that forensic linguists are not thinking like this?

References Coulthard M., & Johnson, A. 2007. An Introduction to Forensic Linguistics: Language in Evidence. London: Routledge. Labov, W. 1972. Sociolinguistic patterns. Philadelphia, Pennsylvania: University of Pennsylvania Press. Morson, G.S. & Emerson, C. 1990. Mikhail Bakhtin: Creation of a Prosaics. Palo Alto, California: Stanford University Press.

12

Related Documents