News plagiarism has been perhaps one of the most challenging areas of research into plagiarism. Unlike student plagiarism, text reuse by journalists with little or

[302]

rui sousa-silva no attribution at all does not seem to be usually regarded as plagiarism (AngèlilCarter 2000; Coulthard & Johnson 2007), not even when substantial amounts of text are reused. This is one of the problems reported by Angèlil-Carter (2000) in her discussion of the subject. As the borderline of plagiarism is as dependent on its definition and on the author’s intention as much as it is on the text genre, the usage of large amounts of text by journalists with little or no attribution tends to be overlooked. This is a result of the underlying assumption that news pieces are expected to report on ‘real-world’ facts and events. And since, for reasons of faithfulness, these facts and events cannot be reported differently, the more faithfully a journalist reports them, the more professionally they act, and the higher the likelihood that a higher textual overlap is to be expected. Therefore, texts reporting those facts and events can hardly be charged with plagiarism. Another reason for this apparent leniency with news text lifting is that news corporations frequently subscribe to paid newswire services whose contents they are allowed to reuse. Additionally, when faced with the need to acknowledge their sources, journalists seem to have a double-standard. On the one hand, they do not hesitate to clearly cite their primary sources — and keep their identity confidential when necessary to protect them — in order to ensure the truthfulness of the news piece. In some extreme cases, they even resist pressure to identify these sources. On the other hand, they often reuse text from other (secondary) sources to write their articles, while not always citing them. This is the case of reusing text from other media organisations, or even from newswire services. Notwithstanding these underlying assumptions, journalists have been punished for plagiarising. In February 2015, Jared Keller, the news director of the news site Mic, was fired after he was found to have lifted passages of text from other news sources. Keller reproduced the text literally or with minor changes, with little or no reference to the source. Where he provided a reference, this was made in passing. That same month, the columnist Tanveer Ahmed was dismissed by the Australian after a blogger accused him of plagiarising an American political website. Two years earlier, the New Yorker writer Jonah Lehrer was fired for recycling New Yorker blog posts, among other misdeeds. One of the most paradigmatic cases, however, is that of Jayson Blair, who in 2003 resigned from The New York Times after facing accusations of journalistic fraud, including plagiarism. In particular, he was accused of lifting material from newswire services and other newspapers, such as the Washington Post and The San Antonio Express-News. In 2007, a reader of the Portuguese quality newspaper Público found that the journalist Clara Barata plagiarised from other sources, including Wikipedia. This case is even more complex than the others, as the texts were not lifted from an original in the same language, but instead from an original in another language. A similar case is that of a reporter of the Telegraph-Journal in Canada, who was fired in 2009 for lifting a news piece from L’Acadie Nouvelle.

OSLa volume 7(1), 2015

a forensic linguistic analysis of news plagiarism

[303]

This paper investigates how a forensic linguistic analysis can assist the detection and/or provision of evidence of news plagiarism. It builds on the assumption that it is crucial to devise a method for identifying the textual elements that can be used to flag a text as a potential instance of plagiarism, not only to raise suspicion about its originality, but also to develop translingual plagiarism detection techniques (Sousa-Silva 2014). A method of this type is presented below. [2] n e w s , p l a g i a r i s m , a n d l i f t i n g

Indeed, although a vast body of research into plagiarism has been published over the last decades (Anderson 1998; Angèlil-Carter 2000; Carroll 2001; Carroll & Appleton 2001; Jameson 1993; Lindey 1952; Pecorari 2008; Howard & Robillard 2008; Roig 2001; Scollon 1995; Howard 1995), it has focused mostly on academic plagiarism, to the detriment of other instances of text reuse. One of the reasons why academic plagiarism has attracted most research attention is that it is seen as an educational issue that needs to be identified during the student’s academic path (Carroll 2001; Carroll & Appleton 2001), and especially teach students how to adopt an appropriate academic conduct (Howard 1995). On the contrary, comparatively little research has been conducted into news text reuse. This is supported by the strong views, usually matching the infringing journalist’s argument, that writing news pieces is different from academic writing, and that in order to preserve the readability of the article citing all the secondary sources used is impractical. Paradoxically, although the conventions and regulations applying to the use of newswire copy are not universal, they tend to be clear in this respect. Cases of such conventions and regulations abound. Agencies require that the source(s) be credited, and forbid the unacknowledged use of ‘authored articles’, i.e. news pieces signed by individual reporters, rather than being simply news wires. The Reuters Handbook of Journalism (Reuters 2008), e.g., describes plagiarism as a ‘cardinal sin’. It strongly argues that, whereas ethical guiding principles contribute to a better journalism, ‘rigid rules’ restrict and constrain the ability to operate. The Reuters Style Guide states in addition that, in accordance with the Reuters Code of Conduct, the company’s journalists are required to always “search for and report the truth, fairly, honestly and unfailingly” (Reuters 2008, pg. 1). In addition to stating that plagiarism is a “cardinal sin”, this style guide considers fabrication and plagiarism two of the “10 Absolutes of Reuters Journalism”. Their journalists are, therefore, required to do a “proper attribution to the source of material that is not” theirs, and are instructed that “it is insufficient to label video or a photograph as ‘handout’ ”; on the contrary, it is a requirement that the source be clearly identified. This style guide further states that “it is essential for transparency that material we did not gather ourselves is clearly attributed in stories to the source, including when that source is a rival organisation” and concludes that “failure to do so may open us to charges of plagiarism” (Reuters 2008, pg. 5). OSLa volume 7(1), 2015

[304]

rui sousa-silva Likewise, the International Federation of Journalists1 (IFJ) and the Portuguese journalists’ union (Sindicato dos Jornalistas2 ) consider plagiarism a ‘serious professional offense’. Similarly, the style guide of the main Portuguese quality newspaper, Público3 , establishes that plagiarism is forbidden by the newspaper, and adds that all relevant information collected from other media organisations or news agencies must be attributed. In cases where the news piece is based on news wires of different agencies, these should be cited in the text in the order they have most contributed to the news article. When the news wires are used as mere sources, and the article is mainly written by the journalist, the agencies should be cited in the body of the news article. But if the article is based mainly on news wires, then a reference to these should be included. In addition, the style guide explicitly states that texts translated from other languages should be clearly marked as translations and include the translator’s name. It is then unsurprising that, in accordance with its policy, Público published an apology, in 2006, for one of their journalists, Clara Barata, who published an article that was mainly translated from the New Scientist and Wikipedia. The suspicion was raised by a reader, who noticed that the text looked familiar to him when he first read it, and later identified the original sources. The newspaper initiated an investigation and later realised that the journalist plagiarised 13 significant extracts using translation. The case was compared to that of the famous New York Times journalist, Jayson Blair, who in 2003 was dismissed after the newspaper was challenged by other news organisations for accusations of plagiarism. Cases of news plagiarism have however long been reported. In 1996, another news organisation, the Portuguese news agency Lusa, had submitted a complaint to the journalist’s union, Sindicato dos Jornalistas, claiming that several Portuguese media organisations were plagiarising texts authored and signed by their own journalists, and which were not included in newswire services. Given the stance adopted by these organisations and media self-regulatory measures, news plagiarism cases have been unsurprisingly addressed more often by self-regulation, codes of ethics and deontology than by the law. And this traditional perspective of journalism as being exempt from plagiarism has been challenged, not the least by journalistic practice, as well as by the practice illustrated by the cases discussed above. It is thus evident that, despite reporting facts, news are subject to principles of originality as much as other text genres, including student assignments. News plagiarism therefore is not treated much differently from academic plagiarism. Like academic plagiarism, it is not only subject to internal rules and regulations, but also tends to be resolved internally by the respective organisations. [1] [2] [3]

OSLa volume 7(1), 2015

See http://www.ifj.org/en See http://www.jornalistas.online.pt/ Available at http://static.publico.clix.pt/nos/livro_estilo/16p-palavras.html

a forensic linguistic analysis of news plagiarism

[305]

Nevertheless, establishing a framework and guiding principles to address news plagiarism is not the only issue at stake. An additional challenge to handle news text reuse is that of detection. Coulthard & Johnson (2007) argue that the technologies that make it easier to plagiarise also make it easier to catch plagiarists. The technological developments of the last decades have, in fact, facilitated the detection procedure. But in cases of news plagiarism, instances of lifting are not uncommonly detected by intuition, although the feeling of déjà-vu is less likely to occur than in academic plagiarism. Readers often find themselves feeling that they have already read the same thing elsewhere, and initiate a whistleblowing process. The case of Clara Barata discussed above illustrates this point. Elsewhere, I demonstrated that lifting text from an original in the same language can be easily detected, using simpler to more complex text-matching tools and techniques; a straightforward comparison suffices in this case to identify the unoriginal instances (Sousa-Silva 2014). On the contrary, detecting text reuse from an original in another language is comparatively more complex. Since the plagiarised (i.e. the original) and the plagiarising (i.e. the derivative) texts are in two different languages, translation thus works as an obfuscation technique that prevents a direct textual comparison. Firstly, machine and machine-assisted detection cannot be systematically used for text comparison. Secondly, manual searches using particularly suspect strings of text, such as those commonly performed by teaching staff, are missed by search engines as the text is not absolutely identical. [3] p l a g i a r i s m d e t e c t i o n : t h e c a s e f o r f o r e n s i c l i n g u i s t i c s

In recent years, many people, from literary critics and copyright lawyers to teachers and forensic linguists, have shown a growing interest in the field of plagiarism and plagiarism detection, even if for different reasons (Coulthard & Johnson 2007). Whereas the literary critic may be interested in judging the literary quality of a literary work, the teacher is more interested in educating students — and hence concerned more with the moral values of plagiarism itself, than with the financial implications of the infringement (Howard 1995; Robillard & Howard 2008; Scollon 1994, 1995). The copyright lawyer, on the contrary, is prone to be more interested in the financial implications of plagiarism — and seek for the corresponding compensation. Plagiarism has been traditionally considered an immoral, more than an illegal act (Garner 2009). Consequently, it should be more appropriately addressed as an ethical, rather than a legal offense (Goldstein 2003). This is especially so because the works entitled to protection are immaterial and ubiquitous. As a result, they can be simultaneously used by different people, thus compromising the original author’s ability to control the use of his/her own work (Pereira 2003, pg. 20). OSLa volume 7(1), 2015

[306]

rui sousa-silva However, it has been demonstrated that plagiarism is indeed both immoral and illegal (Finnis 1991; Eiras & Fortes 2010), which makes it punishable by law (Pereira 2003). Plagiarism is thus more appropriately addressed as both a moral and an ethical issue. As I argued elsewhere, ‘[o]n the moral side, plagiarism brings social implications, with the power to ruin the reputation of the plagiarist; on the legal side, it implies the infringement of moral rights, and often financial rights, both of which are punishable by law’ (Sousa-Silva 2013, pg. 61). Indeed, as these financial rights are more easily quantifiable than the respective moral rights, it is not surprising that they are the ones more promptly addressed by the courts. It is not uncommon that instances of plagiarism bring along serious legal implications. And neither are the cases brought before the courts of law restricted to those having financial implications. Many high-profile cases brought to the fore in recent years show that, not only is plagiarism seen as a violation of codes of ethics, but also it is punished. News plagiarism is not an exception, as the cases presented above demonstrate. This makes plagiarism well suited for a Forensic Linguistics approach, as forensic linguists set as their research object the legal aspect of the act — and the result of such act. In legal cases, forensic linguistics can and do not only assist the investigative procedures, by assisting ethics committees, boards and decision makers determining lifting; they also provide linguistic evidence to a Court as to whether two or more texts have been produced independently, or whether they build upon a previous original text. Forensic linguistics is the field of linguistics that applies a linguistic analysis across all types of interaction in the legal context (Caldas-Coulthard 2014). In other words, this field is above all focused on all aspects of the interaction between language and the law. However, linguists operating in forensic contexts have contributed significantly to cases that span beyond the purely legal. In the field of plagiarism in particular, linguistic analyses have made significant advances in recent years in the detection of same-language plagiarism and ‘translingual’ plagiarism alike. It has been almost 20 years since Johnson (1997) compared a set of student texts to conclude that they were not original. By devising a method that consisted of comparing only lexical items, rather than using string matching techniques, she demonstrated that they were a result of collusion, i.e. a sort of group plagiarism. Although the text strings were altered in order to produce slightly different versions, a comparison of the lexical items showed that the texts had not been produced independently. Johnson’s linguistic analysis did not involve the courts, but was sufficient to demonstrate lifting among students. And more importantly, her analytical methods were later applied in court cases. Turell (2004) built upon Johnson’s (1997) work to investigate whether a linguistic analysis that had previously been tried and tested with student plagiarism could also be used to successfully deter-

OSLa volume 7(1), 2015

a forensic linguistic analysis of news plagiarism

[307]

mine plagiarism in published translated texts. She compared four translations of Shakespeare’s Julius Caesar into Spanish and demonstrated how a forensic linguistic analysis is sufficiently sound to prove that one translation derives from another translation of the same original, rather than having been produced independently. Moreover, she illustrated very clearly how this forensic analysis can be used to provide evidence of the lifting. Turell’s comparison of the four texts included overlapping vocabulary, shared once-only words, unique vocabulary and shared once-only phrases. The excellent performance of this method is based on the simple principle that, since all these elements are relatively independent of word order, they tend to perform better than text matching techniques. The case studied by Turell is a typical case of plagiarism that is often decided by the courts of law: these translations are themselves literary, original works, and hence are subject to copyright. Violation of copyright in these circumstances therefore has financial, in addition to moral implications, owing to the fact that the translator and the publishing company own intellectual property rights over the translated work similar to the ones owned by the original author and the publisher of the original. From a forensic linguistic perspective, this task is particularly challenging because every translation is bound to reflect the form and content of the original, and the more literal the translation, the more difficult it is to show its originality. Despite this obstacle, Turell’s analysis proved that the suspect translation plagiarised a pre-existing translated text. Although these studies in the field of forensic linguistics, among others, have been paramount in the study of plagiarism, research has mostly focused on ‘intralingual plagiarism’ — i.e. on the analysis of reuse of texts written in the same language. Conversely, there has been relatively little research into ‘translingual plagiarism’ (Sousa-Silva 2013). This is a case of plagiarism by translation, where a text lifts, verbatim or otherwise, from another text written in another language without a clear, proper and unambiguous attribution. Two reasons in particular may account for this fact. Firstly, research into plagiarism has been mainly English-centred. Not only has most research into academic integrity, education and policies been conducted in the Anglo-American context, but also the depth and breadth of the research object does not leave much room for concern with source texts written in other languages. Consequently, on the detection side, too, software has been designed to address the needs of this particular context. Moreover, if we take the Internet in general as an example, most texts are nowadays written in English4 , and the demand for texts written in other languages is comparatively much smaller. Secondly, despite the urgent need to detect textual lifting from other languages, not the least as a result of the high volume of scientific production in English, a very strong effort is required to detect this type of pla[4]

The Internet World Stats website reports that in 2013 English was by far the most widely used language in the Internet — see http://www.internetworldstats.com/stats7.htm OSLa volume 7(1), 2015

[308]

rui sousa-silva giarism. Owing to these constraints, there is currently no means of systematically screening texts for translingual plagiarism in the same way as there is to detect same-language plagiarism. As a result, such cases can almost only be grasped by intuition, without any computer assistance. In most cases, translingual plagiarism consists of texts that are translated freely and ‘informally’ from another language, without acknowledging the original author. This is hardly the case of literary texts, a professional and acknowledged translation of which is usually commissioned. But translation of other text genres (e.g. news and blog comments, besides academic plagiarism) without attribution can easily pass unnoticed. This is mainly because, contrary to Turell’s study above, they do not plagiarise another translation in the same language, but rather the original, in another language. The text is thus not lifted word-for-word, which makes the plagiarism more difficult to monitor. In this respect, a forensic linguistic analysis is crucial, not only to assist the detection procedure, but also to demonstrate the extent of the borrowing, and whether a text is an instance of plagiarism, or on the contrary whether the textual reuse is acceptable. More importantly, this analysis is able to provide evidence that a text — or more than one — was not produced independently. This will be addressed in the next section. [4] r a i s i n g s u s p i c i o n a n d d e t e c t i n g p l a g i a r i s m

This paper first studies the detection of verbatim reuse of news articles. Subsequently, a method is proposed to raise suspicion that a text may have been plagiarised. Thirdly, it illustrates how to find evidence that a text has plagiarised another text in another language. This research is based on a corpus of news pieces that are publicly available, and which are supposed to have been produced independently, although on similar topics.

[4.1] Verbatim Plagiarism Detecting verbatim plagiarism, i.e. where the derivative text lifts (almost) literally from an original in the same language, without alterations, is straightforward and easy. As long as the original is known, a simple comparison of the original and the suspect texts — manually or using common computer tools — suffices to identify the amount of overlap, as well as the extent of the lifting. In order to showcase this, I randomly selected a text made available by the Portuguese news agency Lusa, from a corpus of 28 news pieces that were authored and signed by inhouse journalists. An Internet search of a few strings of text found two individual instances of textual reuse without acknowledgement, which consequently consist of plagiarism. The first one was published by the Portuguese quality newspaper, Jornal de Notícias (JN). The second one was published online by the TV broadcasting corporation, TVI. This text is reproduced in the following two extracts. The OSLa volume 7(1), 2015

The news piece published by JN (Extract 1) has a textual overlap of 96%, i.e. 527 out of a total of 554 words (the original piece published by Lusa was 550 words long). The text published by TVI (Extract 2) has a textual overlap of 100%. This online news piece reused all the 550 words of the text published by Lusa, although a few additional words were added (the text published by TVI is 566 words long). This is the result of the slight alterations made to the original news article published in the newspaper. It should be noted that Lusa is referenced in passing, as quotes used in the text are attributed to the news agency. However, nowhere in the article is authorship attributed to the original news piece. The piece broadcast by TVI also references Lusa in passing, by attributing the quotes to the agency, but goes further then JN in that it attributes the authorship to their own reporter and the TV station newsroom (“Redacção/PP”). The changes introduced to the TVI text are only minor, even if compared to the ones introduced by JN. Interestingly, there is one sentence in the original article that lacks a word, and hence the reproduction of that error raises some issues of ungrammaticality: “Se, por exemplo, se encontra a arma do crime sem impressões digitais poderá ter pólen, não daquele local, mas da sua proveniência”. In order for the sentence to be grammatical, at least a pronoun is needed after “digitais” and before “poderá”, such as ‘ela’ or ‘esta’. However, neither JN, nor TVI seemed to have noticed it, and reproduced the grammatical error. This provides a clear OSLa volume 7(1), 2015

[312]

rui sousa-silva evidence that the text is not original. Furthermore, chronological aspects show the directionality of the lifting, i.e. that JN and TVI lifted the text from Lusa (or from each other), but not the other way around.

[4.2] Investigating Suspected Plagiarism by Translation As shown in the previous section, detecting news plagiarism is straightforward and easy, especially as the media go increasingly online. However, more sophisticated techniques are required when news pieces are plagiarised from other languages by journalists, who tend to translate the text freely into another language (usually, their mother tongue) — often using machine translation services, such as Google Translate. In these cases, the output of the machine translation is frequently grammatically flawed. To a lesser or greater extent, adjustments are therefore required, not only to make the text readable, but also publishable. In order to raise the suspicion that a text derives from an original in another language, and consequently detect instances of plagiarism of this type (as is the case of Público discussed above), it is necessary to either rely on intuition (the feeling of déjà-vu), or else build upon linguistic analysis. The latter is also required to provide evidence of the lifting, as the former is insufficient in this respect. As part of the linguistic approach, a syntactic analysis has the potential to trigger suspicion that a text may be an instance of plagiarism, as long as the two languages involved have a different syntax. This builds on the very simple principle that a text written from sources in another language tends to retain syntactic elements of that language, whereas texts written originally in one language tend to adhere to that language standards. The following extracts illustrate this point: Extract 3: The renewal of the Toural square in the center of Guimarães, will move to the end of the year, but the design is totally different from the planned study presented two years ago. The project challenged by vimaranenses resolve the tunnel road and underground parking. The car traffic will be maintained throughout the area, but there will be news. It is planned to create a street in the far east of Alameda de S. Damasus, within what is now the garden, and to distribute the traffic from the city center. The remaining garden is enhanced with more plant species, and have a new design, giving an idea of urban forest. The project, coordinated by Maria Manuel Oliveira, the department of architecture at the University of Minho, provides the return of the fountain of Toural, public source of the sixteenth century passed, about one hundred years, the garden of Caramel. One of the central ideas expressed by the architects is the reuse of existing elements, such as furniture. The assistance is extended to the Republic of Brazil and off street of Santo António, changing the configuration of public transport. The taxi stand will be reduced and OSLa volume 7(1), 2015

a forensic linguistic analysis of news plagiarism

[313]

parking of buses transferred to the field of Kitchen. In the tower of the old wall with the inscription Here Born Portugal plans to establish a viewpoint that is an ideal place to observe the new floor of the square, designed by the plastic artist Ana Jotta, based on the same rocks of quartz and basalt now available . The assistance will be financed by EU funds after being approved an application to the program of urban regeneration of the NSRF in the value of 9.9 million. Authority takes possession of convent Well near the Toural, the former Convent of Dominica, in the seventeenth century, will be incorporated in the project of Capital of Culture. The municipality approved yesterday by the declaration of ownership of the property where usucapião are installed several cultural associations. In the building, now dilapidated, will be installed in the residence artists. The camera will have to find an alternative site for the installation of the seats of Tertulia Nicolina and Child Center of Popular Culture, although not yet officially have contacted the associations. The building for the House of Memory is also flagged. This is an old industrial plastics, the Count of Margaride avenue, into the city. This partially empty factory has an area free in the back so that the building is created from scratch.

Extract 4: Iran rallies planned amid clampdown Anti-government protesters in Iran have announced they are to hold another rally in the capital to dispute the veracity of a presidential election. Supporters of candidate Mir Hossein Mousavi called on Wednesday for a rally to go ahead at 5pm local time (13:30 GMT), despite the authorities imposing a ban on the opposition gatherings. Mahmoud Ahmadinejad, the incumbent president, was officially declared winner of Friday’s election by a margin of two-to-one over Mir Hossein Mousavi. Hossein, a reformist candidate who was the nearest rival to Ahmadinejad, a conservative, has accused the authorities of rigging the vote. But Ahmadinejad has said that the result proved he has popular support. “The election result confirmed the work of the ninth government which was based on honesty and service to the people,” he said on Wednesday in a statement to Iran’s ISNA news agency. Violence on tape Despite the restrictions placed by the government on the media, violent scenes of police beating Mousavi supporters taken on mobile phones have been broadcast on news bulletins across the world. The Revolutionary Guard has warned the country’s online media it will face legal action if it “creates tension”. Within the country, mobile phone text services have been down OSLa volume 7(1), 2015

[314]

rui sousa-silva since the election. There is no access to Facebook, Twitter, or YouTube. The interior ministry has ordered an investigation into an attack on university students in which it is claimed four people were killed. Anoushaka Maraslian, a Middle East analyst in London, told Al Jazeera: “University cities in Iran have always been very active in political dissent. That’s the concern of the elders; that’s the concern of the Guardian Council, and that’s why they’re making conessions, because they realise that young Iranians are leading the protests …with parallels to [the revolution in] 1979.” At least seven people have been killed in recent clashes between the authorities and the opposition movement, according to state media reports, while hundreds more are thought to have been injured. For its part, the foreign ministry summoned the Swiss ambassador, who represents US interests in Tehran, on Wednesday to protest at “interventionist” US statements on Iran’s election. Obama told CNBC there appeared to be little difference in policy between Ahmadinejad and Mousavi. “Either way we are going to be dealing with an Iranian regime that has historically been hostile to the United States,” he said. Mousavi has called on his supporters to hold peaceful demonstrations or gather in mosques on Thursday in solidarity with people killed or hurt in the post-election unrest. “In the course of the past days and as a consequence of illegal and violent encounters with [people protesting] against the outcome of the presidential election, a number of our countrymen were wounded or martyred,” Mousavi said on his website. “I ask the people to express their solidarity with the families …by coming together in mosques or taking part in peaceful demonstrations.”

Although it is clear that neither of the texts reproduced in Extracts 3 and 4 were originally written in English, their quality varies; Extract 3 is of very poor quality, and sometimes even imperceptible, whereas Extract 4, despite not being entirely correct, is rather clear and intelligible. A reader of English without any knowledge of Portuguese will understand the translation of the article in Extract 4 better than they will understand the translation of the article in Extract 3. Surprisingly, they were both published in the same newspaper, the Portuguese quality newspaper Público. In order to avoid any bias arising from editorial policies, a random, yet intentional decision was made to select the articles from two different sections of the same newspaper. Extract 3 was published in the Local news section, whereas Extract 4 was published in the World section of the newspaper. They were then translated into English using Google Translate (http://translate.google.com), which produced the English version of the texts transcribed above. The oddness often found in translated texts is a good trigger of suspected plagiarism, which can be complemented with machine translation so as to enable the search and subsequent side-by-side comparison of the suspect text against the potential original. Indeed, as I explained elsewhere (Sousa-Silva 2014), machinetranslating suspect texts (in this case, written in Portuguese) into English should OSLa volume 7(1), 2015

a forensic linguistic analysis of news plagiarism

[315]

give the forensic linguist a clue as to whether the text might have originated somewhere else — in which case it would be considered plagiarism. Extracts 5 and 6 illustrate this method. Extract 5 reproduces the article that was originally published in Portuguese. The news piece does not attribute the text to any news agency in particular; on the contrary, only a general reference to “Agencies” is initially made. After translating this text into English, a few sentences were selected to perform an Internet search using lexical items as keywords, while discarding functional words. These lexical items were therefore used as filtered n-grams (Maia et al. 2008). The search based on these search parameters returned two relevant articles: one was published by The Australian newspaper5 , and the other one was broadcast in the Channel News Asia website6 . With the exception of minor differences in details related to dates (e.g. ‘Sunday’ or ‘weekend’, and a paragraph used by Channel News Asia that was left out by the The Australian), the two articles were entirely identical. In both cases, authorship was attributed to the same source, Agence France Presse (AFP) and, in the case of Channel News Asia, to “ls/yb”. Extract 6 transcribes the text published originally by The Australian. Since the two texts are reproduced in Extracts 5 and 6 in their original language, the comparison focused on identifying the strings with overlapping ideas, rather than the strings of identical text. The underlined text shows the overlapping strings. The numbers at the beginning of the underlined strings show the matching strings in the other text. Extract 5: The Público news article Encontro com Abbas em Washington Obama defende um Estado palestiniano e o fim da expansão dos colonatos 2009-05-28 23:25:00 PÚBLICO, Agências O Presidente Barack Obama defendeu hoje a criação de um Estado palestiniano. [01]No fim do seu primeiro encontro com o presidente da Autoridade Palestiniana, o líder norte-americano repetiu uma vez mais o seu [02]apelo a Israel [02]para que ponha fim à construção nos colonatos erguidos dos Territórios Palestinianos e honre os compromissos que assumiu. As duas partes, afirmou Obama na Casa Branca, têm [05]“obrigações face ao roteiro” — o plano internacional de 2003 para a resolução do conflito israelo-palestiniano. Nestas inclui-se “parar com a colonização”. [04]Durante a discussão com o novo primeiro-ministro israelita, Benjamin Netanyahu, a semana passada, “fui muito claro quanto à necessidade de travar a colonização”, esclareceu ainda Obama. Os palestinianos devem por seu turno fazer progressos na [5] [6]

Extract 6: The Australian news article Obama presses Israel on settlements but rules out peace timetable May 29, 2009 US President Barack Obama has renewed pressure on Israel over settlements but rejected a timetable for his peace drive, noting domestic pressures heaped on Israeli Prime Minister Benjamin Netanyahu. [01]As Mr Obama met Palestinian leader Mahmud Abbas for the first time as president, he [02]called for a halt to settlement building on the occupied West Bank, as his administration sparred with Israel over the sensitive issue. Mr Obama vowed an “aggressive” mediation effort, ahead of his visit to Saudi Arabia and Egypt next week, while Mr Abbas pledged to live up to all previous peace agreements and warned [03]“time is of the essence” for a two-state solution. [04]The US president recalled that last week he had been “very clear” with Mr Netanyahu about the need to “stop settlements” and again stated his desire to see a two-state solution to the Israeli-Palestinian conflict. Asked if he would strong-arm Israel if it did not back down in its refusal to support a Palestinian state, Mr Obama said: “I think it’s important not to assume the worst, but to assume the best”. He rejected an opportunity to set a date for the establishment of a “viable, potential” Palestinian state. “I want to see progress made, and we will work very aggressively to achieve it. I don’t want to put an artificial timetable,” he said. “I am confident that we can move this forward if all parties are ready to meet their obligations.” On Wednesday, Secretary of State Hillary Clinton had significantly hardened the US position on settlements, prompting a blunt dismissal from Israel. But Mr Obama appeared to give Netanyahu some leeway, noting the fierce pressures imposed on the Israeli leader by his hawkish right-wing coalition. “I think that we don’t have a moment to lose, but I also don’t make decisions OSLa volume 7(1), 2015

a forensic linguistic analysis of news plagiarism

[317]

based on just a conversation that we had last week,” Mr Obama said. “Because obviously Prime Minister Netanyahu has to work through these issues in his own government, in his own coalition.” The US president also called on Mr Abbas to offer security improvements to Israel and to quell anti-Israel incitement in Palestinian mosques and schools. Mr Abbas warned that all parties should work to alleviate the plight of the Palestinians and move towards statehood. “I would like to take this opportunity to affirm to you that we are fully committed to all of our [05]obligations under the roadmap, from the ‘A’ to the ‘Z’,” he said. Mr Abbas added that he had shared ideas with Mr Obama based on the roadmap and the 2002 Saudi peace plan backed by the Arab league. The US-backed roadmap calls for a halt to Jewish settlement activity in Palestinian territories and an end to Palestinian attacks against Israel but has made little progress since it was drafted in 2003. Ms Clinton said Mr Obama “wants to see a stop to settlements. [06]Not some settlements, not outposts, not natural growth exceptions.” But Israel dismissed the blunt US call. [07]“Normal life” will be allowed in settlements in the occupied West Bank, government spokesman Mark Regev said, using a euphemism for continuing construction to accommodate population growth. He added the fate of settlements “will be determined in final status negotiations between Israel and the Palestinians and in the interim, normal life must be allowed to continue in those communities.” The Palestinian Authority has ruled out restarting peace talks with Israel unless it removes all roadblocks and freezes settlement activity. Mr Netanyahu told Mr Obama last week at their first White House meeting that he was willing to “immediately” relaunch the peace talks but failed to publicly back the creation of a Palestinian state or to freeze settlement activity. The Israeli prime minister told his cabinet at the weekend he did not intend to build new settlements but that “it makes no sense to ask us not to answer to the needs of natural growth and to stop all construction,” aides said. The Abbas meeting represented Mr Obama’s latest attempt to revive the stalled Middle East peace process, which have included talks with Jordan’s King Abdullah II, Mr Netanyahu and in London with Saudi King Abdullah. Next week, Mr Obama will meet the Saudi King in Riyadh and deliver a long-awaited address to the Muslim world in Cairo. But he said he would not lay out his long-awaited peace plan in the speech, which he said was designed to lay out a path for a “better” US relationship with the Islamic world. AFP

The shallow linguistic analysis above shows that some sentences containing overlapping ideas consist of quotations, and hence tend to be appropriately used in the text. As they quote someone else’s direct speech, they are the type of facts that cannot be subject to plagiarism. The analysis also reveals that the order of the ideas differs in the two texts, so overlapping strings are used in different sections of the article. This might suggest that the text was produced independently. Additionally, the Portuguese article was published on 28 May, whereas the articles OSLa volume 7(1), 2015

[318]

rui sousa-silva published in The Australian and broadcast by Channel News Asia were both published on 29 May. Although prior authorship is a strong indicator of originality, this does not mean that the Portuguese article does not derive from the original AFP newswire, especially considering that the two World section news articles (which attribute authorship to an international news agency (AFP)) greatly overlap. Although access to the original AFP news wire is restricted, comparison with the two articles published on 29th May suggests that the Portuguese article also derives, at least partly, from the same source. The comparison shows, as well, that many strings in the article that are supposed to have been produced independently overlap with strings in the text whose authorship is attributed to AFP. Strikingly, the sentence “Ms Clinton said Mr Obama ‘wants to see a stop to settlements. Not some settlements, not outposts, not natural growth exceptions’ ” is attributed to Hilary Clinton in the Portuguese text, but AFP describes it as Obama’s reported speech. [5] w h y o d d n e s s m at t e r s

The results of the analysis provide evidence that news plagiarism exists and can be detected, even in instances of text reporting facts. It is also forbidden — and seriously punished — by those news corporations. The cases discussed demonstrate that, although quality newspapers are more careful in citing their sources (usually well-known international agencies), attribution is often incomplete, inadequate, or vague. In the cases presented in this paper, for instance, JN made no attribution at all, Público attributed authorship to “Agencies” without naming any agencies in particular, and TVI lifted the original text entirely and passed it off as their own. These commonly represent a violation of the established standards and ethics policies, when regularly enforced. For instance, although Público has a clear ethics policy and instructions on when and how to cite, it published an article vaguely attributing authorship to “Agencies”. In this respect, news plagiarism is not much different from academic plagiarism, with the exception that the latter is done by people training as writers, whereas the former is done by professional writers. The analysis of the texts also shows that (free) machine translation tools are a good resource to test suspect cases of translingual plagiarism. In the case discussed, the result of a machine-translated non-suspect article enabled the selection of some sentences that were used to conduct an Internet search. After discarding the functional words and focusing on the lexical items, two articles published in different news companies were found that were likely to derive from the same source. Although it could be argued that the contrastive analysis of the Portuguese (suspect) text against the text whose authorship is attributed to AFP is not enough to sustain the claims of plagiarism, it clearly shows that the Portuguese version has not been produced independently, despite the inexistent OSLa volume 7(1), 2015

a forensic linguistic analysis of news plagiarism

[319]

one-to-one match between the Portuguese and the English versions. What this suggests is that there is a high likelihood that the same piece of news includes different releases from the foreign press and international websites. [6] c o n c l u s i o n

The research presented in this article, despite being built upon a shallow linguistic analysis, supported the design of a new approach to translingual plagiarism detection, whose potential was previously demonstrated (Sousa-Silva 2014). It adds to an extensive body of research conducted over the last decades, which demonstrates that forensic linguistics has the investigative and evidential potential in cases of plagiarism, as well as in cases of copyright infringement. On the investigative side, a forensic linguistic analysis has assisted in the development of methods, tools and procedures to reveal and detect instances of plagiarism. On the evidential side, this approach has long demonstrated and proved why a certain instance of reused text is plagiarism, or conversely why a certain text is falsely accused. The latter, in particular, is an area that requires a more in-depth linguistic analysis, which is beyond the scope of this article. The forensic nature of plagiarism has often been challenged, on the grounds that most cases of plagiarism (such as academic) do not involve legal instances. Indeed, academic plagiarism cases tend to be managed by the academy, as much as news plagiarism cases tend to be addressed by the media corporations involved. Therefore, they are usually — but not always — judged as a moral, more than a legal issue, and settled outside the courts of law. The involvement of the courts of law in plagiarism cases (including academic) is not new, especially as a means of rescinding degrees. Nevertheless, given that accusations of plagiarism can and do have serious implications on the suspect plagiarist’s life, proving or disproving an instance as plagiarism can be unquestionably relevant, both within and outside the courts of law. The future for research into plagiarism is anything but dull, and clearly shows a great opportunity for collaborative research involving forensic as well as computational linguists and engineers. Although strong methods of linguistic research into plagiarism have been developed, there is always room for improvement, not only by designing new analytic methods, but also by adapting existing ones (whose relevance has been demonstrated) to new challenges. Computational forensic linguistics is definitely an area from which plagiarism detection can greatly benefit. Although those systems that use linguistic information are good performers, simple string matching software often return disappointing results. In this respect, Maia et al.’s (2008, pg. 83) argument for the collaboration between linguists and engineers remains valid today as it was by then: “[w]hat is needed is good will and serious attempts by both sides to understand each other’s point of view. If this can be made to happen, everyone will benefit and the results OSLa volume 7(1), 2015

[320]

rui sousa-silva for research will be far greater than if they continue to work separately.” Like Alice, one cannot but become curiouser and curiouser… [7] a c k n o w l e d g m e n t s

This article is based on the research conducted as part of my PhD (Sousa-Silva 2013), and different aspects were presented at the 9th International Conference of the IAFL, in Amsterdam, in 2009, and at the IAMCR Conference, which took place in Braga in 2010. I would like to thank Belinda Maia, with whom I thoroughly discussed the research presented in this article. Her comments, her opinion, and her feedback were invaluable to the outcome of this study, and her permanent support of my research into forensic linguistics is truly appreciated. This work was partially supported by grant SFRH/BD/47890/2008 FCT-Portugal, co-financed by POPH/FSE.