inkdroidPaper or Plastichttps://inkdroid.org/
Sun, 10 Dec 2017 19:27:22 -0600Sun, 10 Dec 2017 19:27:22 -0600Jekyll v3.6.2Information FluxBossewitch & Sinnreich (2012) contains a useful description of information flux which is a conceptual framework for thinking about the flow of information between an individual and a network with respect to knowledge and power. The authors build on well known analyses of surveillance and the panopticon (Foucault, 2012) to provide additional blueprints or strategies of information transfer, and a method for means for generating them.

This makes it sound complicated but really it’s just a method of diagramming how information moves in a system between various parties and giving the pattern a name.

Panopticon

Sousveillance

Off the Grid

Promiscuous Broadcaster

Voracious Collector

It’s interesting how power is written into these blueprints (encoded as multiple As and a singular B), particularly how less powerful parties can become powerful over time due to the information flux that is at play. They also talk about how information flux is a reductionist technique, in that it does not speak necessarily of values, which are lost when simply representing the transfers.

I thought this idea was particularly compelling and useful when considering the work I’ve been a part of with Documenting the Now, which is attempting to position research and archival practices for social media. The companies of Twitter and Facebook our powerful parties, but the archives are powerful as well when considering their interactions with the voices that are present in social media. Those same voices, collectively, also have power that they can exert. Anyway, these are just notes to remind myself later of this useful concept–and share it with you if you are interested as well.

Here are some references mentioned in the article that I flagged to follow up on :

Gandy Jr, O. (2006). The new politics of surveillance and visibility, chapter Data Mining, Surveillance, and Discrimination in the Post-9/11 Environment, pages 363–384. University of Toronto Press. (On the ways in which electronic information resists deletion in its very nature).

Nippert-Eng, C. E. (2010). Islands of privacy. University of Chicago Press. (An ethnography of secret sharing – sounds like a really tough thing to do an ethnography on, so I’m super interested to read it.)

Clarke, A. C. and Baxter S. (2000). The Light of Other Days. Tor. (A sci-fi novel about how a technology that allows people to spy on each other via light sent through wormholes. The story explores the effects this technology has upon society.)

References

]]>Sun, 10 Dec 2017 00:00:00 -0600https://inkdroid.org/2017/12/10/information-flux/
https://inkdroid.org/2017/12/10/information-flux/powerinformationdocnowProspectusI’ve been trying to keep this blog updated as I move through the PhD program at the UMD iSchool. Sometimes it’s difficult to share things here because of fear that the content or ideas are just too rough around the edges. The big assumption being that anybody even finds it, and then finds the time to read the content.

As with most PhD programs the work is leading up to the dissertation. I’m finishing my coursework this semester and so I have put together a prospectus for the research I’d like to do in my dissertation. I’m going to spend the next 8 months or so doing a lot of background reading and writing about it, in order to set up this research. I imagine this prospectus will get revised some more before I share it with my committee, and the trajectory itself will surely change as I work through it. But I thought I’d share the prospectus in this preliminary state to see if anyone has suggestions for things to read or angles to take.

Many thanks to my advisor Ricky Punzalan for his help getting me this far.

Appraisal Practices in Web Archives

It is difficult to imagine today’s scientific, cultural and political systems without the web and the underlying Internet. As the web has become a dominant form of global communications and publishing over the last 25 years we have witnessed the emergence of web archiving as an increasingly important activity. Web archiving is the practice of collecting content from the web for preservation, which is then made accessible at another part of the web known as a web archive. Developing record keeping practices for web content is extremely important for the production of history (Brügger, 2017) and for sustaining the networked public sphere (Benkler, 2006). However, even with widespread practice we still understand very little about the processes by which web content is being selected for an archive.

Part of the reason for this is that the web is an immensely large, decentralized and constantly changing information landscape. Despite efforts to archive the entire web (Kahle, 2007), the idea of a complete archive of the web remains both economically infeasible (Rosenthal, 2012), and theoretically impossible (Masanès, 2006). Features of the web’s Hypertext Transfer Protocol (HTTP), such as code-on-demand (Fielding, 2000), content caching (Fielding, Nottingham, & Reschke, 2014) and personalization (Barth, 2011), have transformed what was originally conceived of as a document oriented web into an information system that delivers information based on who you are, when you ask, and what software you use (Berners-Lee & Fischetti, 2000). The very notion of a singular artifact that can be archived, which has been under strain since the introduction of electronic records (Bearman, 1989), is now being pushed to its conceptual limit.

The web is also a site of constant breakdown (Bowker & Star, 2000) in the form of broken links, failed business models, unsustainable infrastructure, obsolescence and general neglect. Ceglowski (2011) has estimated that about a quarter of all links break every 7 years. Even within highly curated regions of the web, such as scholarly publishing (Sanderson, Phillips, & Sompel, 2011) and jurisprudence (Zittrain, Albert, & Lessig, 2014) rates of link rot can be as high as 50%. Web archiving projects work in varying measures to stem this tide of loss–to save what is deemed worth saving before it becomes 404 Not Found. In many ways web archiving can be seen as a form of repair or maintenance work (Graham & Thrift, 2007 ; Jackson, 2014) that is conducted by archivists in collaboration with each other, as well as with tools and infrastructures that support their efforts.

Deciding what to keep and what gets to be labeled archival have long been a topic of discussion in archival science. Over the past two centuries archival researchers have developed a body of literature around the concept of appraisal, which is the practice of identifying and retaining records of enduring value. The rapid increase in the amount of records being generated, which began in the mid-20th century, led to the inevitable realization that it is impractical to attempt to preserve the complete documentary record. Appraisal decisions must be made, which necessarily shape the archive over time, and by extension our knowledge of the past (Bearman, 1989 ; Cook, 2011). It is in the particular contingencies of the historical moment that the archive is created, sustained and used (Booms, 1987 ; Harris, 2002). The desire for a technology that enables a complete archival record of the web, where everything is preserved and remembered in an archival panopticon, is an idea that has deep philosophical roots, and many social and political ramifications (Brothman, 2001 ; Mayer-Schönberger, 2011).

Notwithstanding these theoretical and practical complexities, the construction of web archives presents new design opportunities for archivists to work in collaboration with each other, as well as with the systems, services and bespoke software solutions used for performing the work. It is essential for these designs to be informed by a better understanding of the processes by which web content is selected for an archive. What are the approaches and theoretical underpinnings for appraisal in web archiving as a sociotechnical appraisal practice? To lay the foundation for answering this question I will be reviewing and integrating the research literature in three areas: Archives and Memory, Sociotechnical Systems (STS), and Praxiography.

Clearly, a firm grounding in the literature of appraisal practices in archives is an important dimension to this research project. Understanding the various appraisal techniques that have been articulated and deployed will help in assessing how these techniques are being translated to the appraisal of web content (Maemura, Becker, & Milligan, 2016). Particular attention will be paid to emerging practices for the appraisal of electronic records and web content. Because the web is a significantly different medium than archives have traditionally dealt with it is important to situate archival appraisal within the larger context of social or collective memory practices (Jacobsen, Punzalan, & Hedstrom, 2013). In addition, the emerging practice of participatory archiving will also be examined to gain insight into how the web is allowing the gatekeeping role of the archivist.

Appraisal practices for web content necessarily involve the use of computer technology as both the means by which the archival processing is performed, and as the source of the content that is being archived. Any analysis of appraisal practices must account for the ways in which the archivist and the technology of the web work together as part of a sociotechnical system. While the specific technical implementations of web archiving systems are of interest, the subject of archival appraisal requires that these systems be studied for their social and cultural and effects. The interdisciplinary approach of software studies provide a theoretical and methodological approach for analyzing computer technologies as assemblages of software, hardware, standards and social practices. Examining the literature of software studies as it relates to archival appraisal will also selectively include reading in the related areas of infrastructure, platform and algorithm studies.

Finally, since archival appraisal is at its core a practice it is imperative to theoretically ground an analysis of appraisal using the literature of practice theory or praxiography. Praxiography is a broad interdisciplinary field of research that draws upon branches of anthropology, sociology, history of science and philosophy in order to understand practice as a sociomaterial phenomena. Ethnographic attention to topics such as rules, strategies, outcomes, training, mentorship, artifacts, work and history also provide an approach to empirical study that I plan on using in my research.

Timeline

2017-11-01 - Prospectus Draft

2017-12-01 - Prospectus Final Draft

2017-12-15 - Committee Review

2018-01-15 - Committee Approval Meeting

2018-09-01 - Proposal Final Draft

2018-10-01 - Proposal Defense

Reading

Archives and Memory

Anderson, K. D. (2011). Appraisal Learning Networks: How University Archivists Learn to Appraise Through Social Interaction. Los Angeles: University of California, Los Angeles.

]]>Mon, 13 Nov 2017 00:00:00 -0600https://inkdroid.org/2017/11/13/prospectus/
https://inkdroid.org/2017/11/13/prospectus/Appraisal TalkThis is a draft of a talk I’m giving at SIGCIS on October 29, 2017. It’s part of a larger article that I will hopefully publish shortly or drop in a pre-print repository.

As the World Wide Web has become a prominent, if not the predominant, form of global communications and publishing over the last 25 years we have seen the emergence of web archiving as an increasingly important activity. The web is an immensely large and constantly changing information landscape that fundamentally resists the idea of “archiving it all” (Masanès, 2006). The web is also a site for constant breakdown in the form of broken links, failed business models, unsustainable infrastructure, obsolescence and general neglect. Web archiving projects work in varying measures to stem this tide of loss–to save what is deemed worth saving before it is 404 Not Found. In many ways you can think of web archiving as a form of repair or maintenance work that is conducted by archivists in collaboration with each other, as well as tools and infrastructures (Graham & Thrift, 2007 ; Jackson, 2014).

In this presentation I will describe some research I’ve been doing into how web archives are assembled and why I think this matters for historians of technology. What follows is essentially what (Brügger, 2012b) calls a web historiography where the focus is on the web as a particular technology of history rather than a particular history of web technology. The web, and by extension, web archives provide a singular view of life and culture since its inception 25 years ago. Understanding how and why web archives are assembled is an important task for the scholars who are attempting to use them (Maemura, Becker, & Milligan, 2016). As we will see, it is the network of relationships and connections that a web archive is involved with that make it an archive.

By web archives I specifically mean archives of web content, not necessarily archives that are on the web. Brügger distinguishes between three types of content that can be found on the web:

digitized: content that has been converted to digital format by some means (image scanning, transcription, etc) and then placed on the web.

born-digital: content that is created digital (word processor files, blog posts, social media, digital photographs, etc) and can be naturally found on the web.

reborn-digital: is digitized or born-digital content that has been collected and preserved from the web, and then re-presented as part of a web archive.

It is this third category of reborn-digital content that I’m concerned with here. A prime example is the Internet Archive, which I imagine some of you have used as a source of material in your own research. There are now thousands of organizations around the world collecting web content for a variety of archival purposes.

The question of what and how web content ends up in an archive is of historiographical significance, because history is necessarily shaped by the evidence of the past that survives into the present. Since it is physically impossible to archive everything, archives have always contained gaps or silences. Trouillot (1995) provides a framework for thinking about these moments in which these silences enter the archive:

Silences enter the process of historical production at four crucial moments: the moment of fact creation (the making of sources): the moment of fact assembly (the making of archives); the moment of fact retrieval (the making of narratives); and the moment of retrospective significance (the making of history in the final instance).

Given the significance of the making of archives to the making of history, and the abundance of material on the web, how do archivists decide what to save?

Archivists have traditionally used the term appraisal to describe the process of determining the value of records, in order to justify their inclusion into the archive. While notions of value, and the methods for measuring it differ, the activity of appraisal is central to the work of the archivist. To further specify this moment in which content becomes archival Ketelaar (2001) introduced the neologism archivalization as

the conscious or unconscious choice (determined by social and cultural factors) to consider something worth archiving. Archivalization precedes archiving. The searchlight of archivalization has to sweep the world for something to light up in the archival sense, before we can proceed to register, to record, to inscribe it, in short before we archive it.

In order to better understand this process of lighting up web content in web archives I conducted 30 interviews with web archivists, software developers, researchers and activists to discover how they decide to preserve web content. Inspired by the work of Suchman (1995), Star (1999) and Kelty (2008) these were ethnographic interviews that aimed to develop a thick description of how practitioners enact appraisal in their particular work environments.

In the first pass at analysis I coded the jottings and field notes generated. These provided a detailed picture of the sociotechnical environment in which appraisal work is being performed (Summers & Punzalan, 2017). However questions still remained about the particular psychological or social context for the decision making process around moments of archivalization in web archives.

On a second pass I performed a critical discourse analysis on the interview transcripts themselves. I selected critical discourse analysis (CDA) because it offers a theoretical framework for analyzing the way in which participants’ use of language reflects identity formation, figured worlds and communities of practice, while also speaking to the larger sociocultural context that web archiving work is taking place within.

A Discourse is a socially accepted association among ways of using language, of thinking, feeling, believing, valuing, and of acting that can be used to identify oneself as a member of a socially meaningful group or ‘social network’, or to signal (that one is playing) a socially meaningful ‘role’. (J. Gee, 2015, p. 143)

CDA provides a theoretical framework for empirically studying the way that form and function operate in language, and how this analysis can provide insight into social practices. One of CDA’s key proponents is James Gee, whose 7 building tasks provided me with a guide for analyzing my interview transcripts to gain insight into practices of appraisal in web archives (J. P. Gee, 2014). The 7 building tasks include:

Significance: how is language used to foreground and background certain things?

Activities: how is language being used to enact particular activities?

Identity: how is language being used to position specific identities and make them recognizable?

Relationships: what relationships are signaled in the use of language?

Politics: how are notions of of value and norms established in the use of language?

Connections: how is language used to connect and disconnect ideas, activities, objects?

Sign systems and knowledge: how does language position (privilege or disprivilege) particular sign systems, or ways of knowing and believing?

There’s not enough time for me to get into all the details of my findings here, but I would like to share a brief look at what this analysis looks like as a way of introducing my key findings. All the names used in the transcriptions are pseudonyms in order to allow the participants to be themselves as much as possible.

Line

Speaker

Utterance

41

Jim

Well Alex helped me get in contact with the employees /

42

Alex was already on the ground with it.

43

Ed

Oh okay //

44

Jim

and Alex /

45

KNEW /

46

that it was going to be a lot of data /

47

and was like /

48

ok so [be a little more] /

49

Ed

[ahhhh]

50

Jim

careful with this

Here I am interviewing Jim, who works at a non-profit web archiving organization. I selected this snippet because it highlights how discourse reflects the relationships that are involved in the appraisal process. Just before this snippet Jim is talking about how he wasn’t sure whether a particular video streaming site could be archived because of the amount of data involved. He sought the advice of his immediate supervisor Ariana, who then brought in Alex, who is the Director of the archive. It turned out that the Director had a connection with a staff person who was working at the video streaming company, who could provide key information about the amount of data that needed to be archived. Here Jim is using the hierarchical, chain-of-command relationships to lend weight and formality to what is actually a much richer set of circular relationships within the organization. The relationships also extended outside the archive and into the organization that had created the video content.

We see this pattern reflected in another interview with Jack, who is an archivist at a large university, who has been working to document the activities of the fracking industry within his state.

Line

Speaker

Utterance

1

Jack

I really see like one of / my next curatorial responsibilities being /

2

not really more crawling or more selecting /

3

but using the connections I’ve made here /

4

to get more contact and more dialogue going with /

5

with the actual communities I’ve been documenting //

6

And I’m a little nervous about how it’s gonna go /

7

because I went ahead and crawled a bunch of stuff /

8

without really doing that in advance //

Here Jack is explicitly describing “connections” or relationships as an essential part of his job as an archivist. Just before this snippet he had finished describing how he got the idea to document fracking from a web archivist at another institution, who was already engaged in documenting fracking in his state. Jack’s interest in documenting environmental issues had developed while working with a mentor at a previous university. Jack wanted to collaborate with this archivist to better document fracking activity as it extends across geopolitical boundaries. He sought the approval from the Associate Dean of the Library who was very supportive of the idea. However as this snippet illustrates Jack sees these professional relationships as necessary but not sufficient for doing the work. He sees dialogue with the communities being documented, in this case activist communities, as an important dimension to the work of web archiving.

In addition to focusing on relationships Gee’s Making Strange Tool is a discourse analysis technique for foregrounding what might otherwise slip into the background:

In any communication, listeners/readers should try to act as if they were outsiders.

The use of crawling and selecting on line 2 is a phrase that Jack uses several times in the interview. Crawling refers to the behavior of software used to collect content from the web. The software that is used to do this is originally referred to as a web spider because of the way it automatically and recursively follows links in web content for some period of time. But web spiders need to be told by a person where to begin crawling, which is the process of selection.

If you are thinking that selection and appraisal sound similar that’s because they are practically synonyms for each other. Both terms are concerned with identifying material that is of enduring value for preservation in an archive. Appraisal speaks to the theory, method or framework that is used for performing the activity of selection.

In physical archives, boxes of paper manuscripts, files, diskettes or hard drives change hands. A retiring researcher donates their personal papers or workstation to an archive. Or a particular business unit transfers a set of material to an archive according to a previously agreed upon record retention program. In either case a relationship between the record creator or owner and the archive is established. This relationship is intrinsic to the appraisal process.

But in web archives this material transaction is not necessary or it is transformed almost beyond recognition. The architecture and infrastructure of the web, as well as the underlying Internet, allow content to be instantly retrieved across vast distances. You only need to know the URL for the resource and to instruct your web client (be it a browser or a crawler) to retrieve it. When it is all working. As noted by Brügger (2012a) the reliability of archived copies of web content is not a given. Features of the HTTP protocol, such as cookies (Barth, 2011) and caching (Fielding, Nottingham, & Reschke, 2014) combined with the rendering capabilities of the client software mean that the idea of a single idealized, canonical representation of a web resource retreats from view. This seeming immateriality of web content is an illusion generated by the very real assemblage of physical networks, computing machinery, storage devices, electrical grids and cooling units that must operate in concert to deliver access.

As we saw with Jack, there is no need to enter into a conversation with a website owner to start archiving web content. When the content is on the web an archivist can start the archiving software, give it a URL, configure the crawling behavior (how far, how long, how much, etc) and let it do its work. The decision of what to crawl is detached from the relationships that have traditionally guided appraisal. But like a ghost limb, Jack still felt the significance of these connections between the archive and the content creators for doing archival work. He wanted to establish them, even if they were not technically necessary. The links of relationships between people have effectively been replaced by hypertext links that provide discoverability and access.

In many ways what this analysis seems to point to is an evolving practice of web archiving where traditional concepts of appraisal are being unbracketed from one context and reapplied in another. Focusing on the objects, be they paper files, boxes, or representations of HTTP transactions, is less at issue than the practices that involve those objects, and their network of interactions. This shift in attention recalls the work of ethnographer and philosopher Annemarie Mol, whose work studying the treatment of atherosclerosis highlights the importance of practice:

It is possible to refrain from understanding objects as the central focus of different people’s perspectives. It is possible to understand them instead as things manipulated in practices. If we do this–if instead of bracketing the practices in which objects are handled we foreground them–this has far reaching effects. Reality multiplies. (Mol, 2002, p. 5)

The web archive is situated among these multiple record realities involving the creators of records with the preservers of records with the users of records.

But to return to the question I started with: what does all this tell us about how content is appraised for websites, and historiography of the web? I think these brief examples highlight just how important it is to maintain the manifold of relationships between record creators and the archive. Appraisal, as it is embodied in the practices of archivists, and encoded into software tools, is a social enterprise that shapes the historical record. Just as the infrastructure of the web enables communication across great geographic distances, it also simultaneously moves to obscure the relationship between the archive and the archived. Further research is needed to discover practices that help bridge this gap and make it legible, while allowing for new conceptions of appraisal to develop and be translated.

If you’re a scholar who uses archives of web content I encourage you to reach out to the archivists you know, and to work with them to help build these practices and ensure that they are collecting the things you value. If you work as part of an organization and want to ensure that your web content is being collected and archives try reaching out to an archivist to let them know of your interest. And of course if you are an archivist, and you are stymied by thinking about archiving web content, there are good reasons for that. The web is a big place, and its hard to know what to collect. Focusing on the relationships you have with the communities you document can help make it more manageable and meaningful.

In August 2014 I took part in a panel conversation at the Society of American Archivists meeting in Washington DC that focused on the imperative for archivists to interrogate the role of power, ethics and regulation in information systems. The conference itself stands out in my memory, because it began on 10 August, the day after Michael Brown was killed by police officer Darren Wilson in Ferguson, Missouri. I distinctly remember the hand that shot up immediately during the Q&A period to ask what, if anything, we will remember of the voices from Ferguson in social media, that raised awareness of the injustice that had occurred there. Before anyone had much of a chance to respond another voice asked whether anyone had seen the blog post about how radically different Twitter and Facebook’s presentations of Ferguson were. The topic of power, ethics and regulation were not simply academic subjects for discussion, they were demands for understanding from information professionals actively engaged in the work of historical production.

The blog post mentioned that day was What Happens to #Ferguson Affects Ferguson by Zeynep Tufekci. It was published on the social media platform Medium, as the sustained protests in Ferguson began to propel the hashtag #BlackLivesMatter into many Twitter timelines and newsrooms around the world. Like so much of her work, Tufekci’s post asked her readers to think critically about the algorithmic shift we have been witnessing in our media and culture since the advent of the web and the rise of social media. Tufekci is a consummate public scholar, who uses online spaces like her blog, Twitter, Medium, TED talks and editorials in the The Atlantic and The New York Times to advance a crucial discussion of how the affordances of information technology are both shaped, and being shaped, by social movements and political infrastructures. It is a pivotal time for scholars to step out from the pages of academic journals and into the World Wide Web spaces that are grappling with the impact of post-truth politics and fake news. It is into this time and place that Tufekci’s first book Twitter and Tear Gas: The Power and Fragility of Networked Protest is launched.

Tufekci’s book is divided into three main parts 1) Making a Movement, 2) A Protester’s Tools, and 3) After the Protests. While these suggest a chronological ordering to the discussion, the different parts, and the ten chapters found within them, reflect a shifting attention to the specifics of networked social movements. Part 1 provides the reader with a general discussion of how the networked public sphere operates with respect to social movements. This is followed by Part 2 which takes a deeper dive into the specific affordances and sociotechnical logics of social media platforms such as Twitter, Facebook and Google. And finally, Part 3 integrates the previous discussion by articulating a theory for how social movements function in, and through, online spaces.

Throughout the book Tufekci focuses on the specifics of protest and counter-protest, while stressing that social media spaces are not disembodied and virtual phenomena, but are actual, contingent configurations of people, technology and power. In teasing out the dimensions of networked public sphere Tufekci reminds me of Kelty’s concept of a recursive public in which the public’s participants are actively engaged in the maintenance, modification and design of the technical and material means that sustain the public itself (Kelty, 2008). In many ways Twitter and Tear Gas hacks the sociopolitical systems that it describes. It’s no mistake that the book is licensed with the Creative Commons and is freely downloadable from it’s companion website.

Prior to her academic career, Tufekci worked as a software developer at IBM where she first encountered the information infrastructure we call the Internet. You can sense this training and engagement with practice in her work which always seems to be pushing up against, but not overstepping, the art of what is possible. As a sociologist she brings the eye of an ethnographer to her study of protest. Tufekci is not a distant observer, but a participant, with actual stakes in the political outcomes she describes. She is pictured on the dust jacket wearing a helmet to protect her from tear gas canisters that were shot into the crowd that she was a part of in the Gezi Park protests. The book sits on the solid foundations of her own experience as well as the experiences of activists and organisers that she interviews. But Twitter and Tear Gas also significantly engages with sociological theories to bring clarity and understanding to how social media and social movements are co-produced.

In the pages of Twitter and Tear Gas you will find scenes of protests from around the world that are put into conversation with each other. From Zapatista solidarity networks, to the disruption of the World Trade Organization in Seattle, the global anti-war protests after 9/11, to [Occupy]https://en.wikipedia.org/wiki/Occupy_movement) in Zuccotti Park, the Egyptian Revolution in Tahrir Square, the Gezi Park protests in Istanbul, the Indignados in the Plaza del Sol, the Umbrella Movement in Hong Kong, and BlackLivesMatter in Ferguson, Missouri. Twitter and Tear Gas functions as a historical document that describes how individuals engaged in political action were empowered and inextricably bound up with social media platforms. While it provides a useful map of the terrain for those of us in the present, I suspect that Twitter and Tear Gas will also be an essential text for future historians who are trying to reconstruct how these historical movements were entangled with information technology, when the applications, data sources and infrastructures no longer exist, or have been transformed by neglect, mergers and acquisitions, or the demands for something new, into something completely different. Even if we have web archives that preserve some “sliver of a sliver” of the past web (Harris, 2002) we still need to remember the stories of interaction and the technosocial contingencies that these dynamic platforms provided. Despite all the advances we have seen in information technology a book is still a useful way to do this.

One of the primary theoretical contributions of this text is the concept of capacity, or a social movement’s ability to marshal end effect narrative, electoral and disruptive change. Tufekci outlines how the affordances of social media platforms make possible the leaderless adhocracy of just-in-time protests, and how these compare to our historical understanding of the African-American Civil Rights Movement in the United States. The use of hashtags in Twitter allow protesters to communicate at a great speed and distance to mobilise direct action in near real time. Planning the Civil Rights Movement took over a decade, and involved the development of complex communication networks to support long term strategic planning.

Being able to skip this capacity building phase allows networked social movements to respond more quickly and in a more decentralised fashion. This gives movements currency and can make them difficult for those in power to control. But doing so can often land these agile protests in what Tufekci calls a tactical freeze, where, after an initial successful march, the movement is unable to make further collective decisions that will advance their cause. In some ways this argument recalls Gladwell (2010) who uses the notion of weak ties(Granovetter, 1973) to contend that social media driven protests are fundamentally unable to produce significant activism on par with what was achieved during the civil rights era. But Tufekci is making a more nuanced point that draws upon a separate literature to make her argument, notably the capacity in development work of Sen (1993) and the capability theory of justice of Nussbaum (2003). Tufekci’s application of these concepts to social movements, and her categories of capacity, combined with the mechanics of signaling by which capacities observed and responded to operate as a framework for understanding why we cannot use simple outcome measures, such as numbers of people who attend a protest, when trying to understand the impact of networked social movements. For those who are listening she is also pointing to an area that is much in need of innovation, experimentation and study: tools and practices for collective decision-making that will allow people to thaw these tactical freezes.

Another significant theoretical thread running through Twitter and Tear Gas concerns the important role that attention plays in understanding the dynamics of networked protest. Social media is well understood as an attention economy, where users work for likes and retweets to get eyes on their content. Social and financial rewards can follow from this attention. While networked protest operates in a similar fashion, the dynamics of attention can often work against its participants, as they criticise each other in order to distinguish themselves. Tufekci also relates how the affordances of advertising platforms such as Google and Facebook made it profitable for Macedonian teenagers to craft and spread fake news stories that would draw attention away from traditional news sources, generate clicks and ad revenue, and as a side effect, profoundly disrupt political discourse.

Perhaps most significant is the new role that attention denial plays in online spaces, as a tactic employed by the state and other actors seeking to shape public opinion. Tufekci calls this the Reverse-Streisand Effect, since it uses the Internet to funnel attention to topics other than a particular topic at hand. She highlights the work of King, Pan, & Roberts (2013) that analysed how China’s so-called 50 Cent Army of web commenters shapes public opinion not simply by censoring material on the web, but by drawing attention elsewhere at key moments. Social media platforms are geo-political arenas, where bot armies are deployed to drown out hashtags and thwart communication, or attack individuals with threats and volumes of traffic that severely disrupt the target’s use of the platform. When people’s eyes can be guided, or pushed away, censorship is no longer needed. It is truly chilling to consider the lengths that those in power, or seeking power, might stoop to, in order to provide these events when needed.

As significant as these theoretical contributions are, it is Tufekci’s personal voice combined with flashes of insight that I remember most from Twitter and Tear Gas. Details such as the use of the use of Occupy’s human microphone to amplify speaker’s voices and shape speech is a poignant metaphor for Twitter’s capacity for amplifying short message bursts that cascade through the network as retweets. In another Tufekci considers why so many protest camps set up libraries, and connects the work being done in social media to the work of pamphleteers throughout history. She describes the surreal experience of watching pastel hearts float across Periscope videos from Turkish Parliamentarians that were preparing to be bombed during an attempted coup. Near the end of the book she draws an analogy between the rise of fake news fueled by social media, and the ways in which Gutenberg’s printing press escalated the Catholic Church’s distribution of indulgences, opening itself up to the criticism found in Luther’s 95 theses–which were also printed. The stories work is generative of a humanistic outlook that does not deny or celebrate big data:

There is no perfect, ideal platform for social movements. There is no neutrality or impartiality–ethics, norms, identities, and compromise permeate all discussions and choices of design, affordances, policies, and algorithms on online platforms. And yet given the role of the these platforms in governance and expression, acknowledging and exploring these ramifications and dimensions seems more important than ever. (p. 185)

In fact, saying that Tufekci’s book has an explicit narrative arc is an oversimplification. It functions more like a fabric that weaves theory, observation and story, as topics are introduced and returned to later; there is no set chronology or teleology that is being pursued. On finishing the book it is clear how the concepts of attention and capacity are present throughout. But Tufekci makes these theoretical connections not with over abstraction and heavy citation, but by presenting scenes of protest where these concepts are being enacted. While there are certainly references to the supporting literature, the text is not densely packed with them. Finer theoretical manoeuvres are reserved for the endnotes, and do not overwhelm the reader as they move through the text. If you are teaching a course that surveys either communications, sociology or the politics of social media platforms and information infrastructures more generally Twitter and Tear Gas belongs on your syllabus. Your students will thank you: they can download the book for free, they can follow Tufekci on Twitter and Facebook, and her book speaks directly to the socio-political moment we are all living in.

Sen, A. (1993). The quality of life. In M. Nussbaum & A. Sen (Eds.), The quality of life. Oxford: Clarendon Press.

]]>Thu, 12 Oct 2017 00:00:00 -0500https://inkdroid.org/2017/10/12/teargas/
https://inkdroid.org/2017/10/12/teargas/twitterreadingactivismAnalyzing RetweetsYesterday I got into conversation with Ben Nimmo and Brian Krebs who were the subjects of an intense botnet attack on Twitter. They were experiencing a large number of followers in a short period of time, and a selection of their tweets were getting artificially boosted by retweets of up to 80,000 times. You can read Brian’s detailed writeup here.

At first it seemed completely counter-intuitive to me that someone would direct their botnet (which in all likelihood they are paying for) to boost the followers and messages of people they disagree with. But I wasn’t thinking deviously enough. As Brian points out in his post, Twitter appear to have stepped up suspending botnet accounts and the beneficiaries of the botnet traffic. So boosting a user you don’t like could get them suspended. In addition, as Ben wrote about a few days ago, it is also an intimidation tactic that disrupts the target’s use of Twitter.

Our specific conversation was about how to analyze the retweets since there are tens of thousands and Twitter’s statuses/retweets API endpoint is limited to fetching the last 100 retweets. However, it is possible to use the search/tweets endpoint to search for the retweets using the text of the tweet, as long as the retweets have been sent in the last 7 days, which is furthest back Twitter allow you to search for tweets in. So there is a brief window in which you can fetch the retweets.

If you, like Ben and Brian find yourself needing to collect retweets I thought I would document the process a little bit here. The basic approach should work with different Twitter clients if you prefer to work in another language—I used twarc because I’m familiar with it, and it handles rate limiting easily. I also worked from the command line to explain the process at a higher level. You could certainly write a small program to do this.

So, I wanted to get the retweets for a tweet from Brian that generated a great deal of rapid retweet traffic that appeared to him to be bot driven:

Bring on the bots and sock puppet accounts. Amazing how a tweet about Putin always engenders defensive responses about Trump.

First you’ll need to install twarc for interacting with the Twitter API from the command line. If you don’t have Python yet you’ll need to go get that first.

pip install twarc

Now you are ready to tell twarc about your Twitter API keys. Go over to apps.twitter.com, create an app and note the keys down so you can tell twarc about them with:

twarc configure

With twarc and your twitter keys in hand you are ready to collect the tweets using twarc’s search command. To run a search you need a query. In this case we’re going to use some identifying text from the tweet in question. The results are line-oriented-json, where every line is a complete JSON document for a tweet. The JSON is exactly what is returned from the Twitter API for a tweet.

twarc search 'Bring on the bots and sock puppet accounts Amazing how a tweet about Putin' > briankrebs.jsonl

This command could run for a while depending on how many retweets you can get. Twitter only allow you to get 17,000 ever 15 minutes. twarc will handle waiting until it can go get more. You can see a file twarc.log which contains information about what it is doing.

Once that finishes you probably want to be absolutely sure the file only includes retweets of that specific tweet. It’s possible that your search generated some false positives if the words happened to be used in tweets that were not retweets of your subject tweet. One handy way of doing this is to use jq to filter them using the tweet id of the original tweet:

Now that you have the JSON for the retweets you can do analysis of the users by creating a CSV file of information about them. For example I was interested in looking at the followers, friends and tweets counts, as well as when the account was created and the user’s preferred language. Yes, this user profile information can be found in the information you get for each tweet, or in this case, retweet. For the full details checkout the Tweet Field Guide from Twitter. jq is also pretty good at extracting bits of the json and writing it as CSV:

Here is the file I generated. You should be able to open that in your spreadsheet software of choice and look for patterns.

One other thing I was interested in doing was seeing what connections there might be between the retweeters of Brian and the retweeters of this tweet by Ben, which he thought got artificially boosted as well:

So I went through the exact same process to generate a file of user information for the retweeters of that tweet. With that file in hand I just needed to see what users were present in both. One nice little trick for doing the join is to use csvkit’s csvjoin.

I realize this post was a bit of an esoteric post, but I wanted to write up the process in case you find yourself wanting to analyze retweets. One thing I don’t know the answer to is why the number of retweets returned isn’t exactly the same as the number of retweets displayed in the tweet. One explanation for this is that the search index is imperfect, and there is some hidden limitation apart from the ~7 day window that it will return results in. Another more likely explanation is that some of the retweets were from accounts that have been suspended or deleted, but the retweet count has not been adjusted to account for that. I guess only Twitter know the answer to that one.

]]>Thu, 31 Aug 2017 00:00:00 -0500https://inkdroid.org/2017/08/31/retweets/
https://inkdroid.org/2017/08/31/retweets/twitterbotsforensicsDelete ForensicsTL;DR Deleted tweets in a #unitetheright dataset seem to largely be the result of Twitter proactively suspending accounts. Surprisingly, a number of previously identified deletes now appear to be available, which suggests users are temporarily taking their accounts private. Image and video URLs from protected, suspended and deleted accounts/tweets appear to still be available. The same appears to be true of Facebook.

Note: Data Artist Erin Gallagher provided lots of feedback and ideas for what follows in this post. Follow her on Medium to see more of her work, and details about this dataset shortly.

In my last post I jotted down some notes about how to identify deleted Twitter data using the technique of hydration. But, as I said near the end, calling these tweets deletes obscures what actually happened to the tweet. A delete implies that a user has decided to delete their tweet. Certainly this can happen, but the reality is a bit more complicated. Here are the scenarios I can think of (please get in touch if you can think of more):

The user could have decided to protect their account, or take it private. This will result in all their tweets becoming unavailable except to those users who are an approved followers of the account.

The user could have decided to delete their account, which has the effect of deleting all of their tweets.

The user account could have been suspended by Twitter because it was identified as a source of spam or abuse of some kind.

If the tweet is not itself a retweet the user could have simply decided to delete the individual tweet.

If the tweet is a retweet then 1,2,3 or 4 may have have happened to the original tweet.

If the tweet is a retweet and none of 1-4 hold then the user deleted their retweet. The original tweet still exists, but it is no longer marked as retweeted by the given user.

I know, this is like an IRS form from hell right? So how could we check these things programmatically? Let’s take a look at them one by one.

If an account has been protected you can go to the user’s Twitter profile on the web and look for the text “This account’s Tweets are protected.” in the HTML.

If the account has been completely deleted you can go to the user’s Twitter profile on the web and you will get a HTTP 404 Not Found error.

If the account has been suspended, attempting to fetch the user’s Twitter profile on the web will result in a HTTP 302 Found response that redirects to https://twitter.com/account/suspended

If the tweet is not a retweet and fetching the tweet on the web results in a HTTP 404 Not Found then the individual tweet has been deleted.

If the tweet is a retweet and one of 1, 2, 3 or 4 happened to the original tweet then that’s why it is no longer available.

If the tweet is a retweet and the original tweet is still available on the web then the user has decided to delete their retweet, or unretweet (I really hope that doesn’t become a word).

With this byzantine logic in hand it’s possible to write a program to do automated lookups on the live web, with some caching to prevent looking up the same information more than once. It is a bit slow because I added a sleep to not go at twitter.com too hard. The script also identifies itself with a link to the program on GitHub in the User-Agent string. I added this program deletes.py to the utility scripts in the twarc repository.

So I ran deletes.py on the #unitetheright deletes I identified previously and here’s what it found:

Result

Count

Percent

ORIGINAL_USER_SUSPENDED

9437

57.2%

ORIGINAL_TWEET_DELETED

2529

15.3%

TWEET_OK

980

5.9%

USER_DELETED

972

5.8%

USER_PROTECTED

654

3.9%

RETWEET_DELETED

612

3.7%

USER_SUSPENDED

502

3.0%

ORIGINAL_USER_PROTECTED

378

2.2%

TWEET_DELETED

367

2.2%

ORIGINAL_USER_DELETED

61

0.3%

I think it’s interesting to see that, at least with this dataset, the majority of the deletes were a result of Twitter proactively suspending users because of a tweet that had been retweeted a lot. Perhaps this is the result Twitter monitoring other users flagging the user’s tweets as abusive or harmful, or blocking the user entirely. I think it speaks well of Twitter’s attempts to try to make their platform a more healthy environment. But of course we don’t know how many tweets ought to have been suspended, so we’re only seeing part of the story–the content that Twitter actually made efforts to address. But they appear to be trying, which is good to see.

Another stat that struck me as odd was the number of tweets that were actually available on the web (TWEET_OK). These are tweets that appeared to be unavailable three days ago when I hydrated my dataset. So in the past three days 980 tweets that appeared to be unavailable have reappeared. Since there’s no trash can on Twitter (you can’t undelete a tweet) that means that the creators of these tweets must have protected their account, and then flipped it back to public. I guess it’s also possible that Twitter suspended them, and then reversed their action. I’ve heard from other people who will protect their account when a tweet goes viral to protect themselves from abuse and unwanted attention, and then turn it back to public again when the storm passes. I think this could be evidence of that happening.

One unexpected thing that I noticed in the process of digging through the results is that even after an account has been suspended it appears that media URLs associated with their tweets still resolve. For example the polNewsForever account was suspended but their profile image still resolves. In fact videos and images that polNewsForever have posted also still seem to resolve. The same is true of actual deletes. I’m not going to reference the images and videos here because they were suspended for a reason. So you will need to take my word for it…or run an experiment yourself…

FWIW, a quick test on Facebook shows that it works the same way. I created a public post with an image, copied the URL for the image, deleted the post, and the image URL still worked. Maybe the content expires in their CDNs at some point? It would be weird if it just lived their forever like a dead neuron. I guess this highlights why it’s important to limit the distribution of the JSON data that contain these URLs.

Since the avatar URLs are still available it’s possible to go through the suspended accounts and look at their avatar images. Here’s what I found:

Notice the pattern? They aren’t eggheads, but pretty close. Another interesting thing to note is that 52% of the suspended accounts were created August 11, 2017 or after (the date of the march). So a significant amount of the suspensions look like Twitter trying to curb traffic created by bots.

]]>Fri, 18 Aug 2017 00:00:00 -0500https://inkdroid.org/2017/08/18/delete-forensics/
https://inkdroid.org/2017/08/18/delete-forensics/twitterarchiveforensicsUTRI’ve always intended to use this blog as more of a place for rough working notes as well as somewhat more fully formed writing. So in that spirit here are some rough notes for some digging into a collection of tweets that used the #unitetheright hashtag. Specifically I’ll describe a way of determining what tweets have been deleted.

Note: be careful with deleted Twitter data. Specifically be careful about how you publish it on the web. Users delete content for lots of reasons. Republishing deleted data on the web could be seen as a form of Doxing. I’m documenting this procedure for identifying deleted tweets because it can provide insight into how particularly toxic information is traveling on the web. Please use discretion in how you put this data to use.

So I started by building a dataset of #unitetheright data using twarc:

twarc search '#unitetheright' > tweets.json

I waited two days and then was able to gather some information about the tweets that were deleted. I was also interested in what content and websites people were linking to in their tweets because of the implications this has for web archives. Here are some basic stats about the dataset:

Deletes

So how do you get a sense of what has been deleted from your data? While it might make sense to write a program to do this eventually, I find it can be useful to work in a more a more exploratory way on the command line first and then when I’ve got a good workflow I can put that into a program. I guess if I were a real data scientist I would be doing this in R or a Jupyter notebook at least. But I still enjoy working at the command line, so here are the steps I took to identify tweets that had been deleted from the original dataset:

First I extracted and sorted the tweet identifiers into a separate file using jq:

jq -r '.id_str' tweets.json | sort -n > ids.csv

Then I hydrated those ids with twarc. If the tweet has been deleted since it was first collected it cannot be hydrated:

twarc hydrate ids.csv > hydrated.json

I extracted these hydrated ids:

jq -r .id_str hydrated.json | sort -n > ids-hydrated.csv

Then I used diff to compare the pre and post hydration ids, and used a little bit of Perl to strip of the diff formatting, which results in a file of tweet ids that have been deleted.

Since we have the data that was deleted we can now build a file of just deleted tweets. Maybe there’s a fancy way to do this on the command line but I found it easiest to write a little bit of Python to do it:

After you run it you should have a file delete.json. You might want to convert it to CSV with something like twarc’s json2csv.py utility to inspect in a spreadsheet program.

Calling these tweets deleted is a bit of a misnomer. A user could have deleted their tweet, deleted their account, protected their account or Twitter could have decided to suspend the users account. Also, the user could have done none of these things and simply retweeted a user who had done one of these things. Untangling what happened is left for another blog post. To be continued…

]]>Tue, 15 Aug 2017 00:00:00 -0500https://inkdroid.org/2017/08/15/utr/
https://inkdroid.org/2017/08/15/utr/Post Custodial LogicsI love how Kelleher (2017) positions the radical? idea of funding the development of archival infrastructure where it is actually needed using such a logical appeal to the status quo:

One strategy that UTL employed in collaboration with project partners to address challenges of agency, differential access to resources, and the most direct application of benefit was very deliberate transactional use of project funding. Rather than assume transfer of documentation to UTL — either through donation or purchase — as required under the custodial paradigm, UTL instead helped to arrange and purchased negotiated access to documentation that remained in the custody or control of the partner organization. Project funds were put toward the arrangement, description, preservation, and digitization of documentation, just as they would have been if the archival materials were at UTL. But the investments were made not in Texas, but locally with the partner organizations. In this way, the partner organizations and in some cases communities were able to build infrastructure and skills in digitization, metadata, software development, and preservation appropriate to the context of their organizational goals and uses of the documentation. And in two cases at least, the human rights organization developed significant local expertise that served them well beyond their partnership with UTL. Additionally, rather than acquire the original records themselves — as called for under the custodial paradigm — UTL sometimes purchased digitized copies of documentation or gained non-exclusive access to documentation as they and partners made it available online. Though somewhat unusual for a custodial archival repository, this system was very familiar and comfortable for UTL as an academic library that annually spent hundreds of thousands of dollars for access to databases. Partner organizations, with funds earned in this manner, could and did hire and train, or otherwise provide direct humanitarian aid to individuals documented in the records, so at least some saw benefit from participation in the project.

]]>Fri, 21 Jul 2017 00:00:00 -0500https://inkdroid.org/2017/07/21/post-custodial/
https://inkdroid.org/2017/07/21/post-custodial/archivesAssemblages of AppraisalI had the opportunity to put together a poster for AERI this year. The poster presents a paper that I recently gave at CSCW (Summers & Punzalan, 2017). Creating it was a surprisingly useful process of distilling the paper to its essentials while re-presenting it visually. It occurred to me that the poster session audience and the typical web audience have something in common: limited attention. So I reworked the poster content here as a blog post to try to make my research findings a bit more accessible.

Even after over 20 years of active web archiving we know surprisingly little about how archivists appraise and select web content for preservation. Since we can’t keep it all, how we decide what to keep from the web is certain to shape the historical record (Cook, 2011). In this context, we ask the following research questions:

How are archivists deciding what to collect from the web?

How do technologies for web archiving figure in their appraisal decisions?

Are there opportunities to design more useful systems for the appraisal of content for web archives?

Methodology

To answer these questions I conducted a series ethnographic interviews with 29 individuals involved in the selection of web content. Participants include web archivists as well as researchers, managers, local government employees, volunteers, social activists, and entrepreneurs. The field notes from these interviews were analyzed using inductive thematic analysis.

Analysis began with reading all the field notes together, followed by line by line coding. While coding was done without reference to an explicit theoretical framework, it was guided by an interest in understanding archival appraisal as a sociotechnical and algorithmic system (Botticelli, 2000 ; Kitchin, 2016).

Findings

Coding and analysis surfaced six interconnected and interdependent themes that fell into two categories, the social and the technical, which are illustrated here in green and yellow respectively.

Appraisal in the context of web archiving is a complex interplay between the following:

Crawl Modalities: The selection strategies designed into tools and chosen by archivists in their work: domains, websites, documents, topics, and events.

Time: How long to collect, how often to collect, how quickly web content needed to be gathered, perceptions of change in content.

Money: Grants from foundations and agencies to support collection activities, staffing, subscription fees, relationship between money and storage.

Conclusion

The findings highlighted sites of breakdown that are illustrated by the red lines in the thematic diagram. These breakdowns are examples of infrastructural inversion (Bowker & Star, 2000), or sites where the infrastructure of web archiving became legible.

Breakdowns between People and Tools were seen in the use of external applications such as email, spreadsheets and forms to provide missing communication features for documenting provenance and appraisal decisions.

Breakdowns in Money, Crawl Modalities and Information Structures occurred when archivists could not determine how much it would cost to archive a website, and attempted to estimate the size of websites.

Appraisal decisions depend on visualizations of the material archive.

While our chosen research methodology and findings do not suggest specific implications for design (Dourish & Bell, 2011) they do highlight rich sites for for repair work as well as improvisational and participatory design (Jackson, 2014).

Acknowledgments

Thank you to Ricky Punzalan for much guidance during the planning and execution of the study. Leah Findlater and Jessica Vitak also helped in the selection of research methods. Nicholas Taylor, Jess Ogden and Samantha Abrams provided lots of useful feedback on early drafts, as well as pointers into the literature that were extremely helpful.

I also want to thank the Maryland Institute for Technology in the Humanities and the Documenting the Now project (funded by the Mellon Foundation) who provided generous support for this research. My most heartfelt thanks are reserved for the members of the web archiving community who shared their time, expertise and wisdom with me.

… we are concerned with the argument, implicit if not explicit in many discussions about the pitfalls of interdisciplinary investigation, that one primary measure of the strength of social or cultural investigation is the breadth of implications for design that result (Dourish, 2006). While we have both been involved in ethnographic work carried out for this explicit purpose, and continue to do so, we nonetheless feel that this is far from the only, or even the most significant, way for technological and social research practice to be combined. Just as from our perspective technological artifacts are not purely considered as “things you might want to use,” from their investigation we can learn more than simply “what kinds of things people want to use.” Instead, perhaps, we look to some of the questions that have preoccupied us throughout the book: Who do people want to be? What do they think they are doing? How do they think of themselves and others? Why do they do what they do? What does technology do for them? Why, when, and how are those things important? And what roles do and might technologies play in the social and cultural worlds in which they are embedded?

These investigations do not primarily supply ubicomp practitioners with system requirements, design guidelines, or road maps for future development. What they might provide instead are insights into the design process itself; a broader view of what digital technologies might do; an appreciation for the relevance of social, cultural, economic, historical, and political contexts as well as institutions for the fabric of everyday technological reality; a new set of conceptual resources to bring to bear within the design process; and a new set of questions to ask when thinking about technology and practice.

I’m very grateful to Jess Ogden for pointing me at this book by Dourish and Bell when I was recently bemoaning the fact that I struggled to find any concrete implications for design in Summers & Punzalan (2017).