Guiding Questions for Constructing Digital Archives -- Matt Pitchford
Historians talk about the materials that make up archives as “traces” of the past. Only left to us are partial impressions of history, created through a three-step process of selection that is fundamentally shaped by power. First, there are the selections made at the creation of the archival document about who gets to speak: who and what should be observed and recorded. Second, there are the selections made in the collection of the documents as the archivist decides which objects to accept, how they are labeled and described, and which and to what degree they are made accessible to scholars. And finally, there are the selections the scholar makes in deciding which documents to use to build their narrative and make their claims.
Because even informal archives undergo this process, every scholar takes part in this archival process. Therefore, all academics would benefit from thinking like an archivist when engaging in research and argument. This is especially true for digital scholars, however, because their archives are arguably less structured than traditional archives, both because of the recency of their creation and because of the sheer number of possible artifacts available. These textual traces are still shaped by the power dynamics of the digital itself—who has access to the internet, who is able to speak and circulate in particular web cultures, etc. But these vast collections of tweets and websites are rarely cataloged in any direct way. They require additional work from the interested scholar in their collection, labeling, and analysis—the kind of work that an archivist might do in a traditional archive.
Scholars in communication have already spent time thinking about how archives, as rhetorical constructs themselves, shape our work. Cara Finnegan has argued that “archives—even transparent image archives—function as terministic screens, simultaneously revealing and concealing ‘facts,’ at once enabling and constraining interpretation.” Charles Morris III, in turn, notes that archives are dynamic sites of “rhetorical power.” Many archives pre-exist rhetorical analysis—they are collected by government agencies, librarians, interested persons, activists, or other third parties. But what about scholars interested in digital artifacts (texts, images, sounds, websites, tweets, forum posts, etc.)? Must all digital scholars craft their own archives?
Not everyone interested in digital and networked rhetorics has to construct an archive of their own making. But scholars of digital rhetoric will often find themselves confronting questions of curation, selection, and collection that are fundamentally archival in nature. Although not all digital rhetoricians will need to construct formal archives, all of us must justify and make clear the choices they make in terms of 1) the principles of selection, 2) the tools and means of selection, and 3) the future access/publicity of the archive. These decisions should be explained inside the scholarly work itself, especially if the archive will be public later, as digital scholars will make selections that reveal and conceal facts, and our ability to produce quality and ethical interpretations rests on our ability to attend to the ways in which power shapes our selections as to collection, labeling, and making things accessible. Whose voices do we choose to reveal? Or perhaps choose to hide?
In digital spaces—where rhetorical critics often work to construct their own collection, corpus, or selection of artifacts for analysis—the boundary between scholar and archivist is further blurred. For example, Robert Glenn Howard’s work with vernacular rhetoric is both an argument based on the selection of an artifact (like the Sinner’s prayer on amateur websites) and a rhetorical analysis of how it functions performatively in a community. Digital spaces also complicate the institutional differences between the actions of collecting artifacts for disciplinary research and collecting artifacts for libraries, archives, or museums (which are themselves terms “slippery at best”).
Critics construct texts appropriate for rhetorical criticism. And although texts have always been fragmented, the sheer scope of digital rhetorics makes this fragmentation more apparent and more thorny. Social media is an obvious space where this happens: there are millions of individual texts authored about election coverage on Twitter or vaccination on Facebook. Visual artifacts circulate on Instagram or between Snapchat users. On a (relatively) smaller scale, forums, Reddit communities, blogs, and fan spaces all provide rich sets of texts for rhetorical analysis. And, subsequently, rhetorical scholars have constructed archives from which to base their claims and analysis. For example, Laura Greis’s method of iconographic tracking is a meticulous, and manual, process of building an archival database of images. This database becomes the means through which we can see how images become viral, circulate, and mutate in a variety of digital contexts.
For me, one major aspect that sets apart the construction of digital archives from others is the scope of the archives themselves. Certainly, the amount of original material, as discussed above, is enormous. But so too can a digital archive contain an incredible number of discrete texts. Even a single day’s worth of data from social media, if collected in one place, is an overwhelming amount of information. What, then, should be included in the archive? I suggest the three following guiding questions: what are the principles of selection, the means of selection, and the collection’s future publicity?
First, the principles of selection—what should be included in the archive or not—is something idiosyncratic to each project. It depends on the question that the project is attempting to answer. How do people on Twitter talk about politics? What arguments do people on Facebook deploy surrounding understandings of science and truth? How do images articulate power? Each of these questions invite different archival structures and components. As Jenny Rice and Jeff Rice have argued, the fluctuating space of meaning in digital networks mean that scholars should think about information usage that is not about permanence, or completeness, but focuses on users and temporality. But then what? Are these collections to be composed through the order offered by users (hashtags, particular keyword terms, communities) or something that is broader than an individual’s own experience of the Internet (Twitter follower networks, statistical samples of larger collections)? This is not a neutral process: what is included—what is made visible and made archivable—shapes the nature of knowledge and claim construction from the archive itself.
The second guiding question, the means of selection, asks how to pragmatically put together the archive—inviting other sets of questions. What knowledges are required to collect the data? Some of this knowledge may be technical—knowing computer code to scrape data based on keywords, time, or other parameters. Other sets of knowledge may be social and cultural—having the ethnographic or autoethnographic tools to be able to find and interview others online. Other knowledge may blend into the realm of the social scientist (is this sample significant enough to establish claims) or archivist (what metadata is pertinent, and how (inter)reliable is that metadata)? Finally, and unfortunately, the vast majority of social media data is in some way proprietary. This means that gaining access to the data itself may have a monetary cost, or have other restrictions and gatekeepers.
And then finally, what happens to the archive next? Is it made public as a part of the research? But making data public is not always feasible, or ethical, depending on what is collected or how. Some data may require IRB approval, and sharing it may require additional steps from the scholar to anonymize it or simply be impossible. If open data is the goal, there may need to be sacrifices of completeness elsewhere in the project. Or perhaps the project is based on previously collected data: There exist public datasets of social media data that already exist freely online (for example: CrisisLex or pushshift.io), or data repositories at different institutions that help protect data into the future. Will your data be public into the future, or strictly personal?
These guiding questions are not new, and perhaps not even surprising. But thinking like an archivist is a useful heuristic for rhetoricians interested in digital scholarship.
 Becker, Carl. “Everyman His Own Historian” The American Historical Review, Vol. 37, No. 2 (Jan., 1932), pp. 221-236.
 Burns, Kathryn. Into the Archive: Writing and Power in Colonial Peru, (Durham, N.C.: Duke University Press, 2010).
 Cara A. Finnegan. "What Is This a Picture Of?: Some Thoughts on Images and Archives." Rhetoric & Public Affairs 9, no. 1 (2006): 116-123. https://muse.jhu.edu/ (accessed July 26, 2019), 117.
 Charles E. Morris. "The Archival Turn in Rhetorical Studies; Or, the Archive's Rhetorical (Re)turn." Rhetoric & Public Affairs 9, no. 1 (2006): 113-115. https://muse.jhu.edu/ (accessed July 26, 2019).
 Clement, Tanya, Wendy Hagenmaier, and Jennie Levine Knies. "Toward a notion of the archive of the future: impressions of practice by librarians, archivists, and digital humanities scholars." The Library Quarterly 83, no. 2 (2013): 112-130.
 Howard, Robert Glenn. “A Theory of Vernacular Rhetoric: The Case of the ‘Sinner’s Prayer’ Online,” Folklore 116 (2005): 172-188.
 Ibid. 113.
 McGee, Michael Calvin. "Text, context, and the fragmentation of contemporary culture." Western Journal of Communication (includes Communication Reports) 54, no. 3 (1990): 274-289.
 Greis, Laura. Still Life with Rhetoric: A New Materialist Approach for Visual Rhetorics (Boulder, CO: University Press of Colorado, 2015).
 Rice, Jenny and Jeff Rice. “Pop Up Archives” in Rhetoric and the Digital Humanities. Ridolfo, Jim, and William Hart-Davidson, eds. (University of Chicago Press, 2015).