In late July, the National Endowment for the Arts quietly issued a request for proposals for a national arts data repository to serve as a clearing house for “high-quality datasets with arts variables.” Given my experience as a data geek, this news seemed like Christmas in July. I started daydreaming about accessing flawlessly documented datasets (perfect codebooks! detailed workflows and logbooks!) and having a dedicated repository where I could store and share all of the data from my own projects. Most of all, I thought about how this archive might transform our sector.

Many hopes sprang to mind. A one-stop shop for longitudinal and reliable datasets could change how arts researchers document, publish, share, verify, and re-use data, and have implications for policymakers, practitioners, and administrators. For one, we could deepen the current dialogue(s) on arts research and cultural vibrancy indicators to examine their utility and comprehensiveness. This in turn could help the sector develop better frameworks to define arts and culture and more transparent benchmarks for measuring the health and vitality of the arts sector. While several arts and cultural indicator projects like the Urban Institute’s Arts and Culture Indicators Project or ArtPlace’s Vibrancy Indicators utilize widely available datasets, crucial segments of the data and documentation—like the actual methodology used to derive cell phone activity for ArtPlace’s Vibrancy Indicators—are not available for wide consumption. A popular adage in the sciences is to “trust, but verify” research results. Should the arts be any different? If a critical mass of datasets and supporting documentation is included in the NEA’s repository and accessed for re-use, researchers might communicate more openly and make strides toward building a shared framework for how we understand the arts.

Still, as excited as I was, my experience with arts-related datasets has taught me that working with data archives is never rarely as simple as we’d like it to be. In my past role as a data curation consultant, I researched and advised how best to document, select, publish, and deposit research data for potential re-use on behalf of various stakeholders in the arts and humanities. As part of a digital humanities project called Visualizing Statues, I investigated research methods to develop a digital management plan (DMP), a process that is now required for all NEA and NEH research grants. The questions that came up were manifold and pretty universal to the arts and humanities sector—how to properly cite datasets, for example, or how to migrate proprietary file formats so that others could access datasets without using soon-obsolete software. One of the biggest of these questions was what constituted arts research data. The Visualizing Statues project used the National Science Board (NSB)’s Long-Lived Digital Data Collections definition of data as “any information…including text, numbers, images, video or movies.” This definition is a standard among data curators. Its chief advantage is that it provides a springboard for researchers to consider data as something with a wide and changing scope rather than something static. On the other hand, the definition is so broad that it makes articulating the exact provenance of datasets difficult as they change hands from “raw” to normalized and cleaned. In Visualizing Statues, “the data” included everything from photographs to geocoordinates to ArcGIS code to epigraphic documents in TEI XML. Citing the different types and versions of data became a major issue when it came time to select a data archive, as some have policies that are unfriendly to such a heterogeneous mix of formats.

Despite concerns over version control and the wide scope of data in VisualizingStatues, the good news was that the project reflected a trend toward data sharing and access in the arts and humanities. Our researchers uniformly expressed willingness to share and deposit their data in an archive. We were also able to access others’ research data in conjunction with this project, though all of it came directly from the principal investigators’ colleagues. This is pretty typical; in general, access to arts-related datasets varies greatly. Anyone can search for and download files from wide-access or public datasets like the NEA’s Survey of Public Participation in the Arts and the Urban Institute’s National Center for Charitable Statistics, but to access datasets from smaller projects, which typically are not publicly funded and rely on third-party consultants, interpersonal connections are often required. Even having friends in high places won’t work for everyone, as some datasets are kept for internal or highly restricted use. Given these roadblocks, arts researchers are often forced to or unknowingly duplicate efforts using precious project resources. The ability to inventory and access current and past datasets will let us better understand the scope of research in the sector while encouraging collaboration between researchers, and keeping past research data relevant.

Certain data archives and collection initiatives in the sector are already trying to address issues of access. Princeton’s Cultural Policy and the Arts National Data Archive (CPANDA), which has been around since 2001, seems an obvious choice for depositing and retrieving arts-related datasets. CPANDA describes itself as the “world’s first interactive digital archive of policy-relevant data on the arts and cultural policy in the United States.” However, its site is like a digital ghost town, with its most recent dataset dated 2011. One reason for CPANDA’s disuse may be the rigidly hierarchical format of its finding aids and directories. Researchers can conduct a simple search of terms used in survey questions or browse by the title or subject of a study, making searches difficult unless the researcher knows exactly what he or she is looking for. Say, for example, that you are a researcher looking for past surveys on how Americans have spent their leisure time since 2008. You could type “American Time Use” in the search bar and retrieve a list of survey questions with the terms American, time, and use in them. Or, you could browse by title, assuming you have a specific study in mind. Otherwise, you’re out of luck.

The Cultural Data Project (CDP), another seemingly obvious choice, is not an archive per se as much as a longitudinal collection of participating members’ financial data. The CDP was first launched in 2004 and states that it is “the emerging national standard for data collection in the arts and cultural sector.” However, the CDP requires data deposits use the CDP’s data collection instrument (the Data Profile) which collects narrowly but deeply on financial and programming data. Unfortunately, access to CDP’s rich datasets is neither public nor wide, and researchers not otherwise affiliated with CDP or its participating members may find their request for CDP data denied.

More recently, Southern Methodist University’s Meadows School of the Arts and Cox School of Business announced a collaboration with CDP and others on a National Center for Arts Research (NCAR) that “will analyze the largest database of arts research ever assembled.” This “hub for critical [arts research] data” may be more in line with the NEA’s RFP than CPANDA or CDP’s efforts, though whether it will provide wide or public access to its data is unclear. The Center’s focus will be “analysis, insights, and enablement” while data gathering will be on an ad-hoc basis. While NCAR released an introduction to its inaugural report in early December, it’s too soon to tell whether it will yield viable datasets for arts researchers. In the meantime, we have a clear and timely need for an arts data archive that utilizes relevant analysis tools, regularly maintains its datasets, is simple to navigate, and allows users to easily deposit and download data.

Fortunately, the NEA solicitation addresses some of these needs. It specifies that the NEA is seeking “a contractor who has an established data archiving infrastructure and the capacity to create, maintain, update, and expand an archive of datasets, metadata, and links to related literature.” This means that not only the reports, tables, figures, codebooks, and appendices affiliated with the data will be available to the public, but (huzzah!) the actual data as well. Very importantly, the NEA’s proposal states that researchers using the repository should have access to enough resources within it to be able to verify or replicate the data. The standard of replicability is virtually unheard of in arts research, in large part because of our reliance on qualitative research methods. Requiring it for datasets deposited in this archive would go far toward ameliorating some of the trust issues with arts research that then NEA Chief of Staff Jamie Bennett referred to in an interview with Barry Hessenius earlier this year:

I think we generally have a research and data problem in the arts. Our data sets are often not as robust, and our research is not always seen as being as rigorous as other sectors.

In addition, the NEA proposal asks the consultant do the following to ensure a successful deployment and implementation of the data archive:

Design and develop a website to serve as a discipline-specific archive that stores arts-related datasets, metadata, and references to literature (or citations) that use the datasets

Acquire and process art-related datasets, metadata, and links to literature currently housed at CPANDA

Update, maintain, and add to this new archive of arts-related data for a base period of one year

As of publication, the NEA has not yet announced the awardee of the data archiving contract, though a source there indicated the agency would “have something to say soon.” Although I am more skeptical now than I was in late July, I am cautiously optimistic that the creation of the arts data archive along with the NEA and NEH’s Digital Management Plan requirements will further data sharing and access in the sector. Although the growing pains are still evident, these measures should stimulate the arts community’s interest in data and collaborative research, and will hopefully thereby cultivate a better understanding of the nature of human creative endeavor. As stated by the American Council of Learned Societies Commission on Cyberinfrastructure for the Humanities and Social Sciences:

We have remarkable opportunities to bring new analytic and interpretive power to bear on the materials and the methods of the humanities and the social sciences; by so doing, we can advance our understanding of human cultures past, present, and future. In the process, however, [stakeholders] will also have to re-examine their own . . . culture, rethinking its outward forms, its established practices, and its apparent assumptions.

For the arts, this emerging path may be tumultuous, but indicates great promise.

Why would you leave unacknowledged the fact that cultural institutions hate (and for those who do not hate, dislike and disdain) the Cultural Data Project? They complain that it requires too much time, too much diversion of resources and that, as this post implies, too little useful information beyond certain specified queries and circumstances. At what point do you sk yourself when the need for data, which is all about you, outweighs the need for the most efficient allocation of time and resources, which is all about the arts groups from whom you need the data in the first place?

http://createquity.com Ian David Moss

At what point do you sk yourself when the need for data, which is all about you, outweighs the need for the most efficient allocation of time and resources, which is all about the arts groups from whom you need the data in the first place?

I’ve been asking that question a lot lately, Leonard, and I think it’s an important one. I actually have a post in the works about right-sizing data collection efforts to the task at hand, and I think it can be argued that our sector (among others) has been guilty at times of falling in love with the benefits of data without fully considering the costs. I’m sure that many on the proverbial front lines of the sector would agree.

http://www.culturaldata.org Chris Caltagirone

Yvonne Lee’s call for more open access to cultural data resources is both timely and important. The Cultural Data Project is proud to be a leading resource for researchers. Ms. Lee’s depiction of the CDP, however, is not accurate. The CDP captures a significant amount of operational and programmatic data, in addition to detailed financial data. Of greater concern is her statement that “access to CDP’s rich datasets is neither public nor wide, and researchers not otherwise affiliated with CDP or its participating members may find their request for CDP data denied.” This is a serious misstatement that belies the facts. Since January 1, 2010 the CDP has responded to more than 250 requests by providing data, with fewer than 5 denials during that same time period. Between August 2012 and August 2013 access to the data was granted more than 65 times. We invite readers to review our CDP in Research page (http://www.culturaldata.org/research/) to explore the diversity of projects undertaken across the field using CDP data.

Requests for CDP data come from researchers, funders, students completing dissertations and theses, and advocates using CDP to talk about the strengths of the cultural sector and the impact the arts are having in their communities. That said, we are aware that our process for approving data requests, which also involved review by state-based stakeholders, has been time consuming and challenging for some. We have recently convened a Research Advisory Committee comprised of five highly-regarded researchers in the field to assist us in evaluating our procedures and making recommendations for improvements. You can read about the committee and its work here http://www.culturaldata.org/wp-content/uploads/cdp-research-advisory-committee.pdf. I would encourage anyone who is interested in using CDP data to contact me directly at ccaltagirone@culturaldata.org.

http://createquity.com Ian David Moss

Thanks for this additional context Christopher, and especially for the statistics regarding access to CDP data, which I had not seen before. Regarding the accuracy of Yvonne’s statements, since there have indeed been cases when access has been denied in the past, it doesn’t seem to me that any part of the sentence you quoted is untrue and I don’t think that a correction is warranted. However, your larger point is well taken: readers should not leave this article with the mistaken impression that getting access to CDP data is especially difficult, and I know you guys are working hard to make it even easier.