What is data interoperation “made of”?
27 July, 2016
Image by Pawel Kryj, freeimages.com
Reporting on a presentation I gave to the Council of Scientific Society Presidents (CSSP)
On May 9th, 2016, I was invited to speak as a representative of 4S to the meeting of the Council of Scientific Society Presidents (CSSP). Attending were many of the elected presidents of disciplinary bodies in the natural and engineering sciences. The topic at hand was "The Challenge of Managing Cyber-Data: The state of data integrity, data curation, and digital preservation."
I chose to present on the topic of "what is data interoperation made of?” This talk draws on the well worn trick-of-the-trade of opening the black box of a technical term, in this case to reveal the complex, sometimes embroiled, and largely forgotten practical and technical pathways that lead to the tidy ordered tables we call data. Below I say a little about the talk, and then more about its reception amongst these leading scientists in their fields.
To begin I retold one of my favorite data collection stories: the occasion on which I participated in taking river-water temperatures in Baltimore county. As instructed by an ecological scientist, I waded into a river and positioned myself downstream from the thermometer to ensure my body heat did not affect the reading. Such practical and embodied standardizations are learned by all graduate students who do this work; an everyday technique that seeks to ensure that longitudinal datasets are produced in the same way over time. I saw nods from those in ecologically allied fields across the audience.
Turning from the generation of data to the complex and windings pathways that data thereafter traverse, I told a tale from my more recent studies of HIV/AIDS research infrastructures. In projected slides, I traced the trajectories of HIV serostatus data collected in the early 1980s that had thereafter wound their way across generations of databases, cleaned and reformatted in various ways, before ending up, in 2012, with other such data collected from across North America, thereafter integrated to produce vast synoptic and "statistically powered" views of the trajectory of HIV disease across decades and demographics. Increasingly, such computationally driven data integration projects are being emphasized across all the sciences -- social, natural and engineering -- for their ability to find inventive and often dramatic new uses for these data.
Lastly, I recounted my core concern that these advantages of integration may be coming at the cost of abstracting away where those data came from: the rivers where they had been collected; the men and women who had donated their time to aide biomedical research; the technicians and graduate students who across the years had positioned their bodies downstream of instruments; and the multiple generations of cleaning, sharing, and harmonization those data had gone through before landing in the hands of the computational and information experts of interoperation. This is what I mean when I ask, “what is data integration made of"? I hope to remind that interoperation rests upon the backgrounded work, sometimes decades worth, of scientists, technicians, data managers and volunteers. Often the epistemic limits of these data also carry forward to integration projects: that is, forgetting the trajectories of data’s generation does not equate with escaping the limits of their production.
To this part of the presentation I also saw nods across the audience, for these elder stateswomen and men of their diverse disciplines recognized that the worlds that had made up their graduate and early academic lives were in flux; that how they had once learned to collect and work with data were undergoing, in uneven stops and starts, great shifts. As Steve Jackson reminds us, transformations in the sociotechical architectures of data generation and manipulation will be coupled with shifts in scientific vocations and identities.
Not so long ago, I would have expected a very different tone in such a room, one undergirded by an informational utopianism i.e.:
"we are facing an explosion of data but storage is cheap, let's keep it all"
"data require annotation with metadata, let's find ways of automating that"
"interoperation is technologically challenging but only epistemically beneficial."
Instead, the room was quite receptive to my arguments, and already came equipped to hear me describe the challenges and tensions of data integration. After all, as the boosterism of Big Data that has touched virtually all fields has calmed, the leadership of the sciences are now faced with making the more practical and thorny financial, labor and regulatory commitments for the preservation, manipulation and interoperation of data in their fields.
Following my talk, attendees at this conference instead asked me many other kinds of quite challenging questions. Perhaps the tone in the room was close to angst. Paraphrasing now:
“If data may find uses well beyond the intent of those who generated them (or, with subject data, well beyond our current conception of informed consent) how should we plan for such uncertain and often ethically challenging futures for data?”
“If it is hard to know what needs to be preserved about the context for the production of data [or “metadata”, the who, what, how, and why of data production], how do we decide what metadata should be preserved along with data?”
“If each additional preservation of a datum must be accompanied by ongoing human labor and technical investment to sustain its meaning and utility, will that eat up the constrained time, money and resources of the sciences for their other meaningful tasks?”
These scientists were pained with the recognition that their fields would soon, if not already, need to make institutional level commitments about what data should be kept and which should be thrown away or allowed to degrade. They were faced with making decisions about which specific datasets, amongst the troves in all disciplines, to allocate limited resources to for their preservation and interoperation. And they were concerned that the potential advantages of data reuse would be coupled with the dangers of unpredictable future trajectories.
I came away from this meeting heartened. Here I had found only hints of once prevalent archival naïvetés. In this sense it seems that STS and other scholars of data sharing, preservation and interoperation have been successful in spreading their understandings that these goals are hard, fraught and consequential, rather than easy, simple and a unilateral good. It appears that, within this group at least, technological solutionism has waned. That the arguments of those that have sought to articulate data's dangers and consequences are, at least, familiar. And that, in continuing my research and advocacy I will in the future, perhaps, no longer need to begin by explaining the basics of oncoming challenges for data preservation and interoperation, and instead I can turn to the next set of challenges that follow. For to the questions the participants asked, I have as of yet few answers to give…
Human Centered Design and Engineering (HCDE)
University of Washington