Working in libraries, or doing research on how people use, find, and store information we often come across large–or very large–data sets. These data sets can be a veritable goldmine of information about how people behave, what information they need, and what services we need to build and develop.
I’ve worked a lot with big data–I used the OCLC circulation data set to examine the existence of browsing. I’ve looked at what people search for from the library homepage, and whether that changes with a web scale search. This work prompted a project using our zero results searches to help our users by changing purchasing, search parameters, and web site structure; a clear practical benefit to users.
I’ve also used anonymous data to examine how people read, because one of the ebook providers that Swinburne uses provides information about access down to the page level. All the data I’ve used has been anonymous; while I can often track individual users, I know nothing about them other than what they have looked at–not their name, age, gender or any other identifying information. Even so, there is something that really bothers people about being watched reading in person; Catherine C. Marshall (the expert in this field) describes it as “creepy”. Is it any less creepy to watch people read online?
There are many public examples of data being used in ways that are more challenging than simply watching people read. The large data set collected by 23andme has ruined families; there is the notorious Facebook mood experiment; and financial institutions have used big data to engage in discriminatory (and arguably illegal) practices for over 50 years. All of these things have been managed by commercial entities, however–those entities don’t have an internal structure responsible for maintaining human ethics, and such a structure would fix it, right? Well, not strictly true. A recent media story comments on technology that was originally developed at Cambridge being commercialised used to affect the outcome of the American elections.
Not only can ethics committees not control the commercial implementations of big data uses, though, they also often are simply not interested in the use of anonymous data. The impetus for this blog post was a lengthy discussion amongst RADAR grant holders about whether their respective institutions considered anonymous data something that needed to be dealt with in the ethics process; many institutions do not. This leaves the burden of behaving ethically with the individual researcher. While the vast majority of researchers are ethical, cautious and respectful, the exact location of the line between acceptable and unacceptable behaviour is likely to vary substantially between fields and individual researchers.
It’s not clear to me what the right answer is. There is plenty of anonymous data that presents no risk, even on the “creepy” factor, and it would be an ethical harm to waste the time of ethics committees and researchers on applying for permission to use it. There are some uses of big data that are clearly more challenging. And in the middle is a large grey area that we as researchers will need to map and account for in a dawning age of data collection and use.
By Dana McKay and George Buchanan