By: Maya Samet, Data Science Fellow with the Arctic Data Center and Erin McLean, Community Engagement and Outreach Coordinator with the Arctic Data Center
Journals, funding agencies, and researchers are increasingly acknowledging the importance of making data publicly available (For a general discussion of the landscape, see: Open Data Metrics: Lighting the Fire). Benefits of open data practices include visibility of research, reproducibility of results, prevention of effort duplication, and the possibility to conduct new, innovative types of high-quality research with aggregate datasets. In such an open science landscape, data citation practices are crucial for giving data creators credit for their work. The Make Data Count initiative additionally encourages researchers to cite data for the purposes of increased research discovery by driving traffic between data and articles, and generation of reliable open data metrics for use by all research stakeholders (Lowenberg et al. 2019). According to the Scholix interoperability initiative, the role of data repositories in this process should be to generate usage and citation metrics for the datasets they host, and share them with community "hubs" such as OpenAIRE, CrossRef, and DataCite (Cousijn et al. 2019). Per these recommendations by the larger data citation research community, Arctic Data Center has taken multiple steps towards producing data citation information for all datasets in our collection, including a new feature enabling dataset owners to directly register citations to their datasets.
Supporting Data Citation at the Arctic Data Center
Using the scythe R package developed by our team, we regularly query journal publishers for citations that include the DOI of any Arctic Data Center dataset and register those connections as dataset citations. We’ve also conducted a programmatic text search for citation mentions over all of our dataset abstracts, since some researchers use the abstracts to refer to publications affiliated with their data.
Though we’ve made progress with these programmatic methods, tracking all dataset use in publications is a very difficult task to complete programmatically, since in many cases, data that are used in a publication are not formally cited. According to a paper by Belter (2014), oceanographic datasets were more often informally mentioned in the body of an article rather than formally cited in the Acknowledgments or Reference sections. Another study by Zhao et al. (2017) found that datasets used in science publications were only cited 6% of the time and referred to using their DOI 9% of the time, with the rest of the references using language that is less standardized, traceable, or permanently identifiable. Data use is difficult to track in this landscape, and we know formal data citations aren’t telling the full story of how often data is relied on in scientific publications.
Individual researchers and data owners can help us with this. That is why we recently implemented a “Register Citation” feature allowing researchers to register known citations to their datasets. Researchers may register a citation for any occasions where they know a certain publication uses or refers to a certain dataset, and the citation will be viewable on the dataset profile within 24 hours.
Moving forward
We plan to integrate our data citation systems with DataCite, which would make Arctic Data Center data citations available through CrossRef and DOI.org, two DOI registration systems connected with many major publishers worldwide that enable cross-publisher citation linking. We’re also looking to continue developing our programmatic search for citations with different text mining techniques that would identify citations in varied contexts, and to expand the pool of publications we search across (Currently, we query SCOPUS, Elsevier, and PubMed for citations).
We hope that this information is helpful to you. Our goal with this initiative is to foster the growth and improvement of data citation practices in the Arctic science community. You are welcome to reach out to us at support [at] arcticdata.io with any feedback or questions about these new features.
About the Authors
Maya Samet is a Data Science Fellow at the NCEAS Arctic Data Center and a Teaching Assistant at the UC Berkeley Data Analytics Bootcamp. She holds a BS from UC Santa Barbara in Statistical Science and has experience applying this education to research in various fields, including people analytics, psychology, and informatics. Her fellowship at NCEAS ends January 2021, she will be seeking new opportunities starting then. Contact her via email (samet [at] nceas.ucsb.edu).
Erin McLean is the Community Engagement and Outreach Coordinator with the Arctic Data Center, headquartered at NCEAS in Santa Barbara. She holds a bachelor of arts from Boston University in marine science and English literature and a master of science from the University of Rhode Island in biological and environmental sciences. A scientist, educator, and writer, she has built her career on making science more accessible to all. Contact her via email (mclean [at] nceas.ucsb.edu).
Citations
Belter, C. W. 2014. Measuring the Value of Research Data: A Citation Analysis of Oceanographic Data Sets. PLoS ONE, 9(3). doi:10.1371/journal.pone.0092590
Cousijn, H., Feeney, P., Lowenberg, D., Presani, E., & Simons, N. 2019. Bringing Citations and Usage Metrics Together to Make Data Count. Data Science Journal, 18(1), 9. doi:10.5334/dsj-2019-009
Lowenberg, Daniella, Chodacki, John, Fenner, Martin, Kemp, Jennifer, & Jones, Matthew B. 2019. Open Data Metrics: Lighting the Fire (Version 1). Zenodo, 32-34. http://doi.org/10.5281/zenodo.3525349
Zhao, M., Yan, E., Li, K. 2017. Data set mentions and citations: A content analysis of full-text publications. Journal of the Association for Information Science and Technology, 69(1), 32-46. doi:10.1002/asi.23919
Originally published on the Arctic Data Center blog