By Eric Crandall

August 25, 2021

News from: Publications

Students and biodiversity scientists whose research was upended by the pandemic held an online datathon to attach dates and locations to the world’s largest genomic database.

Many people know that scientists measure the biodiversity of an ecosystem by estimating the total number of different species present. Biodiversity can be used as an indicator of the health of the ecosystem because many species can provide many different ecosystem services, including converting sunlight into sugar, forming soils, cycling nutrients, cleaning air and water, and pollination. Having many species also builds redundancy into ecosystems, which can help to prevent collapses.

This species diversity depends on an equally important lower level of biodiversity that is often invisible to the naked eye: genetic diversity. Just as an ecosystem can be made up of thousands of species, every individual plant or animal has thousands of genes in its genome that help it to adapt and survive in its unique environment. These genes come in different flavors called alleles which serve species similar to the way species serve ecosystems. The more combinations of alleles that a species has, the better shot it has at adapting to environmental upheavals, such as changes in climate, or the appearance of an invasive species. The number and frequency of alleles in a species is a measure of its genetic diversity and can be used as a measure of the species’ evolutionary potential.

The genetic diversity of a species can be measured by sequencing portions of the genome from a sample of individuals: scientists who do this are called population geneticists. When population geneticists sequence the DNA of an organism, they are expected to share that genetic sequence data in a global database called the Sequence Read Archive (SRA), which currently contains over 600 terabytes of sequence data from wild species of plants, animals, and fungi and is growing exponentially. To be informative for any measurement of genetic diversity, or monitoring of the species’ evolutionary health, this genetic sequence data should also contain accompanying information about where it is from (what species, and geographic location) and when the genetic sample was collected.  However, as a new paper published in the Proceedings of the National Academy of Sciences describes, there is a small problem with these genetic data that is actually a huge problem: they usually don’t include information about the time and place where the sequenced organism was sampled. “Only about 14% of these SRA datasets contained information about when and where they were sampled,” said Michigan State Postdoctoral Researcher and lead author, Dr. Rachel Toczydlowski. Without being able to place a genetic sample in time and geographic space, the data can’t be used by other researchers to understand how well a species might be able to adapt to climate change.

Last summer, with their own research plans in shambles due to the pandemic, a group of biology graduate students and researchers from across the United States, Australia, and New Zealand started working to address this problem via an online, remote “datathon”. Starting with a list of about 800 datasets for which samples were missing latitude and longitude, the students searched online for published scientific papers associated with those datasets. They then read through the paper to find the geographic location and collection date for each sample reported in the SRA. When students couldn’t find this information, they emailed the authors of the study. Yet even after reading through over 500 such papers, they were only able to find dates and locations for about 33% of the samples that they worked on.


Caption: Losing Nemo? A sample of the clownfish Amphiprion ocellaris, commonly known as “Nemo”. Genomic data from hundreds of thousands of samples like this one are missing important metadata that describe the time and place they were sampled (Credit: Eric Crandall).

“It was really frustrating as an evolutionary biologist who has generated genetic sequence data myself to discover how many genetic sequences lack basic spatio-temporal information. It took me many hundreds of hours to secure permits, travel to field locations, find the species I was looking for, collect and preserve samples, extract and sequence DNA from those samples back in the lab, and write and run code to process the resulting genetic sequence data. Now multiply that effort by thousands of research projects across the globe. We are losing terabytes of genetic data that could otherwise provide invaluable baselines for biodiversity monitoring and help us to answer fundamental questions in biology” said Toczydlowski.


About the author:

Eric Crandall is the senior author of the study and an Assistant Research Professor of Biology at Pennsylvania State University. He co-led the datathon together with researchers from Michigan State University, Massey University, University of Central Florida, University of Queensland, and University of Hawaii. Members of this team previously helped to develop a tool that allows researchers to archive this metadata more easily with the genetic sequences, called the Genomics Observatories Metadatabase (GEOME). They are using GEOME to attach metadata that were found through the datathon to over 25,000 genetic samples stored in the SRA.

Eric Crandall: “If we want to protect global biodiversity, we can’t just think at the level of species, we also need to think about the genetic diversity within each species”.


Access the publication here.

Read the press release from Michigan State University here.

Click here for a tweet thread of the lead author, Rachel H. Toczydlowski.

Biology students make lemonade out of pandemic lemons by improving accessibility of genetic data from wild animal and plant species