A PhD student in my lab was keen to perform some RNA-sequencing (RNA-seq) analysis as part of their PhD thesis. Their project was centred on using cellular and animal models to examine the brain and behavioural effects of a mutation in a gene, the human homologue of which is implicated in neurodevelopment disorders. They were aware that there were a number of published RNA-seq studies on this same gene mutation. So, rather than reinventing the wheel, they decided to perform our specific analyses on these existing datasets, as all three papers indicated the raw data had been deposited in online sequencing databases as is conventional for studies of this kind.
Study one: too busy to share?
The first study, published in the American Journal of Medical Genetics B used a cultured cell model.
The link to the data was dead, so as the supervisor, I wrote to the corresponding author to ask where the data was. They initially got back very quickly indicating that they could send us the data. However, after two weeks we had heard nothing more and I contacted the corresponding author again – they got back saying their bioinformatics people were very busy and it would probably take a while to send us these data due to other demands not their time.
We never heard from them again. This paper has been cited 21 times.
Study two: data shared, but poor quality
The second study, published in Nucleic Acids Research, used a mouse model, performing extensive RNA-seq studies of the brain at various postnatal timepoints.
The data from this study was available on Gene Expression Omnibus (GEO). However, when my PhD student downloaded the data there were problems as it seemed highly truncated and nowhere near the depth indicated in the paper. We contacted the authors who very promptly supplied my PhD student with the raw data directly. However, again, when processing the data it became apparent there was a large discrepancy between the actual quality of the data, and what they reported as the quality of the data in the text of the paper. In particular, the read depth was much lower (30 million average was reported, but the average was closer to 8 million) and the number of repeated sequences much greater.
As a consequence when my PhD student processed the data much of it failed our normal quality control, and the data that scraped through could not realistically be regarded as reliable. This paper, which was published quite recently, has been cited 6 times and is still being cited.
Study three: success
On a happier note, the final study, published in Neuron and using a primary cell culture model, was a dream – the data was readily available, and was exactly as the authors outlined in their manuscript.
Consequently, the PhD student was able to re-analyse these data and complete her PhD thesis.