bioRxiv, the preprint server for biology, recently turned 2 years old. This seems a good point to take a look at how bioRxiv has developed over this time and to discuss any concerns sceptical people may have about using the service.
Firstly, thanks to Richard Sever (@cshperspectives) for posting the data below. The first plot shows the number of new preprints deposited and the number that were revised, per month since bioRxiv opened in Nov 2013. There are now about 200 preprints being deposited per month and this number will continue to increase. The cumulative article count (of new preprints) shows that, as of the end of last month, there are >2500 preprints deposited at bioRxiv.
What is take up like across biology? To look at this, the number of articles in different subject categories can be totted up. Evolutionary Biology, Bioinformatics and Genomics/Genetics are the front-running disciplines. Obviously counting articles should be corrected for the size of these fields, but it’s clear that some large disciplines have not adopted preprinting in the same way. Cell biology, my own field, has some catching up to do. It’s likely that this reflects cultures within different fields. For example, genomics has a rich history of data deposition, sharing and openness. Other fields, less so…
So what are we waiting for?
I’d recommend that people wondering about preprinting go and read Stephen Curry’s post “just do it“. Any people who remain sceptical should keep reading…
Do I really want to deposit my best work on bioRxiv?
I’ve picked six preprints that were deposited in 2015. This selection demonstrates how important work is appearing first at bioRxiv and is being downloaded thousands of times before the papers appear in the pages of scientific journals.
- Accelerating scientific publishing in biology. A preprint about preprinting from Ron Vale, subsequently published in PNAS.
- Analysis of protein-coding genetic variation in 60,706 humans. A preprint summarising a huge effort from ExAC Exome Aggregation Consortium. 12,366 views, 4,534 downloads.
- TP53 copy number expansion correlates with the evolution of increased body size and an enhanced DNA damage response in elephants. This preprint was all over the news, e.g. Science.
- Sampling the conformational space of the catalytic subunit of human γ-secretase. CryoEM is the hottest technique in biology right now. Sjors Scheres’ group have been at the forefront of this revolution. This paper is now out in eLife.
- The genome of the tardigrade Hypsibius dujardini. The recent controversy over horizontal gene transfer in Tardigrades was rapidfire thanks to preprinting.
- CRISPR with independent transgenes is a safe and robust alternative to autonomous gene drives in basic research. This preprint concerning biosafety of CRISPR/Cas technology could be accessed immediately thanks to preprinting.
But many journals consider preprints to be previous publications!
Wrong. It is true that some journals have yet to change their policy, but the majority – including Nature, Cell and Science – are happy to consider manuscripts that have been preprinted. There are many examples of biology preprints that went on to be published in Nature (ancient genomes) and Science (hotspots in birds). If you are worried about whether the journal you want to submit your work to will allow preprinting, check this page first or the SHERPA/RoMEO resource. The journal “information to authors” page should have a statement about this, but you can always ask the Editor.
I’m going to get scooped
Preprints establish priority. It isn’t possible to be scooped if you deposit a preprint that is time-stamped showing that you were the first. The alternative is to send it to a journal where no record will exist that you submitted it if the paper is rejected, or sometimes even if they end up publishing it (see discussion here). Personally, I feel that the fear of scooping in science is overblown. In fields that are so hot that papers are coming out really fast the fear of scooping is high, everyone sees the work if its on bioRxiv or elsewhere – who was first is clear to all. Think of it this way: depositing a preprint at bioRxiv is just the same as giving a talk at a meeting. Preprints mean that there is a verifiable record available to everyone.
Preprints look ugly, I don’t want people to see my paper like that.
The depositor can format their preprint however they like! Check out Christophe Leterrier’s beautifully formatted preprint, or this one from Dennis Eckmeier. Both authors made their templates available so you can follow their example (1 and 2).
Yes but does -insert name of famous scientist- deposit preprints?
Lots of high profile scientists have already used bioRxiv. David Bartel, Ewan Birney, George Church, Ray Deshaies, Jennifer Doudna, Steve Henikoff, Rudy Jaenisch, Sophien Kamoun, Eric Karsenti, Maria Leptin, Rong Li, Andrew Murray, Pam Silver, Bruce Stillman, Leslie Vosshall and many more. Some sceptical people may find this argument compelling.
I know how publishing works now and I don’t want to disrupt the status quo
It’s paradoxical how science is all about pushing the frontiers, yet when it comes to publishing, scientists are incredibly conservative. Physics and Mathematics have been using preprinting as part of the standard route to publication for decades and so adoption by biology is nothing unusual and actually, we will simply be catching up. One vision for the future of scientific publishing is that we will deposit preprints and then journals will search out the best work from the server to highlight in their pages. The journals that will do this are called “overlay journals”. Sounds crazy? It’s already happening in Mathematics. Terry Tao, a Fields medal-winning mathematician recently deposited a solution to the Erdos discrepency problem on arXiv (he actually put them on his blog first). This was then “published” in Discrete Analysis, an overlay journal. Read about this here.
Disclaimer: other preprint services are available. F1000 Research, PeerJ Preprints and of course arXiv itself has quantitative biology section. My lab have deposited work at bioRxiv (1, 2 and 3) and I am an affiliate for the service, which means I check preprints before they go online.
Edit 14/12/15 07:13 put the scientists in alphabetical order. Added a part about scooping.
The post title comes from the term “white label” which is used for promotional vinyl copies of records ahead of their official release.
Our recent paper on “the mesh” in kinetochore fibres (K-fibres) of the mitotic spindle was our first adventure in 3D electron microscopy. This post is about some of the new data analysis challenges that were thrown up by this study. I promised a more technical post about this paper and here it is, better late than never.
In the paper we describe how over-expression of TACC3 causes the microtubules (MTs) in K-fibres to become “more wonky”. This was one of those observations that we could see by eye in the tomograms, but we needed a way to quantify it. And this meant coming up with a new spatial statistic.
After a few false starts*, we generated a method that I’ll describe here in the hope that the extra detail will be useful for other people interested in similar problems in cell biology.
The difficulty in this analysis comes from the fact that the fibres are randomly oriented, because of the way that the experiment is done. We section orthogonally to the spindle axis, but the fibre is rarely pointing exactly orthogonal to the tomogram. So the challenge is to reorient all the fibres to be able to pool numbers from across different fibres to derive any measurements. The IgorPro code to do this was made available with the paper. I have recently updated this code for a more streamlined workflow (available here).
We had two 3D point sets, one representing the position of each microtubule in the fibre at bottom of our tomogram and the other set is the position at the top. After creating individual MT waves from these point sets to work with, these waves could be plotted in 3D to have a look at them.
We need to normalise the fibres by getting them to all point in the same direction. We found that trying to pull out the average trajectory for the fibre didn’t work so well if there were lots of wonky MTs. So we came up with the following method:
- Calculate the total cartesian distance of all MT waves in an xy view, i.e. the sum of all projections of vectors on an xy plane.
- Rotate the fibre.
- Recalculate the total distance.
So we start off with this set of waves (Original). We rotate through 3D space and plot the total distance at each rotation to find the minimum, i.e. when most MTs are pointing straight at the viewer. This plot (Finding Minimum) is coloured so that hot colours are the smallest distance, it shows this calculation for a range of rotations in phi and theta. Once this minimum is found, the MT waves can be rotated by this value and the set is then normalised (you need to click on the pictures to see them properly).
Now we have all of the fibres that we imaged oriented in the same way, pointing to the zenith. This means we can look at angles relative to the z axis and derive statistics.
The next challenge was to make a measure of “wonkiness”. In other words, test how parallel the MTs are.
Violin plots of theta don’t really get across the wonkiness of the TACC3 overexpressed K-fibres (see figure above). To visualise this more clearly, each MT was turned into a vector starting at the origin and the point where the vector intersected with an xy plane set at an arbitrary distance in z (100 nm) was calculated. The scatter of these intersections demonstrates nicely how parallel the MTs are. If all MTs were perfectly parallel, they would all intersect at 0,0. In the control this is more-or-less true, with a bit of noise. In contrast, the TACC3-overexpressed group have much more scatter. What was nice is that the radial scatter was homogeneous, which showed that there was no bias in the acquisition of tomograms. The final touch was to generate a bivariate histogram which shows the scatter around 0,0 but it is normalised for the total number of points. Note that none of this possible without the first normalisation step.
The only thing that we didn’t have was a good term to describe what we were studying. “Wonkiness” didn’t sound very scientific and “parallelness” was also a bit strange. Parallelism is a word used in the humanities to describe analogies in art, film etc. However, it seemed the best term to describe the study of how parallel the MTs in a fibre are.
With a little help from my friends
The development of this method was borne out of discussions with Tom Honnor and Julia Brettschneider in the Statistics department in Warwick. The idea for the intersecting scatter plot came from Anne Straube in the office next door to me. They are all acknowledged in our paper for their input. A.G. at WaveMetrics helped me speed up my code by using MatrixOP and Euler’s rotation. His other suggestion of using PCA to do this would undoubtedly be faster, but I haven’t implemented this – yet. The bivariate histograms were made using JointHistogram() found here. JointHistogram now ships with Igor 7.
* as we describe in the paper
Several other strategies were explored to analyze deviations in trajectory versus the fiber axis. These were: examining the variance in trajectory angles, pairwise comparison of all MTs in the bundle, comparison to a reference MT that represented the fiber axis, using spherical rotation and rotating by an average value. These produced similar results, however, the one described here was the most robust and represents our best method for this kind of spatial statistical analysis.
The post title is taken from the Blondie LP “Parallel Lines”.
We have a new paper out! You can access it here.
Title of the paper: The mesh is a network of microtubule connectors that stabilizes individual kinetochore fibers of the mitotic spindle
What’s it about? When a cell divides, the two new cells need to get the right number of chromosomes. If this process goes wrong, it is a disaster which may lead to disease e.g. cancer. The cell shares the chromosomes using a “mitotic spindle”. This is a tiny machine made of microtubules and other proteins. We have found that the microtubules are held together by something called “the mesh”. This is a weblike structure which connects the microtubules and gives them structural support.
Does this have anything to do with cancer? Some human cancer cells have high levels of proteins called TACC3 and Aurora A kinase. We know that TACC3 is changed by Aurora A kinase. This changed form of TACC3 is part of the mesh. In our paper we mimic the cancer condition by increasing TACC3 levels. The mesh changes and the microtubules become wonky. This causes problems for dividing cells. It might be possible to target TACC3 using drugs to treat certain types of cancer, but this is a long way in the future.
Who did the work? Faye Nixon, a PhD student in the lab did most of the work. She used a method to look at mitotic spindles in 3D to study the mesh. My lab actually discovered the mesh by accident. A previous student, Dan Booth – back in 2011 – was looking at mitotic spindles to try and get 3D electron microscopy (tomography) working in the lab. Tomography works just like a CAT scan in a hospital, but on a much smaller scale. The mesh is found in the gaps between microtubules that are 25 nanometre wide (1 nanometre is 1 billionth of a metre), this is about 3,000 times smaller than a human hair, so it is very small! It was Dan who found the mesh and gave it the name. Other people in the lab did some really nice work which helped us to understand how the mesh works in dividing cells. Cristina Gutiérrez-Caballero did some experiments using a different type of microscope and Fiona Hood contributed some test tube experiments. Ian Prior at University of Liverpool, co-supervises Faye and helped with electron microscopy.
Have you discovered a new structure in cells? Yes and No. All cell biologists dream of finding a new structure in cells. It’s so unlikely though. Scientists have been looking at cells since the 17th Century and so the chances of seeing something that no-one has seen before are very small. In the 1970s, “inter-microtubule bridges” in the mitotic spindle were described using 2D electron microscopy. What we have done is to look at these structures in 3D for the first time and find that they are a network rather than individual connectors.
Nixon, F.M., Gutiérrez-Caballero, C., Hood, F.E., Booth, D.G., Prior, I.A. & Royle, S.J. (2015) The mesh is a network of microtubule connectors that stabilizes individual kinetochore fibers of the mitotic spindle eLife, doi: 10.7554/eLife.07635
This post is written in plain English to try to describe what is in the paper. I’m planning on writing a more technical post on some of the spatial statistics we developed as part of this paper.
The post title is from “Pull Together” a track from Shack’s H.M.S. Fable album.
My post on the strange data underlying the new impact factor for eLife was read by many people. Thanks for the interest and for the comments and discussion that followed. I thought I should follow up on some of the issues raised in the post.
- eLife received a 2013 Impact Factor despite only publishing 27 papers in the last three months of the census window. Other journals, such as Biology Open did not.
- There were spurious miscites to papers before eLife published any papers. I wondered whether this resulted in an early impact factor.
- The Web of Knowledge database has citations from articles in the past referring to future articles!
1. Why did eLife get an early Impact Factor? It turns out that there is something called a partial Impact Factor. This is where an early Impact Factor is awarded to some journals in special cases. This is described here in a post at Scholarly Kitchen. Cell Reports also got an early Impact Factor and Nature Methods got one a few years ago (thanks to Daniel Evanko for tweeting about Nature Methods’ partial Impact Factor). The explanation is that if a journal is publishing papers that are attracting large numbers of citations it gets fast-tracked for an Impact Factor.
2. In a comment, Rafael Santos pointed out that the miscites were “from a 2013 eLife paper to an inexistent 2010 eLife paper, and another miscite from a 2013 PLoS Computational Biology paper to an inexistent 2011 eLife paper”. The post at Scholarly Kitchen confirms that citations are not double-checked or cleaned up at all by Thomson-Reuters. It occurred to me that journals looking to game their Impact Factor could alter the year for citations to papers in their own journal in order to inflate their Impact Factor. But no serious journal would do that – or would they?
3. This is still unexplained. If anybody has any ideas (other than time travel) please leave a comment.
I noticed something strange about the 2013 Impact Factor data for eLife.
Before I get onto the problem. I feel I need to point out that I dislike Impact Factors and think that their influence on science is corrosive. I am a DORA signatory and I try to uphold those principles. I admit that, in the past, I used to check the new Impact Factors when they were released, but no longer. This year, when the 2013 Impact Factors came out I didn’t bother to log on to take a look. A chance Twitter conversation with Manuel Théry (@ManuelTHERY) and Christophe Leterrier (@christlet) was my first encounter with the new numbers.
Huh? eLife has an Impact Factor?
For those that don’t know, the 2013 Impact Factor is worked out by counting the total number of 2013 cites to articles in a given journal that were published in 2011 and 2012. This number is divided by the number of “citable items” in that journal in 2011 and 2012.
Now, eLife launched in October 2012. So it seems unfair that it gets an Impact Factor since it only published papers for 12.5% of the window under scrutiny. Is this normal?
I looked up the 2013 Impact Factor for Biology Open, a Company of Biologists journal that launched in January 2012* and… it doesn’t have one! So why does eLife get an Impact Factor but Biology Open doesn’t?**
Looking at the numbers for eLife revealed that there were 230 citations in 2013 to eLife papers in 2011 and 2012. One of which was a mis-citation to an article in 2011. This article does not exist (the next column shows that there were no articles in 2011). My guess is that Thomson Reuters view this as the journal existing for 2011 and 2012, and therefore deserving of an Impact Factor. Presumably there are no mis-cites in the Biology Open record and it will only get an Impact Factor next year. Doesn’t this call into question the veracity of the database? I have found other errors in records previously (see here). I also find it difficult to believe that no-one checked this particular record given the profile of eLife.
Perhaps unsurprisingly, I couldn’t track down the rogue citation. I did look at the cites to eLife articles from all years in Web of Science, the Thomson Reuters database (which again showed that eLife only started publishing in Oct 2012). As described before there are spurious citations in the database. Josh Kaplan’s eLife paper on UNC13/Tomosyn managed to rack up 5 citations in 2004, some 9 years before it was published (in 2013)! This was along with nine other papers that somehow managed to be cited in 2004 before they were published. It’s concerning enough that these data are used for hiring, firing and funding decisions, but if the data are incomplete or incorrect this is even worse.
Summary: I’m sure the Impact Factor of eLife will rise as soon as it has a full window for measurement. This would actually be 2016 when the 2015 Impact Factors are released. The journal has made it clear in past editorials (and here) that it is not interested in an Impact Factor and won’t promote one if it is awarded. So, this issue makes no difference to the journal. I guess the moral of the story is: don’t take the Impact Factor at face value. But then we all knew that already. Didn’t we?
* For clarity, I should declare that we have published papers in eLife and Biology Open this year.
** The only other reason I can think of is that eLife was listed on PubMed right away, while Biology Open had to wait. This caused some controversy at the time. I can’t see why a PubMed listing should affect Impact Factor. Anyhow, I noticed that Biology Open got listed in PubMed by October 2012, so in the end it is comparable to eLife.
Edit: There is an update to this post here.
Edit 2: This post is the most popular on Quantixed. A screenshot of visitors’ search engine queries (Nov 2014)…
The post title is taken from “Strange Things” from Big Black’s Atomizer LP released in 1986.