Last week, ALM (article-level metric) data for PLoS journals were uploaded to Figshare with the invitation to do something cool with it.
Well, it would be rude not to. Actually, I’m one of the few scientists on the planet that hasn’t published a paper with Public Library of Science (PLoS), so I have no personal agenda here. However, I love what PLoS is doing and what it has achieved to disrupt the scientific publishing system. Anyway, what follows is not in any way comprehensive, but I was interested to look at a few specific things:
- Is there a relationship between Twitter mentions and views of papers?
- What is the fraction of views that are PDF vs HTML?
- Can citations be predicted by more immediate article level metrics?
The tl;dr version is 1. Yes. 2. ~20%. 3. Can’t say but looks unlikely.
1. Twitter mentions versus paper views
All PLoS journals are covered. The field containing paper views is (I think) “Counter” this combines views of HTML and PDF (see #2). A plot of Counter against Publication Date for all PLoS papers (upper plot) shows that the number of papers published has increased dramatically since the introduction of PLoS ONE in 2007. There is a large variance in number of views, which you’d expect and also, the views tail off for the most recent papers, since they have had less time to accumulate views. Below is the same plot where the size and colour of the markers reflects their Twitter score (see key). There’s a sharp line that must correspond to the date when Twitter data was logged as an ALM. There’s a scattering of mentions after this date to older literature, but one 2005 paper stands out – Ioannidis’s paper Why Most Published Research Findings Are False. It has a huge number of views and a large twitter score, especially considering that it was a seven year old paper when they started recording the data. A pattern emerges in the post-logging period. Papers with more views are mentioned more on Twitter. The larger darker markers are higher on the y-axis. Mentioning a paper on Twitter is sure to generate views of the paper, at some (unknown) conversion rate. However, as this is a single snapshot, we don’t know if Twitter mentions drive more downloads of papers, or whether more “interesting”/highly downloaded work is talked about more on Twitter.
2. Fraction of PDF vs HTML views
I asked a few people what they thought the download ratio is for papers. Most thought 60-75% as PDF versus 40-25% HTML. I thought it would be lower, but I was surprised to see that it is, at most, 20% for PDF. The plot below shows the fraction of PDF downloads (counter_pdf/(counter_pdf+counter_html)). For all PLoS journals, and then broken down for PLoS Biol, PLoS ONE.
This was a surprise to me. I have colleagues who don’t like depositing post-print or pre-print papers because they say that they prefer their work to be seen typeset in PDF format. However, this shows that, at least for PLoS journals, the reader is choosing to not see a typeset PDF at all, but a HTML version.
Maybe the PLoS PDFs are terribly formatted and 80% people don’t like them. There is an interesting comparison that can be done here, because all papers are deposited at Pubmed Central (PMC) and so the same plot can be generated for the PDF fraction there. The PDF format is different to PLoS and so we can test the idea that people prefer HTML over PDF at PLoS because they don’t like the PLoS format.
The fraction of PDF downloads is higher, but only around 30%. So either the PMC format is just as bad, or this is the way that readers like to consume the scientific literature. A colleague mentioned that HTML views are preferable to PDF if you want to actually want to do something with the data, e.g. for meta-analysis. This could have an effect. HTML views could be skim reading, whereas PDF is for people who want to read in detail… I wonder whether these fractions are similar at other publishers, particularly closed access publishers?
3. Citation prediction?
ALMs are immediate whereas citations are slow. If we assume for a moment that citations are a definitive means to determine the impact of a paper (which they may not be), then can ALMs predict citations? This would make them very useful in the evaluation of scientists and their endeavours. Unfortunately, this dataset is not sufficient to answer this properly, but with multiple timepoints, the question could be investigated. I looked at number of paper downloads and also the Mendeley score to see how these two things may foretell citations. What follows is a strategy to do this is an unbiased way with few confounders.
The dataset has a Scopus column, but for some reason these data are incomplete. It is possible to download data (but not on this scale AFAIK) for citations from Web of Science and then use the DOI to cross-reference to the other dataset. This plot shows the Scopus data as a function of “Total Citations” from Web of Science, for 500 papers. I went with the Web of Science data as this appears more robust.
The question is whether there is a relationship between downloads of a paper (Counter, either PDF or HTML) and citations. Or between Mendeley score and citations. I figured that downloading, Mendeley and citation, show three progressive levels of “commitment” to a paper and so they may correlate differently with citations. Now, to look at this for all PLoS journals for all time would be silly because we know that citations are field-specific, journal-specific, time-sensitive etc. So I took the following dataset from Web of Science: the top 500 most-cited papers in PLoS ONE for the period of 2007-2010 limited to “cell biology”. By cross-referencing I could check the corresponding values for Counter and for Mendeley.
I was surprised that the correlation was very weak in both cases. I thought that the correlation would be stronger with Mendeley, however signal-to-noise is a problem here with few users of the service compared with counting downloads. Below each plot is a ranked view of the papers, with the Counter or Mendeley data presented as a rolling average. It’s a very weak correlation at best. Remember that this is post-hoc. Papers that have been cited more would be expected to generate more views and higher Mendeley scores, but this is not necessarily so. Predicting future citations based on Counter or Mendeley, will be tough. To really know if this is possible, this approach needs to be used with multiple ALM timepoints to see if there is a predictive value for ALMs, but based on this single timepoint, it doesn’t seem as though prediction will be possible.
Again, looking at this for a closed access journal would be very interesting. The most-downloaded paper in this set, had far more views (143,952) than other papers cited a similar number of times (78). The paper was this one which I guess is of interest to bodybuilders! Presumably, it was heavily downloaded by people who probably are not in a position to cite the paper. Although these downloads didn’t result in extra citations, this paper has undeniable impact outside of academia. Because PLoS is open access, the bodybuilders were able to access the paper, rather than being met by a paywall. Think of the patients who are trying to find out more about their condition and can’t read any of the papers… The final point here is that ALMs have their own merit, irrespective of citations, which are the default metric for judging the impact of our work.
Methods: To crunch the numbers for yourself, head over to Figshare and download the csv. A Web of Science subscription is needed for the citation data. All the plots were generated in IgorPro, but no programming is required for these comparisons and everything I’ve done here can be easily done in Excel or another package.
Edit: Matt Hodgkinson (@mattjhodgkinson) Snr Ed at PLoS ONE told me via Twitter that all ALM data (periodically updated) are freely available here. This means that some of the analyses I wrote about are possible.
The post title comes from Six Plus One a track on Dad Man Cat by Corduroy. Plus is as close to PLoS as I could find in my iTunes library.