The Great Curve: Citation distributions

This post follows on from a previous post on citation distributions and the wrongness of Impact Factor.

Stephen Curry had previously made the call that journals should “show us the data” that underlie the much-maligned Journal Impact Factor (JIF). However, this call made me wonder what “showing us the data” would look like and how journals might do it.

What citation distribution should we look at? The JIF looks at citations in a year to articles published in the preceding 2 years. This captures a period in a paper’s life, but it misses “slow burner” papers and also underestimates the impact of papers that just keep generating citations long after publication. I wrote a quick bit of code that would look at a decade’s worth of papers at one journal to see what happened to them as yearly cohorts over that decade. I picked EMBO J to look at since they have actually published their own citation distribution, and also they appear willing to engage with more transparency around scientific publication. Note that, when they published their distribution, it considered citations to papers via a JIF-style window over 5 years.

I pulled 4082 papers with a publication date of 2004-2014 from Web of Science (the search was limited to Articles) along with data on citations that occurred per year. I generated histograms to look at distribution of citations for each year. Papers published in 2004 are in the top row, papers from 2014 are in the bottom row. The first histogram shows citations in the same year as publication, in the next column, the following year and so-on. Number of papers is on y and on x the number of citations. Sorry for the lack of labelling! My excuse is that my code made a plot with “subwindows”, which I’m not too familiar with.

allPlot

What is interesting is that the distribution changes over time:

  • In the year of publication, most papers are not cited at all, which is expected since there is a lag to publication of papers which can cite the work and also some papers do not come out until later in the year, meaning the likelihood of a citing paper coming out decreases as the year progresses.
  • The following year most papers are picking up citations: the distribution moves rightwards.
  • Over the next few years the distribution relaxes back leftwards as the citations die away.
  • The distributions are always skewed. Few papers get loads of citations, most get very few.

Although I truncated the x-axis at 40 citations, there are a handful of papers that are picking up >40 cites per year up to 10 years after publication – clearly these are very useful papers!

To summarise these distributions I generated the median (and the mean – I know, I know) number of citations for each publication year-citation year combination and made plots.

citedist

The mean is shown on the left and median on the right. The layout is the same as in the multi-histogram plot above.

Follow along a row and you can again see how the cohort of papers attracts citations, peaks and then dies away. You can also see that some years were better than others in terms of citations, 2004 and 2005 were good years, 2007 was not so good. It is very difficult, if not impossible, to judge how 2013 and 2014 papers will fare into the future.

What was the point of all this? Well, I think showing the citation data that underlie the JIF is a good start. However, citation data are more nuanced than the JIF allows for. So being able to choose how we look at the citations is important to understand how a journal performs. Having some kind of widget that allows one to select the year(s) of papers to look at and the year(s) that the citations came from would be perfect, but this is beyond me. Otherwise, journals would probably elect to show us a distribution for a golden year (like 2004 in this case), or pick a window for comparison that looked highly favourable.

Finally, I think journals are unlikely to provide this kind of analysis. They should, if only because it is a chance for a journal to show how it publishes many papers that are really useful to the community. Anyway, maybe they don’t have to… What this quick analysis shows is that it can be (fairly) easily harvested and displayed. We could crowdsource this analysis using standardised code.

Below is the code that I used – it’s a bit rough and would need some work before it could be used generally. It also uses a 2D filtering method that was posted on IgorExchange by John Weeks.
cdcode

The post title is taken from “The Great Curve” by Talking Heads from their classic LP Remain in Light.

Advertisements

6 responses

  1. Nice, interesting blog. Actually in line with your analysis also from previous posts I checked the author numbers across journals and correlated it with their impact factor.
    https://plotspotter.wordpress.com/2015/09/13/collaborations-in-science-across-the-world/
    Not surprising, the higher the IF of a journal the more authors/publication are involved.

    1. Thanks! I looked at this for just two journals (J Cell Biol and Cell) previously https://quantixed.wordpress.com/2014/07/02/all-this-and-more/
      Your plot is much more comprehensive.

  2. I vote for violin plots of the distributions shown in your second to last (10th) column. Thanks for sharing the code!

    1. Thanks Nikolai. Yes, violins would be very useful here.

  3. […] been calls for journals to publish the distribution of citations to the papers they publish (1 2 3). The idea is to turn the focus away from just one number – the Journal Impact Factor […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: