Rollercoaster: ups and downs of Google Scholar citations

In the UK there is an advertising disclaimer that “the value of your investments may go down as well as up.” Since papers are our main commodity in science and citations are something of a return, surely the “value” of a published paper only ever increases over time. Doesn’t it?

I think this is true when citations to a paper are tracked at a conventional database (Web of Science for example). Citations are added and very rarely taken away. With Google Scholar it is a different story. Now, I am a huge Google Scholar fan so this post is not a criticism of the service at all. One of the nice things about GS is that it counts citations from the “grey literature”, i.e. theses, patents etc. But not so grey as to include blogs and news articles (most of the time). So you get a broader view of the influence of a paper beyond the confines of a conventional database. With this broader view comes volatility, as I’ll show below.

I don’t obsessively check my own page every day – honestly I don’t(!) – but I did happen to check my own page twice within a short space of time and I noticed that my H-index went up by 1 and then decreased by 1. I’m pretty sure I didn’t imagine this and so I began to wonder how stable the citation data in Google Scholar actually is and whether I could track cites automatically.

What goes up (must come down)

Manually checking GS every day is beyond me, and what are computers for anyway? I set up a little routine to grab my data each day and look at the stability of citations (details of how to do this are below if you’re interested).

You can click on the plot to see it in its full glory.

Each line is a plot of citations to a paper over many weeks. The grey line is no citations gained or lost, relative to the start. As the paper accrues citations the line becomes more red and if it loses citations below the starting point it turns blue. They are ranked by the integral of change in citation over time.

The data are retrieved daily so if a paper gains citations and loses an equal number in less than 24 hours, this is not detected.

You can see from the plot that the number of citations to a paper can go down as well as up. For one paper, citations dropped significantly from one day to the next, which undid two month’s worth of increases. This paper is my highest cited work and dropped 10 cites from 443 to 433.

I’m guessing that running this routine on someone working in a field with a higher citation rate would show more volatility.

The increases in citations have an obvious cause but what about the decreases? My guess is that they are duplicate citations which are removed when they are added to a “cluster” (Google’s way of dealing with multiple URLs for the same paper). Another cause is probably something that is subsequently judged to not be a paper, e.g. a blog post, and getting removed.

Please please tell me now

The alert emails from Google Scholar have always puzzled me. I have alerts set up to tell me when my work is cited. I love getting them – who doesn’t want to see who has cited their work? Annoyingly they arrive infrequently and only ever contain one or two new papers. I looked at the frequency of changes in citation number and checked when I received emails from Google Scholar.

Over the same period as the plot above, you can see that citations to my profile happen pretty frequently. Again, if my work was cited at a higher rate, I guess this would be even more frequent. But in this period I only received six or so alert emails. I don’t think GS waits until a citation is stable for a while before emailing, because they tend to come immediately after an update. The alert emails remain a mystery to me. It would be great if they came a bit more often and it would be even better if they told you which paper(s) they cite!

Summary

Google Scholar is a wonderful service that finds an extra 20% or so of the impact of your work compared to other databases. With this extra information comes volatility and the numbers you see on there probably shouldn’t be treated as absolute.

Methods

To do this I used Christian Kreibich’s python script to retrieve information from Google Scholar. I wrote a little shell script to run the scholar.py and set up a daemon to do this everyday at the same time. I couldn’t find a way to search my UserID and so the search for my name brings up some unrelated papers that need to be filtered. There are restrictions on what you can retrieve, so my script retrieved papers within three different time frames to avoid hitting the limit for paper information retrieval.

The daemon is a plist in ~/Library/LaunchAgents/

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.quantixed.gscrape</string>
    <key>KeepAlive</key>
    <false/>
    <key>RunAtLoad</key>
    <false/>
    <key>Program</key>
    <string>/path/to/the/shell/script/gscrape.sh</string>
      <key>StartCalendarInterval</key>
      <dict>
          <key>Hour</key>
          <integer>14</integer>
          <key>Minute</key>
          <integer>30</integer>
      </dict>
</dict>
</plist>

And the shell script is something like

#!/bin/bash
cd /path/to/the/shell/script/
/usr/bin/pythonw '/path/to/your/scholar.py-master/scholar.py' -c 500 --author "Joe Bloggs" --after=1999 --before=2007 --csv > a.csv
/usr/bin/pythonw '/path/to/your/scholar.py-master/scholar.py' -c 500 --author "Joe Bloggs" --after=2008 --before=2012 --csv > b.csv
/usr/bin/pythonw '/path/to/your/scholar.py-master/scholar.py' -c 500 --author "Joe Bloggs" --after=2013 --csv > c.csv
OF=all_$(date +%Y%m%d).csv
cat a.csv b.csv c.csv > $OF

To crunch the data I wrote something in Igor which reads in the CSVs and plotted out my data. This meant first getting a list of clusterIDs which correspond to my papers in order to filter out other people’s work.

I have a surprising number of tracks in my library with Rollercoaster in the title. I will go with indie wannabe act Northern Uproar for the title of this post.

“What goes up (must come down)” is from Graham & Brown’s Super Fresco wallpaper ad from 1984.

“Please please tell me now” is a lyric from Duran Duran’s “Is There Something I Should Know?”.

Advertisements

One response

  1. Very interesting, and an important caveat given that social media metrics seem to be gradually permeating into scientific evaluation (a bad thing, I would say). The tip of an iceberg..?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: