Tag Archives: computing

Adventures in Code III: the quantixed ImageJ Update site

We have some macros for ImageJ/FIJI for making figures and blind analysis which could be useful to others.

I made an ImageJ Update Site so that the latest versions can be pushed out to the people in the lab, but this also gives the opportunity to share our code with the world. Feel free to add the quantixed ImageJ update site to your ImageJ or FIJI installation. Details of how to do that are here.

The code is maintained in this GitHub repo, which has a walkthrough for figure-making in the README. So, if you’d like to make figures the quantixed way, adding ROIs and zooms, then feel free to give this code a try. Please raise any issues there or get in touch some other way.

Disclaimer: this code is under development. I offer no guarantees to its usefulness. I am not responsible for data loss or injury that may result from its use!

Update @ 10:35 2016-12-20 I should point out that other projects already exist to make figures (MagicMontage, FigureJScientiFig). These projects are fine but they didn’t do what I wanted, so I made my own.

The Digital Cell: Getting started with IgorPRO

This post follows on from “Getting Started“.

In the lab we use IgorPRO for pretty much everything. We have many analysis routines that run in Igor, we have scripts for processing microscope metadata etc, and we use it for generating all figures for our papers. Even so, people in the lab engage with it to varying extents. The main battle is that the use of Excel is pretty ubiquitous.

I am currently working on getting more people in the lab started with using Igor. I’ve found that everyone is keen to learn. The approach so far has been workshops to go through the basics. This post accompanies the first workshop, which is coupled to the first few pages of the Manual. If you’re interested in using Igor read on… otherwise you can skip to the part where I explain why I don’t want people in the lab to use Excel.

IgorPro is very powerful and the learning curve is steep, but the investment is worth it.

WaveMetrics_IGOR_Pro_LogoThese are some of the things that Igor can do: Publication-quality graphics, High-speed data display, Ability to handle large data sets, Curve-fitting, Fourier transforms, smoothing, statistics, and other data analysis, Waveform arithmetic, Matrix math, Image display and processing, Combination graphical and command-line user interface, Automation and data processing via a built-in programming environment, Extensibility through modules written in the C and C++ languages. You can even play games in it!

The basics

The first thing to learn is about the objects in the Igor environment and how they work.There are four basic objects that all Igor users will encounter straight away.

  • Waves
  • Graphs
  • Tables
  • Layouts

All data is stored as waveforms (or waves for short). Waves can be displayed in graphs or tables. Graphs and tables can be placed in a Layout. This is basically how you make a figure.

The next things to check out are the command window (which displays the history), the data browser and the procedure window.

Essential IgorPro

  • Tables are not spreadsheets! Most important thing to understand. Tables are just a way of displaying a wave. They may look like a spreadsheet, but they are not.
  • Igor is case insensitive.
  • Spaces. Igor can handle spaces in names of objects, but IMO are best avoided.
  • Igor is 0-based not 1-based
  • Logical naming and logical thought – beginners struggle with this and it’s difficult to get this right when you are working on a project, but consistent naming of objects makes life easier.
  • Programming versus not programming – you can get a long way without programming but at some point it will be necessary and it will save you a lot of time.

Pretty soon, you will go beyond the four basic objects and encounter other things. These include: Numeric and string variables, Data folders, Notebooks, Control panels, 3D plots – a.k.a. gizmo, Procedures.

Getting started guide

Getting started guide

Why don’t we use Excel?

  • Excel can’t make high quality graphics for publication.
    • We do that in Igor.
    • So any effort in Excel is a waste of time.
  • Excel is error-prone.
    • Too easy for mistakes to be introduced.
    • Not auditable. Tough/impossible to find mistakes.
    • Igor has a history window that allows us to see what has happened.
  • Most people don’t know how to use it properly.
  • Not good for biological data – Transcription factor Oct4 gets converted to a date.
  • Limited to 1048576 rows and 16384 columns.
  • Related: useful link describing some spreadsheet crimes of data entry.

But we do use Excel a lot

  • Excel is useful for quick calculations and for preparing simple charts to show at lab meeting.
  • Same way that Powerpoint is OK to do rough figures for lab meeting.
  • But neither are publication-quality.
  • We do use Excel for Tracking Tables, Databases(!) etc.

The transition is tough, but worth it

Writing formulae in Excel is straightforward, and the first thing you will find is that to achieve the same thing in Igor is more complicated. For example, working out the mean for each row in an array (a1:y20) in Excel would mean typing =AVERAGE(A1:y1) in cell z1 and copying this cell down to z20. Done. In Igor there are several ways to do this, which itself can be unnerving. One way is to use the Waves Average panel. You need to know how this works to get it to do what you want.

But before you turn back, thinking I’ll just do this in Excel and then import it… imagine you now want to subtract a baseline value from the data, scale it and then average. Imagine that your data are sampled at different intervals. How would you do that? Dealing with those simple cases in Excel is difficult-to-impossible. In Igor, it’s straightforward.

Resources for learning more Igor:

  • Igor Help – fantastic resource containing the manual and more. Access via Help or by typing ShowHelpTopic “thing I want to search for”.
  • Igor Manual – This PDF is available online or in Applications/Igor Pro/Manual. This used to be a distributed as a hard copy… it is now ~3000 pages.
  • Guided Tour of IgorPro – this is a great way to start and will form the basis of the workshops.
  • Demos – Igor comes packed with Demos for most things from simple to advanced applications.
  • IgorExchange – Lots of code snippets and a forum to ask for advice or search for past answers.
  • Igor Tips – I’ve honestly never used these, you can turn on tips in Igor which reveal help on mouse over.
  • Igor mailing list – topics discussed here are pretty advanced.
  • Introduction to IgorPRO from Payam Minoofar is good. A faster start to learning to program that reading the manual.
  • Hands-on experience!

Part of a series on the future of cell biology in quantitative terms.

The Digital Cell: Getting Started

More on the theme of “The Digital Cell“: using quantitative, computational approaches in cell biology.

So you want to get started? Well, the short version of this post is:

Find something that you need to automate and get going!

Programming

I make no claim to be a computer wizard. My first taste of programming was the same as anyone who went to school in the UK in the 1980s: BBC Basic. Although my programming only went as far as copying a few examples from the book (right), this experience definitely reduced the “fear of the command line”. My next encounter with coding was to learn HTML when I was an undergraduate. It was not until I was a postdoc that I realised that I needed to write scripts in order get computers to do what I wanted them to do for my research.

Image analysis

I work in cell biology. My work involves a lot of microscopy. From the start, I used computer-based methods to quantify images. My first paper mentions quantifying images, but it wasn’t until I was a PhD student that I first used NIH Image (as it was called then) to extract quantitative information from confocal micrographs. I was also introduced to IgorPRO (version 3!) as a PhD student, but did no programming. That came later. As a postdoc, we used Scanalytics’ IPLab and Igor (as well as a bit of ImageJ as it had become). IPLab had an easy scripting language and it was in this program that I learned to write macros for analysis. At this time there were people in the lab who were writing software in IgorPro and MATLAB. While I didn’t pick up programming in IgorPRO or MATLAB then, it made me realise what was possible.

When I started my own group I discovered that IPLab had been acquired by BD Biosciences and then stripped out. I had hundreds of useless scripts and needed a new solution. ImageJ had improved enormously by this time and so this became our default image analysis program. The first data analysis package I bought was IgorPro (version 6) and I have stuck with it since then. In a future post, I will probably return to whether or not this was a good path.

Getting started with programming

Around 2009, I was still unable to program properly. I needed a macro for baseline subtraction – something really simple – and realised I didn’t know how to do it. We didn’t have just one or two traces to modify, we had hundreds. This was simply not possible by hand. It was this situation that made me realise I needed to learn to program.

…having a concrete problem that is impossible to crack any other way is the best motivator for learning to program.

This might seem obvious, but having a concrete problem that is impossible to crack any other way is the best motivator for learning to program. I know many people who have decided they “want to learn to code” or they are “going to learn to use R”. This approach rarely works. Sitting down and learning this stuff without sufficient motivation is really tough. So I would advise someone wanting to learn programming to find something that needs automation and just get going. Just get something to work!

Don’t worry (initially) about any of the following:

  • What program/language to use – as long as it is possible, just pick something and do it
  • If your code is ugly or embarrassing to show to an expert – as long as it runs, it doesn’t matter
  • About copy-and-pasting from examples – it’s OK as long as you take time to understand what you are doing, this is a quick way to make progress. Resources such as stackoverflow are excellent for this
  • Bugs – you can squish them, they will frustrate you, but you might need some…
  • Help – ask for help. Online forums are great, experts love showing off their knowledge. If you have local expertise, even better!

Once you have written something (and it works)… congratulations, you are a computer programmer!

IMG_2206Seriously, that is all there is to it. OK, it’s a long way to being a good programmer or even a competent one, but you have made a start. Like Obi Wan Kenobi says: you’ve taken your first step into a larger world.

So how do you get started with an environment like IgorPro? This will be the topic for next time.

Part of a series on the future of cell biology in quantitative terms.

The Digital Cell: Workflow

The future of cell biology, even for small labs, is quantitative and computational. What does this mean and what should it look like?

My group is not there yet, but in this post I’ll describe where we are heading. The graphic below shows my current view of the ideal workflow for my lab.

Workflow

The graphic is pretty self-explanatory, but to walk you through:

  • A lab member sets up a microscopy experiment. We have standardised procedures/protocols in a lab manual and systems are in place so that reagents are catalogued to minimise error.
  • Data goes straight from the microscope to the server (and backed-up). Images and metadata are held in a database and object identifiers are used for referencing in electronic lab notebooks (and for auditing).
  • Analysis of the data happens with varying degrees of human intervention. The outputs of all analyses are processed automatically. Code for doing these steps in under version control using git (github).
  • Post-analysis the processed outputs contain markers for QC and error checking. We can also trace back to the original data and check the analysis. Development of code happens here too, speeding up slow procedures via “software engineering”.
  • Figures are generated using scripts which are linked to the original data with an auditable record of any modification to the image.
  • Project management, particularly of paper writing is via trello. Writing papers is done using collaborative tools. Everything is synchronised to enable working from any location.
  • This is just an overview and some details are missing, e.g. backup of analyses is done locally and via the server.

Just to reiterate, that my team are not at this point yet, but we are reasonably close. We have not yet implemented three of these things properly in my group, but in our latest project (via collaboration) the workflow has worked as described above.

The output is a manuscript! In the future I can see that publication of a paper as a condensed report will give way to making the data, scripts and analysis available, together with a written summary. This workflow is designed to allow this to happen easily, but this is the topic for another post.

Part of a series on the future of cell biology in quantitative terms.

The Digital Cell

If you are a cell biologist, you will have noticed the change in emphasis in our field.

At one time, cell biology papers were – in the main – qualitative. Micrographs of “representative cells”, western blots of a “typical experiment”… This descriptive style gave way to more quantitative approaches, converting observations into numbers that could be objectively assessed. More recently, as technology advanced, computing power increased and data sets became more complex, we have seen larger scale analysis, modelling, and automation begin to take centre stage.

This change in emphasis encompasses several areas including (in no particular order):

  • Statistical analysis
  • Image analysis
  • Programming
  • Automation allowing analysis at scale
  • Reproducibility
  • Version control
  • Data storage, archiving and accessing large datasets
  • Electronic lab notebooks
  • Computer vision and machine learning
  • Prospective and retrospective modelling
  • Mathematics and physics

The application of these areas is not new to biology and has been worked on extensively for years in certain areas. Perhaps most obviously by groups that identified themselves as “systems biologists”, “computational biologists”, and people working on large-scale cell biology projects. My feeling is that these methods have now permeated mainstream (read: small-scale) cell biology to such an extent that any groups that want to do cell biology in the future have to adapt in order to survive. It will change the skills that we look for when recruiting and it will shape the cell biologists of the future. Other fields such as biophysics and neuroscience are further through this change, while others have yet to begin. It is an exciting time to be a biologist.

I’m planning to post occasionally about the way that our cell biology research group is working on these issues: our solutions and our problems.

Part of a series on the future of cell biology in quantitative terms.

Adventures in code

An occasional series in esoteric programming issues.

As part of a larger analysis project I needed to implement a short program to determine the closest distance of two line segments in 3D space. This will be used to sort out which segments to compare… like I say, part of a bigger project. The best method to do this is to find the closest distance one segment is to the other when the other one is represented as an infinite line. You can then check if that point is beyond the segment if it is you use the limits of the segment to calculate the distance. There’s a discussion on stackoverflow here. The solutions point to one in C++ and one in MATLAB. The C++ version is easiest to port to Igor due to the similarity of languages, but the explanation of the MATLAB code was more approachable. So I ported that to Igor to figure out how it works.

The MATLAB version is:

>> P = [-0.43256      -1.6656      0.12533]
P =
   -0.4326   -1.6656    0.1253
>> Q = [0.28768      -1.1465       1.1909]
Q =
    0.2877   -1.1465    1.1909
>> R = [1.1892    -0.037633      0.32729]
R =
    1.1892   -0.0376    0.3273
>> S = [0.17464     -0.18671      0.72579]
S =
    0.1746   -0.1867    0.7258
>> N = null(P-Q)
N =
   -0.3743   -0.7683
    0.9078   -0.1893
   -0.1893    0.6115
>> r = (R-P)*N
r =
    0.8327   -1.4306
>> s = (S-P)*N
s =
    1.0016   -0.3792
>> n = (s - r)*[0 -1;1 0];
>> n = n/norm(n);
>> n
n =
    0.9873   -0.1587
>> d = dot(n,r)
d =
    1.0491
>> d = dot(n,s)
d =
    1.0491
>> v = dot(s-r,d*n-r)/dot(s-r,s-r)
v =
    1.2024
>> u = (Q-P)'\((S - (S*N)*N') - P)'
u =
    0.9590
>> P + u*(Q-P)
ans =
    0.2582   -1.1678    1.1473
>> norm(P + u*(Q-P) - S)
ans =
    1.0710

and in IgorPro:

Function MakeVectors()
	Make/O/D/N=(1,3) matP={{-0.43256},{-1.6656},{0.12533}}
	Make/O/D/N=(1,3) matQ={{0.28768},{-1.1465},{1.1909}}
	Make/O/D/N=(1,3) matR={{1.1892},{-0.037633},{0.32729}}
	Make/O/D/N=(1,3) matS={{0.17464},{-0.18671},{0.72579}}
End

Function DoCalcs()
	WAVE matP,matQ,matR,matS
	MatrixOp/O tempMat = matP - matQ
	MatrixSVD tempMat
	Make/O/D/N=(3,2) matN
	Wave M_VT
	matN = M_VT[p][q+1]
	MatrixOp/O tempMat2 = (matR - matP)
	MatrixMultiply tempMat2, matN
	Wave M_product
	Duplicate/O M_product, mat_r
	MatrixOp/O tempMat2 = (matS - matP)
	MatrixMultiply tempMat2, matN
	Duplicate/O M_product, mat_s
	Make/O/D/N=(2,2) MatUnit
	matUnit = {{0,1},{-1,0}}
	MatrixOp/O tempMat2 = (mat_s - mat_r)
	MatrixMultiply tempMat2,MatUnit
	Duplicate/O M_Product, Mat_n
	Variable nn
	nn = norm(mat_n)
	MatrixOP/O new_n = mat_n / nn
	//new_n is now a vector with unit length
	Variable dd
	dd = MatrixDot(new_n,mat_r)
	//print dd
	//dd = MatrixDot(new_n,mat_s)
	//print dd
	dd = abs(dd)
	// now find v
	// v = dot(s-r,d*n-r)/dot(s-r,s-r)
	variable vv
	MatrixOp/O mat_s_r = mat_s - mat_r
	MatrixOp/O tempmat2 = dd * mat_n - mat_r
	vv = MatrixDot(mat_s_r,tempmat2) / MatrixDot(mat_s_r,mat_s_r)
	//print vv
	//because vv > 1 then closest post is s (because rs(1) = s) and therefore closest point on RS to infinite line PQ is S
	//what about the point on PQ is this also outside the segment?
	// u = (Q-P)'\((S - (S*N)*N') - P)'
	variable uu
	MatrixOp/O matQ_P = matQ - matP
	MatrixTranspose matQ_P
	//MatrixOP/O tempMat2 = ((matS - (matS * matN) * MatrixTranspose(MatN)) - MatrixTranspose(matP))
	Duplicate/O MatN, matNprime
	MatrixTranspose matNprime
	MatrixMultiply matS, matN
	Duplicate/O M_Product, matSN
	MatrixMultiply M_Product, matNprime
	MatrixOP/O tempMat2 = ((matS - M_product) - matP)
	MatrixTranspose tempMat2
	MatrixLLS matQ_P tempMat2
	Wave M_B
	uu = M_B[0]
	// find point on PQ that is closest to RS
	// P + u*(Q-P)
	MatrixOp/O matQ_P = matQ - matP
	MatrixOp/O matPoint = MatP + (uu * MatQ_P)
	MatrixOP/O distpoint = matPoint - matS
	Variable dist
	dist = norm(distpoint)
	Print dist
End

The sticking points were finding the Igor equivalents of

  • null()
  • norm()
  • dot()
  • \ otherwise known as mldivide

Which are:

  • MatrixSVD (answer is in the final two columns of wave M_VT)
  • norm()
  • MatrixDot()
  • MatrixLLS

MatrixLLS wouldn’t accept a mix of single-precision and double-precision waves, so this needed to be factored into the code.

As you can see, the Igor code is much longer. Overall, I think MATLAB handles Matrix Math better than Igor. It is certainly easier to write. I suspect that there are a series of Igor operations that can do what I am trying to do here, but this was an exercise in direct porting.

More work is needed to condense this down and also deal with every possible case. Then it needs to be incorporated into the bigger program. SIMPLE! Anyway, hope this helps somebody.

The post title is taken from the band Adventures In Stereo.

Tips from the blog VIII: Time Machine on QNAP NAS

This is just a quick tip as it took me a little while to sort this out. In the lab we have two QNAP TS-869 Pro NAS devices. Each was set up with a single RAID6 storage pool and I ran them as a primary and replicant via rsync. We recently bought a bigger server and so the plan was to repurpose one of the NAS boxes to be a Time Machine for all the computers in the lab.

We have around 10 computers that use a Time Machine for small documents that reside on each computer (primary data is always on the server). So far, I’d relied on external Hard Drives to do this. However: 1) they fail, 2) they can get unplugged accidentally and fail to backup and 3) they take up space.

As always the solution is simple and I’ll outline this with the drawbacks and then describe the other things I did to save you wasting time.

  1. Wipe the NAS. I went for RAID5, rather than RAID6. I figured this is safe enough. The NAS emails me if there is a problem with any of the disks and they can be replaced.
  2. Enable Time Machine. In the QNAP interface in Backup Server>Time Machine, click Enable Time Machine support. Set a limit if you like or 0 for no limit. Add a Username and Password.
  3. Pick the NAS as the Time Machine disk. On each Mac, wait for backup to complete, turn Time Machine off. In Time Machine Preferences pick new disk. It will see your QNAP NAS Time Machine share. Click on it, enter user/pass. click OK. Don’t select use both disks (an option in Yosemite onwards).
  4. That’s it. Wait for backup to complete, check it. Unplug external HD and repurpose.

You don’t need each user to have a share or an account on the NAS. You don’t need to mount the volume on the Mac at startup. The steps above are all you need to do.

The major drawback is that all users share the same Time Machine space. In theory, one person could swamp the backup with their files and this will limit how far back all users can go in the Time Machine. The NAS is pretty big, so I think this will be OK. A rule for putting data/big files in a directory on a user’s Mac and then excluding this directory from the Backup seems the obvious workaround.

What not to try

There is a guide on the web (on the QNAP website!) which shows you how to add Time Machine support for a Mac. It requires some command line action and is pretty complicated. Don’t do it. My guess is that this was the original hack and this has now been superseded by the “official” support on the QNAP interface. I’m not sure why they haven’t taken this guide down. There is another page on the site, outlining the steps above. Just use that.

I read on the QNAP forums about the drawback of using the official Time Machine backup (i.e. all users having to share). My brainwave was to create a user for each Mac, enable a home folder, limit the account with a quota and then set the Mac to mount the volume on startup. This could then be the Time Machine and allow me to limit the size of the Time Machine backup on a per user basis. It didn’t work! Time Machine needs a specially formatted volume to work and so it cannot use a home folder in this way. Maybe there is a way to get this work – it would be a better solution – but I couldn’t manage it.

So there you go. If you have one of these devices. This is what worked for us.

Tips from the blog VI: doc to pdf

doctopdfA while back I made this little Automator script to convert Microsoft Word doc and docx files to PDF. It’s useful for when you are sent a bunch of Word files for committee work. Opening PDFs in Preview is nice and hassle-free. Struggling with Word is not.

It’s not my own work, I just put it together after googling around a bit. I’ll put it here for anyone to use.

To get it working:

  1. Open Automator. Choose Template Service and you need to check: Service receives selected files and folders in Finder.app
  2. Set the actions: Get Folder content and Run AppleScript (this is where the script goes)
  3. Now Save the workflow. Suggested name Doc to PDF.

Now to run it:

  1. Select the doc/docx file(s) in the Finder window.
  2. Right-click and look for the service in the contextual menu. This should be down the bottom near “Reveal in Finder”.
  3. Run it.

If you want to put it onto another computer. Go to ~/Library/Services and you will find the saved workflow there. Just copy to the same location on your other computer.

Known bug: Word has to be open for the script to run. It also doesn’t shut down Word when it’s finished.

Here is the code.

property theList : {"doc", "docx"}

on run {input, parameters}
          set output to {}
          tell application "Microsoft Word" to set theOldDefaultPath to get default file path file path type documents path
          repeat with x in input
                    try
                              set theDoc to contents of x
                              tell application "Finder"
                                        set theFilePath to container of theDoc as text

                                        set ext to name extension of theDoc
                                        if ext is in theList then
                                                  set theName to name of theDoc
                                                  copy length of theName to l
                                                  copy length of ext to exl

                                                  set n to l - exl - 1
                                                  copy characters 1 through n of theName as string to theFilename

                                                  set theFilename to theFilename & ".pdf"

                                                  tell application "Microsoft Word"
  set default file path file path type documents path path theFilePath
                                                            open theDoc
                                                            set theActiveDoc to the active document
  save as theActiveDoc file format format PDF file name theFilename
                                                            copy (POSIX path of (theFilePath & theFilename as string)) to end of output
  close theActiveDoc
                                                  end tell
                                        end if
                              end tell
                    end try
          end repeat
          tell application "Microsoft Word" to set default file path file path type documents path path theOldDefaultPath

          return output
end run

 

Tips from the blog is a series and gets its name from a track from Black Sunday by Cypress Hill.

Waiting to happen II: Publication lag times

Following on from the last post about publication lag times at cell biology journals, I went ahead and crunched the numbers for all journals in PubMed for one year (2013). Before we dive into the numbers, a couple of points about this kind of information.

  1. Some journals “reset the clock” on the received date with manuscripts that are resubmitted. This makes comparisons difficult.
  2. The length of publication lag is not necessarily a reflection of the way the journal operates. As this comment points out, manuscripts are out of the journals hands (with the reviewers) for a substantial fraction of the time.
  3. The dataset is incomplete because the deposition of this information is not mandatory. About 1/3 of papers have the date information deposited (see below).
  4. Publication lag times go hand-in-hand with peer review. Moving to preprints and post-publication review would eradicate these delays.

Thanks for all the feedback on my last post, particularly those that highlighted the points above.

rawdatesTo see how all this was done, check out the Methods bit below, where you can download the full summary. I ended up with a list of publication lag times for 428500 papers published in 2013 (see left). To make a bit more sense of this, I split them by journal and then found the publication lag time stats for each. This had to be done per journal since PLoS ONE alone makes up 45560 of the records.

LagTimesTo try and visualise what these publication lag times look like for all journals, I made a histogram of the Median lag times for all journals using a 10 d bin width. It takes on average ~100 d to go from Received to Accepted and a further ~120 d to go from Accepted to Published. The whole process on average takes 239 days.

To get a feel for the variability in these numbers I plotted out the ranked Median times for each journal and overlaid Q25 and Q75 (dots). The IQR for some of the slower journals was >150 d. So the papers that they publish can have very different fates.

IFIs the publication lag time longer at higher tier journals? To look at this, I used the Rec-Acc time and the 2013 Journal Impact Factor which, although widely derided and flawed, does correlate loosely with journal prestige. I have fewer journals in this dataset, because the lookup of JIFs didn’t find every journal in my starting set, either because the journal doesn’t have one or there were minor differences in the PubMed name and the Thomson-Reuters name. The median of the median Rec-Acc times for each bin is shown. So on average, journals with a JIF <1 will take 1 month longer to accept your paper than journal with an IF ranging from 1-10. After this it rises again, to ~2 months longer at journals with an IF over 10. Why? Perhaps at the lower end, the trouble is finding reviewers; whereas at the higher end, multiple rounds of review might become a problem.

The executive summary is below. These are the times (in days) for delays at all journals in PubMed for 2013.

Interval Median Q25 Q75
Received-to-Accepted 97 69 136
Accepted-to-Published 122 84 186
Received-to-Published 239 178 319

For comparison:

  1. Median time from ovulation to birth of a human being is 268 days.
  2. Mark Beaumont cycled around the world (29,446 km) in 194 days.
  3. Ellen MacArthur circumnavigated the globe single-handed in 72 days.

On the whole it seems that publishing in Cell Biology is quite slow compared to the whole of PubMed. Why this is the case is a tricky question. Is it because cell biologists submit papers too early and they need more revision? Are they more dogged in sending back rejected manuscripts? Is it because as a community we review too harshly and/or ask too much of the authors? Do Editors allow too many rounds of revision or not give clear guidance to expedite the time from Received-to-Accepted? It’s probably a combination of all of these factors and we’re all to blame.

Finally, this amusing tweet to show the transparency of EMBO J publication timelines raises the question: would these authors have been better off just sending the paper somewhere else?

Methods: I searched PubMed using journal article[pt] AND ("2013/01/01"[PDAT] : "2013/12/31"[PDAT]) this gave a huge xml file (~16 GB) which nokogiri balked at. So I divided the query up into subranges of those dates (1.4 GB) and ran the script on all xml files. This gave 1425643 records. I removed records that did not have a received date or those with greater than 12 in the month field (leaving 428513 records). 13 of these records did not have a journal name. This gave 428500 records from 3301 journals. Again, I filtered out negative values (papers accepted before they were received) and a couple of outliers (e.g. 6000 days!). With a bit of code it was quite straightforward to extract simple statistics for each of the journals. You can download the data here to look up the information for a journal of your choice (wordpress only allows xls, not txt/csv). The fields show the journal name and the number of valid articles. Then for Acc-Pub, Rec-Acc and Rec-Pub, the number, Median, lower quartile, upper quartile times in days are given. I set a limit of 5 or more articles for calculation of the stats. Blank entries are where there was no valid data. Note that there are some differences with the table in my last post. This is because for that analysis I used a bigger date range and then filtered the year based on the published field. Here my search started out by specifying PDAT, which is slightly different.

The data are OK, but the publication date needs to be taken with a pinch of salt. For many records it was missing a month or day, so the date used for some records is approximate. In retrospect using the Entrez date or one of the other required fields would have probably be better. I liked the idea of the publication date as this is when the paper finally appears in print which still represents a significant delay at some journals. The Recieved-to-Accepted dates are valid though.

Waiting to Happen: Publication lag times in Cell Biology Journals

My interest in publication lag times continues. Previous posts have looked at how long it takes my lab to publish our work, how often trainees publish and I also looked at very long lag times at Oncogene. I recently read a blog post on automated calculation of publication lag times for Bioinformatics journals. I thought it would be great to do this for Cell Biology journals too. Hopefully people will find it useful and can use this list when thinking about where to send their paper.

What is publication lag time?

If you are reading this, you probably know how science publication works. Feel free to skip. Otherwise, it goes something like this. After writing up your work for publication, you submit it to a journal. Assuming that this journal will eventually publish the paper (there is usually a period of submitting, getting rejected, resubmitting to a different journal etc.), they receive the paper on a certain date. They send it out to review, they collate the reviews and send back a decision, you (almost always) revise your paper further and then send it back. This can happen several times. At some point it gets accepted on a certain date. The journal then prepares the paper for publication in a scheduled issue on a specific date (they can also immediately post papers online without formatting). All of these steps add significant delays. It typically takes 9 months to publish a paper in the biomedical sciences. In 2015 this sounds very silly, when world-wide dissemination of information is as simple as a few clicks on a trackpad. The bigger problem is that we rely on papers as a currency to get jobs or funding and so these delays can be more than just a frustration, they can affect your ability to actually do more science.

The good news is that it is very straightforward to parse the received, accepted and published dates from PubMed. So we can easily calculate the publication lags for cell biology journals. If you don’t work in cell biology, just follow the instructions below to make your own list.

The bad news is that the deposition of the date information in PubMed depends on the journal. The extra bad news is that three of the major cell biology journals do not deposit their data: J Cell Biol, Mol Biol Cell and J Cell Sci. My original plan was to compare these three journals with Traffic, Nat Cell Biol and Dev Cell. Instead, I extended the list to include other journals which take non-cell biology papers (and deposit their data).

LagTimes1

A summary of the last ten years

Three sets of box plots here show the publication lags for eight journals that take cell biology papers. The journals are Cell, Cell Stem Cell, Current Biology, Developmental Cell, EMBO Journal, Nature Cell Biology, Nature Methods and Traffic (see note at the end about eLife). They are shown in alphabetical order. The box plots show the median and the IQR, whiskers show the 10th and 90th percentiles. The three plots show the time from Received-to-Published (Rec-Pub), and then a breakdown of this time into Received-to-Accepted (Rec-Acc) and Accepted-to-Published (Rec-Pub). The colours are just to make it easier to tell the journals apart and don’t have any significance.

You can see from these plots that the journals differ widely in the time it takes to publish a paper there. Current Biology is very fast, whereas Cell Stem Cell is relatively slow. The time it takes the journals to move them from acceptance to publication is pretty constant. Apart from Traffic where it takes an average of ~3 months to get something in to print. Remember that the paper is often online for this period so this is not necessarily a bad thing. I was not surprised that Current Biology was the fastest. At this journal, a presubmission inquiry is required and the referees are often lined up in advance. The staff are keen to publish rapidly, hence the name, Current Biology. I was amazed at Nature Cell Biology having such a short time from Received-to-Acceptance. The delay in Review-to-Acceptance comes from multiple rounds of revision and from doing extra experimental work. Anecdotally, it seems that the review at Nature Cell Biol should be just as lengthy as at Dev Cell or EMBO J. I wonder if the received date is accurate… it is possible to massage this date by first rejecting the paper, but allowing a resubmission. Then using the resubmission date as the received date [Edit: see below]. One way to legitimately limit this delay is to only allow a certain time for revisions and only allow one round of corrections. This is what happens at J Cell Biol, unfortunately we don’t have this data to see how effective this is.

lagtimes2

How has the lag time changed over the last ten years?

Have the slow journals always been slow? When did they become slow?  Again three plots are shown (side-by-side) depicting the Rec-Pub and then the Rec-Acc and Acc-Pub time. Now the intensity of red or blue shows the data for each year (2014 is the most intense colour). Again you can see that the dataset is not complete with missing date information for Traffic for many years, for example.

Interestingly, the publication lag has been pretty constant for some journals but not others. Cell Stem Cell and Dev Cell (but not the mothership – Cell) have seen increases as have Nature Cell Biology and Nature Methods. On the whole Acc-Pub times are stable, except for Nature Methods which is the only journal in the list to see an increase over the time period. This just leaves us with the task of drawing up a ranked list of the fastest to the slowest journal. Then we can see which of these journals is likely to delay dissemination of our work the most.

The Median times (in days) for 2013 are below. The journals are ranked in order of fastest to slowest for Received-to-Publication. I had to use 2013 because EMBO J is missing data for 2014.

Journal Rec-Pub Rec-Acc Acc-Pub
Curr Biol 159 99.5 56
Nat Methods 192 125 68
Cell 195 169 35
EMBO J 203 142 61
Nature Cell Biol 237 180 59
Traffic 244 161 86
Dev Cell 247 204 43
Cell Stem Cell 284 205 66

You’ll see that only Cell Stem Cell is over the threshold where it would be faster to conceive and give birth to a human being than to publish a paper there (on average). If the additional time wasted in submitting your manuscript to other journals is factored in, it is likely that most papers are at least on a par with the median gestation time.

If you are wondering why eLife is missing… as a new journal it didn’t have ten years worth of data to analyse. It did have a reasonably complete set for 2013 (but Rec-Acc only). The median time was 89 days, beating Current Biology by 10.5 days.

Methods

Please check out Neil Saunders’ post on how to do this. I did a PubMed search for (journal1[ta] OR journal2[ta] OR ...) AND journal article[pt] to make sure I didn’t get any reviews or letters etc. I limited the search from 2003 onwards to make sure I had 10 years of data for the journals that deposited it. I downloaded the file as xml and I used Ruby/Nokogiri to parse the file to csv. Installing Nokogiri is reasonably straightforward, but the documentation is pretty impenetrable. The ruby script I used was from Neil’s post (step 3) with a few lines added:


#!/usr/bin/ruby

require 'nokogiri'

f = File.open(ARGV.first)
doc = Nokogiri::XML(f)
f.close

doc.xpath("//PubmedArticle").each do |a|
r = ["", "", "", "", "", "", "", "", "", "", ""]
r[0] = a.xpath("MedlineCitation/Article/Journal/ISOAbbreviation").text
r[1] = a.xpath("MedlineCitation/PMID").text
r[2] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='received']/Year").text
r[3] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='received']/Month").text
r[4] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='received']/Day").text
r[5] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='accepted']/Year").text
r[6] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='accepted']/Month").text
r[7] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='accepted']/Day").text
r[8] = a.xpath("MedlineCitation/Article/Journal/JournalIssue/Pubdate/Year").text
r[9] = a.xpath("MedlineCitation/Article/Journal/JournalIssue/Pubdate/Month").text
r[10] = a.xpath("MedlineCitation/Article/Journal/JournalIssue/Pubdate/Day").text
puts r.join(",")
end

and then executed as described. The csv could then be imported into IgorPro and processed. Neil’s post describes a workflow for R, or you could use Excel or whatever at this point. As he notes, quite a few records are missing the date information and some of it is wrong, i.e. published before it was accepted. These need to be cleaned up. The other problem is that the month is sometimes an integer and sometimes a three-letter code. He uses lubridate in R to get around this, a loop-replace in Igor is easy to construct and even Excel can handle this with an IF statement, e.g. IF(LEN(G2)=3,MONTH(1&LEFT(G2,3)),G2) if the month is in G2. Good luck!

Edit 9/3/15 @ 17:17 several people (including Deborah Sweet and Bernd Pulverer from Cell Press/Cell Stem Cell and EMBO, respectively) have confirmed via Twitter that some journals use the date of resubmission as the submitted date. Cell Stem Cell and EMBO journals use the real dates. There is no way to tell whether a journal does this or not (from the deposited data). Stuart Cantrill from Nature Chemistry pointed out that his journal do declare that they sometimes reset the clock. I’m not sure about other journals. My own feeling is that – for full transparency – journals should 1) record the actual dates of submission, acceptance and publication, 2) deposit them in PubMed and add them to the paper. As pointed out by Jim Woodgett, scientists want the actual dates on their paper, partly because they are the real dates, but also to claim priority in certain cases. There is a conflict here, because journals might appear inefficient if they have long publication lag times. I think this should be an incentive for Editors to simplify revisions by giving clear guidance and limiting successive revision cycles. (This Edit was corrected 10/3/15 @ 11:04).

The post title is taken from “Waiting to Happen” by Super Furry Animals from the “Something 4 The Weekend” single.