The Digital Cell: Statistical tests

Statistical hypothesis testing, commonly referred to as “statistics”, is a topic of consternation among cell biologists.

This is a short practical guide I put together for my lab. Hopefully it will be useful to others. Note that statistical hypothesis testing is a huge topic and one post cannot hope to cover everything that you need to know.

What statistical test should I do?

To figure out what statistical test you need to do, look at the table below. But before that, you need to ask yourself a few things.

  • What are you comparing?
  • What is n?
  • What will the test tell you? What is your hypothesis?
  • What will the p value (or other summary statistic) mean?

If you are not sure about any of these things, whichever test you do is unlikely to tell you much.

The most important question is: what type of data do you have? This will help you pick the right test.

  • Measurement – most data you analyse in cell biology will be in this category. Examples are: number of spots per cell, mean GFP intensity per cell, diameter of nucleus, speed of cell migration…
    • Normally-distributed – this means it follows a “bell-shaped curve” otherwise called “Gaussian distribution”.
    • Not normally-distributed – data that doesn’t fit a normal distribution: skewed data, or better described by other types of curve.
  • Binomial – this is data where there are two possible outcomes. A good example here in cell biology would be a mitotic index measurement (the proportion of cells in mitosis). A cell is either in mitosis or it is not.
  • Other – maybe you have ranked or scored data. This is not very common in cell biology. A typical example here would be a scoring chart for a behavioural effect with agreed criteria (0 = normal, 5 = epileptic seizures). For a cell biology experiment, you might have a scoring system for a phenotype, e.g. fragmented Golgi (0 = is not fragmented, 5 = is totally dispersed). These arbitrary systems are a not a good idea. Especially, if the person scoring is unblinded to the experimental procedure. Try to come up with an unbiased measurement procedure.

 

What do you want to do? Measurement

(Normal)

Measurement

(not Normal)

Binomial

 

Describe one group Mean, SD Median, IQR Proportion
Compare one group to a value One-sample t-test Wilcoxon test Chi-square
Compare two unpaired groups Unpaired t-test Wilcoxon-Mann-Whitney two-sample rank test Fisher’s exact test

or Chi-square

Compare two paired groups Paired t-test Wilcoxon signed rank test McNemar’s test
Compare three or more unmatched groups One-way ANOVA Kruskal-Wallis test Chi-square test
Compare three or more matched groups Repeated-measures ANOVA Friedman test Cochran’s Q test
Quantify association between two variables Pearson correlation Spearman correlation
Predict value from another measured variable Simple linear regression Nonparametric regression Simple logistic regression
Predict value from several measured or binomial variables Multiple linear (or nonlinear) regression Multiple logistic regression

Modified from Table 37.1 (p. 298) in Intuitive Biostatistics by Harvey Motulsky, 1995 OUP.

What do “paired/unpaired” and “matched/unmatched” mean?

Most of the data you will get in cell biology is unpaired or unmatched. Individual cells are measured and you have say, 20 cells in the control group and 18 different cells in the test group. These are unpaired (or unmatched in the case of more than one test group) because the cells are different in each group. If you had the same cell in two (or more) groups, the data would be paired (or matched). An example of a paired dataset would be where you have 10 cells that you treat with a drug. You take a measurement from each of them before treatment and a measurement after. So you have paired measurements: one for cell A before treatment, one after; one for cell B before and after, and so on.

How to do some of these tests in IgorPRO

The examples below assume that you have values in waves called data0, data1, data2,… substitute the wavenames for your actual wave names.

Is it normally distributed?

The simplest way is to plot them and see. You can plot out your data using Analysis>Histogram… or Analysis>Packages>Percentiles and BoxPlot… Another possibility is to look at skewness or kurtosis of the dataset (you can do this with WaveStats, see below)

However, if you only have a small number of measurements, or you want to be sure, you can do a test. There are several tests you can do (Kolmogorov-Smirnoff, Jarque-Bera, Shapiro-Wilk). The easiest to do and most intuitive (in Igor) is Shapiro-Wilk.


StatsShapiroWilkTest data0

If p < 0.05 then the data are not normally distributed. Statistical tests on normally distributed data are called parametric, while those on non-normally distributed data are non-parametric.

Describe one group

To get the mean and SD (and lots of other statistics from your data):


Wavestats data0

To get the median and IQR:


StatsQuantiles/ALL data0

The mean and sd are also stored as variables (V_avg, V_sdev). StatsQuantiles calculates V_median, V_Q25, V_Q75, V_IQR, etc. Note that you can just get the median by typing Print StatsMedian(data0) or – in Igor7 – Print median(data0). There is often more than one way to do something in Igor.

Compare one group to a value

It is unlikely that you will need to do this. In cell biology, most of the time we do not have hypothetical values for comparison, we have experimental values from appropriate controls. If you need to do this:


StatsTTest/CI/T=1 data0

Compare two unpaired groups

Use this for normally distributed data where you have test versus control, with no other groups. For paired data, use the additional flag /PAIR.


StatsTTest/CI/T=1 data0,data1

For the non-parametric equivalent, if n is large computation takes a long time. Use additional flag /APRX=2. If the data are paired, use the additional flag /WSRT.


StatsWilcoxonRankTest/T=1/TAIL=4 data0,data1

For binomial data, your waves will have 2 points. Where point 0 corresponds to one outcome and point 1, the other. Note that you can compare to expected values here, for example a genetic cross experiment can be compared to expected Mendelian frequencies. To do Fisher’s exact test, you need a 2D wave representing a contingency table. McNemar’s test for paired binomial data is not available in Igor

StatsChiTest/S/T=1 data0,data1

If you have more than two groups, do not do multiple versions of these tests, use the correct method from the table.

Compare three or more unmatched groups

For normally-distributed data, you need to do a 1-way ANOVA followed by a post-hoc test. The ANOVA will tell you if there are any differences among the groups and if it is possible to investigate further with a post-hoc test. You can discern which groups are different using a post-hoc test. There are several tests available, e.g. Dunnet’s is useful where you have one control value and a bunch of test conditions. We tend to use Tukey’s post-hoc comparison (the /NK flag also does Newman-Keuls test).


StatsAnova1Test/T=1/Q/W/BF data0,data1,data2,data3
StatsTukeyTest/T=1/Q/NK data0,data1,data2,data3

The non-parametric equivalent is Kruskal-Wallis followed by a multiple comparison test. Dunn-Holland-Wolfe method is used.


StatsKSTest/T=1/Q data0,data1,data2,data3
StatsNPMCTest/T=1/DHW/Q data0,data1,data2,data3

Compare three or more matched groups

It’s unlikely that this kind of data will be obtained in a typical cell biology experiment.

StatsANOVA2RMTest/T=1 data0,data1,data2,data3

There are also operations for StatsFriedmanTest and StatsCochranTest.

Correlation

Straightforward command for two waves or one 2D wave. Waves (or columns) must be of the same length


StatsCorrelation data0

At this point, you probably want to plot out the data and use Igor’s fitting functions. The best way to get started is with the example experiment, or just display your data and Analysis>Curve Fitting…

Hazard and survival data

In the lab we have, in the past, done survival/hazard analysis. This is a bit more complex and we used SPSS and would do so again as Igor does not provide these functions.

Notes for use

Screen Shot 2016-07-12 at 14.18.18The good news is that all of this is a lot more intuitive in Igor 7! There is a new Menu item called Statistics, where most of these functions have a dialog with more information. In Igor 6.3 you are stuck with the command line. Igor 7 will be out soon (July 2016).

  • Note that there are further options to most of these commands, if you need to see them
    • check the manual or Igor Help
    • or type ShowHelpTopic “StatsMedian” in the Command Window (put whatever command you want help with between the quotes).
  • Extra options are specified by “flags”, these are things like “/Q” that come after the command. For example, /Q means “quiet” i.e. don’t print the output into the history window.
  • You should always either print the results to the history or put them into a table so that we can check them. Note that the table gets over written if you do the same test with different data, so printing in this case is a good idea.
  • The defaults in Igor are setup OK for our needs. For example, Igor does two-tailed comparison, alpha = 0.05, Welch’s correction, etc.
  • Most operations can handle waves of different length (or have flags set to handle this case).
  • If you are used to doing statistical tests in Excel, you might be wondering about tails and equal variances. The flags are set in the examples to do two-tailed analysis and unequal variances are handled by Welch’s correction.
  • There’s a school of thought that says that using non-parametric tests is best to be cautious. These tests are not as powerful and so it is best to use parametric tests (t test, ANOVA) when you can.

Part of a series on the future of cell biology in quantitative terms.

The Digital Cell: Getting started with IgorPRO

This post follows on from “Getting Started“.

In the lab we use IgorPRO for pretty much everything. We have many analysis routines that run in Igor, we have scripts for processing microscope metadata etc, and we use it for generating all figures for our papers. Even so, people in the lab engage with it to varying extents. The main battle is that the use of Excel is pretty ubiquitous.

I am currently working on getting more people in the lab started with using Igor. I’ve found that everyone is keen to learn. The approach so far has been workshops to go through the basics. This post accompanies the first workshop, which is coupled to the first few pages of the Manual. If you’re interested in using Igor read on… otherwise you can skip to the part where I explain why I don’t want people in the lab to use Excel.

IgorPro is very powerful and the learning curve is steep, but the investment is worth it.

WaveMetrics_IGOR_Pro_LogoThese are some of the things that Igor can do: Publication-quality graphics, High-speed data display, Ability to handle large data sets, Curve-fitting, Fourier transforms, smoothing, statistics, and other data analysis, Waveform arithmetic, Matrix math, Image display and processing, Combination graphical and command-line user interface, Automation and data processing via a built-in programming environment, Extensibility through modules written in the C and C++ languages. You can even play games in it!

The basics

The first thing to learn is about the objects in the Igor environment and how they work.There are four basic objects that all Igor users will encounter straight away.

  • Waves
  • Graphs
  • Tables
  • Layouts

All data is stored as waveforms (or waves for short). Waves can be displayed in graphs or tables. Graphs and tables can be placed in a Layout. This is basically how you make a figure.

The next things to check out are the command window (which displays the history), the data browser and the procedure window.

Essential IgorPro

  • Tables are not spreadsheets! Most important thing to understand. Tables are just a way of displaying a wave. They may look like a spreadsheet, but they are not.
  • Igor is case insensitive.
  • Spaces. Igor can handle spaces in names of objects, but IMO are best avoided.
  • Igor is 0-based not 1-based
  • Logical naming and logical thought – beginners struggle with this and it’s difficult to get this right when you are working on a project, but consistent naming of objects makes life easier.
  • Programming versus not programming – you can get a long way without programming but at some point it will be necessary and it will save you a lot of time.

Pretty soon, you will go beyond the four basic objects and encounter other things. These include: Numeric and string variables, Data folders, Notebooks, Control panels, 3D plots – a.k.a. gizmo, Procedures.

Getting started guide

Getting started guide

Why don’t we use Excel?

  • Excel can’t make high quality graphics for publication.
    • We do that in Igor.
    • So any effort in Excel is a waste of time.
  • Excel is error-prone.
    • Too easy for mistakes to be introduced.
    • Not auditable. Tough/impossible to find mistakes.
    • Igor has a history window that allows us to see what has happened.
  • Most people don’t know how to use it properly.
  • Not good for biological data – Transcription factor Oct4 gets converted to a date.
  • Limited to 1048576 rows and 16384 columns.
  • Related: useful link describing some spreadsheet crimes of data entry.

But we do use Excel a lot

  • Excel is useful for quick calculations and for preparing simple charts to show at lab meeting.
  • Same way that Powerpoint is OK to do rough figures for lab meeting.
  • But neither are publication-quality.
  • We do use Excel for Tracking Tables, Databases(!) etc.

The transition is tough, but worth it

Writing formulae in Excel is straightforward, and the first thing you will find is that to achieve the same thing in Igor is more complicated. For example, working out the mean for each row in an array (a1:y20) in Excel would mean typing =AVERAGE(A1:y1) in cell z1 and copying this cell down to z20. Done. In Igor there are several ways to do this, which itself can be unnerving. One way is to use the Waves Average panel. You need to know how this works to get it to do what you want.

But before you turn back, thinking I’ll just do this in Excel and then import it… imagine you now want to subtract a baseline value from the data, scale it and then average. Imagine that your data are sampled at different intervals. How would you do that? Dealing with those simple cases in Excel is difficult-to-impossible. In Igor, it’s straightforward.

Resources for learning more Igor:

  • Igor Help – fantastic resource containing the manual and more. Access via Help or by typing ShowHelpTopic “thing I want to search for”.
  • Igor Manual – This PDF is available online or in Applications/Igor Pro/Manual. This used to be a distributed as a hard copy… it is now ~3000 pages.
  • Guided Tour of IgorPro – this is a great way to start and will form the basis of the workshops.
  • Demos – Igor comes packed with Demos for most things from simple to advanced applications.
  • IgorExchange – Lots of code snippets and a forum to ask for advice or search for past answers.
  • Igor Tips – I’ve honestly never used these, you can turn on tips in Igor which reveal help on mouse over.
  • Igor mailing list – topics discussed here are pretty advanced.
  • Introduction to IgorPRO from Payam Minoofar is good. A faster start to learning to program that reading the manual.
  • Hands-on experience!

Part of a series on the future of cell biology in quantitative terms.

Everything In Its Right Place

Something that has driven me nuts for a while is the bug in FIJI/ImageJ when making montages of image stacks. This post is about a solution to this problem.

What’s a montage?

You have a stack of images and you want to array them in m rows by n columns. This is useful for showing a gallery of each frame in a movie or to separate the channels in a multichannel image.

What’s the bug/feature in ImageJ?

If you select Image>Stacks>Make Montage… you can specify how you want to layout your montage. You can specify a “border” for this. Let’s say we have a stack of 12 images that are 300 x 300 pixels. Let’s arrange them into 3 rows and 4 columns with 0 border.

Screen Shot 2016-07-06 at 14.21.53So far so good. We have an image that is 1200 x 900. But it looks a bit rubbish, we need some grouting (white pixel space between the images). We don’t need a border, but let’s ignore that for the moment. So the only way to do this in ImageJ is to specify a border of 8 pixels.

Screen Shot 2016-07-06 at 14.22.27Looks a lot better. Ok there’s a border around the outside, which is no use, but it looks good. But wait a minute! Check out the size of the image (1204 x 904). This is only 4 pixels bigger in x and y, yet we added all that grouting, what’s going on?

The montage is not pixel perfect.

Screen Shot 2016-07-06 at 14.23.38

So the first image is not 300 x 300 any more. It is 288 x 288. Hmmm, maybe we can live with losing some data… but what’s this?

Screen Shot 2016-07-06 at 14.24.49

The next image in the row is not even square! It’s 292 x 288. How much this annoys you will depend on how much you like things being correct… The way I see it, this is science, if we don’t look after the details, who will? If I start with 300 x 300 images, it’s not too much to ask to end up with 300 x 300 images, is it? I needed to fix this.

Solutions

I searched for a while for a solution. It had clearly bothered other people in the past, but I guess people just found their own workaround.

ImageJ solution for multichannel array

So for a multichannel image, where the grayscale images are arrayed next to the merge, I wrote something in ImageJ to handle this. These macros are available here. There is a macro for doing the separation and arraying. Then there is a macro to combine these into a bigger figure.

Igor solution

For the exact case described above, where large stacks need to be tiled out into and m x n array, I have to admit I struggled to write something for ImageJ and instead wrote something for IgorPRO. Specifying 3 rows, 4 columns and a grout of 8 pixels gives the correct TIFF 1224 x 916, with each frame showing in full and square. The code is available here, it works for 8 bit greyscale and RGB images.

Screen Shot 2016-07-06 at 14.55.31

I might update the code at some point to make sure it can handle all data types and to allow labelling and adding of a scale bar etc.

The post title is taken from “Everything In Its Right Place” by Radiohead from album Kid A.

The Digital Cell: Getting Started

More on the theme of “The Digital Cell“: using quantitative, computational approaches in cell biology.

So you want to get started? Well, the short version of this post is:

Find something that you need to automate and get going!

Programming

I make no claim to be a computer wizard. My first taste of programming was the same as anyone who went to school in the UK in the 1980s: BBC Basic. Although my programming only went as far as copying a few examples from the book (right), this experience definitely reduced the “fear of the command line”. My next encounter with coding was to learn HTML when I was an undergraduate. It was not until I was a postdoc that I realised that I needed to write scripts in order get computers to do what I wanted them to do for my research.

Image analysis

I work in cell biology. My work involves a lot of microscopy. From the start, I used computer-based methods to quantify images. My first paper mentions quantifying images, but it wasn’t until I was a PhD student that I first used NIH Image (as it was called then) to extract quantitative information from confocal micrographs. I was also introduced to IgorPRO (version 3!) as a PhD student, but did no programming. That came later. As a postdoc, we used Scanalytics’ IPLab and Igor (as well as a bit of ImageJ as it had become). IPLab had an easy scripting language and it was in this program that I learned to write macros for analysis. At this time there were people in the lab who were writing software in IgorPro and MATLAB. While I didn’t pick up programming in IgorPRO or MATLAB then, it made me realise what was possible.

When I started my own group I discovered that IPLab had been acquired by BD Biosciences and then stripped out. I had hundreds of useless scripts and needed a new solution. ImageJ had improved enormously by this time and so this became our default image analysis program. The first data analysis package I bought was IgorPro (version 6) and I have stuck with it since then. In a future post, I will probably return to whether or not this was a good path.

Getting started with programming

Around 2009, I was still unable to program properly. I needed a macro for baseline subtraction – something really simple – and realised I didn’t know how to do it. We didn’t have just one or two traces to modify, we had hundreds. This was simply not possible by hand. It was this situation that made me realise I needed to learn to program.

…having a concrete problem that is impossible to crack any other way is the best motivator for learning to program.

This might seem obvious, but having a concrete problem that is impossible to crack any other way is the best motivator for learning to program. I know many people who have decided they “want to learn to code” or they are “going to learn to use R”. This approach rarely works. Sitting down and learning this stuff without sufficient motivation is really tough. So I would advise someone wanting to learn programming to find something that needs automation and just get going. Just get something to work!

Don’t worry (initially) about any of the following:

  • What program/language to use – as long as it is possible, just pick something and do it
  • If your code is ugly or embarrassing to show to an expert – as long as it runs, it doesn’t matter
  • About copy-and-pasting from examples – it’s OK as long as you take time to understand what you are doing, this is a quick way to make progress. Resources such as stackoverflow are excellent for this
  • Bugs – you can squish them, they will frustrate you, but you might need some…
  • Help – ask for help. Online forums are great, experts love showing off their knowledge. If you have local expertise, even better!

Once you have written something (and it works)… congratulations, you are a computer programmer!

IMG_2206Seriously, that is all there is to it. OK, it’s a long way to being a good programmer or even a competent one, but you have made a start. Like Obi Wan Kenobi says: you’ve taken your first step into a larger world.

So how do you get started with an environment like IgorPro? This will be the topic for next time.

Part of a series on the future of cell biology in quantitative terms.

The Digital Cell: Workflow

The future of cell biology, even for small labs, is quantitative and computational. What does this mean and what should it look like?

My group is not there yet, but in this post I’ll describe where we are heading. The graphic below shows my current view of the ideal workflow for my lab.

Workflow

The graphic is pretty self-explanatory, but to walk you through:

  • A lab member sets up a microscopy experiment. We have standardised procedures/protocols in a lab manual and systems are in place so that reagents are catalogued to minimise error.
  • Data goes straight from the microscope to the server (and backed-up). Images and metadata are held in a database and object identifiers are used for referencing in electronic lab notebooks (and for auditing).
  • Analysis of the data happens with varying degrees of human intervention. The outputs of all analyses are processed automatically. Code for doing these steps in under version control using git (github).
  • Post-analysis the processed outputs contain markers for QC and error checking. We can also trace back to the original data and check the analysis. Development of code happens here too, speeding up slow procedures via “software engineering”.
  • Figures are generated using scripts which are linked to the original data with an auditable record of any modification to the image.
  • Project management, particularly of paper writing is via trello. Writing papers is done using collaborative tools. Everything is synchronised to enable working from any location.
  • This is just an overview and some details are missing, e.g. backup of analyses is done locally and via the server.

Just to reiterate, that my team are not at this point yet, but we are reasonably close. We have not yet implemented three of these things properly in my group, but in our latest project (via collaboration) the workflow has worked as described above.

The output is a manuscript! In the future I can see that publication of a paper as a condensed report will give way to making the data, scripts and analysis available, together with a written summary. This workflow is designed to allow this to happen easily, but this is the topic for another post.

Part of a series on the future of cell biology in quantitative terms.

The Digital Cell

If you are a cell biologist, you will have noticed the change in emphasis in our field.

At one time, cell biology papers were – in the main – qualitative. Micrographs of “representative cells”, western blots of a “typical experiment”… This descriptive style gave way to more quantitative approaches, converting observations into numbers that could be objectively assessed. More recently, as technology advanced, computing power increased and data sets became more complex, we have seen larger scale analysis, modelling, and automation begin to take centre stage.

This change in emphasis encompasses several areas including (in no particular order):

  • Statistical analysis
  • Image analysis
  • Programming
  • Automation allowing analysis at scale
  • Reproducibility
  • Version control
  • Data storage, archiving and accessing large datasets
  • Electronic lab notebooks
  • Computer vision and machine learning
  • Prospective and retrospective modelling
  • Mathematics and physics

The application of these areas is not new to biology and has been worked on extensively for years in certain areas. Perhaps most obviously by groups that identified themselves as “systems biologists”, “computational biologists”, and people working on large-scale cell biology projects. My feeling is that these methods have now permeated mainstream (read: small-scale) cell biology to such an extent that any groups that want to do cell biology in the future have to adapt in order to survive. It will change the skills that we look for when recruiting and it will shape the cell biologists of the future. Other fields such as biophysics and neuroscience are further through this change, while others have yet to begin. It is an exciting time to be a biologist.

I’m planning to post occasionally about the way that our cell biology research group is working on these issues: our solutions and our problems.

Part of a series on the future of cell biology in quantitative terms.

Adventures in code II

I needed to generate a uniform random distribution of points inside a circle and, later, a sphere. This is part of a bigger project, but the code to do this is kind of interesting. There were no solutions available for IgorPro, but stackexchange had plenty of examples in python and mathematica. There are many ways to do this. The most popular seems to be to generate a uniform random set of points in a square or cube and then discard those that are greater than the radius away from the origin. I didn’t like this idea, because I needed to extend it to spheroids eventually, and as I saw it the computation time saved was minimal.

Here is the version for points in a circle (radius = 1, centred on the origin).

circleCode

This gives a nice set of points, 1000 shown here.

pointsCircle

And here is the version inside a sphere. This code has variable radius for the sphere.

sphereCode

The three waves (xw,yw,zw) can be concatenated and displayed in a Gizmo. The code just plots out the three views.

pointsSphere

My code uses var + enoise(var) to get a random variable from 0,var. This is because enoise goes from -var to +var. There is an interesting discussion about whether this is a truly flat PDF here.

This is part of a bigger project where I’ve had invaluable help from Tom Honnor from Statistics.

This post is part of a series on esoterica in computer programming.

Voice Your Opinion: Editors shopping for preprints is the future

Today I saw a tweet from Manuel Théry (an Associate Ed at Mol Biol Cell). Which said that he heard that the Editor-in-Chief of MBoC, David Drubin shops for interesting preprints on bioRxiv to encourage the authors to submit to MBoC. This is not a surprise to me. I’ve read that authors of preprints on bioRxiv have been approached by journal Editors previously (here and here, there are many more). I’m pleased that David is forward-thinking and that MBoC are doing this actively.

I think this is the future.

Why? If we ignore for a moment the “far future” which may involve the destruction of most journals, leaving a preprint server and a handful of subject-specific websites which hunt down and feature content from the server and co-ordinate discussions and overviews of current trends… I actually think this is a good idea for the “immediate future” of science and science publishing. Two reasons spring to mind.

  1. Journals would be crazy to miss out: The manuscripts that I am seeing on bioRxiv are not stuff that’s been dumped there with no chance of “real publication”. This stuff is high profile. I mean that in the following sense: the work in my field that has been posted is generally interesting, it is from labs that do great science, and it is as good as work in any journal (obviously). For some reason I have found myself following what is being deposited here more closely than at any real journal. Journals would be crazy to miss out on this stuff.
  2. Levelling the playing field: For better-or-worse papers are judged on where they are published. The thing that bothers me most about this is that manuscripts are only submitted to 1 or more journals before “finding their home”. This process is highly noisy and it means that if we accept that there is a journal hierarchy, your paper may or may not be deserving of the kudos it receives in its resting place. If all journals actively scour the preprint server(s), the authors can then pick the “highest bidder”. This would make things fairer in the sense that all journals in the hierarchy had a chance to consider the paper and its resting place may actually reflect its true quality.

I don’t often post opinions here, but I thought this would take more than 140 characters to explain. If you agree or disagree, feel free to leave a comment!

Edit @ 11:46 16-05-26 Pedro Beltrao pointed out that this idea is not new, a post of his from 2007.

Edit 16-05-26 Misattributed the track to Extreme Noise Terror (corrected). Also added some links thanks to Alexis Verger.

The post title comes from “Voice Your Opinion” by Unseen Terror. The version I have is from a Peel sessions compilation “Hardcore Holocaust”.

If I Can’t Change Your Mind

I have written previously about Journal Impact Factors (here and here). The response to these articles has been great and earlier this year I was asked to write something about JIFs and citation distributions for one of my favourite journals. I agreed and set to work.

Things started off so well. A title came straight to mind. In the style of quantixed, I thought The Number of The Beast would be amusing. I asked for opinions on Twitter and got an even better one (from Scott Silverman @sksilverman) Too Many Significant Figures, Not Enough Significance. Next, I found an absolute gem of a quote to kick off the piece. It was from the eminently quotable Sydney Brenner.

Before we develop a pseudoscience of citation analysis, we should remind ourselves that what matters absolutely is the scientific content of a paper and that nothing will substitute for either knowing it or reading it.

looseEndsThat quote was from a Loose Ends piece that Uncle Syd penned for Current Biology in 1995. Wow, 1995… that is quite a few years ago I thought to myself. Never mind. I pressed on.

There’s a lot of literature on JIFs, research assessment and in fact there are whole fields of scholarly activity (bibliometrics) devoted to this kind of analysis. I thought I’d better look back at what has been written previously. The “go to” paper for criticism of JIFs is Per Seglen’s analysis in the BMJ, published in 1997. I re-read this and I can recommend it if you haven’t already seen it. However, I started to feel uneasy. There was not much that I could add that hadn’t already been said, and what’s more it had been said 20 years ago.

Around about this time I was asked to review some fellowship applications for another EU country. The applicants had to list their publications, along with the JIF. I found this annoying. It was as if SF-DORA never happened.

There have been so many articles, blog posts and more written on JIFs. Why has nothing changed? It was then that I realised that it doesn’t matter how many things are written – however coherently argued – people like JIFs and they like to use them for research assessment. I was wasting my time writing something else. Sorry if this sounds pessimistic. I’m sure new trainees can be reached by new articles on this topic, but acceptance of JIF as a research assessment tool runs deep. It is like religious thought. No amount of atheist writing, no matter how forceful, cogent, whatever, will change people’s minds. That way of thinking is too deeply ingrained.

As the song says, “If I can’t change your mind, then no-one will”.

So I declared defeat and told the journal that I felt like I had said all that I could already say on my blog and that I was unable to write something for them. Apologies to all like minded individuals for not continuing to fight the good fight.

But allow me one parting shot. I had a discussion on Twitter with a few people, one of whom said they disliked the “JIF witch hunt”. This caused me to think about why the JIF has hung around for so long and why it continues to have support. It can’t be that so many people are statistically illiterate or that they are unscientific in choosing to ignore the evidence. What I think is going on is a misunderstanding. Criticism of a journal metric as being unsuitable to judge individual papers is perceived as an attack on journals with a high-JIF. Now, for good or bad, science is elitist and we are all striving to do the best science we can. Striving for the best for many scientists means aiming to publish in journals which happen to have a high JIF. So an attack of JIFs as a research assessment tool, feels like an attack on what scientists are trying to do every day.

JIFDistBecause of this intense focus on high-JIF journals… what people don’t appreciate is that the reality is much different. The distribution of JIFs is as skewed as that for the metric itself. What this means is that focussing on a minuscule fraction of papers appearing in high-JIF journals is missing the point. Most papers are in journals with low-JIFs. As I’ve written previously, papers in journals with a JIF of 4 get similar citations to those in a journal with a JIF of 6. So the JIF tells us nothing about citations to the majority of papers and it certainly can’t predict the impact of these papers, which are the majority of our scientific output.

So what about those fellowship applicants? All of them had papers in journals with low JIFs (<8). The applicants’ papers were indistinguishable in that respect. What advice would I give to people applying to such a scheme? Well, I wouldn’t advise not giving the information asked for. To be fair to the funding body they also asked for number of citations for each paper, but for papers that are only a few months old, this number is nearly always zero. My advice would be to try and make sure that your paper is available freely for anyone to read. Many of the applicants’ papers were outside my expertise and so the title and abstract didn’t tell me much about the significance of the paper. So I looked at some of these papers to look at the quality of the data in there… if I had access. Applicants who had published in closed access journals are at a disadvantage here because if I couldn’t download the paper then it was difficult to assess what they had been doing.

I was thinking that this post would be a meta-meta-blogpost. Writing about an article which was written about something I wrote on my blog. I suppose it still is, except the article was never finished. I might post again about JIFs, but for now I doubt I will have anything new to say that hasn’t already been said.

The post title is taken from “If I Can’t Change Your Mind” by Sugar from their LP Copper Blue. Bob Mould was once asked about song-writing and he said that the perfect song was like a maths puzzle (I can’t find a link to support this, so this is from memory). If you are familiar with this song, songwriting and/or mathematics, then you will understand what he means.

Edit @ 08:22 16-05-20 I found an interview with Bob Mould where he says song-writing is like city-planning. Maybe he just compares song-writing to lots of different things in interviews. Nonetheless I like the maths analogy.

Adventures in code

An occasional series in esoteric programming issues.

As part of a larger analysis project I needed to implement a short program to determine the closest distance of two line segments in 3D space. This will be used to sort out which segments to compare… like I say, part of a bigger project. The best method to do this is to find the closest distance one segment is to the other when the other one is represented as an infinite line. You can then check if that point is beyond the segment if it is you use the limits of the segment to calculate the distance. There’s a discussion on stackoverflow here. The solutions point to one in C++ and one in MATLAB. The C++ version is easiest to port to Igor due to the similarity of languages, but the explanation of the MATLAB code was more approachable. So I ported that to Igor to figure out how it works.

The MATLAB version is:

&gt;&gt; P = [-0.43256      -1.6656      0.12533]
P =
   -0.4326   -1.6656    0.1253
&gt;&gt; Q = [0.28768      -1.1465       1.1909]
Q =
    0.2877   -1.1465    1.1909
&gt;&gt; R = [1.1892    -0.037633      0.32729]
R =
    1.1892   -0.0376    0.3273
&gt;&gt; S = [0.17464     -0.18671      0.72579]
S =
    0.1746   -0.1867    0.7258
&gt;&gt; N = null(P-Q)
N =
   -0.3743   -0.7683
    0.9078   -0.1893
   -0.1893    0.6115
&gt;&gt; r = (R-P)*N
r =
    0.8327   -1.4306
&gt;&gt; s = (S-P)*N
s =
    1.0016   -0.3792
&gt;&gt; n = (s - r)*[0 -1;1 0];
&gt;&gt; n = n/norm(n);
&gt;&gt; n
n =
    0.9873   -0.1587
&gt;&gt; d = dot(n,r)
d =
    1.0491
&gt;&gt; d = dot(n,s)
d =
    1.0491
&gt;&gt; v = dot(s-r,d*n-r)/dot(s-r,s-r)
v =
    1.2024
&gt;&gt; u = (Q-P)'\((S - (S*N)*N') - P)'
u =
    0.9590
&gt;&gt; P + u*(Q-P)
ans =
    0.2582   -1.1678    1.1473
&gt;&gt; norm(P + u*(Q-P) - S)
ans =
    1.0710

and in IgorPro:

Function MakeVectors()
	Make/O/D/N=(1,3) matP={{-0.43256},{-1.6656},{0.12533}}
	Make/O/D/N=(1,3) matQ={{0.28768},{-1.1465},{1.1909}}
	Make/O/D/N=(1,3) matR={{1.1892},{-0.037633},{0.32729}}
	Make/O/D/N=(1,3) matS={{0.17464},{-0.18671},{0.72579}}
End

Function DoCalcs()
	WAVE matP,matQ,matR,matS
	MatrixOp/O tempMat = matP - matQ
	MatrixSVD tempMat
	Make/O/D/N=(3,2) matN
	Wave M_VT
	matN = M_VT[p][q+1]
	MatrixOp/O tempMat2 = (matR - matP)
	MatrixMultiply tempMat2, matN
	Wave M_product
	Duplicate/O M_product, mat_r
	MatrixOp/O tempMat2 = (matS - matP)
	MatrixMultiply tempMat2, matN
	Duplicate/O M_product, mat_s
	Make/O/D/N=(2,2) MatUnit
	matUnit = {{0,1},{-1,0}}
	MatrixOp/O tempMat2 = (mat_s - mat_r)
	MatrixMultiply tempMat2,MatUnit
	Duplicate/O M_Product, Mat_n
	Variable nn
	nn = norm(mat_n)
	MatrixOP/O new_n = mat_n / nn
	//new_n is now a vector with unit length
	Variable dd
	dd = MatrixDot(new_n,mat_r)
	//print dd
	//dd = MatrixDot(new_n,mat_s)
	//print dd
	dd = abs(dd)
	// now find v
	// v = dot(s-r,d*n-r)/dot(s-r,s-r)
	variable vv
	MatrixOp/O mat_s_r = mat_s - mat_r
	MatrixOp/O tempmat2 = dd * mat_n - mat_r
	vv = MatrixDot(mat_s_r,tempmat2) / MatrixDot(mat_s_r,mat_s_r)
	//print vv
	//because vv &gt; 1 then closest post is s (because rs(1) = s) and therefore closest point on RS to infinite line PQ is S
	//what about the point on PQ is this also outside the segment?
	// u = (Q-P)'\((S - (S*N)*N') - P)'
	variable uu
	MatrixOp/O matQ_P = matQ - matP
	MatrixTranspose matQ_P
	//MatrixOP/O tempMat2 = ((matS - (matS * matN) * MatrixTranspose(MatN)) - MatrixTranspose(matP))
	Duplicate/O MatN, matNprime
	MatrixTranspose matNprime
	MatrixMultiply matS, matN
	Duplicate/O M_Product, matSN
	MatrixMultiply M_Product, matNprime
	MatrixOP/O tempMat2 = ((matS - M_product) - matP)
	MatrixTranspose tempMat2
	MatrixLLS matQ_P tempMat2
	Wave M_B
	uu = M_B[0]
	// find point on PQ that is closest to RS
	// P + u*(Q-P)
	MatrixOp/O matQ_P = matQ - matP
	MatrixOp/O matPoint = MatP + (uu * MatQ_P)
	MatrixOP/O distpoint = matPoint - matS
	Variable dist
	dist = norm(distpoint)
	Print dist
End

The sticking points were finding the Igor equivalents of

  • null()
  • norm()
  • dot()
  • \ otherwise known as mldivide

Which are:

  • MatrixSVD (answer is in the final two columns of wave M_VT)
  • norm()
  • MatrixDot()
  • MatrixLLS

MatrixLLS wouldn’t accept a mix of single-precision and double-precision waves, so this needed to be factored into the code.

As you can see, the Igor code is much longer. Overall, I think MATLAB handles Matrix Math better than Igor. It is certainly easier to write. I suspect that there are a series of Igor operations that can do what I am trying to do here, but this was an exercise in direct porting.

More work is needed to condense this down and also deal with every possible case. Then it needs to be incorporated into the bigger program. SIMPLE! Anyway, hope this helps somebody.

The post title is taken from the band Adventures In Stereo.

Follow

Get every new post delivered to your Inbox.

Join 79 other followers