I use a Garmin 800 GPS device to log my cycling activity. including my commutes. Since I have now built up nearly 4 years of cycling the same route, I had a good dataset to look at how accurate the device is.
I wrote some code to import all of the rides tagged with commute in rubiTrack 4 Pro (technical details are below). These tracks needed categorising so that they could be compared. Then I plotted them out as a gizmo in Igor Pro and compared them to a reference data set which I obtained via GPS Visualiser.
The reference dataset is black. Showing the “true” elevation at those particular latitude and longitude coordinates. Plotted on to that are the commute tracks coloured red-white-blue according to longitude. You can see that there are a range of elevations recorded by the device, apart from a few outliers they are mostly accurate but offset. This is strange because I have the elevation of the start and end points saved in the device and I thought it changed the altitude it was measuring to these elevation positions when recording the track, obviously not.
To look at the error in the device I plotted out the difference in the measured altitude at a given location versus the true elevation. For each route (to and from work) a histogram of elevation differences is shown to the right. The average difference is 8 m for the commute in and 4 m for the commute back. This is quite a lot considering that all of this is only ~100 m above sea level. The standard deviation is 43 m for the commute in and 26 m for the way back.
This post at VeloViewer comparing GPS data on Strava from pro-cyclists riding the St15 of 2015 Giro d’Italia sprang to mind. Some GPS devices performed OK, whereas others (including Garmin) did less well. The idea in that post is that rain affects the recording of some units. This could be true and although I live in a rainy country, I doubt it can account for the inaccuracies recorded here. Bear in mind that that stage was over some big changes in altitude and my recordings, very little. On the other hand, there are very few tracks in that post whereas there is lots of data here.
It’s interesting that the data is worse going in to work than coming back. I do set off quite early in the morning and it is colder etc first thing which might mean the unit doesn’t behave as well for the commute to work. Both to and from work tracks vary most in lat/lon recordings at the start of the track which suggests that the unit is slow to get an exact location – something every Garmin user can attest to. Although I always wait until it has a fix before setting off. The final two plots show what the beginning of the return from work looks like for location accuracy (travelling east to west) compared to a midway section of the same commute (right). This might mean the the inaccuracy at the start determines how inaccurate the track is. As I mentioned, the elevation is set for start and end points. Perhaps if the lat/lon is too far from the endpoint it fails to collect the correct elevation.
I’m disappointed with the accuracy of the device. However, I have no idea whether other GPS units (including phones) would outperform the Garmin Edge 800 or even if later Garmin models are better. This is a good but limited dataset. A similar analysis would be possible on a huge dataset (e.g. all strava data) which would reveal the best and worst GPS devices and/or the best conditions for recording the most accurate data.
I described how to get GPX tracks from rubiTrack 4 Pro into Igor and how to crunch them in a previous post. I modified the code to get elevation data out from the cycling tracks and generally made the code slightly more robust. This left me with 1,200 tracks. My commutes are varied. I frequently go from A to C via B and from C to A via D which is a loop (this is what is shown here). But I also go A to C via D, C to A via B and then I also often extend the commute to include 30 km of Warwickshire countryside. The tracks could be categorized by testing whether they began at A or C (this rejected some partial routes) and then testing whether they passed through B or D. These could then be plotted and checked visually for any routes which went off course, there were none. The key here is to pick the right B and D points. To calculate the differences in elevation, the simplest thing was to get GPS Visualiser to tell me what the elevation should be for all the points I had. I was surprised that the API could do half a million points without complaining. This was sufficient to do the rest. Note that the comparisons needed to be done as lat/lon versus elevation because due to differences in speed, time or trackpoint number lead to inherent differences in lat/lon (and elevation). Note also due to the small scale I didn’t bother converting lat/lon into flat earth kilometres.
The post title comes from “Elevation” by Television, which can be found on the classic “Marquee Moon” LP.
Towards the end of 2015, I started distance running. I thought it’d be fun to look at the frequency of my runs over the course of 2016.
Most of my runs were recorded with a GPS watch. I log my cycling data using Rubitrack, so I just added my running data to this. This software is great but to do any serious number crunching, other software is needed. Yes, I know that if I used strava I can do lots of things with my data… but I don’t. I also know that there are tools for R to do this, but I wrote something in Igor instead. The GitHub repo is here. There’s a technical description below, as well as some random thoughts on running (and cycling).
The animation shows the tracks I recorded as 2016 rolled by. The routes won’t mean much to you, but I can recognise most of them. You can see how I built up the distance to run a marathon and then how the runs became less frequent through late summer to October. I logged 975 km with probably another 50 km or so not logged.
To pull the data out of rubiTrack 4 Pro is actually quite difficult since there is no automated export. An applescript did the job of going through all the run activities and exporting them as gpx. There is an API provided by Garmin to take the data straight from the FIT files recorded by the watch, but everything is saved and tagged in rubiTrack, so gpx is a good starting point. GPX is an xml format which can be read into Igor using XMLutils XOP written by andyfaff. Previously, I’ve used nokogiri for reading XML, but this XOP keeps everything within Igor. This worked OK, but I had some trouble with namespaces which I didn’t resolve properly and what is in the code is a slight hack. I wrote some code which imported all the files and then processed the time frame I wanted to look at. It basically looks at a.m. and p.m. for each day in the timeframe. Igor deals with date/time nicely and so this was quite easy. Two lookups per day were needed because I often went for two runs per day (run commuting). I set the lat/lon at the start of each track as 0,0. I used the new alpha tools in IP7 to fade the tracks so that they decay away over time. They disappear with 1/8 reduction in opacity over a four day period. Igor writes out to mov which worked really nicely, but wordpress can’t host movies, so I added a line to write out TIFFs of each frame of the animation and assembled a nice gif using FIJI.
Getting started with running
Getting into running was almost accidental. I am a committed cyclist and had always been of the opinion: since running doesn’t improve aerobic cycling performance (only cycling does that), any activity other than cycling is a waste of time. However, I realised that finding time for cycling was getting more difficult and also my goal is to keep fit and not to actually be a pro-cyclist, so running had to be worth a try. Roughly speaking, running is about three times more time efficient compared to cycling. One hour of running approximates to three hours of cycling. I thought, I would just try it. Over the winter. No more than that. Of course, I soon got the running bug and ran through most of 2016. Taking part in a few running events (marathon, half marathons, 10K). A quick four notes on my experience.
- The key thing to keeping running is staying healthy and uninjured. That means building up distance and frequency of running very slowly. In fact, the limitation to running is the body’s ability to actually do the distance. In cycling this is different, as long as you fuel adequately and you’re reasonably fit, you could cycle all day if you wanted. This not true of running, and so, building up to doing longer distances is essential and the ramp up shouldn’t be rushed. Injuries will cost you lost weeks on a training schedule.
- There’s lots of things “people don’t tell you” about running. Blisters and things everyone knows about, but losing a toenail during a 20 km run? Encountering runner’s GI problems? There’s lots of surprises as you start out. Joining a club or reading running forums probably helps (I didn’t bother!). In case you are wondering, the respective answers are getting decent shoes fitted and well, there is no cure.
- Going from cycling to running meant going from very little upper body mass to gaining extra muscle. This means gaining weight. This is something of a shock to a cyclist and seems counterintuitive, since more activity should really equate to weight loss. I maintained cycling through the year, but was not expecting a gain of ~3 kilos.
- As with any sport, having something to aim for is essential. Training for training’s sake can become pointless, so line up something to shoot for. Sign up for an event or at least have an achievement (distance, average speed) in your mind that you want to achieve.
So there you have it. I’ll probably continue to mix running with cycling in 2017. I’ll probably extend the repo to do more with cycling data if I have the time.
The post title is taken from “Colours Running Out” by TOY from their eponymous LP.
I needed to generate a uniform random distribution of points inside a circle and, later, a sphere. This is part of a bigger project, but the code to do this is kind of interesting. There were no solutions available for IgorPro, but stackexchange had plenty of examples in python and mathematica. There are many ways to do this. The most popular seems to be to generate a uniform random set of points in a square or cube and then discard those that are greater than the radius away from the origin. I didn’t like this idea, because I needed to extend it to spheroids eventually, and as I saw it the computation time saved was minimal.
Here is the version for points in a circle (radius = 1, centred on the origin).
This gives a nice set of points, 1000 shown here.
And here is the version inside a sphere. This code has variable radius for the sphere.
The three waves (xw,yw,zw) can be concatenated and displayed in a Gizmo. The code just plots out the three views.
My code uses var + enoise(var) to get a random variable from 0,var. This is because enoise goes from -var to +var. There is an interesting discussion about whether this is a truly flat PDF here.
This is part of a bigger project where I’ve had invaluable help from Tom Honnor from Statistics.
This post is part of a series on esoterica in computer programming.
2016 was the 400 year anniversary of William Shakespeare’s death. Stratford-upon-Avon Rotary Club held the Shakespeare Marathon on the same weekend. Runners had an option of half or full marathon. There were apparently 3.5 K runners. Only 700 of whom were doing the full marathon. The chip results were uploaded last night and can be found here. Similar to my post on the Coventry Half Marathon, I thought I’d quickly analyse the data.
The best time was 02:34:51 by Adam Holland of Notfast. Fastest female runner was 3:14:39 by Josie Hinton of London Heathside.
Congrats to everyone who ran and thanks to the organisers and all the supporters out on the course.
The post title is taken from “Pledging My Time” a track from Blonde on Blonde by Bob Dylan
The end of the month sees the Coventry Half Marathon. I looked at what constitutes a good time over this course, based on 2015 results. I thought I’d post this here in case any one is interested.
The breakdown of runners by category for the 2015 event. Male Senior (MSEN) category has the most runners, constituting a wide age grouping. There were 3565 runners in total, 5 in an undetermined category and 9 DNFs. These 14 were not included in the analysis.
The best time last year was 01:10:21!
Good luck to everyone running this (or any other event) this year.
Edit: The 2016 Coventry Half Marathon happened today. I’m updating this post with the new data.
The width of violins has no special significance compared to 2015. Fastest time this year was 1:08:40 in the MSEN category.
There were more runners this year than last (4212 finishers), across all categories. Also this year there was a wheelchair category, which is not included here as there were only four competitors. FWIW, I placed somewhere in the first violin, in the lower whisker :-).
Congrats to everyone who ran and thanks to the all the supporters out on the course.
The post title is taken from “Pledging My Time” a track from Blonde on Blonde by Bob Dylan
This post follows on from a previous post on citation distributions and the wrongness of Impact Factor.
Stephen Curry had previously made the call that journals should “show us the data” that underlie the much-maligned Journal Impact Factor (JIF). However, this call made me wonder what “showing us the data” would look like and how journals might do it.
What citation distribution should we look at? The JIF looks at citations in a year to articles published in the preceding 2 years. This captures a period in a paper’s life, but it misses “slow burner” papers and also underestimates the impact of papers that just keep generating citations long after publication. I wrote a quick bit of code that would look at a decade’s worth of papers at one journal to see what happened to them as yearly cohorts over that decade. I picked EMBO J to look at since they have actually published their own citation distribution, and also they appear willing to engage with more transparency around scientific publication. Note that, when they published their distribution, it considered citations to papers via a JIF-style window over 5 years.
I pulled 4082 papers with a publication date of 2004-2014 from Web of Science (the search was limited to Articles) along with data on citations that occurred per year. I generated histograms to look at distribution of citations for each year. Papers published in 2004 are in the top row, papers from 2014 are in the bottom row. The first histogram shows citations in the same year as publication, in the next column, the following year and so-on. Number of papers is on y and on x the number of citations. Sorry for the lack of labelling! My excuse is that my code made a plot with “subwindows”, which I’m not too familiar with.
What is interesting is that the distribution changes over time:
- In the year of publication, most papers are not cited at all, which is expected since there is a lag to publication of papers which can cite the work and also some papers do not come out until later in the year, meaning the likelihood of a citing paper coming out decreases as the year progresses.
- The following year most papers are picking up citations: the distribution moves rightwards.
- Over the next few years the distribution relaxes back leftwards as the citations die away.
- The distributions are always skewed. Few papers get loads of citations, most get very few.
Although I truncated the x-axis at 40 citations, there are a handful of papers that are picking up >40 cites per year up to 10 years after publication – clearly these are very useful papers!
To summarise these distributions I generated the median (and the mean – I know, I know) number of citations for each publication year-citation year combination and made plots.
The mean is shown on the left and median on the right. The layout is the same as in the multi-histogram plot above.
Follow along a row and you can again see how the cohort of papers attracts citations, peaks and then dies away. You can also see that some years were better than others in terms of citations, 2004 and 2005 were good years, 2007 was not so good. It is very difficult, if not impossible, to judge how 2013 and 2014 papers will fare into the future.
What was the point of all this? Well, I think showing the citation data that underlie the JIF is a good start. However, citation data are more nuanced than the JIF allows for. So being able to choose how we look at the citations is important to understand how a journal performs. Having some kind of widget that allows one to select the year(s) of papers to look at and the year(s) that the citations came from would be perfect, but this is beyond me. Otherwise, journals would probably elect to show us a distribution for a golden year (like 2004 in this case), or pick a window for comparison that looked highly favourable.
Finally, I think journals are unlikely to provide this kind of analysis. They should, if only because it is a chance for a journal to show how it publishes many papers that are really useful to the community. Anyway, maybe they don’t have to… What this quick analysis shows is that it can be (fairly) easily harvested and displayed. We could crowdsource this analysis using standardised code.
Below is the code that I used – it’s a bit rough and would need some work before it could be used generally. It also uses a 2D filtering method that was posted on IgorExchange by John Weeks.
The post title is taken from “The Great Curve” by Talking Heads from their classic LP Remain in Light.
In the lab we have been doing quite a bit of analysis of cell migration in 2D. Typically RPE1 cells migrating on fibronectin-coated glass. There are quite a few tools out there to track cell movements and to analyse their migration. Naturally, none of these did quite what we wanted and none fitted nicely into our analysis workflow. This meant writing something from scratch in IgorPro. You can access the code from my GitHub pages.
We’ve previously published a paper doing these kinds of analysis, but this code is all-new, faster and more efficient
The CellMigration repo contains an ipf that has three functions which will import tracks into Igor and analyse them. We use manual tracking in ImageJ/FIJI to obtain the co-ordinates for analysis, this is the only non-Igor part. I figured it was not worth porting this part too. Instructions are given in the repo and are hopefully self-explanatory. Here’s a screenshot of a typical experiment.
The post title is taken from the track “At a Crawl” by The Melvins from their Ozma LP
I was searching for the excitation and emission spectra for mNeonGreen. I was able to find an image, but no values for the spectra. Here is an approximation of the spectra (xlsx format, still haven’t figured out csv for wordpress).
I got these values using IgorThief.ipf a very handy tool that allows the extraction of XY coordinates from a bitmapped plot (below).
mNeonGreen is available from Allele Biotechnologies.
Here is a great site for comparing fluorescent protein properties.
Edit @ 06:54 4/7/14: A web-based data thief was suggested by @dozenoaks
The post title is taken from “Green is the Colour” from the Pink Floyd LP “More”
In this post I’ll describe a computational method for splitting two sides of a cell biological structure. It’s a simple method that relies on principal component analysis, otherwise known as PCA. Like all things mathematical there are some great resources on the web, if you want to understand this operation in more detail (for example, this great post by Lior Pachter). PCA can applied to many biological problems, you’ve probably seen it used to find patterns in large data sets, e.g. from proteomic studies. It can also be useful for analysing microscopy data. Since our analysis using this method is unlikely to make it into print any time soon, I thought I’d put it up on Quantixed.
During mitosis, a cell forms a mitotic spindle to share copied chromosomes equally to the two new cells. Our lab is working on how this process works and how it goes wrong in cancer. The chromosomes attach to the spindle via kinetochores and during prometaphase they are moved to the middle of the cell. Here, the chromosomes are organised into a disc-like structure called the metaphase plate. The disc is thin in the direction of the spindle axis, but much larger in width and height. To examine the spatial distribution of kinetochores on the plate we wanted a way to approximately separate kinetochores on one side if the plate from the other.
Kinetochores can be easily detected in 3D confocal images of mitotic cells by particle analysis. Kinetochores are easily stained and appear as bright spots that a computer can pick out (we use Imaris for this). The cartesian coordinates of each detected kinetochore were saved as csv and fed into IgorPro. A procedure could then be run which works in three steps. The code is shown at the bottom, it is wrapped in further code that deals with multiple datasets from many cells/experiments etc. The three steps are:
- Analysis on each subset
I’ll describe each step and how it works.
1. Principal component analysis
This is used to find the 3rd eigenvector, which can be used to define a plane passing through the centre of the plate. This plane is used for division.
Now, because the metaphase plate is a disc it has three dimensions, the third of which – “thickness” – is the smallest. PCA will find the principal component, i.e. the direction in which there is most variance. Orthogonal to that is the second biggest variance and orthogonal to that direction is the smallest. These directions are called eigenvectors and their magnitude is the eigenvalue. As there are three dimensions to the data we can get all three eigenvectors out and the 3rd eigenvector corresponds to thickness of the metaphase plate. Metaphase plates in cells grown on coverslips are orientated similarly, but the cells themselves are at random orientations. PCA takes no notice of this and can simply reveal the direction of the smallest dimension of a 3D structure. The movie shows this in action for a simulated data set. The black spots are arranged in a disk shape about the origin. They are rotated about x by 45° (the blue spots). We then run PCA and show the eigenvectors as unit vectors (red lines). The 3rd eigenvector is normal to the plane of division, i.e. the 1st and 2nd eigenvectors lie on the plane of division.
Also, the centroid needs to be defined. This is simply the cartesian coordinates for the average of each dimension. It is sometimes referred to as the mean vector. In the example this was the origin, in reality this will depend on the position and the overall height of the cell.
A much longer method to get the eigenvectors is to define the variance-covariance matrix (sometimes called the dispersion matrix) for each dimension, for all kinetochores and then do an eigenvector decomposition on the matrix. PCA is one command, whereas the matrix calculation would be an extra loop followed by an additional command.
The distance of each kinetochore to the plane that we defined is calculated. If it is a positive value then the kinetochore lies on the same side as the normal vector (defined above). If it is negative then it is on the other side. The maths behind how to do this are in section 10.3.1 of Geometric Tools for Computer Graphics by Schneider & Eberly (starting on p. 374). Google it, there is a PDF version on the web. I’ll save you some time, you just need one equation that defines a plane,
Where the unit normal vector is [a b c] and a point on the plane is [x y z]. We’ll use the coordinates of the centroid as a point on the plane to find d. Now that we know this, we can use a similar equation to find the distance of any point to the plane,
Results for each kinetochore are used to sort each side of the plane into separate waves for further calculation. In the movie below, the red dots and blue dots show the positions of the kinetochores on either side of the division plane. It’s a bit of an optical illusion, but the cube is turning in a right hand fashion.
3. Analysis on each subset
Now that the data have been sorted, separate calculations can be carried out on each. In the example, we were interested in how the kinetochores were organised spatially and so we looked at the distance to nearest neighbour. This is done by finding the Euclidean distance from each kinetochore to every other kinetochore and putting the lowest value for each kinetochore into a new wave. However, this calculation can be anything you want. If there are further waves that specify other properties of the kinetochores, e.g. brightness, then these can be similarly processed here.
The code in its present form (not very streamlined) was fast and could be run on every cell from a number of experiments, reading out positional data for 10,000 kinetochores in ~2 s. For QC it is possible to display the two separated coordinated sets to check that the division worked fine (see above). The power of this method is that it doesn’t rely on imaging spindle poles or anything else to work out the orientation of the metaphase plate. It works well for metaphase cells, but cells with any misaligned chromosomes ruin the calculation. It is possible to remove these and still fit the plane, but for our analysis we focused on cells at metaphase with a defined plate.
What else can it be used for?
Other structures in the cell can be segregated in a similar way. For example, the Golgi apparatus has a trans and a cis side, which could be similarly divided (although using the 2nd eigenvector as normal to the plane, rather than the 3rd).
Acknowledgements: I’d like to thank A.G. at WaveMetrics Inc. for encouraging me to try PCA rather than my dispersion matrix approach.
The post title comes from “Division Day” a great single by Elliott Smith.
Having recently got my head around violin plots, I thought I would explain what they are and why you might want to use them.
There are several options when it comes to plotting summary data. I list them here in order of granularity, before describing violin plots and how to plot them in some detail.
This is the mainstay of most papers in my field. Typically, a bar representing the mean value that’s been measured is shown with an error bar which shows either the standard error of the mean, the standard deviation, or more rarely a confidence interval. The two data series plotted in all cases is the waiting time for Old Faithful eruptions (waiting), a classic dataset from R. I have added some noise to waiting (waiting_mod) for comparison. I think it’s fair to say that most people feel that the bar chart has probably had its day and that we should be presenting our data in more informative ways*.
Pros: compact, easy to tell differences between groups
Cons: hides the underlying distribution, obscures the n number
The box plot – like many things in statistics – was introduced by Tukey. It’s sometimes known as a Tukey plot, or a box-and-whiskers plot. The idea was to give an impression of the underlying distribution without showing a histogram (see below). Histograms are great, but when you need to compare many distributions they do not overlay well and take up a lot of space to show them side-by-side. In the simplest form, the “box” is the interquartile range (IQR, 25th and 75th percentiles) with a line to show the median. The whiskers show the 10th and 90th percentiles. There are many variations on this theme: outliers can be shown or not, the whiskers may show the limits of the dataset (or something else), the boxes can be notched or their width may represent the sample size…
Pros: compact, easy to tell differences between groups, shows normality/skewness
Cons: hides multimodal data, sometimes obscures the n number, many variations
A histogram is a method of showing the distribution of a dataset and was introduced by Pearson. The number of observations within a bin are counted and plotted. The ‘bars’ sit next to each other, because the variable being measured is continuous. The variable being measured is on the x-axis, rather than the category (as in the other plots).
Often the area of all the bars is normalised to 1 in order to assess the distributions without being confused by differences in sample size. As you can see here, “waiting” is bimodal. This was hidden in the bar chart and in the bot plot.
Related to histograms are other display types such as stemplots or stem-and-leaf plots.
Pros: shows multimodal data, shows normality/skewness clearly
Cons: not compact, difficult to overlay, bin size and position can be misleading
Scatter dot plot
It’s often said that if there are less than 10 data points, then best practice is to simply show the points. Typically the plot is shown together with a bar to show the mean (or median) and maybe with error bars showing s.e.m., s.d., IQR. There are a couple of methods of plotting the points, because they need to be scattered in x value in order to be visualised. Adding random noise is one approach, but this looks a bit messy (top). A symmetrical scatter can be introduced by binning (middle) and a further iteration is to bin the y values rather than showing their true location (bottom). There’s a further iteration which constrains the category width and overlays multiple points, but again the density becomes difficult to see.
These plots still look quite fussy, the binned version is the clearest but then we are losing the exact locations of the points, which seems counterintuitive. Another alternative to scattering the dots is to show a rug plot (see below) where there is no scatter.
Pros: shows all the observations
Cons: can be difficult to assess the distribution
This type of plot was introduced in the software package NCSS in 1997 and described in this paper: Hintze & Nelson (1998) The American Statistician 52(2):181-4 [PDF]. As the title says, violin plots are a synergism between box plot and density trace. A thin box plot is shown together with a symmetrical kernel density estimate (KDE, see explanation below). The point is to be able to quickly assess the distribution. You can see that the bimodality of waiting in the plot, but there’s no complication of lots of points just a smooth curve to see the data.
Pros: shows multimodal data, shows normality/skewness unambiguously
Cons: hides n, not familiar to many readers.
* Why is the bar chart so bad and why should I show my data another way?
The best demonstration of why the bar chart is bad is Anscombe’s Quartet (the figure to the right is taken from the Wikipedia page). These four datasets are completely different, yet they all have the same summary statistics. The point is, you would never know unless you plotted the data. A bar chart would look identical for all four datasets.
Making Violin Plots in IgorPro
I wanted to make Violin Plots in IgorPro, since we use Igor for absolutely everything in the lab. I wrote some code to do this and I might make some improvements to it in the future – if I find the time! This was an interesting exercise, because it meant forcing myself to understand how smoothing is done. What follows below is an aide memoire, but you may find it useful.
What is a kernel density estimate?
A KDE is a non-parametric method to estimate a probability density function of a variable. A histogram can be thought of as a simplistic non-parametric density estimate. Here, a rectangle is used to represent each observation and it gets bigger the more observations are made.
What’s wrong with using a histogram as a KDE?
The following examples are taken from here (which in turn are taken from the book by Bowman and Azzalini described below). A histogram is simplistic. We lose the location of each datapoint because of binning. Histograms are not smooth and the estimate is very sensitive to the size of the bins and also the starting location of the first bin. The histograms to the right show the same data points (in the rug plot).
Using the same bin size, they result in very different distributions depending on where the first bin starts. My first instinct to generate a KDE was to simply smooth a histogram, but this is actually quite inaccurate as it comes from a lossy source. Instead we need to generate a real KDE.
How do I make a KDE?
To do this we place a kernel (a Gaussian is commonly used) at each data point. The rationale behind this is that each observation can be thought of as being representative of a greater number of observations. It sounds a bit bizarre to assume normality to estimate a density non-parametrically, but it works. We can sum all of the kernels to give a smoothed distribution: the KDE. Easy? Well, yes as long as you know how wide to make the kernels. To do this we need to find the bandwidth, h (also called the smoothing parameter).
It turns out that this is not completely straightforward. The answer is summarised in this book: Bowman & Azzalini (1997) Applied Smoothing Techniques for Data Analysis. In the original paper on violin plots, they actually do not have a good solution for selecting h for drawing the violins, and they suggest trying several different values for h. They recommend starting at ~15% of the data range as a good starting point. Obviously if you are writing some code, the process of selecting h needs to be automatic.
Optimising h is necessary because if h is too large, the estimate with be oversmoothed and features will be lost. If h is too small, then it will be undersmoothed and bumpy. The examples to the right (again, taken from Bowman & Azzalini, via this page) show examples of undersmoothed, oversmoothed and optimal smoothing.
An optimal solution to find h is
This is termed Silverman’s rule-of-thumb. If smoothing is needed in more than one dimension, the multidimensional version is
You might need multidimensional smoothing to contextualise more than one parameter being measured. The waiting data used above describes the time to wait until the next eruption from Old Faithful. The duration of the eruption is measured, and also the wait to the next eruption can be extracted, giving three parameters. These can give a 3D density estimate as shown here in the example.
The Bowman & Azzalini recommend that, if the distribution is long-tailed, using the median absolute deviation estimator is robust for .
where is the median of the sample. All of this is something you don’t need to worry about if you use R to plot violins, the implementation in there is rock solid having been written in S plus and then ported to R years ago. You can even pick how the h selection is done from
sm.density, or even modify the optimal h directly using
To get this working in IgorPro, I used some code for 1D KDE that was already on IgorExchange. It needed a bit of modification because it used
FastGaussTransform to sum the kernels as a shortcut. It’s a very fast method, but initially gave an estimate that seemed to be undersmoothed. I spent a while altering the formula for h, hence the detail above. To cut a long story short,
FastGaussTransform uses Taylor expansion of the Gauss transform and it just needed more terms to do this accurately. This is set with the
/TET flag. Note also, that in Igor the width of a Gauss is sigma*2^1/2.
OK, so how do I make a Violin for plotting?
I used the draw tools to do this and placed the violins behind an existing box plot. This is necessary to be able to colour the violins (apparently transparency is coming to Igor in IP7). The other half of the violin needs to be calculated and then joined by the DrawPoly command. If the violins are trimmed, i.e. cut at the limits of the dataset, then this required an extra point to be added. Without trimming, this step is not required. The only other issue is how wide the violins are plotted. In R, the violins are all normalised so that information about n is lost. In the current implementation, box width is 0.1 and the violins are normalised to the area under the curve*(0.1/2). So, again information on n is lost.
Ideas for developments of the Violin Plot method in IgorPro
- incorporate it into the ipf for making boxplots so that it is integrated as an option to ‘calculate percentiles’
- find a better solution for setting the width of the violin
- add other bandwidth options, as in R
- add more options for colouring the violins
What do you think? Did I miss something? Let me know in the comments.
Bowman, A.W. & Azzalini, A. (1997) Applied Smoothing Techniques for Data Analysis : The Kernel Approach with S-Plus Illustrations: The Kernel Approach with S-Plus Illustrations. Oxford University Press.
Hintze, J.L. & Nelson, R.D. (1998) Violin plots: A Box Plot-Density Trace Synergism. The American Statistician, 52:181-4.