We have a new paper out! You can access it here.
This paper really was a team effort. Faye Nixon and Tom Honnor are joint-first authors. Faye did most of the experimental work in the final months of her PhD and Tom came up with the idea for the mathematical modelling and helped to rewrite our analysis method in R. Other people helped in lots of ways. George did extra segmentation, rendering and movie making. Nick helped during the revisions of the paper. Ali helped to image samples… the list is quite long.
The paper in a nutshell
We used a 3D imaging technique called SBF-SEM to see microtubules in dividing cells, then used computers to describe their organisation.
Serial block face scanning electron microscopy. This method allows us to take an image of a cell and then remove a tiny slice, take another image and so on. We then have a pile of images which covers the entire cell. Next we need to put them back together and make some sense of them.
How do you do that?
We use a computer to track where all the microtubules are in the cell. In dividing cells – in mitosis – the microtubules are in the form of a mitotic spindle. This is a machine that the cell builds to share the chromosomes to the two new cells. It’s very important that this process goes right. If it fails, mistakes can lead to diseases such as cancer. Before we started, it wasn’t known whether SBF-SEM had the power to see microtubules, but we show in this paper that it is possible.
We can see lots of other cool things inside the cell too like chromosomes, kinetochores, mitochondria, membranes. We made many interesting observations in the paper, although the focus was on the microtubules.
So you can see all the microtubules, what’s interesting about that?
The interesting thing is that our resolution is really good, and is at a large scale. This means we can determine the direction of all the microtubules in the spindle and use this for understanding how well the microtubules are organised. Previous work had suggested that proteins whose expression is altered in cancer cause changes in the organisation of spindle microtubules. Our computational methods allowed us to test these ideas for the first time.
Resolution at a large scale, what does that mean?
The spindle is made of thousands of microtubules. With a normal light microscope, we can see the spindle but we can’t tell individual microtubules apart. There are improvements in light microscopy (called super-resolution) but even with those improvements, right in the body of the spindle it is still not possible to resolve individual microtubules. SBF-SEM can do this. It doesn’t have the best resolution available though. A method called Electron Tomography has much higher resolution. However, to image microtubules at this large scale (meaning for one whole spindle), it would take months or years of effort! SBF-SEM takes a few hours. Our resolution is better than light microscopy, worse than electron tomography, but because we can see the whole spindle and image more samples, it has huge benefits.
What mathematical modelling did you do?
Cells are beautiful things but they are far from perfect. The microtubules in a mitotic spindle follow a pattern, but don’t do so exactly. So what we did was to create a “virtual spindle” where each microtubule had been made perfect. It was a bit like “photoshopping” the cell. Instead of straightening the noses of actresses, we corrected the path of every microtubule. How much photoshopping was needed told us how imperfect the microtubule’s direction was. This measure – which was a readout of microtubule “wonkiness” – could be done on thousands of microtubules and tell us whether cancer-associated proteins really cause the microtubules to lose organisation.
The publication process
The paper is published in Journal of Cell Science and it was a great experience. Last November, we put up a preprint on this work and left it up for a few weeks. We got some great feedback and modified the paper a bit before submitting it to a journal. One reviewer gave us a long list of useful comments that we needed to address. However, the other two reviewers didn’t think our paper was a big enough breakthrough for that journal. Our paper was rejected*. This can happen sometimes and it is frustrating as an author because it is difficult for anybody to judge which papers will go on to make an impact and which ones won’t. One of the two reviewers thought that because the resolution of SBF-SEM is lower than electron tomography, our paper was not good enough. The other one thought that because SBF-SEM will not surpass light microscopy as an imaging method (really!**) and because EM cannot be done live (the cells have to be fixed), it was not enough of a breakthrough. As I explained above, the power is that SBF-SEM is between these two methods. Somehow, the referees weren’t convinced. We did some more work, revised the paper, and sent it to J Cell Sci.
J Cell Sci is a great journal which is published by Company of Biologists, a not-for-profit organisation who put a lot of money back into cell biology in the UK. They are preprint friendly, they allow the submission of papers in any format, and most importantly, they have a fast-track*** option. This allowed me to send on the reviews we had and including our response to them. They sent the paper back to the reviewer who had a list of useful comments and they were happy with the changes we made. It was accepted just 18 days after we sent it in and it was online 8 days later. I’m really pleased with the whole publishing experience with J Cell Sci.
* I’m writing about this because we all have papers rejected. There’s no shame in that at all. Moreover, it’s obvious from the dates on the preprint and on the JCS paper that our manuscript was rejected from another journal first.
** Anyone who knows something about microscopy will find this amusing and/or ridiculous.
*** Fast-track is offered by lots of journals nowadays. It allows authors to send in a paper that has been reviewed elsewhere with the peer review file. How the paper has been revised in light of those comments is assessed by at the Editor and one peer reviewer.
Parallel lines is of course the title of the seminal Blondie LP. I have used this title before for a blog post, but it matches the topic so well.
I use a Garmin 800 GPS device to log my cycling activity. including my commutes. Since I have now built up nearly 4 years of cycling the same route, I had a good dataset to look at how accurate the device is.
I wrote some code to import all of the rides tagged with commute in rubiTrack 4 Pro (technical details are below). These tracks needed categorising so that they could be compared. Then I plotted them out as a gizmo in Igor Pro and compared them to a reference data set which I obtained via GPS Visualiser.
The reference dataset is black. Showing the “true” elevation at those particular latitude and longitude coordinates. Plotted on to that are the commute tracks coloured red-white-blue according to longitude. You can see that there are a range of elevations recorded by the device, apart from a few outliers they are mostly accurate but offset. This is strange because I have the elevation of the start and end points saved in the device and I thought it changed the altitude it was measuring to these elevation positions when recording the track, obviously not.
To look at the error in the device I plotted out the difference in the measured altitude at a given location versus the true elevation. For each route (to and from work) a histogram of elevation differences is shown to the right. The average difference is 8 m for the commute in and 4 m for the commute back. This is quite a lot considering that all of this is only ~100 m above sea level. The standard deviation is 43 m for the commute in and 26 m for the way back.
This post at VeloViewer comparing GPS data on Strava from pro-cyclists riding the St15 of 2015 Giro d’Italia sprang to mind. Some GPS devices performed OK, whereas others (including Garmin) did less well. The idea in that post is that rain affects the recording of some units. This could be true and although I live in a rainy country, I doubt it can account for the inaccuracies recorded here. Bear in mind that that stage was over some big changes in altitude and my recordings, very little. On the other hand, there are very few tracks in that post whereas there is lots of data here.
It’s interesting that the data is worse going in to work than coming back. I do set off quite early in the morning and it is colder etc first thing which might mean the unit doesn’t behave as well for the commute to work. Both to and from work tracks vary most in lat/lon recordings at the start of the track which suggests that the unit is slow to get an exact location – something every Garmin user can attest to. Although I always wait until it has a fix before setting off. The final two plots show what the beginning of the return from work looks like for location accuracy (travelling east to west) compared to a midway section of the same commute (right). This might mean the the inaccuracy at the start determines how inaccurate the track is. As I mentioned, the elevation is set for start and end points. Perhaps if the lat/lon is too far from the endpoint it fails to collect the correct elevation.
I’m disappointed with the accuracy of the device. However, I have no idea whether other GPS units (including phones) would outperform the Garmin Edge 800 or even if later Garmin models are better. This is a good but limited dataset. A similar analysis would be possible on a huge dataset (e.g. all strava data) which would reveal the best and worst GPS devices and/or the best conditions for recording the most accurate data.
I described how to get GPX tracks from rubiTrack 4 Pro into Igor and how to crunch them in a previous post. I modified the code to get elevation data out from the cycling tracks and generally made the code slightly more robust. This left me with 1,200 tracks. My commutes are varied. I frequently go from A to C via B and from C to A via D which is a loop (this is what is shown here). But I also go A to C via D, C to A via B and then I also often extend the commute to include 30 km of Warwickshire countryside. The tracks could be categorized by testing whether they began at A or C (this rejected some partial routes) and then testing whether they passed through B or D. These could then be plotted and checked visually for any routes which went off course, there were none. The key here is to pick the right B and D points. To calculate the differences in elevation, the simplest thing was to get GPS Visualiser to tell me what the elevation should be for all the points I had. I was surprised that the API could do half a million points without complaining. This was sufficient to do the rest. Note that the comparisons needed to be done as lat/lon versus elevation because due to differences in speed, time or trackpoint number lead to inherent differences in lat/lon (and elevation). Note also due to the small scale I didn’t bother converting lat/lon into flat earth kilometres.
The post title comes from “Elevation” by Television, which can be found on the classic “Marquee Moon” LP.
Statistical hypothesis testing, commonly referred to as “statistics”, is a topic of consternation among cell biologists.
This is a short practical guide I put together for my lab. Hopefully it will be useful to others. Note that statistical hypothesis testing is a huge topic and one post cannot hope to cover everything that you need to know.
What statistical test should I do?
To figure out what statistical test you need to do, look at the table below. But before that, you need to ask yourself a few things.
- What are you comparing?
- What is n?
- What will the test tell you? What is your hypothesis?
- What will the p value (or other summary statistic) mean?
If you are not sure about any of these things, whichever test you do is unlikely to tell you much.
The most important question is: what type of data do you have? This will help you pick the right test.
- Measurement – most data you analyse in cell biology will be in this category. Examples are: number of spots per cell, mean GFP intensity per cell, diameter of nucleus, speed of cell migration…
- Normally-distributed – this means it follows a “bell-shaped curve” otherwise called “Gaussian distribution”.
- Not normally-distributed – data that doesn’t fit a normal distribution: skewed data, or better described by other types of curve.
- Binomial – this is data where there are two possible outcomes. A good example here in cell biology would be a mitotic index measurement (the proportion of cells in mitosis). A cell is either in mitosis or it is not.
- Other – maybe you have ranked or scored data. This is not very common in cell biology. A typical example here would be a scoring chart for a behavioural effect with agreed criteria (0 = normal, 5 = epileptic seizures). For a cell biology experiment, you might have a scoring system for a phenotype, e.g. fragmented Golgi (0 = is not fragmented, 5 = is totally dispersed). These arbitrary systems are a not a good idea. Especially, if the person scoring is unblinded to the experimental procedure. Try to come up with an unbiased measurement procedure.
|What do you want to do?||Measurement
|Describe one group||Mean, SD||Median, IQR||Proportion|
|Compare one group to a value||One-sample t-test||Wilcoxon test||Chi-square|
|Compare two unpaired groups||Unpaired t-test||Wilcoxon-Mann-Whitney two-sample rank test||Fisher’s exact test
|Compare two paired groups||Paired t-test||Wilcoxon signed rank test||McNemar’s test|
|Compare three or more unmatched groups||One-way ANOVA||Kruskal-Wallis test||Chi-square test|
|Compare three or more matched groups||Repeated-measures ANOVA||Friedman test||Cochran’s Q test|
|Quantify association between two variables||Pearson correlation||Spearman correlation|
|Predict value from another measured variable||Simple linear regression||Nonparametric regression||Simple logistic regression|
|Predict value from several measured or binomial variables||Multiple linear (or nonlinear) regression||Multiple logistic regression|
Modified from Table 37.1 (p. 298) in Intuitive Biostatistics by Harvey Motulsky, 1995 OUP.
What do “paired/unpaired” and “matched/unmatched” mean?
Most of the data you will get in cell biology is unpaired or unmatched. Individual cells are measured and you have say, 20 cells in the control group and 18 different cells in the test group. These are unpaired (or unmatched in the case of more than one test group) because the cells are different in each group. If you had the same cell in two (or more) groups, the data would be paired (or matched). An example of a paired dataset would be where you have 10 cells that you treat with a drug. You take a measurement from each of them before treatment and a measurement after. So you have paired measurements: one for cell A before treatment, one after; one for cell B before and after, and so on.
How to do some of these tests in IgorPRO
The examples below assume that you have values in waves called data0, data1, data2,… substitute the wavenames for your actual wave names.
Is it normally distributed?
The simplest way is to plot them and see. You can plot out your data using Analysis>Histogram… or Analysis>Packages>Percentiles and BoxPlot… Another possibility is to look at skewness or kurtosis of the dataset (you can do this with WaveStats, see below)
However, if you only have a small number of measurements, or you want to be sure, you can do a test. There are several tests you can do (Kolmogorov-Smirnoff, Jarque-Bera, Shapiro-Wilk). The easiest to do and most intuitive (in Igor) is Shapiro-Wilk.
If p < 0.05 then the data are not normally distributed. Statistical tests on normally distributed data are called parametric, while those on non-normally distributed data are non-parametric.
Describe one group
To get the mean and SD (and lots of other statistics from your data):
To get the median and IQR:
The mean and sd are also stored as variables (V_avg, V_sdev). StatsQuantiles calculates V_median, V_Q25, V_Q75, V_IQR, etc. Note that you can just get the median by typing Print StatsMedian(data0) or – in Igor7 – Print median(data0). There is often more than one way to do something in Igor.
Compare one group to a value
It is unlikely that you will need to do this. In cell biology, most of the time we do not have hypothetical values for comparison, we have experimental values from appropriate controls. If you need to do this:
Compare two unpaired groups
Use this for normally distributed data where you have test versus control, with no other groups. For paired data, use the additional flag /PAIR.
For the non-parametric equivalent, if n is large computation takes a long time. Use additional flag /APRX=2. If the data are paired, use the additional flag /WSRT.
For binomial data, your waves will have 2 points. Where point 0 corresponds to one outcome and point 1, the other. Note that you can compare to expected values here, for example a genetic cross experiment can be compared to expected Mendelian frequencies. To do Fisher’s exact test, you need a 2D wave representing a contingency table. McNemar’s test for paired binomial data is not available in Igor
If you have more than two groups, do not do multiple versions of these tests, use the correct method from the table.
Compare three or more unmatched groups
For normally-distributed data, you need to do a 1-way ANOVA followed by a post-hoc test. The ANOVA will tell you if there are any differences among the groups and if it is possible to investigate further with a post-hoc test. You can discern which groups are different using a post-hoc test. There are several tests available, e.g. Dunnet’s is useful where you have one control value and a bunch of test conditions. We tend to use Tukey’s post-hoc comparison (the /NK flag also does Newman-Keuls test).
StatsAnova1Test/T=1/Q/W/BF data0,data1,data2,data3 StatsTukeyTest/T=1/Q/NK data0,data1,data2,data3
The non-parametric equivalent is Kruskal-Wallis followed by a multiple comparison test. Dunn-Holland-Wolfe method is used.
StatsKSTest/T=1/Q data0,data1,data2,data3 StatsNPMCTest/T=1/DHW/Q data0,data1,data2,data3
Compare three or more matched groups
It’s unlikely that this kind of data will be obtained in a typical cell biology experiment.
There are also operations for StatsFriedmanTest and StatsCochranTest.
Straightforward command for two waves or one 2D wave. Waves (or columns) must be of the same length
At this point, you probably want to plot out the data and use Igor’s fitting functions. The best way to get started is with the example experiment, or just display your data and Analysis>Curve Fitting…
Hazard and survival data
In the lab we have, in the past, done survival/hazard analysis. This is a bit more complex and we used SPSS and would do so again as Igor does not provide these functions.
Notes for use
The good news is that all of this is a lot more intuitive in Igor 7! There is a new Menu item called Statistics, where most of these functions have a dialog with more information. In Igor 6.3 you are stuck with the command line. Igor 7 will be out soon (July 2016).
- Note that there are further options to most of these commands, if you need to see them
- check the manual or Igor Help
- or type ShowHelpTopic “StatsMedian” in the Command Window (put whatever command you want help with between the quotes).
- Extra options are specified by “flags”, these are things like “/Q” that come after the command. For example, /Q means “quiet” i.e. don’t print the output into the history window.
- You should always either print the results to the history or put them into a table so that we can check them. Note that the table gets over written if you do the same test with different data, so printing in this case is a good idea.
- The defaults in Igor are setup OK for our needs. For example, Igor does two-tailed comparison, alpha = 0.05, Welch’s correction, etc.
- Most operations can handle waves of different length (or have flags set to handle this case).
- If you are used to doing statistical tests in Excel, you might be wondering about tails and equal variances. The flags are set in the examples to do two-tailed analysis and unequal variances are handled by Welch’s correction.
- There’s a school of thought that says that using non-parametric tests is best to be cautious. These tests are not as powerful and so it is best to use parametric tests (t test, ANOVA) when you can.
Part of a series on the future of cell biology in quantitative terms.
If you are a cell biologist, you will have noticed the change in emphasis in our field.
At one time, cell biology papers were – in the main – qualitative. Micrographs of “representative cells”, western blots of a “typical experiment”… This descriptive style gave way to more quantitative approaches, converting observations into numbers that could be objectively assessed. More recently, as technology advanced, computing power increased and data sets became more complex, we have seen larger scale analysis, modelling, and automation begin to take centre stage.
This change in emphasis encompasses several areas including (in no particular order):
- Statistical analysis
- Image analysis
- Automation allowing analysis at scale
- Version control
- Data storage, archiving and accessing large datasets
- Electronic lab notebooks
- Computer vision and machine learning
- Prospective and retrospective modelling
- Mathematics and physics
The application of these areas is not new to biology and has been worked on extensively for years in certain areas. Perhaps most obviously by groups that identified themselves as “systems biologists”, “computational biologists”, and people working on large-scale cell biology projects. My feeling is that these methods have now permeated mainstream (read: small-scale) cell biology to such an extent that any groups that want to do cell biology in the future have to adapt in order to survive. It will change the skills that we look for when recruiting and it will shape the cell biologists of the future. Other fields such as biophysics and neuroscience are further through this change, while others have yet to begin. It is an exciting time to be a biologist.
I’m planning to post occasionally about the way that our cell biology research group is working on these issues: our solutions and our problems.
Part of a series on the future of cell biology in quantitative terms.
I needed to generate a uniform random distribution of points inside a circle and, later, a sphere. This is part of a bigger project, but the code to do this is kind of interesting. There were no solutions available for IgorPro, but stackexchange had plenty of examples in python and mathematica. There are many ways to do this. The most popular seems to be to generate a uniform random set of points in a square or cube and then discard those that are greater than the radius away from the origin. I didn’t like this idea, because I needed to extend it to spheroids eventually, and as I saw it the computation time saved was minimal.
Here is the version for points in a circle (radius = 1, centred on the origin).
This gives a nice set of points, 1000 shown here.
And here is the version inside a sphere. This code has variable radius for the sphere.
The three waves (xw,yw,zw) can be concatenated and displayed in a Gizmo. The code just plots out the three views.
My code uses var + enoise(var) to get a random variable from 0,var. This is because enoise goes from -var to +var. There is an interesting discussion about whether this is a truly flat PDF here.
This is part of a bigger project where I’ve had invaluable help from Tom Honnor from Statistics.
This post is part of a series on esoterica in computer programming.
Top Trumps is a card game for children. The mind can wander when playing such games with kids… typically, I start thinking: what is the best strategy for this game? But also, as the game drags on: what is the quickest way to lose?
Since Top Trumps is based on numerical values with simple outcomes, it seemed straightforward to analyse the cards and to simulate different scenarios to look at these questions.
Many Top Trumps variants exist, but the pack I’ll focus on is Marvel Universe “Who’s Your Hero?” made by Winning Moves (cat. No.: 3399736). Note though that the approach can probably be adapted to handle any other Top Trumps set.
There are 30 cards featuring Marvel characters. Each card has six categories:
- Top Trumps Rating.
What is the best card and which one is the worst?
In order to determine this I pulled in all the data and compared each value to every other card’s value, and repeated this per category (code is here, the data are here). The scaling is different between category, but that’s OK, because the game only uses within field comparisons. This technique allowed me to add up how many cards have a lower value for a certain field for a given card, i.e. how many cards would that card beat. These victories could then be summed across all six fields to determine the “winningest card”.
The cumulative victories can be used to rank the cards and a category plot illustrates how “winningness” is distributed throughout the deck.
The best card in the deck is Iron Man. What is interesting is that Spider-Man has the designation Top Trump (see card), but he’s actually second in terms of wins over all other cards. Head-to-head, Spider-Man beats Iron Man in Skill and Mystique. They draw on Top Trumps Rating. But Iron Man beats Spider-Man on the three remaining fields. So if Iron Man comes up in your hand, you are most likely to defeat your opponent.
At the other end of the “winningest card” plot, the worst card, is Wasp. Followed by Ant Man and Bucky Barnes. There needs to be a terrible card in every Top Trump deck, and Wasp is it. She has pitiful scores in most fields. And can collectively only win 9 out of (6 * 29) = 174 contests. If this card comes up, you are pretty much screwed.
What about draws? It’s true that a draw doesn’t mean losing and the active player gets another turn, so a draw does have some value. To make sure I wasn’t overlooking this with my system of counting victories, I recalculated the values using a Football League points system (3 points for a win, 1 point for a draw and 0 for a loss). The result is the same, with only some minor changes in the ranking.
I went with the first evaluation system in order to simulate the games.
I wrote a first version of the code that would printout what was happening so I could check that the simulation ran OK. Once that was done, it was possible to call the function that runs the game, do this multiple (1 x 10^6) times and record who won (player 1 or player 2) and for how many rounds each game lasted.
A typical printout of a game (first 9 rounds) is shown here. So now I could test out different strategies: What is the best way to win and what is the best way to lose?
If you knew which category was the most likely to win, you could pick that one and just win every game? Well, not quite. If both players take this strategy, then the player that goes first has a slight advantage and wins 57.8% of the time. The games can go on and on, the longest is over 500 rounds. I timed a few rounds and it worked out around 15 s per round. So the longest game would take just over 2 hours.
Strategy 2: pick one category and stick with it
This one requires very little brainpower and suits the disengaged adult: just keep picking the same category. In this scenario, Player 1 just picks strength every time while Player 2 picks their best category. This is a great way to lose. Just 0.02% of games are won using this strategy.
Strategy 3: pick categories at random
The next scenario was to just pick random categories. I set up Player 1 to do this and play against Player 2 picking their best category. This means 0.2% of wins for Player 1. The games are over fairly quickly with the longest of 1 x 10^6 games stretching to 200 rounds.
If both players take this strategy, it results in much longer games (almost 2000 rounds for the longest). The player-goes-first advantage disappears and the wins are split 49.9 to 50.1%.
Strategy 4: pick your worst category
How does all of this compare with selecting the worst category? To look at this I made Player 2 take this strategy, while Player 1 picked the best category. The result was definitive, it is simply not possible for Player 2 to win. Player 1 wins 100% of all 1 x 10^6 games. The games are over in less than 60 rounds, with most being wrapped up in less than 35 rounds. Of course this would require almost as much knowledge of the deck as the winning strategy, but if you are determined to lose then it is the best strategy.
The hand you’re dealt
Head-to-head, the best strategy is to pick your best category (no surprise there), but whether you win or lose depends on the cards you are dealt. I looked at which player is dealt the worst card Wasp and at the outcome. The split of wins for player 1 (58% of games) are with 54% of those, Player 2 stated with Wasp. Being dealt this card is a disadvantage but it is not the kiss of death. This analysis could be extended to look at the outcome if the n worst cards end up in your hand. I’d predict that this would influence the outcome further than just having Wasp.
So there you have it: every last drop of fun squeezed out of a children’s game by computational analysis. At quantixed, we aim to please.
The post title is taken from “Weak Superhero” by Rocket From The Crypt off their debut LP “Paint As A Fragrance” on Headhunter Records
Well, the 2015/2016 season was one to forget for Crewe Alexandra. Relegation to League Two (English football’s 4th tier) was confirmed on 9th April with a 3-0 defeat to local rivals Port Vale. Painful.
Maybe Repeat Failure is a bit strong. Under Dario Gradi, the Railwaymen eventually broke into League One/Championship (the 2nd Tier) where they punched above their weight for 8 seasons. The stats for all league finishes can be downloaded and plotted out to get a sense of Crewe’s fortunes over a century-and-a-bit.
The data are normalised because the number of teams in each league has varied over the years from 16 to 24. There were several years where The Alex finished bottom but there was nowhere to go. You can see the trends that have seen the team promoted and then relegated. It looked inevitable that the team would go down this season.
Now, the reasons why the Alex have done so badly this season are complex, however there is a theme to Crewe’s performances over all of this time. Letting in too many goals. To a non-supporter this might seem utterly obvious – of course you lose a lot if you let in too many goals. But Crewe are incredibly leaky and their goal difference historically is absolutely horrendous. The Alex are currently in 64th place on the all-time table, between West Ham and Portsmouth, with 4242 points – not bad – however our goal difference is -952. That’s minus 952 goals. Only Hartlepool have a worse goal difference (-1042). That’s out of 144 teams. At Gresty Road they’ve scored 3384 and let in 2526. On the road they netted 2135 but let in 3945.
See you in League Two for 2016/2017.
The post title is taken from “Repeat Failure” by The Delgados from their Peloton LP.
Fans of probability love random processes. And lotteries are a great example of random number generation.
The UK National Lottery ran in one format from 19/11/1994 until 7/10/2015. I was talking to somebody who had played the same set of numbers in all of these lottery draws and I wondered what the net gain or loss has been for them over this period.
The basic format is that people buy a line of numbers (6 numbers, from 1-49) and try to match the six numbers (from 49 balls numbered 1-49) drawn from a machine. The aim is to match all six balls and win the jackpot. The odds of this are fantastically small (1 in ~14 million), but if they are the only person matching these numbers they can take away £3-5 million. There are prizes for matching three numbers (1 in ~56 chance), four numbers (1 in ~1,032), five numbers (1 in ~55,491) or five numbers plus a seventh “bonus ball” (1 in ~2,330,636). Typical prizes are £10, £100, £1,500, or £50,000, respectively.
The data for all draws are available here. I pulled all draws regardless of machine that was used or what set of balls was used. This is what the data look like.
The rows are the seven balls (colour coded 1-49) that came out of the machine over 2065 draws.
I wrote a quick bit of code which generated all possible combinations of lottery numbers and compared all of these combinations to the real-life draws. The 1 in 14 million that I referred to earlier is actually
This gives us the following.
Crunching these combinations against the real-life draw outcomes tells us what would have happened if every possible ticket had been bought for all draws. If we assume a £1 stake for each draw and ~14 million people each buying a unique combination line. Each person has staked £2065 for the draws we are considering.
- The unluckiest line is 6, 7, 10, 21, 26, 36. This would’ve only won 12 lots of three balls, i.e. £120 – a net loss of £1945
- The luckiest line is 3, 6, 13, 23, 27, 49. These numbers won 41 x three ball, 2 x four ball, 1 x jackpot, 1 x 5 balls + bonus.
- Out of all possible combinations, 13728621 of them are in the red by anything from £5 to £1945. This is 98.2% of combinations.
Pretty terrible odds all-in-all. Note that I used the typical payout values for this calculation. If all possible tickets had been purchased the payouts would be much higher. So this calculation tells us what an individual could expect if they played the same numbers for every draw.
Note that the unluckiest line and the luckiest line have an equal probability of success in the 2066th draw. There is nothing intrinsically unlucky or lucky about these numbers!
I played the lottery a few times when it started with a specified set of numbers. I matched 3 balls twice and 4 balls once. I’ve not played since 1998 or so. Using another function in my code, I could check what would’ve happened if I’d kept playing all those intervening years. Fortunately, I would’ve looked forward to a net loss with 43 x three balls and 2 x four balls. Since I actually had a ticket for some of those wins and hardly any for the 2020 losing draws, I feel OK about that. Discovering that my line had actually matched the jackpot would’ve been weird, so I’m glad that wasn’t the case.
There’s lots of fun to be had with this dataset and a quick google suggests that there are plenty of sites on the web doing just that.
Here’s a quick plot for fun. The frequency of balls drawn in the dataset:
- The ball drawn the least is 13
- The one drawn the most is 38
- Expected number of appearances is 295 (14455/49).
- 14455 is 7 balls from 2065 draws
Since October 2015, the Lottery changed to 1-59 balls and so the dataset used here is effectively complete unless they revert to the old format.
The title of this post comes from “Wrote for Luck” by The Happy Mondays from their 1988 LP Bummed. The Manic Street Preachers recorded a great cover version which was on the B-Side of Roses in The Hospital single.
I read this article on the BBC recently about alcohol consumption in the UK. In passing it mentions how many people in the UK are teetotal. I found the number reported – 21% – unbelievable so I checked out the source for the numbers.
Sure enough, ~20% of the UK population are indeed teetotal (see plots). The breakdown by gender and age is perhaps to be expected. There are fewer teetotal men than women. Older women (65+) in particular are more likely to be teetotal. There has been a slight tendency in recent years for more abstinence across the board, although last year is an exception. The BBC article noted that young people are pushing up the numbers with high rates of sobriety.
I thought this post would make a fun antidote in the run up to the holidays, which in the UK at least is strongly linked with alcohol consumption.
The post title is taken from “Lemonade Secret Drinker” by Mansun, which featured on their first EP (One). It’s a play on “Secret Lemonade Drinker” the theme from R Whites Lemonade TV commercial in the 70s/80s (which I believe was written and sung by Elvis Costello’s father).
Our recent paper on “the mesh” in kinetochore fibres (K-fibres) of the mitotic spindle was our first adventure in 3D electron microscopy. This post is about some of the new data analysis challenges that were thrown up by this study. I promised a more technical post about this paper and here it is, better late than never.
In the paper we describe how over-expression of TACC3 causes the microtubules (MTs) in K-fibres to become “more wonky”. This was one of those observations that we could see by eye in the tomograms, but we needed a way to quantify it. And this meant coming up with a new spatial statistic.
After a few false starts*, we generated a method that I’ll describe here in the hope that the extra detail will be useful for other people interested in similar problems in cell biology.
The difficulty in this analysis comes from the fact that the fibres are randomly oriented, because of the way that the experiment is done. We section orthogonally to the spindle axis, but the fibre is rarely pointing exactly orthogonal to the tomogram. So the challenge is to reorient all the fibres to be able to pool numbers from across different fibres to derive any measurements. The IgorPro code to do this was made available with the paper. I have recently updated this code for a more streamlined workflow (available here).
We had two 3D point sets, one representing the position of each microtubule in the fibre at bottom of our tomogram and the other set is the position at the top. After creating individual MT waves from these point sets to work with, these waves could be plotted in 3D to have a look at them.
We need to normalise the fibres by getting them to all point in the same direction. We found that trying to pull out the average trajectory for the fibre didn’t work so well if there were lots of wonky MTs. So we came up with the following method:
- Calculate the total cartesian distance of all MT waves in an xy view, i.e. the sum of all projections of vectors on an xy plane.
- Rotate the fibre.
- Recalculate the total distance.
So we start off with this set of waves (Original). We rotate through 3D space and plot the total distance at each rotation to find the minimum, i.e. when most MTs are pointing straight at the viewer. This plot (Finding Minimum) is coloured so that hot colours are the smallest distance, it shows this calculation for a range of rotations in phi and theta. Once this minimum is found, the MT waves can be rotated by this value and the set is then normalised (you need to click on the pictures to see them properly).
Now we have all of the fibres that we imaged oriented in the same way, pointing to the zenith. This means we can look at angles relative to the z axis and derive statistics.
The next challenge was to make a measure of “wonkiness”. In other words, test how parallel the MTs are.
Violin plots of theta don’t really get across the wonkiness of the TACC3 overexpressed K-fibres (see figure above). To visualise this more clearly, each MT was turned into a vector starting at the origin and the point where the vector intersected with an xy plane set at an arbitrary distance in z (100 nm) was calculated. The scatter of these intersections demonstrates nicely how parallel the MTs are. If all MTs were perfectly parallel, they would all intersect at 0,0. In the control this is more-or-less true, with a bit of noise. In contrast, the TACC3-overexpressed group have much more scatter. What was nice is that the radial scatter was homogeneous, which showed that there was no bias in the acquisition of tomograms. The final touch was to generate a bivariate histogram which shows the scatter around 0,0 but it is normalised for the total number of points. Note that none of this possible without the first normalisation step.
The only thing that we didn’t have was a good term to describe what we were studying. “Wonkiness” didn’t sound very scientific and “parallelness” was also a bit strange. Parallelism is a word used in the humanities to describe analogies in art, film etc. However, it seemed the best term to describe the study of how parallel the MTs in a fibre are.
With a little help from my friends
The development of this method was borne out of discussions with Tom Honnor and Julia Brettschneider in the Statistics department in Warwick. The idea for the intersecting scatter plot came from Anne Straube in the office next door to me. They are all acknowledged in our paper for their input. A.G. at WaveMetrics helped me speed up my code by using MatrixOP and Euler’s rotation. His other suggestion of using PCA to do this would undoubtedly be faster, but I haven’t implemented this – yet. The bivariate histograms were made using JointHistogram() found here. JointHistogram now ships with Igor 7.
* as we describe in the paper
Several other strategies were explored to analyze deviations in trajectory versus the fiber axis. These were: examining the variance in trajectory angles, pairwise comparison of all MTs in the bundle, comparison to a reference MT that represented the fiber axis, using spherical rotation and rotating by an average value. These produced similar results, however, the one described here was the most robust and represents our best method for this kind of spatial statistical analysis.
The post title is taken from the Blondie LP “Parallel Lines”.