Methods papers for MD997

I am now running a new module for masters students, MD997. The aim is to introduce the class to a range of advanced research methods and to get them to think about how to formulate their own research question(s).

The module is built around a paper which is allocated in the first session. I had to come up with a list of methods-type papers, which I am posting below. There are 16 students and I picked 23 papers. I aimed to cover their interests, which are biological but with some chemistry, physics and programming thrown in. The papers are a bit imaging-biased but I tried to get some ‘omics and neuro in there. There were some preprints on the list to make sure I covered the latest stuff.

The students picked their top 3 papers and we managed to assign them without too much trouble. Every paper got at least one vote. Some papers were in high demand. Fitzpatrick et al. on cryoEM of Alzheimer’s samples and the organoid paper from Lancaster et al. had by far the most votes.

The students present this paper to the class and also use it to formulate their own research proposal. Which one would you pick?

  1. Booth, D.G. et al. (2016) 3D-CLEM Reveals that a Major Portion of Mitotic Chromosomes Is Not Chromatin Mol Cell 64, 790-802. http://dx.doi.org/10.1016/j.molcel.2016.10.009
  2. Chai, H. et al. (2017) Neural Circuit-Specialized Astrocytes: Transcriptomic, Proteomic, Morphological, and Functional Evidence Neuron 95, 531-549 e9. http://dx.doi.org/10.1016/j.neuron.2017.06.029
  3. Chang, J.B. et al. (2017) Iterative expansion microscopy Nat Methods 14, 593-599. http://dx.doi.org/10.1038/nmeth.4261
  4. Chen, B.C. et al. (2014) Lattice light-sheet microscopy: imaging molecules to embryos at high spatiotemporal resolution Science 346, 1257998. http://dx.doi.org/10.1126/science.1257998
  5. Chung, K. & Deisseroth, K. (2013) CLARITY for mapping the nervous system Nat Methods 10, 508-13. http://dx.doi.org/10.1038/nmeth.2481
  6. Eichler, K. et al. (2017) The Complete Connectome Of A Learning And Memory Center In An Insect Brain bioRxiv. http://dx.doi.org/10.1101/141762
  7. Fitzpatrick, A.W.P. et al. (2017) Cryo-EM structures of tau filaments from Alzheimer’s disease Nature 547, 185-190. http://dx.doi.org/10.1038/nature23002
  8. Habib, N. et al. (2017) Massively parallel single-nucleus RNA-seq with DroNc-seq Nat Methods 14, 955-958. http://dx.doi.org/10.1038/nmeth.4407
  9. Hardman, G. et al. (2017) Extensive non-canonical phosphorylation in human cells revealed using strong-anion exchange-mediated phosphoproteomics bioRxiv. http://dx.doi.org/10.1101/202820
  10. Herzik, M.A., Jr. et al. (2017) Achieving better-than-3-A resolution by single-particle cryo-EM at 200 keV Nat Methods. http://dx.doi.org/10.1038/nmeth.4461
  11. Jacquemet, G. et al. (2017) FiloQuant reveals increased filopodia density during breast cancer progression J Cell Biol 216, 3387-3403. http://dx.doi.org/10.1083/jcb.201704045
  12. Jungmann, R. et al. (2014) Multiplexed 3D cellular super-resolution imaging with DNA-PAINT and Exchange-PAINT Nat Methods 11, 313-8. http://dx.doi.org/10.1038/nmeth.2835
  13. Kim, D.I. et al. (2016) An improved smaller biotin ligase for BioID proximity labeling Mol Biol Cell 27, 1188-96. http://dx.doi.org/10.1091/mbc.E15-12-0844
  14. Lancaster, M.A. et al. (2013) Cerebral organoids model human brain development and microcephaly Nature 501, 373-9. http://dx.doi.org/10.1038/nature12517
  15. Madisen, L. et al. (2012) A toolbox of Cre-dependent optogenetic transgenic mice for light-induced activation and silencing Nat Neurosci 15, 793-802. http://dx.doi.org/10.1038/nn.3078
  16. Penn, A.C. et al. (2017) Hippocampal LTP and contextual learning require surface diffusion of AMPA receptors Nature 549, 384-388. http://dx.doi.org/10.1038/nature23658
  17. Qin, P. et al. (2017) Live cell imaging of low- and non-repetitive chromosome loci using CRISPR-Cas9 Nat Commun 8, 14725. http://dx.doi.org/10.1038/ncomms14725
  18. Quick, J. et al. (2016) Real-time, portable genome sequencing for Ebola surveillance Nature 530, 228-232. http://dx.doi.org/10.1038/nature16996
  19. Ries, J. et al. (2012) A simple, versatile method for GFP-based super-resolution microscopy via nanobodies Nat Methods 9, 582-4. http://dx.doi.org/10.1038/nmeth.1991
  20. Rogerson, D.T. et al. (2015) Efficient genetic encoding of phosphoserine and its nonhydrolyzable analog Nat Chem Biol 11, 496-503. http://dx.doi.org/10.1038/nchembio.1823
  21. Russell, M.R. et al. (2017) 3D correlative light and electron microscopy of cultured cells using serial blockface scanning electron microscopy J Cell Sci 130, 278-291. http://dx.doi.org/10.1242/jcs.188433
  22. Strickland, D. et al. (2012) TULIPs: tunable, light-controlled interacting protein tags for cell biology Nat Methods 9, 379-84. http://dx.doi.org/10.1038/nmeth.1904
  23. Yang, J. et al. (2015) The I-TASSER Suite: protein structure and function prediction Nat Methods 12, 7-8. http://dx.doi.org/10.1038/nmeth.3213

If you are going to do a similar exercise, Twitter is invaluable for suggestions for papers. None of the students complained that they couldn’t find three papers which matched their interests. I set up a slide carousel in Powerpoint with the front page of each paper together with some key words to tell the class quickly what the paper was about. I gave them some discussion time and then collated their choices on the board. Assigning the papers was quite straightforward, trying to honour the first choices as far as possible. Having an excess of papers prevented too much horse trading for the papers that multiple people had picked.

Hopefully you find this list useful. I was inspired by Raphaël posting his own list here.

Advertisements

The Sound of Clouds: wordcloud of tweets using R

Another post using R and looking at Twitter data.

As I was typing out a tweet, I had the feeling that my vocabulary is a bit limited. Papers I tweet about are either “great”, “awesome” or “interesting”. I wondered what my most frequently tweeted words are.

Like the last post you can (probably) do what I’ll describe online somewhere, but why would you want to do that when you can DIY in R?

First, I requested my tweets from Twitter. I wasn’t sure of the limits of rtweet for retrieving tweets and the request only takes a few minutes. This gives you a download of everything including a csv of all your tweets. The text of those tweets is in a column called ‘text’.

 


## for text mining
library(tm)
## for building a corpus
library(SnowballC)
## for making wordclouds
library(wordcloud)
## read in your tweets
tweets <- read.csv('tweets.csv', stringsAsFactors = FALSE)
## make a corpus of the text of the tweets
tCorpus <- Corpus(VectorSource(tweets$text))
## remove all the punctation from tweets
tCorpus <- tm_map(tCorpus, removePunctuation)
## good idea to remove stopwords: high frequency words such as I, me and so on
tCorpus <- tm_map(tCorpus, removeWords, stopwords('english'))
## next step is to stem the words. Means that talking and talked become talk
tCorpus <- tm_map(tCorpus, stemDocument)
## now display your wordcloud
wordcloud(tCorpus, max.words = 100, random.order = FALSE)

For my @clathrin account this gave:

wordcloud.png

So my most tweeted word is paper, followed by cell and lab. I’m quite happy about that. I noticed that great is also high frequency, which I had a feeling would also be the case. It looks like @christlet, @davidsbristol, @jwoodgett and @cshperspectives are among my frequent twitterings, this is probably a function of the length of time we’ve been using twitter. The cloud was generated using 10.9K tweets over seven years, it might be interesting to look at any changes over this time…

The cloud is a bit rough and ready. Further filtering would be a good idea, but this quick exercise just took a few minutes.

The post title comes from “The Sound of Clouds” by The Posies from their Solid States LP.

I’m not following you: Twitter data and R

I wondered how many of the people that I follow on Twitter do not follow me back. A quick way to look at this is with R. OK, a really quick way is to give a 3rd party application access rights to your account to do this for you, but a) that isn’t safe, b) you can’t look at anyone else’s data, and c) this is quantixed – doing nerdy stuff like this is what I do. Now, the great thing about R is the availability of well-written packages to do useful stuff. I quickly found two packages twitteR and rtweet that are designed to harvest Twitter data. I went with rtweet and there were some great guides to setting up OAuth and getting going.

The code below set up my environment and pulled down lists of my followers and my “friends”. I’m looking at my main account and not the quantixed twitter account.


library(rtweet)
library(httpuv)
## setup your appname,api key and api secret
appname <- "whatever_name"
key <- "blah614h"
secret <- "blah614h"
## create token named "twitter_token"
twitter_token <- create_token(
app = appname,
consumer_key = key,
consumer_secret = secret)

clathrin_followers <- get_followers("clathrin", n = "all")
clathrin_followers_names <- lookup_users(clathrin_followers)
clathrin_friends <- get_friends("clathrin")
clathrin_friends_names <- lookup_users(clathrin_friends)

The terminology is that people that follow me are called Followers and people that I follow are called Friends. These are the terms used by Twitter’s API. I have almost 3000 followers and around 1200 friends.

This was a bit strange… I had fewer followers with data than actual followers. Same for friends: missing a few hundred in total. I extracted a list of the Twitter IDs that had no data and tried a few other ways to look them up. All failed. I assume that these are users who have deleted their account (and the Twitter ID stays reserved) or maybe they are suspended for some reason. Very strange.


## noticed something weird
## look at the twitter ids of followers and friends with no data
missing_followers <- setdiff(clathrin_followers$user_id,clathrin_followers_names$user_id)
missing_friends <- setdiff(clathrin_friends$user_id,clathrin_friends_names$user_id)

## find how many real followers/friends are in each set
aub <- union(clathrin_followers_names$user_id,clathrin_friends_names$user_id)
anb <- intersect(clathrin_followers_names$user_id,clathrin_friends_names$user_id)

## make an Euler plot to look at overlap
fit <- euler(c(
"Followers" = nrow(clathrin_followers_names) - length(anb),
"Friends" = nrow(clathrin_friends_names) - length(anb),
"Followers&Friends" = length(anb)))
plot(fit)

In the code above, I arranged in sets the “real Twitter users” who follow me or I follow them. There was an overlap of 882 users, leaving 288 Friends who don’t follow me back – boo hoo!

I next wanted to see who these people are, which is pretty straightforward.


## who are the people I follow who don't follow me back
bonly <- setdiff(clathrin_friends_names$user_id,anb)
no_follow_back <- lookup_users(bonly)

Looking at no_follow_back was interesting. There are a bunch of announcement accounts and people with huge follower counts that I wasn’t surprised do not follow me back. There are a few people on the list with whom I have interacted yet they don’t follow me, which is a bit odd. I guess they could have unfollowed me at some point in the past, but my guess is they were never following me in the first place. It used to be the case that you could only see tweets from people you followed, but the boundaries have blurred a lot in recent years. An intermediary only has to retweet something you have written for someone else to see it and you can then interact, without actually following each other. In fact, my own Twitter experience is mainly through lists, rather than my actual timeline. And to look at tweets in a list you don’t need to follow anyone on there. All of this led me to thinking: maybe other people (who follow me) are wondering why I don’t follow them back… I should look at what I am missing out on.

## who are the people who follow me but I don't follow back
aonly <- setdiff(clathrin_followers_names$user_id,anb)
no_friend_back <- lookup_users(aonly)
## save csvs with all user data for unreciprocated follows
write.csv(no_follow_back, file = "nfb.csv")
write.csv(no_friend_back, file = "nfb2.csv")

With this last bit of code, I was able to save a file for each subset of unreciprocated follows/friends. Again there were some interesting people on this list. I must’ve missed them following me and didn’t follow back.

I used these lists to prune my friends and to follow some interesting new people. The csv files contain the Twitter bio of all the accounts so it’s quick to go through and check who is who and who is worth following. Obviously you can search all of this content for keywords and things you are interested in.

So there you have it. This is my first “all R” post on quantixed – hope you liked it!

The post title is from “I’m Not Following You” the final track from the 1997 LP of the same name from Edwyn Collins.

Start Me Up: Endocytosis on demand

We have a new paper out. The title is New tools for ‘hot-wiring’ clathrin-mediated endocytosis with temporal and spatial precision. You can read it here.

Cells have a plasma membrane which is the barrier between the cell’s interior and the outside world. In order to import material from outside, cells have a special process called endocytosis. During endocytosis, cells form a tiny bubble of plasma membrane and pull it inside – taking with it a little pocket of the outside world. This process is very important to the cell. For example, it is one way that cells import nutrients to live. It also controls cell movement, growth, and how cells talk to one another. Because it is so important, cell biologists have studied how endocytosis works for decades.

Studying endocytosis is tricky. Like naughty children, cells simply do not do what they are told. There is no way to make a cell in the lab “do endocytosis”. It does it all the time, but we don’t know when or where on the cell surface a vesicle will be made. Not only that, but when a vesicle is made, we don’t really know what cargo it contains. It would be helpful to cell biologists if we could bring cells under control. This paper shows a way to do this. We demonstrate that clathrin-mediated endocytosis can be triggered, so that we can make it happen on-demand.

Endocytosis on-demand

Using a chemical which diffuses into the cell, we can trigger endocytosis to happen all over the cell. The movie on the right shows vesicles (bright white spots) forming after we add the chemical (at 0:00). The way that we designed the system means that the vesicles that form have one type of cargo in there. This is exciting because it means that we can now deliver things into cells using this cargo. So, we can trigger endocytosis on-demand and we can control the cargo, but we still cannot control where on the plasma membrane this happens.

We solved this problem by engineering a light-sensitive version of our system. With this new version we can use blue light to trigger endocytosis. Whereas the chemical diffused everywhere, the light can be focussed in a narrow region on the cell and endocytosis can be trigger only in that region. This means we control where, as well as when, a vesicle will form.

What does hot-wiring mean?

It is possible to start a car without a key by “hot-wiring” it. This happens in the movies, when the bad guy breaks into a car and just twists some wires together to start the car and make a getaway. To trigger endocytosis we used the cell’s own proteins, but we modified them. We chopped out all the unnecessary parts and just left the bare essentials. We call the process of triggering endocytosis “hot-wiring” because it is similar to just twisting the wires together rather than having a key.

It turns out that movies are not like real life, and hot-wiring a car is actually quite difficult and takes a while. So our systems are more like the Hollywood version than real life!

What is this useful for?

As mentioned above, the systems we have made are useful for cell biologists because they allow cells to be “tamed”. This means that we can accurately study the timing of endocytosis and which proteins are required in a very controlled way. It also potentially means that molecules can be delivered to cells that cannot normally enter. So we have a way to “force feed” cells with whatever we want. This would be most useful for drugs or nanoparticles that are not actively taken up by cells.

Who did the work?

Almost all of the work in the paper was by Laura Wood, a PhD student in the lab. She had help from fellow lab members Nick Clarke, who did the correlative light-electron microscopy, and Sourav Sarkar who did the binding experiments. Gabrielle Larocque, another PhD student did some fantastic work to revise the paper after Laura had departed for a post-doc position at another University. We put the paper up on bioRxiv in Summer 2016 and the paper has slowly made its way through peer review to be published in J Cell Biol today.

Wait? I’m a cell biologist! I want to know how this thing really works!

OK. The design is shown to the right. We made a plasma membrane “anchor” and a clathrin “hook” which is a fragment of protein which binds clathrin. The anchor and the hook have an FRB domain and an FKBP domain and these can be brought together by rapamycin. When the clathrin hook is at the membrane this is recognised by clathrin and vesicle formation can begin. The main hook we use is the appendage and hinge from the beta2 subunit of the AP2 complex.

Normally AP2, which has four subunits, needs to bind to PIP2 in the plasma membrane and undergo a conformational change to recognise a cargo molecule with a specific motif, only then can clathrin bind the beta2 appendage and hinge. By hot-wiring, we effectively remove all of those other proteins and all of those steps to just bring the clathrin binding bit to the membrane when we want. Being able to recreate endocytosis using such a minimalist system was a surprise. In vitro work from Dannhauser and Ungewickell had suggested this might be possible, but it really seems that the steps before clathrin engagement are not a precursor for endocytosis.

To make the light inducible version we used TULIPs (tunable light-controlled interacting proteins). So instead of FRB and FKBP we had a LOVpep and PDZ domain on the hook and anchor.

The post title comes from “Start Me Up” by The Rolling Stones. Originally on Tattoo You, but perhaps better known for its use by Microsoft in their Windows 95 advertising campaign. I’ve finally broken a rule that I wouldn’t use mainstream song titles for posts on this blog.

Fusion confusion: new paper on FGFR3-TACC3 fusions in cancer

We have a new paper out! This post is to explain what it’s about.

Cancer cells often have gene fusions. This happens because the DNA in cancer cells is really messed up. Sometimes, chromosomes can break and get reattached to a different one in a strange way. This means you get a fusion between one gene and another which makes a new gene, called a gene fusion. There are famous fusions that are known to cause cancer, such as the Philadelphia chromosome in chronic myelogenous leukaemia. This rearrangement of chromosomes 9 and 22 result in a fusion called BCR-ABL. There are lots of different gene fusions and a few years ago, a new fusion was discovered in bladder and brain cancers, called FGFR3-TACC3.

Genes encode proteins and proteins do jobs in cells. So the question is: how are the proteins from gene fusions different to their normal versions, and how do they cause cancer? Many of the gene fusions that scientists have found result in a protein that continues to send a signal to the cell when it shouldn’t. It’s thought that this transforms the cell to divide uncontrollably. FGFR3-TACC3 is no different. FGFR3 can send signals and the TACC3 part probably makes it do this uncontrollably. But, what about the TACC3 part? Does that do anything, or is this all about FGFR3 going wrong?

What is TACC3?

Chromosomes getting shared to the two daughter cells

TACC3, or transforming acidic coiled-coil protein 3 to give it its full name, is a protein important for cell division. It helps to share the chromosomes to the two daughter cells when a cell divides. Chromosomes are shared out by a machine built inside the cell called the mitotic spindle. This is made up of tiny threads called microtubules. TACC3 stabilises these microtubules and adds strength to this machine.

We wondered if cancer cells with FGFR3-TACC3 had problems in cell division. If they did, this might be because the TACC3 part of FGFR3-TACC3 is changed.

We weren’t the first people to have this idea. The scientists that found the gene fusion suggested that FGFR3-TACC3 might bind to the mitotic spindle but not be able to work properly. We decided to take a closer look…

What did you find?

First of all FGFR3-TACC3 is not actually bound to the mitotic spindle. It is at the cells membrane and in small vesicles in the cell. So if it is not part of the mitotic spindle, how can it affect cell division? One unusual thing about TACC3 is that it is a dimer, meaning two TACC3s are stuck together. Stranger than that, these dimers can stick to more dimers and multimerise into a much bigger protein. When we looked at the normal TACC3 in the cell we noticed that the amount bound to the spindle had decreased. We wondered whether the FGFR3-TACC3 was hoovering the normal TACC3 off the spindle, preventing normal cell division.

We made the cancer cells express a bit more normal TACC3 and this rescued the faulty division. We also got rid of the FGFR3-TACC3 fusion, and that also put things back to normal. Finally, we made a fake FGFR3-TACC3 which had a dummy part in place of FGFR3 and this was just as good at hoovering up normal TACC3 and causing cell division problems. So our idea seemed to be right!

What does this mean for cancer?

This project was to look at what is going on inside cancer cells and it is a long way from any cancer treatments. Drug companies can develop chemicals which stop cell signalling from fusions, these could work as anti-cancer agents. In the case of FGFR3-TACC3, what we are saying is: even if you stop the signalling there will still be cell division problems in the cancer cells. So an ideal treatment might be to block TACC3 interactions as well as stopping signalling. This is very difficult to do and is far in the future. Doing work like this is important to understand all the possible ways to tackle a specific cancer and to find any problems with potential treatments.

The people

Sourav Sarkar did virtually all the work for this paper and he is first author. Sourav left the lab before we managed to submit this paper and so the revision experiments requested by the peer reviewers were done by Ellis Ryan.

Why didn’t we post this paper as a preprint?

My group have generally been posting our new manuscripts as preprints while they undergo peer review, but we didn’t post this one. I was reluctant because many cancer journals at the time of submission did not allow preprints. This has changed a bit in the last few months, but back in February several key cancer journals did not accept papers that had appeared first as preprints.

The title of the post comes from “Fusion Confusion” 4th track on the Hazy EP by Dr Phibes & The House of Wax Equations.

Rollercoaster: ups and downs of Google Scholar citations

In the UK there is an advertising disclaimer that “the value of your investments may go down as well as up.” Since papers are our main commodity in science and citations are something of a return, surely the “value” of a published paper only ever increases over time. Doesn’t it?

I think this is true when citations to a paper are tracked at a conventional database (Web of Science for example). Citations are added and very rarely taken away. With Google Scholar it is a different story. Now, I am a huge Google Scholar fan so this post is not a criticism of the service at all. One of the nice things about GS is that it counts citations from the “grey literature”, i.e. theses, patents etc. But not so grey as to include blogs and news articles (most of the time). So you get a broader view of the influence of a paper beyond the confines of a conventional database. With this broader view comes volatility, as I’ll show below.

I don’t obsessively check my own page every day – honestly I don’t(!) – but I did happen to check my own page twice within a short space of time and I noticed that my H-index went up by 1 and then decreased by 1. I’m pretty sure I didn’t imagine this and so I began to wonder how stable the citation data in Google Scholar actually is and whether I could track cites automatically.

What goes up (must come down)

Manually checking GS every day is beyond me, and what are computers for anyway? I set up a little routine to grab my data each day and look at the stability of citations (details of how to do this are below if you’re interested).

You can click on the plot to see it in its full glory.

Each line is a plot of citations to a paper over many weeks. The grey line is no citations gained or lost, relative to the start. As the paper accrues citations the line becomes more red and if it loses citations below the starting point it turns blue. They are ranked by the integral of change in citation over time.

The data are retrieved daily so if a paper gains citations and loses an equal number in less than 24 hours, this is not detected.

You can see from the plot that the number of citations to a paper can go down as well as up. For one paper, citations dropped significantly from one day to the next, which undid two month’s worth of increases. This paper is my highest cited work and dropped 10 cites from 443 to 433.

I’m guessing that running this routine on someone working in a field with a higher citation rate would show more volatility.

The increases in citations have an obvious cause but what about the decreases? My guess is that they are duplicate citations which are removed when they are added to a “cluster” (Google’s way of dealing with multiple URLs for the same paper). Another cause is probably something that is subsequently judged to not be a paper, e.g. a blog post, and getting removed.

Please please tell me now

The alert emails from Google Scholar have always puzzled me. I have alerts set up to tell me when my work is cited. I love getting them – who doesn’t want to see who has cited their work? Annoyingly they arrive infrequently and only ever contain one or two new papers. I looked at the frequency of changes in citation number and checked when I received emails from Google Scholar.

Over the same period as the plot above, you can see that citations to my profile happen pretty frequently. Again, if my work was cited at a higher rate, I guess this would be even more frequent. But in this period I only received six or so alert emails. I don’t think GS waits until a citation is stable for a while before emailing, because they tend to come immediately after an update. The alert emails remain a mystery to me. It would be great if they came a bit more often and it would be even better if they told you which paper(s) they cite!

Summary

Google Scholar is a wonderful service that finds an extra 20% or so of the impact of your work compared to other databases. With this extra information comes volatility and the numbers you see on there probably shouldn’t be treated as absolute.

Methods

To do this I used Christian Kreibich’s python script to retrieve information from Google Scholar. I wrote a little shell script to run the scholar.py and set up a daemon to do this everyday at the same time. I couldn’t find a way to search my UserID and so the search for my name brings up some unrelated papers that need to be filtered. There are restrictions on what you can retrieve, so my script retrieved papers within three different time frames to avoid hitting the limit for paper information retrieval.

The daemon is a plist in ~/Library/LaunchAgents/

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.quantixed.gscrape</string>
    <key>KeepAlive</key>
    <false/>
    <key>RunAtLoad</key>
    <false/>
    <key>Program</key>
    <string>/path/to/the/shell/script/gscrape.sh</string>
      <key>StartCalendarInterval</key>
      <dict>
          <key>Hour</key>
          <integer>14</integer>
          <key>Minute</key>
          <integer>30</integer>
      </dict>
</dict>
</plist>

And the shell script is something like

#!/bin/bash
cd /path/to/the/shell/script/
/usr/bin/pythonw '/path/to/your/scholar.py-master/scholar.py' -c 500 --author "Joe Bloggs" --after=1999 --before=2007 --csv > a.csv
/usr/bin/pythonw '/path/to/your/scholar.py-master/scholar.py' -c 500 --author "Joe Bloggs" --after=2008 --before=2012 --csv > b.csv
/usr/bin/pythonw '/path/to/your/scholar.py-master/scholar.py' -c 500 --author "Joe Bloggs" --after=2013 --csv > c.csv
OF=all_$(date +%Y%m%d).csv
cat a.csv b.csv c.csv > $OF

To crunch the data I wrote something in Igor which reads in the CSVs and plotted out my data. This meant first getting a list of clusterIDs which correspond to my papers in order to filter out other people’s work.

I have a surprising number of tracks in my library with Rollercoaster in the title. I will go with indie wannabe act Northern Uproar for the title of this post.

“What goes up (must come down)” is from Graham & Brown’s Super Fresco wallpaper ad from 1984.

“Please please tell me now” is a lyric from Duran Duran’s “Is There Something I Should Know?”.

The Second Arrangement

To validate our analyses, I’ve been using randomisation to show that the results we see would not arise due to chance. For example, the location of pixels in an image can be randomised and the analysis rerun to see if – for example – there is still colocalisation. A recent task meant randomising live cell movies in the time dimension, where two channels were being correlated with one another. In exploring how to do this automatically, I learned a few new things about permutations.

Here is the problem: If we have two channels (fluorophores), we can test for colocalisation or cross-correlation and get a result. Now, how likely is it that this was due to chance? So we want to re-arrange the frames of one channel relative to the other such that frame i of channel 1 is never paired with frame i of channel 2. This is because we want all pairs to be different to the original pairing. It was straightforward to program this, but I became interested in the maths behind it.

The maths: Rearranging n objects is known as permutation, but the problem described above is known as Derangement. The number of permutations of n frames is n!, but we need to exclude cases where the ith member stays in the ith position. It turns out that to do this, you need to use the principle of inclusion and exclusion. If you are interested, the solution boils down to

n!\sum_{k=0}^{n}\frac{(-1)^k}{k!}

Which basically means: for n frames, there are n! number of permutations, but you need to subtract and add diminishing numbers of different permutations to get to the result. Full description is given in the wikipedia link. Details of inclusion and exclusion are here.

I had got as far as figuring out that the ratio of permutations to derangements converges to e. However,  you can tell that I am not a mathematician as I used brute force calculation to get there rather than write out the solution. Anyway, what this means in a computing sense, is that if you do one permutation, you might get a unique combination, with two you’re very likely to get it, and by three you’ll certainly have it.

Back to the problem at hand. It occurred to me that not only do we not want frame i of channel 1 paired with frame i of channel 2 but actually it would be preferable to exclude frames i ± 2, let’s say. Because if two vesicles are in the same location at frame i they may also be colocalised at frame i-1 for example. This is more complex to write down because for frames 1 and 2 and frames n and n-1, there are fewer possibilities for exclusion than for all other frames. For all other frames there are n-5 legal positions. This obviously sets a lower limit for the number of frames capable of being permuted.

The answer to this problem is solved by rook polynomials. You can think of the original positions of frames as columns on a n x n chess board. The rows are the frames that need rearranging, excluded positions are coloured in. Now the permutations can be thought of as Rooks in a chess game (they can move horizontally or vertically but not diagonally). We need to work out how many arrangements of Rooks are possible such that there is one rook per row and such that no Rook can take another.

If we have an 7 frame movie, we have a 7 x 7 board looking like this (left). The “illegal” squares are coloured in. Frame 1 must go in position D,E,F or G, but then frame 2 can only go in E, F or G. If a rook is at E1, then we cannot have a rook at E2. And so on.

To calculate the derangements:

1 + 29 x + 310 x^2 + 1544 x^3 + 3732 x^4 + 4136 x^5 + 1756 x^6 + 172 x^7

This is a polynomial expansion of this expression:

R_{m,n}(x) = n!x^nL_n^{m-n}(-x^{-1})

where L_n^\alpha(x) is an associated Laguerre polynomial. The solution in this case is 8 possibilities. From 7! = 5040 permutations. Of course our movies have many more frames and so the randomisation is not so limited. In this example, frame 4 can only either go in position A or G.

Why is this important? The way that the randomisation is done is: the frames get randomised and then checked to see if any “illegal” positions have been detected. If so, do it again. When no illegal positions are detected, shuffle the movie accordingly. In the first case, the computation time per frame is constant, whereas in the second case it could take much longer (because there will be more rejections). In the case of 7 frames, with the restriction of no frames at i ±2, then the failure rate is 5032/5040 = 99.8%. Depending on how the code is written, this can cause some (potentially lengthy) wait time. Luckily, the failure rate comes down with more frames.

What about it practice? The numbers involved in directly calculating the permutations and exclusions quickly becomes too big using non-optimised code on a simple desktop setup (a 12 x 12 board exceeds 20 GB). The numbers and rates don’t mean much, what I wanted to know was whether this slows down my code in a real test. To look at this I ran 100 repetitions of permutations of movies with 10-1000 frames. Whereas with the simple derangement problem permutations needed to be run once or twice, with greater restrictions, this means eight or nine times before a “correct” solution is found. The code can be written in a way that means that this calculation is done on a placeholder wave rather than the real data and then applied to the data afterwards. This reduces computation time. For movies of around 300 frames, the total run time of my code (which does quite a few things besides this) is around 3 minutes, and I can live with that.

So, applying this more stringent exclusion will work for long movies and the wait times are not too bad. I learned something about combinatorics along the way. Thanks for reading!

Further notes

The first derangement issue I mentioned is also referred to as the hat-check problem. Which refers to people (numbered 1,2,3 … n) with corresponding hats (labelled 1,2,3 … n). How many ways can they be given the hats at random such that they do not get their own hat?

Adding i+1 as an illegal position is known as problème des ménages. This is a problem of how to seat married couples so that they sit in a man-woman arrangement without being seated next to their partner. Perhaps i ±2 should be known as the vesicle problem?

The post title comes from “The Second Arrangement” by Steely Dan. An unreleased track recorded for the Gaucho sessions.

Parallel lines: new paper on modelling mitotic microtubules in 3D

We have a new paper out! You can access it here.

The people

This paper really was a team effort. Faye Nixon and Tom Honnor are joint-first authors. Faye did most of the experimental work in the final months of her PhD and Tom came up with the idea for the mathematical modelling and helped to rewrite our analysis method in R. Other people helped in lots of ways. George did extra segmentation, rendering and movie making. Nick helped during the revisions of the paper. Ali helped to image samples… the list is quite long.

The paper in a nutshell

We used a 3D imaging technique called SBF-SEM to see microtubules in dividing cells, then used computers to describe their organisation.

What’s SBF-SEM?

Serial block face scanning electron microscopy. This method allows us to take an image of a cell and then remove a tiny slice, take another image and so on. We then have a pile of images which covers the entire cell. Next we need to put them back together and make some sense of them.

How do you do that?

We use a computer to track where all the microtubules are in the cell. In dividing cells – in mitosis – the microtubules are in the form of a mitotic spindle. This is a machine that the cell builds to share the chromosomes to the two new cells. It’s very important that this process goes right. If it fails, mistakes can lead to diseases such as cancer. Before we started, it wasn’t known whether SBF-SEM had the power to see microtubules, but we show in this paper that it is possible.

We can see lots of other cool things inside the cell too like chromosomes, kinetochores, mitochondria, membranes. We made many interesting observations in the paper, although the focus was on the microtubules.

So you can see all the microtubules, what’s interesting about that?

The interesting thing is that our resolution is really good, and is at a large scale. This means we can determine the direction of all the microtubules in the spindle and use this for understanding how well the microtubules are organised. Previous work had suggested that proteins whose expression is altered in cancer cause changes in the organisation of spindle microtubules. Our computational methods allowed us to test these ideas for the first time.

Resolution at a large scale, what does that mean?

The spindle is made of thousands of microtubules. With a normal light microscope, we can see the spindle but we can’t tell individual microtubules apart. There are improvements in light microscopy (called super-resolution) but even with those improvements, right in the body of the spindle it is still not possible to resolve individual microtubules. SBF-SEM can do this. It doesn’t have the best resolution available though. A method called Electron Tomography has much higher resolution. However, to image microtubules at this large scale (meaning for one whole spindle), it would take months or years of effort! SBF-SEM takes a few hours. Our resolution is better than light microscopy, worse than electron tomography, but because we can see the whole spindle and image more samples, it has huge benefits.

What mathematical modelling did you do?

Cells are beautiful things but they are far from perfect. The microtubules in a mitotic spindle follow a pattern, but don’t do so exactly. So what we did was to create a “virtual spindle” where each microtubule had been made perfect. It was a bit like “photoshopping” the cell. Instead of straightening the noses of actresses, we corrected the path of every microtubule. How much photoshopping was needed told us how imperfect the microtubule’s direction was. This measure – which was a readout of microtubule “wonkiness” – could be done on thousands of microtubules and tell us whether cancer-associated proteins really cause the microtubules to lose organisation.

The publication process

The paper is published in Journal of Cell Science and it was a great experience. Last November, we put up a preprint on this work and left it up for a few weeks. We got some great feedback and modified the paper a bit before submitting it to a journal. One reviewer gave us a long list of useful comments that we needed to address. However, the other two reviewers didn’t think our paper was a big enough breakthrough for that journal. Our paper was rejected*. This can happen sometimes and it is frustrating as an author because it is difficult for anybody to judge which papers will go on to make an impact and which ones won’t. One of the two reviewers thought that because the resolution of SBF-SEM is lower than electron tomography, our paper was not good enough. The other one thought that because SBF-SEM will not surpass light microscopy as an imaging method (really!**) and because EM cannot be done live (the cells have to be fixed), it was not enough of a breakthrough. As I explained above, the power is that SBF-SEM is between these two methods. Somehow, the referees weren’t convinced. We did some more work, revised the paper, and sent it to J Cell Sci.

J Cell Sci is a great journal which is published by Company of Biologists, a not-for-profit organisation who put a lot of money back into cell biology in the UK. They are preprint friendly, they allow the submission of papers in any format, and most importantly, they have a fast-track*** option. This allowed me to send on the reviews we had and including our response to them. They sent the paper back to the reviewer who had a list of useful comments and they were happy with the changes we made. It was accepted just 18 days after we sent it in and it was online 8 days later. I’m really pleased with the whole publishing experience with J Cell Sci.

 

* I’m writing about this because we all have papers rejected. There’s no shame in that at all. Moreover, it’s obvious from the dates on the preprint and on the JCS paper that our manuscript was rejected from another journal first.

** Anyone who knows something about microscopy will find this amusing and/or ridiculous.

*** Fast-track is offered by lots of journals nowadays. It allows authors to send in a paper that has been reviewed elsewhere with the peer review file. How the paper has been revised in light of those comments is assessed by at the Editor and one peer reviewer.

Parallel lines is of course the title of the seminal Blondie LP. I have used this title before for a blog post, but it matches the topic so well.

Adventures in Code V: making a map of Igor functions

I’ve generated a lot of code for IgorPro. Keeping track of it all has got easier since I started using GitHub – even so – I have found myself writing something only to discover that I had previously written the same thing. I was thinking that it would be good to make a list of all functions that I’ve written to locate long lost functions.

This question was brought up on the Igor mailing list a while back and there are several solutions – especially if you want to look at dependencies. However, this two liner works to generate a file called funcfile.txt which contains a list of functions and the ipf file that they are appear in.

grep "^[ \t]*Function" *.ipf | grep -oE '[ \t]+[A-Za-z_0-9]+\(' | tr -d " " | tr -d "(" > output
for i in `cat output`; do grep -ie "$i" *.ipf | grep -w "Function" >> funcfile.txt ; done

Thanks to Thomas Braun on the mailing list for the idea. I have converted it to work on grep (BSD grep) 2.5.1-FreeBSD which runs on macOS. Use the terminal, cd to the directory containing your ipf files and run it. Enjoy!

EDIT: I did a bit more work on this idea and it has now expanded to its own repo. Briefly, funcfile.txt is converted to tsv and then parsed – using Igor – to json. This can be displayed using some d3.js magic.


Part of a series with code snippets and tips.

Realm of Chaos

Caution: this post is for nerds only.

I watched this numberphile video last night and was fascinated by the point pattern that was created in it. I thought I would quickly program my own version to recreate it and then look at patterns made by more points.

I didn’t realise until afterwards that there is actually a web version of the program used in the video here. It is a bit limited though so my code was still worthwhile.

A fractal triangular pattern can be created by:

  1. Setting three points
  2. Picking a randomly placed seed point
  3. Rolling a die and going halfway towards the result
  4. Repeat last step

If the first three points are randomly placed the pattern is skewed, so I added the ability to generate an equilateral triangle. Here is the result.

and here are the results of a triangle through to a decagon.

All of these are generated with one million points using alpha=0.25. The triangle, pentagon and hexagon make nice patterns but the square and polygons with more than six points make pretty uninteresting patterns.

Watching the creation of the point pattern from a triangular set is quite fun. This is 30000 points with a frame every 10 points.

Here is the code.

Some other notes: this version runs in IgorPro. In my version, the seed is set at the centre of the image rather than a random location. I used the random allocation of points rather than a six-sided dice.

The post title is taken from the title track from Bolt Thrower’s “Realm of Chaos”.