Andy, in response to my griping on #Rstats, asks “What’s wrong with an arrow going both ways!?!?!” In R one can perfom assignment three ways:

a = 2
a <- 2
2 -> a

I’m pretty sure all these things are equivalent. The only subtlety I’ve noticed so far is that to set default options in function definitions you have to use `=`.

This redundancy drives me absolutely nuts as an R beginner. From a human point of view code has two main jobs: to be written and to be read. Now for the first job, having three different ways to assign a value to a variable might be nice? I’m not sure how, but I guess it would be nice sometimes to write a function, then remember that you hadn’t assigned it a variable and do it at the end instead.

For the second job  though, reading R is awful. Now I’ve heard horror stories about perl, so maybe it’s just that I’m used to readable code and that, in the big scheme of things, R is nice and readable. But yesterday, ploughing through someone’s metaprogramming, tearing at my hair as to the obtuseness of everything, I was foiled for about an hour having not realised that the fsking arrow could go both ways.

I think it highlights a big flaw in R. Sometimes it’s like no “destructive” design decisions have been taken since the original work of Ihaka and Gentleman. I spend so much time messing about trying to convert a data.frame of factors into a list of “vectors” of strings or something. It seems that no-one has ever said “OK, we need a way to assign a value to a variable: let’s pick one”. Instead three have emerged. There are two different types of classes (which only seems to have any real effect on the help files). There are loads of data types that are in regular use, with somewhat bastardised names (i.e. a ‘vector’, ‘matrix’, ‘list’, ‘array’ are all different things – a vector doesn’t seem to be a subclass of matrix, for example).

Now, I’m sure that the pluralities in R could maybe be held up to be a good thing. They probably encode subtleties that I don’t understand, and that my bitching is just me exposing my naivety about the language. And I know that R is descended from LISP and S and that its user base either grew up with R or with one of these other languages. Therefore things that are natural to this background will seem alien to someone like me with a MATLAB and Python background. And there’s backwards compatibility to consider and so on.

So don’t get me wrong – I think R is an amazing piece of work, and the communities around R are incredibly devoted and responsive. It’s kind of like the “first” three episodes of Star Wars: there’s an amazing film in there somewhere, it just needs some editorial decisions to be made. Maybe a new “Hadley” language will be born whereby R is completely re-API’d into a single language that only uses data.frames…

I asked a question on the superb biostar stackexchange site. It’s here: http://biostar.stackexchange.com/questions/1054/homology-bioconductor

It’s about finding geneome-wide homologies using bioconductor. It turns out that bioconductor has a package called biomaRt which allows you to query the Ensembl databases with ease. (Ensembl stores gene information for a bunch of different organisms).

I thought I’d write down my solution here, as a sort of extended answer to my question on biostar, in case anyone trips up on the question there and would like a more complete answer. You’ll need to read the question before any of this code makes sense!

library(biomaRt)
gen_hs2mm <- function(affyids){
    ensembl_hs <- useMart(
        "ensembl",
        dataset = "hsapiens_gene_ensembl"
    )
    hs2mm_filters <- c(
        "affy_hg_u133a",
        "with_mmusculus_homolog"
    )
    hs2mm_gene_atts <- c(
         "affy_hg_u133a",
        "ensembl_gene_id"
    )
    hs2mm_homo_atts <- c(
        "ensembl_gene_id",
        "mouse_ensembl_gene"
    )
    # the names in these lists are arbitrary
    hs2mm_value = list(
        affyid=affyids,
        with_homolog=TRUE
    )
    # get the human genes and mouse orthologues
    hs2mm_gene <- getBM(
        attributes = hs2mm_gene_atts,
        filters = hs2mm_filters,
        value = hs2mm_value,
        mart = ensembl_hs
    )
    hs2mm_homo <- getBM(
        attributes = hs2mm_homo_atts,
        filters = hs2mm_filters,
        value = hs2mm_value,
        mart = ensembl_hs
    )
    # merge the two lists!
    hs2mm <- merge(hs2mm_gene,hs2mm_homo)
}

gen_mm2hs <- function(affyids){
    ensembl_mm <- useMart("ensembl",
        dataset = "mmusculus_gene_ensembl")
        mm2hs_filters <- c(
        "affy_mogene_1_0_st_v1",
        "with_hsapiens_homolog"
    )
    mm2hs_gene_atts <- c(
        "affy_mogene_1_0_st_v1",
        "ensembl_gene_id"
    )
    mm2hs_homo_atts <- c(
        "ensembl_gene_id",
        "human_ensembl_gene"
    )
    # the names in these lists are arbitrary
    mm2hs_value = list(
        affyids=affyids,
        with_homolog=TRUE
    )
    # get the mouse genes and human orthologues
    mm2hs_gene <- getBM(
        attributes = mm2hs_gene_atts ,
        filters = mm2hs_filters,
        value = mm2hs_value,
        mart = ensembl_mm
    )
    mm2hs_homo <- getBM(
        attributes = mm2hs_homo_atts,
        filters = mm2hs_filters,
        value = mm2hs_value,
        mart = ensembl_mm
    )
    mm2hs <- merge(mm2hs_gene,mm2hs_homo)
}
source('load_data.r')
# here immgen and cd4T are different experession set objects 
# from Bioconductor.
# immgen is mouse data (from the Immunological Genome Project) 
# and cd4T is human data
# cd4T can be found on GEO using the accessionID GDS785 
# See ref[1]
immgen <- load_immgen()
cd4T <- load_GDS785()
hs2mm <- gen_hs2mm(rownames(exprs(cd4T)))
mm2hs <- gen_mm2hs(rownames(exprs(immgen)))
colnames(hs2mm)[1] <- 'human_ensembl_gene'
colnames(mm2hs)[1] <- 'mouse_ensembl_gene'
# the final thing is to merge the two tables to make a single 
# table containing all the probes that are homologous, along 
# with their respsective EnsemblIDs
homol <- merge(hs2mm,mm2hs)

[1] Lee MS, Hanspers K, Barker CS, Korn AP et al. Gene expression profiles during human CD4+ T cell differentiation. Int Immunol2004 Aug;16(8):1109-24. PMID: 15210650

Profiling in R

May 13, 2010

Giving a quick talk about vectorising code in R next month. Need to learn about profiling in R. This is what I have discovered in the last hour or so (when I should have been working):

Rprof()

This is the basic profiler in R. One turns it on:
Rprof("profile.out")
One runs some code … and one turns it off using, as always, R’s intuitive API
Rprof(NULL)
To see the results, you can use `summaryRprof(“profile.out”)` which parses the file you made using Rprof and presents it in a nice clear format, which looks something like:
> summaryRprof("method1.out")
$by.self
self.time self.pct total.time total.pct
"strsplit"          4.08     94.0       4.10      94.5
"+"                 0.12      2.8       0.12       2.8
"is.character"      0.12      2.8       0.12       2.8
"as.logical"        0.02      0.5       0.02       0.5
$by.total
total.time total.pct self.time self.pct
"strsplit"           4.10      94.5      4.08     94.0
"+"                  0.12       2.8      0.12      2.8
"is.character"       0.12       2.8      0.12      2.8
"as.logical"         0.02       0.5      0.02      0.5
There’s more detail here about `Rprof()` and `summaryRprof()`.

proftools

This package presents an alternative to summaryRprof, using graphviz to generate teh pretteh. Not sure how massively useful it would be in practice, unless your code got pretty complex! I’d be interested to know how complicated people’s functions actually got when using R.
To make the graph, make sure you have graph and Rgraphviz properly installed, then bash out
plotProfileCallGraph("profile.out")

profr

I kind of want everything Hadley Wickham does to be awesome, so when I see he’s made a profiling tool, I give it a lot more attention than it necessarily deserves. For example, you won’t find me hunting for documentation through anyone else’s github account. Nevertheless, this is what I find myself doing for `profr` (which in my head I will be pronouncing “proffer”). Note: it does turn out that I could have simply typed ?profr in my R terminal (does this always work)?
To use, simply run
out <- profr(code_to_profile)
then to inspect it use the normal `head`, `tail` and `summary`, noting that `out` is (pretty much) a `data.frame`:
> head(out)
f level time start  end  leaf source
8      example     1 0.44  0.00 0.44 FALSE  utils
9  <Anonymous>     2 0.04  0.00 0.04 FALSE   <NA>
10     library     2 0.02  0.04 0.06 FALSE   base
11      source     2 0.38  0.06 0.44 FALSE   base
12  prepare_Rd     3 0.04  0.00 0.04 FALSE   <NA>
13        %in%     3 0.02  0.04 0.06 FALSE   base
> summary(out)
f          level             time          start
match        : 7   Min.   : 1.000   Min.   :0.02   Min.   :0.0000
%in%         : 6   1st Qu.: 5.000   1st Qu.:0.02   1st Qu.:0.1200
is.factor    : 6   Median : 7.000   Median :0.02   Median :0.1600
eval.with.vis: 6   Mean   : 8.113   Mean   :0.04   Mean   :0.1685
inherits     : 5   3rd Qu.:10.500   3rd Qu.:0.02   3rd Qu.:0.2200
<Anonymous>  : 3   Max.   :19.000   Max.   :0.44   Max.   :0.4200
(Other)      :82
end            leaf            source
Min.   :0.0200   Mode :logical   Length:115
1st Qu.:0.1400   FALSE:99        Class :character
Median :0.2000   TRUE :16        Mode  :character
Mean   :0.2085   NA's :0
3rd Qu.:0.2500
Max.   :0.4400

Here I should say that `code_to_profile` was the standard `example(glm)`. One can also use `plot()` or `ggplot()` to plot the ‘call tree’. I’m not quite sure what a ‘call tree’ is, and to be honest the (adimittedly pretty) graph that `ggplot(out)` produces doesn’t enlighten me a great deal. That’s OK though – all I really wanted was the time spent.

A quick point about Hadley’s code, that I think highlights something wrong with R in general (and right with Python) is that in `profr`, from a user perspective I don’t really have to learn anything new to make it work. I apply a function (handily called `profr`) to some code and I get back a `data.frame` on which I can use very standard tools in order to investigate. I don’t have to write to a file (or generate a temp file to write to or anything), I don’t have to learn to stop and start hidden things, or that to stop profiling I pass `NULL` to a function that started profiling. I don’t have to use a whole new tool to parse what is basically a table. And to plot I can just use `plot` rather than having to dissapear down the (admittedly fun) hole that is graphviz. It feels like every new thing I come across in R means learning a whole API and set of concepts and object structure. (Admittedly ggplot is the total opposite of this, but hey I guess that was the point)

More Segment HMMs

April 23, 2009

It’s weird how you can muck about with structures for so long and still forget what the hell you’re doing all the time. This post is supposed to try and clarify segment HMMs a bit for me, so that I can approach the learning problem with a clear mind. It’s all very specific; and it’s all below the fold.

Read the rest of this entry »

I keep getting bitten on the bum with Python and ‘proper’ programming things. Today it’s this:


In [253]: 8**14
Out[253]: 4398046511104L

In [254]: 8**(numpy.int32(14))
Out[254]: 0

I would feel guilty about this if I had to specify the type of numbers I was using, but this is a dynamic programming language! How the hell am I supposed to know that raising an int by a numpy.int32 will force the result to also be an numpy.int32 (and therefore (silently) 0 because int32 can’t hold my number) whereas raising an int by an int will result in a long? Why didn’t the int32 become a long also? Or an int64 or something that could hold my number?

Gah!

Nasty Python Things

March 27, 2009

So I seem to keep writing commands that look like this:

delta[t][q] = max(
    [delta[tau][j] +
        pylab.log(
            pylab.array([
                output_dist(Q=q,L=(t-tau),Y=Y[tau+1:t]),
                duration_dist(Q=q,L=(t-tau)),
                transition_dist[q,j]]).prod())            
    for j in self.state_range])

Is this bad? The above is the max of a list. The list is made up using a list comprehension, where each element is the log of a product of a 1D array plus a bit. Each element of each array is a call to a function associated with my model. The trouble is, if I break it down into some for loops, then I start having to invent temporary names for my variables, which seems clunky.

Any opinions?

M

I came across the Microcosmographia Academia in one of Niranjan‘s slides (which are an interesting view into the world of academic politics and bureaucracy) and then found it in full at the webpage of Dr Utting of Kent University. It’s a lovely satire on academic politics from 1908 in the form of a short pamphlet. Just short enough, in fact, that I’ve wasted a good hour typesetting it [pdf,tex]. It starts with this small ‘advertisement’:

If you are young, do not read this book; it is not fit for you;
If you are old, throw it away; you have nothing to learn from it;
If you are unambitious, light the fire with it; you do not need its guidance.

But, if you are neither less than twenty-five years old, nor more than thirty;
And if you are ambitious withal, and your spirit hankers after academic politics;
Read, and may your soul (if you have a soul) find mercy!