Andy, in response to my griping on #Rstats, asks “What’s wrong with an arrow going both ways!?!?!” In R one can perfom assignment three ways:

a = 2
a <- 2
2 -> a

I’m pretty sure all these things are equivalent. The only subtlety I’ve noticed so far is that to set default options in function definitions you have to use `=`.

This redundancy drives me absolutely nuts as an R beginner. From a human point of view code has two main jobs: to be written and to be read. Now for the first job, having three different ways to assign a value to a variable might be nice? I’m not sure how, but I guess it would be nice sometimes to write a function, then remember that you hadn’t assigned it a variable and do it at the end instead.

For the second job  though, reading R is awful. Now I’ve heard horror stories about perl, so maybe it’s just that I’m used to readable code and that, in the big scheme of things, R is nice and readable. But yesterday, ploughing through someone’s metaprogramming, tearing at my hair as to the obtuseness of everything, I was foiled for about an hour having not realised that the fsking arrow could go both ways.

I think it highlights a big flaw in R. Sometimes it’s like no “destructive” design decisions have been taken since the original work of Ihaka and Gentleman. I spend so much time messing about trying to convert a data.frame of factors into a list of “vectors” of strings or something. It seems that no-one has ever said “OK, we need a way to assign a value to a variable: let’s pick one”. Instead three have emerged. There are two different types of classes (which only seems to have any real effect on the help files). There are loads of data types that are in regular use, with somewhat bastardised names (i.e. a ‘vector’, ‘matrix’, ‘list’, ‘array’ are all different things – a vector doesn’t seem to be a subclass of matrix, for example).

Now, I’m sure that the pluralities in R could maybe be held up to be a good thing. They probably encode subtleties that I don’t understand, and that my bitching is just me exposing my naivety about the language. And I know that R is descended from LISP and S and that its user base either grew up with R or with one of these other languages. Therefore things that are natural to this background will seem alien to someone like me with a MATLAB and Python background. And there’s backwards compatibility to consider and so on.

So don’t get me wrong – I think R is an amazing piece of work, and the communities around R are incredibly devoted and responsive. It’s kind of like the “first” three episodes of Star Wars: there’s an amazing film in there somewhere, it just needs some editorial decisions to be made. Maybe a new “Hadley” language will be born whereby R is completely re-API’d into a single language that only uses data.frames…

Advertisements

I asked a question on the superb biostar stackexchange site. It’s here: http://biostar.stackexchange.com/questions/1054/homology-bioconductor

It’s about finding geneome-wide homologies using bioconductor. It turns out that bioconductor has a package called biomaRt which allows you to query the Ensembl databases with ease. (Ensembl stores gene information for a bunch of different organisms).

I thought I’d write down my solution here, as a sort of extended answer to my question on biostar, in case anyone trips up on the question there and would like a more complete answer. You’ll need to read the question before any of this code makes sense!

library(biomaRt)
gen_hs2mm <- function(affyids){
    ensembl_hs <- useMart(
        "ensembl",
        dataset = "hsapiens_gene_ensembl"
    )
    hs2mm_filters <- c(
        "affy_hg_u133a",
        "with_mmusculus_homolog"
    )
    hs2mm_gene_atts <- c(
         "affy_hg_u133a",
        "ensembl_gene_id"
    )
    hs2mm_homo_atts <- c(
        "ensembl_gene_id",
        "mouse_ensembl_gene"
    )
    # the names in these lists are arbitrary
    hs2mm_value = list(
        affyid=affyids,
        with_homolog=TRUE
    )
    # get the human genes and mouse orthologues
    hs2mm_gene <- getBM(
        attributes = hs2mm_gene_atts,
        filters = hs2mm_filters,
        value = hs2mm_value,
        mart = ensembl_hs
    )
    hs2mm_homo <- getBM(
        attributes = hs2mm_homo_atts,
        filters = hs2mm_filters,
        value = hs2mm_value,
        mart = ensembl_hs
    )
    # merge the two lists!
    hs2mm <- merge(hs2mm_gene,hs2mm_homo)
}

gen_mm2hs <- function(affyids){
    ensembl_mm <- useMart("ensembl",
        dataset = "mmusculus_gene_ensembl")
        mm2hs_filters <- c(
        "affy_mogene_1_0_st_v1",
        "with_hsapiens_homolog"
    )
    mm2hs_gene_atts <- c(
        "affy_mogene_1_0_st_v1",
        "ensembl_gene_id"
    )
    mm2hs_homo_atts <- c(
        "ensembl_gene_id",
        "human_ensembl_gene"
    )
    # the names in these lists are arbitrary
    mm2hs_value = list(
        affyids=affyids,
        with_homolog=TRUE
    )
    # get the mouse genes and human orthologues
    mm2hs_gene <- getBM(
        attributes = mm2hs_gene_atts ,
        filters = mm2hs_filters,
        value = mm2hs_value,
        mart = ensembl_mm
    )
    mm2hs_homo <- getBM(
        attributes = mm2hs_homo_atts,
        filters = mm2hs_filters,
        value = mm2hs_value,
        mart = ensembl_mm
    )
    mm2hs <- merge(mm2hs_gene,mm2hs_homo)
}
source('load_data.r')
# here immgen and cd4T are different experession set objects 
# from Bioconductor.
# immgen is mouse data (from the Immunological Genome Project) 
# and cd4T is human data
# cd4T can be found on GEO using the accessionID GDS785 
# See ref[1]
immgen <- load_immgen()
cd4T <- load_GDS785()
hs2mm <- gen_hs2mm(rownames(exprs(cd4T)))
mm2hs <- gen_mm2hs(rownames(exprs(immgen)))
colnames(hs2mm)[1] <- 'human_ensembl_gene'
colnames(mm2hs)[1] <- 'mouse_ensembl_gene'
# the final thing is to merge the two tables to make a single 
# table containing all the probes that are homologous, along 
# with their respsective EnsemblIDs
homol <- merge(hs2mm,mm2hs)

[1] Lee MS, Hanspers K, Barker CS, Korn AP et al. Gene expression profiles during human CD4+ T cell differentiation. Int Immunol2004 Aug;16(8):1109-24. PMID: 15210650

Profiling in R

May 13, 2010

Giving a quick talk about vectorising code in R next month. Need to learn about profiling in R. This is what I have discovered in the last hour or so (when I should have been working):

Rprof()

This is the basic profiler in R. One turns it on:
Rprof("profile.out")
One runs some code … and one turns it off using, as always, R’s intuitive API
Rprof(NULL)
To see the results, you can use `summaryRprof(“profile.out”)` which parses the file you made using Rprof and presents it in a nice clear format, which looks something like:
> summaryRprof("method1.out")
$by.self
self.time self.pct total.time total.pct
"strsplit"          4.08     94.0       4.10      94.5
"+"                 0.12      2.8       0.12       2.8
"is.character"      0.12      2.8       0.12       2.8
"as.logical"        0.02      0.5       0.02       0.5
$by.total
total.time total.pct self.time self.pct
"strsplit"           4.10      94.5      4.08     94.0
"+"                  0.12       2.8      0.12      2.8
"is.character"       0.12       2.8      0.12      2.8
"as.logical"         0.02       0.5      0.02      0.5
There’s more detail here about `Rprof()` and `summaryRprof()`.

proftools

This package presents an alternative to summaryRprof, using graphviz to generate teh pretteh. Not sure how massively useful it would be in practice, unless your code got pretty complex! I’d be interested to know how complicated people’s functions actually got when using R.
To make the graph, make sure you have graph and Rgraphviz properly installed, then bash out
plotProfileCallGraph("profile.out")

profr

I kind of want everything Hadley Wickham does to be awesome, so when I see he’s made a profiling tool, I give it a lot more attention than it necessarily deserves. For example, you won’t find me hunting for documentation through anyone else’s github account. Nevertheless, this is what I find myself doing for `profr` (which in my head I will be pronouncing “proffer”). Note: it does turn out that I could have simply typed ?profr in my R terminal (does this always work)?
To use, simply run
out <- profr(code_to_profile)
then to inspect it use the normal `head`, `tail` and `summary`, noting that `out` is (pretty much) a `data.frame`:
> head(out)
f level time start  end  leaf source
8      example     1 0.44  0.00 0.44 FALSE  utils
9  <Anonymous>     2 0.04  0.00 0.04 FALSE   <NA>
10     library     2 0.02  0.04 0.06 FALSE   base
11      source     2 0.38  0.06 0.44 FALSE   base
12  prepare_Rd     3 0.04  0.00 0.04 FALSE   <NA>
13        %in%     3 0.02  0.04 0.06 FALSE   base
> summary(out)
f          level             time          start
match        : 7   Min.   : 1.000   Min.   :0.02   Min.   :0.0000
%in%         : 6   1st Qu.: 5.000   1st Qu.:0.02   1st Qu.:0.1200
is.factor    : 6   Median : 7.000   Median :0.02   Median :0.1600
eval.with.vis: 6   Mean   : 8.113   Mean   :0.04   Mean   :0.1685
inherits     : 5   3rd Qu.:10.500   3rd Qu.:0.02   3rd Qu.:0.2200
<Anonymous>  : 3   Max.   :19.000   Max.   :0.44   Max.   :0.4200
(Other)      :82
end            leaf            source
Min.   :0.0200   Mode :logical   Length:115
1st Qu.:0.1400   FALSE:99        Class :character
Median :0.2000   TRUE :16        Mode  :character
Mean   :0.2085   NA's :0
3rd Qu.:0.2500
Max.   :0.4400

Here I should say that `code_to_profile` was the standard `example(glm)`. One can also use `plot()` or `ggplot()` to plot the ‘call tree’. I’m not quite sure what a ‘call tree’ is, and to be honest the (adimittedly pretty) graph that `ggplot(out)` produces doesn’t enlighten me a great deal. That’s OK though – all I really wanted was the time spent.

A quick point about Hadley’s code, that I think highlights something wrong with R in general (and right with Python) is that in `profr`, from a user perspective I don’t really have to learn anything new to make it work. I apply a function (handily called `profr`) to some code and I get back a `data.frame` on which I can use very standard tools in order to investigate. I don’t have to write to a file (or generate a temp file to write to or anything), I don’t have to learn to stop and start hidden things, or that to stop profiling I pass `NULL` to a function that started profiling. I don’t have to use a whole new tool to parse what is basically a table. And to plot I can just use `plot` rather than having to dissapear down the (admittedly fun) hole that is graphviz. It feels like every new thing I come across in R means learning a whole API and set of concepts and object structure. (Admittedly ggplot is the total opposite of this, but hey I guess that was the point)