Andy, in response to my griping on #Rstats, asks “What’s wrong with an arrow going both ways!?!?!” In R one can perfom assignment three ways:

a = 2
a <- 2
2 -> a

I’m pretty sure all these things are equivalent. The only subtlety I’ve noticed so far is that to set default options in function definitions you have to use `=`.

This redundancy drives me absolutely nuts as an R beginner. From a human point of view code has two main jobs: to be written and to be read. Now for the first job, having three different ways to assign a value to a variable might be nice? I’m not sure how, but I guess it would be nice sometimes to write a function, then remember that you hadn’t assigned it a variable and do it at the end instead.

For the second job  though, reading R is awful. Now I’ve heard horror stories about perl, so maybe it’s just that I’m used to readable code and that, in the big scheme of things, R is nice and readable. But yesterday, ploughing through someone’s metaprogramming, tearing at my hair as to the obtuseness of everything, I was foiled for about an hour having not realised that the fsking arrow could go both ways.

I think it highlights a big flaw in R. Sometimes it’s like no “destructive” design decisions have been taken since the original work of Ihaka and Gentleman. I spend so much time messing about trying to convert a data.frame of factors into a list of “vectors” of strings or something. It seems that no-one has ever said “OK, we need a way to assign a value to a variable: let’s pick one”. Instead three have emerged. There are two different types of classes (which only seems to have any real effect on the help files). There are loads of data types that are in regular use, with somewhat bastardised names (i.e. a ‘vector’, ‘matrix’, ‘list’, ‘array’ are all different things – a vector doesn’t seem to be a subclass of matrix, for example).

Now, I’m sure that the pluralities in R could maybe be held up to be a good thing. They probably encode subtleties that I don’t understand, and that my bitching is just me exposing my naivety about the language. And I know that R is descended from LISP and S and that its user base either grew up with R or with one of these other languages. Therefore things that are natural to this background will seem alien to someone like me with a MATLAB and Python background. And there’s backwards compatibility to consider and so on.

So don’t get me wrong – I think R is an amazing piece of work, and the communities around R are incredibly devoted and responsive. It’s kind of like the “first” three episodes of Star Wars: there’s an amazing film in there somewhere, it just needs some editorial decisions to be made. Maybe a new “Hadley” language will be born whereby R is completely re-API’d into a single language that only uses data.frames…


Profiling in R

May 13, 2010

Giving a quick talk about vectorising code in R next month. Need to learn about profiling in R. This is what I have discovered in the last hour or so (when I should have been working):


This is the basic profiler in R. One turns it on:
One runs some code … and one turns it off using, as always, R’s intuitive API
To see the results, you can use `summaryRprof(“profile.out”)` which parses the file you made using Rprof and presents it in a nice clear format, which looks something like:
> summaryRprof("method1.out")
self.time self.pct total.time total.pct
"strsplit"          4.08     94.0       4.10      94.5
"+"                 0.12      2.8       0.12       2.8
"is.character"      0.12      2.8       0.12       2.8
"as.logical"        0.02      0.5       0.02       0.5
total.time total.pct self.time self.pct
"strsplit"           4.10      94.5      4.08     94.0
"+"                  0.12       2.8      0.12      2.8
"is.character"       0.12       2.8      0.12      2.8
"as.logical"         0.02       0.5      0.02      0.5
There’s more detail here about `Rprof()` and `summaryRprof()`.


This package presents an alternative to summaryRprof, using graphviz to generate teh pretteh. Not sure how massively useful it would be in practice, unless your code got pretty complex! I’d be interested to know how complicated people’s functions actually got when using R.
To make the graph, make sure you have graph and Rgraphviz properly installed, then bash out


I kind of want everything Hadley Wickham does to be awesome, so when I see he’s made a profiling tool, I give it a lot more attention than it necessarily deserves. For example, you won’t find me hunting for documentation through anyone else’s github account. Nevertheless, this is what I find myself doing for `profr` (which in my head I will be pronouncing “proffer”). Note: it does turn out that I could have simply typed ?profr in my R terminal (does this always work)?
To use, simply run
out <- profr(code_to_profile)
then to inspect it use the normal `head`, `tail` and `summary`, noting that `out` is (pretty much) a `data.frame`:
> head(out)
f level time start  end  leaf source
8      example     1 0.44  0.00 0.44 FALSE  utils
9  <Anonymous>     2 0.04  0.00 0.04 FALSE   <NA>
10     library     2 0.02  0.04 0.06 FALSE   base
11      source     2 0.38  0.06 0.44 FALSE   base
12  prepare_Rd     3 0.04  0.00 0.04 FALSE   <NA>
13        %in%     3 0.02  0.04 0.06 FALSE   base
> summary(out)
f          level             time          start
match        : 7   Min.   : 1.000   Min.   :0.02   Min.   :0.0000
%in%         : 6   1st Qu.: 5.000   1st Qu.:0.02   1st Qu.:0.1200
is.factor    : 6   Median : 7.000   Median :0.02   Median :0.1600
eval.with.vis: 6   Mean   : 8.113   Mean   :0.04   Mean   :0.1685
inherits     : 5   3rd Qu.:10.500   3rd Qu.:0.02   3rd Qu.:0.2200
<Anonymous>  : 3   Max.   :19.000   Max.   :0.44   Max.   :0.4200
(Other)      :82
end            leaf            source
Min.   :0.0200   Mode :logical   Length:115
1st Qu.:0.1400   FALSE:99        Class :character
Median :0.2000   TRUE :16        Mode  :character
Mean   :0.2085   NA's :0
3rd Qu.:0.2500
Max.   :0.4400

Here I should say that `code_to_profile` was the standard `example(glm)`. One can also use `plot()` or `ggplot()` to plot the ‘call tree’. I’m not quite sure what a ‘call tree’ is, and to be honest the (adimittedly pretty) graph that `ggplot(out)` produces doesn’t enlighten me a great deal. That’s OK though – all I really wanted was the time spent.

A quick point about Hadley’s code, that I think highlights something wrong with R in general (and right with Python) is that in `profr`, from a user perspective I don’t really have to learn anything new to make it work. I apply a function (handily called `profr`) to some code and I get back a `data.frame` on which I can use very standard tools in order to investigate. I don’t have to write to a file (or generate a temp file to write to or anything), I don’t have to learn to stop and start hidden things, or that to stop profiling I pass `NULL` to a function that started profiling. I don’t have to use a whole new tool to parse what is basically a table. And to plot I can just use `plot` rather than having to dissapear down the (admittedly fun) hole that is graphviz. It feels like every new thing I come across in R means learning a whole API and set of concepts and object structure. (Admittedly ggplot is the total opposite of this, but hey I guess that was the point)

I keep getting bitten on the bum with Python and ‘proper’ programming things. Today it’s this:

In [253]: 8**14
Out[253]: 4398046511104L

In [254]: 8**(numpy.int32(14))
Out[254]: 0

I would feel guilty about this if I had to specify the type of numbers I was using, but this is a dynamic programming language! How the hell am I supposed to know that raising an int by a numpy.int32 will force the result to also be an numpy.int32 (and therefore (silently) 0 because int32 can’t hold my number) whereas raising an int by an int will result in a long? Why didn’t the int32 become a long also? Or an int64 or something that could hold my number?



February 28, 2009

Look! A python module for learning bayesian network structures has surfaced on the JMLR software track – bits of the author’s blog are quite interesting too… This is article number 5 for the software track in JMLR, and it’s been going over a year now! I suppose 5 examples is quite enough to figure out what’s required for submission here, though, so maybe uptake will improve…

Latex, Beamer, Python, Beauty

February 25, 2009

This post quickly shows how to create pretty latex beamer slides, that don’t look anything much like maths seminar slides, and  that can include well formatted (Python) code, as well as the normal stuff that Beamer is so good at (maths, videos, etc). I’m assuming you’ve got a `standard’ install of latex (by standard I mean: like mine) that includes beamer and the listings package. I’m also assuming that you already know how to use beamer, so I’ve not included all the documentclass junk and so on. Here’s a taster of what it looks like, how to do it is after the break:


Read the rest of this entry »

Python Scope Example

February 20, 2009

Wow I just learnt a new thing about Python’s scope! Below are two examples.


x = 1
def foo():
    print x


x = 1
def foo():
    print x
    x = 2

Code snippet number 1 works. It looks up the variable x in its global namespace and prints it. Easy!

Code snippet number 2 breaks on line 3, because there’s an x defined in the function’s local namespace (even though it comes later) and so it knows that x is local and hence barfs when you try to print it before assigning it a value!

Having battled through this in Matlab, doing it the Python way is extremely easy on any platform. One makes use of Mencoder, which we can just call using os.system. Here’s an idiom, culled mostly from the scipy wiki:

from pylab import *
import os
# set the interactive mode of pylab to ON
# opens a new figure to plot into
fig_hndl = figure()
# make an empty list into which we'll append
# the filenames of the PNGs that compose each
# frame.
# filename for the name of the resulting movie
filename = 'animation'
number_of_frames = 100
for t in range(number_of_frames):
	# draw the frame
	# form a filename
	fname = '_tmp%03d.png'%t
	# save the frame
	# append the filename to the list
# call mencoder	
os.system("mencoder 'mf://_tmp*.png' -mf type=png:fps=10 
	-ovc lavc -lavcopts vcodec=wmv2 -oac copy 
	-o " + filename + ".mpg")
# cleanup
for fname in files: os.remove(fname)

Depending on how complicated the plot is, this can take anywhere from a while to ages and ages. It’s good quality, though! You can muck about with the frame rate in the call to encoder where it says fps=10. Change that 10 to something else (30 seems popular!). Also there are probably better codecs than wmv2!

Having written this, it strikes me we could probably do all this in Matlab in the same way, rather than trying to make movies using Matlab’s builtins…