I asked a question on the superb biostar stackexchange site. It’s here: http://biostar.stackexchange.com/questions/1054/homology-bioconductor

It’s about finding geneome-wide homologies using bioconductor. It turns out that bioconductor has a package called biomaRt which allows you to query the Ensembl databases with ease. (Ensembl stores gene information for a bunch of different organisms).

I thought I’d write down my solution here, as a sort of extended answer to my question on biostar, in case anyone trips up on the question there and would like a more complete answer. You’ll need to read the question before any of this code makes sense!

library(biomaRt)
gen_hs2mm <- function(affyids){
    ensembl_hs <- useMart(
        "ensembl",
        dataset = "hsapiens_gene_ensembl"
    )
    hs2mm_filters <- c(
        "affy_hg_u133a",
        "with_mmusculus_homolog"
    )
    hs2mm_gene_atts <- c(
         "affy_hg_u133a",
        "ensembl_gene_id"
    )
    hs2mm_homo_atts <- c(
        "ensembl_gene_id",
        "mouse_ensembl_gene"
    )
    # the names in these lists are arbitrary
    hs2mm_value = list(
        affyid=affyids,
        with_homolog=TRUE
    )
    # get the human genes and mouse orthologues
    hs2mm_gene <- getBM(
        attributes = hs2mm_gene_atts,
        filters = hs2mm_filters,
        value = hs2mm_value,
        mart = ensembl_hs
    )
    hs2mm_homo <- getBM(
        attributes = hs2mm_homo_atts,
        filters = hs2mm_filters,
        value = hs2mm_value,
        mart = ensembl_hs
    )
    # merge the two lists!
    hs2mm <- merge(hs2mm_gene,hs2mm_homo)
}

gen_mm2hs <- function(affyids){
    ensembl_mm <- useMart("ensembl",
        dataset = "mmusculus_gene_ensembl")
        mm2hs_filters <- c(
        "affy_mogene_1_0_st_v1",
        "with_hsapiens_homolog"
    )
    mm2hs_gene_atts <- c(
        "affy_mogene_1_0_st_v1",
        "ensembl_gene_id"
    )
    mm2hs_homo_atts <- c(
        "ensembl_gene_id",
        "human_ensembl_gene"
    )
    # the names in these lists are arbitrary
    mm2hs_value = list(
        affyids=affyids,
        with_homolog=TRUE
    )
    # get the mouse genes and human orthologues
    mm2hs_gene <- getBM(
        attributes = mm2hs_gene_atts ,
        filters = mm2hs_filters,
        value = mm2hs_value,
        mart = ensembl_mm
    )
    mm2hs_homo <- getBM(
        attributes = mm2hs_homo_atts,
        filters = mm2hs_filters,
        value = mm2hs_value,
        mart = ensembl_mm
    )
    mm2hs <- merge(mm2hs_gene,mm2hs_homo)
}
source('load_data.r')
# here immgen and cd4T are different experession set objects 
# from Bioconductor.
# immgen is mouse data (from the Immunological Genome Project) 
# and cd4T is human data
# cd4T can be found on GEO using the accessionID GDS785 
# See ref[1]
immgen <- load_immgen()
cd4T <- load_GDS785()
hs2mm <- gen_hs2mm(rownames(exprs(cd4T)))
mm2hs <- gen_mm2hs(rownames(exprs(immgen)))
colnames(hs2mm)[1] <- 'human_ensembl_gene'
colnames(mm2hs)[1] <- 'mouse_ensembl_gene'
# the final thing is to merge the two tables to make a single 
# table containing all the probes that are homologous, along 
# with their respsective EnsemblIDs
homol <- merge(hs2mm,mm2hs)

[1] Lee MS, Hanspers K, Barker CS, Korn AP et al. Gene expression profiles during human CD4+ T cell differentiation. Int Immunol2004 Aug;16(8):1109-24. PMID: 15210650

Advertisements

More Segment HMMs

April 23, 2009

It’s weird how you can muck about with structures for so long and still forget what the hell you’re doing all the time. This post is supposed to try and clarify segment HMMs a bit for me, so that I can approach the learning problem with a clear mind. It’s all very specific; and it’s all below the fold.

Read the rest of this entry »

Nasty Python Things

March 27, 2009

So I seem to keep writing commands that look like this:

delta[t][q] = max(
    [delta[tau][j] +
        pylab.log(
            pylab.array([
                output_dist(Q=q,L=(t-tau),Y=Y[tau+1:t]),
                duration_dist(Q=q,L=(t-tau)),
                transition_dist[q,j]]).prod())            
    for j in self.state_range])

Is this bad? The above is the max of a list. The list is made up using a list comprehension, where each element is the log of a product of a 1D array plus a bit. Each element of each array is a call to a function associated with my model. The trouble is, if I break it down into some for loops, then I start having to invent temporary names for my variables, which seems clunky.

Any opinions?

M

Hello World

January 23, 2009

Howdy!

I’m here to blog about my adventures in dynamic modelling and machine learning. I’ve got a PhD in spatiotemporal modelling, and I’ve been post-docing for a year and a half studying behaviour in a couple of different contexts. This work has covered quite a broad range of techniques in dynamic modelling, and has introduced me to some new ways of coding (mostly involving Python). So I thought it might be useful to write all this stuff down.