Tuesday, September 24, 2013

mapping Probe ID to Gene ID by using R

1) in or after 2006, there should be good Probe Id to GENE id map.

2) necessary package:
Biobase,
GEOquery

3) load the GPL platform file:
gpl5425 = getGEO('GPL5425',destdir=".")
         
## get annotation table
gpl5425_ann = Table(gpl5425)

#make the rownames to be ID
rownames(gpl5425_ann) = gpl5425_ann$ID

--------------------------------------------------------------------------------
                               ID                                        GB_ACC      GENE_SYMBOL                                    
AA799301_PROBE1 AA799301_PROBE1     AA799301   LOC498225            
AA799313_PROBE1 AA799313_PROBE1     AA799313      Siat10 ST3
AA799329_PROBE1 AA799329_PROBE1     AA799329     Col10a1      
AA799331_PROBE1 AA799331_PROBE1 NM_001007634        Pelo
--------------------------------------------------------------------------------

#assume the expression data is:
gse24417

--------------------------------------------------------------------------------
            ID_REF GSM204540 GSM204503 GSM204587 GSM204994
1 NM_016986_Probe1   14.5383   15.6827   14.0196   15.0969
2 NM_016987_Probe1   12.7629   15.6014   13.1481   14.4640
3 NM_016988_Probe1   14.0954   14.0583   13.6035   14.3924
4    M27893_Probe1   12.4787   12.6605   11.6335   14.0050
5 NM_012490_Probe1    8.9914    8.2654    9.2640    6.5239
--------------------------------------------------------------------------------

rownames(gse24417) = gse24417$ID_REF

#get the probe id, with upper character
pid = toupper(gse24417$ID_REF)

#get the gene symbol id
gid = gpl5425_ann[pid,]$GENE_SYMBOL

## gid contains empty value: "" , and NA value "NA", find the index of them and delete the lines.
blank_id = which( gid == "")
na_id = which(is.na(gid))

#now combine the final data
unwanted_index = c(blank_id,na_id)

gid = gid[-unwanted_index]
gse24417 = gse24417[-unwanted_index]

gse24417$ID_REF = gid

## now remove the repeated genes