r - Unlist multiple values in dataframe column but keep track of the row number -

i have data frame contains column multiple values consisting of gene name synonyms separated semicolons:

score <- c("32.01","19.5","18.0") symbol <- c("30 kda adipocyte complemen related protein","aat1","cachectin") synonym <- c("30 kda adipocyte complemen related protein; 30 kda adipocyte complement-related protein; acdc; acrp30; adipoq; apm-1; apm1; adipocyte c1q , collagen domain containing","aat1; aat1; alt-1; alt1; alanine aminotransferase; alanine aminotransferase 1; gpt 1; gpt1; glutamate pyruvate transaminase; glutamic--alanine transaminase 1; glutamic--pyruvic transaminase 1","cachectin; tnf alpha; tnf-a; tnfa; tnfsf-2; tnfsf2; tnfalpha; tumor necrosis factor; tumor necrosis factor ligand superfamily member 2; tumor necrosis factor precursor; tumor necrosis factor alpha") df <- data.frame(score, symbol, synonym, stringsasfactors=false)

this raw output data mining. i'm mapping official gene symbols in data entrez ids. symbol column doesn't contain gene symbol, have extract synonyms (typically, there's official symbol in list). goal wanting keep track of row numbers that, once i've mapped symbols entrez ids, can identify rows didn't map.

i'm using strsplit , unlist parse out synonyms lose track of row each synonym came from:

tmp <- data.frame(unlist(strsplit(as.character(df$synonym), "; ")))

what want looks this:

originalrow <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3) cbind(tmp, originalrow)     synonym                                           originalrow  1   30 kda adipocyte complemen related protein           1 2   30 kda adipocyte complement-related protein          1 3   acdc                                                 1 4   acrp30                                               1 5   adipoq                                               1 6   apm-1                                                1 7   apm1                                                 1 8   adipocyte c1q , collagen domain containing         1 9   aat1                                                 2 10   aat1                                                2 11   alt-1                                               2 12   alt1                                                2 13   alanine aminotransferase                            2 14   alanine aminotransferase 1                          2 15   gpt 1                                               2 16   gpt1                                                2 17   glutamate pyruvate transaminase                     2 18   glutamic--alanine transaminase 1                    2 19   glutamic--pyruvic transaminase 1                    2 20   cachectin                                           3 21   tnf alpha                                           3 22   tnf-a                                               3 23   tnfa                                                3 24   tnfsf-2                                             3 25   tnfsf2                                              3 26   tnfalpha                                            3 27   tumor necrosis factor                               3 28   tumor necrosis factor ligand superfamily member 2   3 29   tumor necrosis factor precursor                     3 30   tumor necrosis factor alpha                         3

any advice appreciated!

here can split each row, combine values row it's row number data.frame, bind data.frames together. can with

do.call("rbind", map(data.frame,      synonym=strsplit(as.character(df$synonym), ";"),      origrow=seq_along(df$synonym)) )

test

Search This Blog

r - Unlist multiple values in dataframe column but keep track of the row number -

Comments

Post a Comment