Reading Excel in R: how to find the start cell in messy spreadsheets -


i'm trying write r code read data mess of old spreadsheets. exact location of data varies sheet sheet: constant first column date , second column has "monthly return" header. in example, data starts in cell b5:

sample spreadsheet

how automate search of excel cells "monthly return" string using r?

at moment, best idea can come upload in r starting @ cell a1 , sort out mess in resulting (huge) matrices. i'm hoping more elegant solution

i haven't found way elegantly, i'm familiar problem (getting data factset pa reports -> excel -> r, right?). understand different reports have different formats, , can pain.

for different version of annoyingly formatted spreadsheets, following. it's not elegant (it requires 2 reads of file) works. reading file twice, make sure columns of correct type, , headers. it's easy mess column imports, i'd rather have code read file twice go through , clean columns myself, , read_excel defaults, if start @ right row, pretty good.

also, it's worth noting of today (2017-04-20), readxl had update. installed new version see if make easy, don't believe that's case, although mistaken.

library(readxl) library(stringr) library(dplyr)  f_path <- file.path("whatever.xlsx")  if (!file.exists(f_path)) {   f_path <- file.choose() }  # read twice, temp_read figure out data starts...  # maybe need -  #   excel_sheets <- readxl::excel_sheets(f_path) #   desired_sheet <- which(stringr::str_detect(excel_sheets,"2 factor brinson attribution")) desired_sheet <- 1 temp_read <- readxl::read_excel(f_path,sheet = desired_sheet)  skip_rows <- null col_skip <- 0 search_string <- "monthly returns" max_cols_to_search <- 10 max_rows_to_search <- 10  # note, - 0, may need add/subtract row if end skipping far later. while (length(skip_rows) == 0) {   col_skip <- col_skip + 1   if (col_skip == max_cols_to_search) break   skip_rows <- which(stringr::str_detect(temp_read[1:max_rows_to_search,col_skip][[1]],search_string)) - 0  }  # ... re-read known starting point. real_data <- readxl::read_excel(   f_path,   sheet = desired_sheet,   skip = skip_rows )  # don't need if start @ right row # given weird spreadsheets weird in own way # may want operate on col_skip, maybe so: # real_data <- real_data %>% #   select(-(1:col_skip)) 

Comments