my goal
i'm struggling creating subset of dataframe based on content of categorical variable s11aq1a20. in howtos came across categorical variable contained string data in case it's integer values have specific meaning (yes = 1, no = 0, 9 = unknown). therefore, added categories let pandas label values properly.
ideally, case , b in sample code below contain 5 rows after subsetting done. currently, works if don't label integer values.
what have figured out far
- case b shows subsetting ins't performed expcted categories added following line:
df.s11aq1a20 = df.s11aq1a20.cat.rename_categories(['yes', 'no', 'unknown'])
sample dataset
the sample dataset (nesarc_short.csv) used testing can found here: https://pastebin.com/nktebsdr
example code:
dataset_path = 'nesarc_short.csv' df = pd.read_csv(dataset_path, low_memory=false, na_values=' ') print('case a: numerical -> working\n') df = pd.read_csv(dataset_path, low_memory=false, na_values=' ') print("a: rows before: " + str(len(df.s11aq1a20))) # outputs: 100 df = df[(df.s11aq1a20 == 1)] print("a: rows after: " + str(len(df.s11aq1a20))) # outputs: 5 ############################################################### print('\ncase b: categorical -> not working\n') df = pd.read_csv(dataset_path, low_memory=false, dtype={ 's11aq1a20' : 'category' }, na_values=' ') # if commented out, subsetting works no labels available df.s11aq1a20 = df.s11aq1a20.cat.rename_categories(['yes', 'no', 'unknown']) print("b: rows before: " + str(len(df.s11aq1a20))) # outputs: 100 df = df[(df.s11aq1a20 == 'yes') | (df.s11aq1a20 == '1') | (df.s11aq1a20 == 1)] print("b: rows after: " + str(len(df.s11aq1a20))) # outputs: 0
console output
case a: numerical-> working
a: rows before: 100
a: rows after: 5
case b: categorical -> not working
b: rows before: 100
b: rows after: 0
Comments
Post a Comment