python - How to subset a pandas dataframe based on a condtion of a categorical variable -


my goal

i'm struggling creating subset of dataframe based on content of categorical variable s11aq1a20. in howtos came across categorical variable contained string data in case it's integer values have specific meaning (yes = 1, no = 0, 9 = unknown). therefore, added categories let pandas label values properly.

ideally, case , b in sample code below contain 5 rows after subsetting done. currently, works if don't label integer values.

what have figured out far

  • case b shows subsetting ins't performed expcted categories added following line:

df.s11aq1a20 = df.s11aq1a20.cat.rename_categories(['yes', 'no', 'unknown'])

sample dataset

the sample dataset (nesarc_short.csv) used testing can found here: https://pastebin.com/nktebsdr

example code:

dataset_path = 'nesarc_short.csv' df = pd.read_csv(dataset_path, low_memory=false, na_values=' ')  print('case a: numerical -> working\n')  df = pd.read_csv(dataset_path, low_memory=false, na_values=' ')  print("a: rows before: " + str(len(df.s11aq1a20))) # outputs: 100  df = df[(df.s11aq1a20 == 1)]  print("a: rows after: " + str(len(df.s11aq1a20))) # outputs: 5   ###############################################################   print('\ncase b: categorical -> not working\n')  df = pd.read_csv(dataset_path, low_memory=false, dtype={ 's11aq1a20' : 'category' }, na_values=' ')  # if commented out, subsetting works no labels available df.s11aq1a20 = df.s11aq1a20.cat.rename_categories(['yes', 'no', 'unknown'])  print("b: rows before: " + str(len(df.s11aq1a20))) # outputs: 100  df = df[(df.s11aq1a20 == 'yes') | (df.s11aq1a20 == '1') | (df.s11aq1a20 == 1)]  print("b: rows after: " + str(len(df.s11aq1a20))) # outputs: 0 

console output

case a: numerical-> working

a: rows before: 100

a: rows after: 5


case b: categorical -> not working

b: rows before: 100

b: rows after: 0


Comments