python pass pandas dataframe, parameters, and functions to scipy.optimize.minimize -


i trying use scipy's scipy.optimize.minimize function minimize function have created. however, function trying optimize on constructed other functions perform calculations based on pandas dataframe.

i understand scipy's minimize function can input multiple arguments via tuple (e.g., structure of inputs scipy minimize function). however, not know how pass in function relies on pandas dataframe.

i have created reproducible example below.

import pandas pd import numpy np scipy.stats import norm scipy.optimize import minimize   ####################     data     #################### # initialize dataframe.  data = pd.dataframe({'id_i': ['aa', 'bb', 'cc', 'xx', 'dd'],                       'id_j': ['zz', 'yy', 'xx', 'bb', 'aa'],                       'y': [0.30, 0.60, 0.70, 0.45, 0.65],                       'num': [1000, 2000, 1500, 1200, 1700],                       'bar': [-4.0, -6.5, 1.0, -3.0, -5.5],                       'mu': [-4.261140, -5.929608, 1.546283, -1.810941, -3.186412]})  data['foo_1'] = data['bar'] - 11 * norm.ppf(1/1.9) data['foo_2'] = data['bar'] - 11 * norm.ppf(1 - (1/1.9))  # store list of ids. id_list = sorted(pd.unique(pd.concat([data['id_i'], data['id_j']], axis=0)))   ####################     functions     #################### # function 1: intermediate calculation calculate predicted values. def calculate_y_pred(row, delta_params, sigma_param, id_list):      # extract relevant values delta_params.     delta_i = delta_params[id_list.index(row['id_i'])]     delta_j = delta_params[id_list.index(row['id_j'])]      # calculate adjusted version of mu.      mu_adj = row['mu'] - delta_i + delta_j      # calculate predicted value of y.     y_pred = norm.cdf(row['foo_1'], loc=mu_adj, scale=sigma_param) / \                 (norm.cdf(row['foo_1'], loc=mu_adj, scale=sigma_param) +                      (1 - norm.cdf(row['foo_2'], loc=mu_adj, scale=sigma_param)))      return y_pred  # function calculate log-likelihood (for row of dataframe data). def loglik_row(row, delta_params, sigma_param, id_list):      # calculate log-likelihood row.     y_pred = calculate_y_pred(row, delta_params, sigma_param, id_list)     y_obs = row['y']     n = row['num']     loglik_row = np.log(norm.pdf(((y_obs - y_pred) * np.sqrt(n)) / np.sqrt(y_pred * (1-y_pred))) /                              np.sqrt(y_pred * (1-y_pred) / n))      return loglik_row  # function calculate sum of negative log-likelihood.  # function called via scipy's minimize function. def loglik_total(data, id_list, params):      # extract parameters.     delta_params = list(params[0:len(id_list)])     sigma_param = init_params[-1]      # calculate negative log-likelihood every row in data , sum values.     loglik_total = -np.sum( data.apply(lambda row: loglik_row(row, delta_params, sigma_param, id_list), axis=1) )      return loglik_total   ####################     optimize     #################### # provide initial parameter guesses.  delta_params = [0 id in id_list] sigma_param = 11 init_params = tuple(delta_params + [sigma_param])  # maximize log likelihood (minimize negative log likelihood).  minimize(fun=loglik_total, x0=init_params,              args=(data, id_list), method='nelder-mead') 

this results in following error: attributeerror: 'numpy.ndarray' object has no attribute 'apply' (the entire error output below). believe error because minimize treating x numpy array, whereas pass pandas dataframe.

attributeerror: 'numpy.ndarray' object has no attribute 'apply' attributeerrortraceback (most recent call last) <ipython-input-93-9a5866bd626e> in <module>()       1 minimize(fun=loglik_total, x0=init_params,  ----> 2             args=(data, id_list), method='nelder-mead') /users/adam/anaconda/lib/python2.7/site-packages/scipy/optimize/_minimize.pyc in minimize(fun, x0, args, method, jac, hess, hessp, bounds, constraints, tol, callback, options)     436                       callback=callback, **options)     437     elif meth == 'nelder-mead': --> 438         return _minimize_neldermead(fun, x0, args, callback, **options)     439     elif meth == 'powell':     440         return _minimize_powell(fun, x0, args, callback, **options) /users/adam/anaconda/lib/python2.7/site-packages/scipy/optimize/optimize.pyc in _minimize_neldermead(func, x0, args, callback, maxiter, maxfev, disp, return_all, initial_simplex, xatol, fatol, **unknown_options)     515      516     k in range(n + 1): --> 517         fsim[k] = func(sim[k])     518      519     ind = numpy.argsort(fsim) /users/adam/anaconda/lib/python2.7/site-packages/scipy/optimize/optimize.pyc in function_wrapper(*wrapper_args)     290     def function_wrapper(*wrapper_args):     291         ncalls[0] += 1 --> 292         return function(*(wrapper_args + args))     293      294     return ncalls, function_wrapper <ipython-input-69-546e169fc54e> in loglik_total(data, id_list, params)       6        7     # calculate negative log-likelihood every row in data , sum values. ----> 8     loglik_total = -np.sum( data.apply(lambda row: loglik_row(row, delta_params, sigma_param, id_list), axis=1) )       9       10     return loglik_total attributeerror: 'numpy.ndarray' object has no attribute 'apply' 

what proper way handle dataframe data , call function loglik_total within scipy's minimize function? suggestions welcome , appreciated.

possible solution: note, have considered edit functions treat data numpy array rather pandas dataframe. however, avoid if possible couple reasons: 1) in loglik_total, use pandas' apply function apply loglik_row function every row of data; 2) convenient refer columns of data column names rather numerical indices.

it not issue data format called loglik_total in wrong manner. here modified version, correct order of arguments (params has go first; pass additional arguments in same order in args of minimize call):

def loglik_total(params, data, id_list):      # extract parameters.     delta_params = list(params[0:len(id_list)])     sigma_param = params[-1]      # calculate negative log-likelihood every row in data , sum values.     lt = -np.sum( data.apply(lambda row: loglik_row(row, delta_params, sigma_param, id_list), axis=1) )      return lt 

if call

res = minimize(fun=loglik_total, x0=init_params,             args=(data, id_list), method='nelder-mead') 

it runs through nicely (note order x, data, id_list, same pass loglik_total) , res looks follows:

final_simplex: (array([[  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,           3.88586258e+05,   3.19488400e+05,   4.90209168e+04,           6.43380010e+04,  -1.85436851e+09],        [  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,           3.88586258e+05,   3.19488400e+05,   4.90209168e+04,           6.43380010e+04,  -1.85436851e+09],        [  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,           3.88586258e+05,   3.19488400e+05,   4.90209168e+04,           6.43380010e+04,  -1.85436851e+09],        [  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,           3.88586258e+05,   3.19488400e+05,   4.90209168e+04,           6.43380010e+04,  -1.85436851e+09],        [  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,           3.88586258e+05,   3.19488400e+05,   4.90209168e+04,           6.43380010e+04,  -1.85436851e+09],        [  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,           3.88586258e+05,   3.19488400e+05,   4.90209168e+04,           6.43380010e+04,  -1.85436851e+09],        [  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,           3.88586258e+05,   3.19488400e+05,   4.90209168e+04,           6.43380010e+04,  -1.85436851e+09],        [  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,           3.88586258e+05,   3.19488400e+05,   4.90209168e+04,           6.43380010e+04,  -1.85436851e+09],        [  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,           3.88586258e+05,   3.19488400e+05,   4.90209168e+04,           6.43380010e+04,  -1.85436851e+09]]), array([-0., -0., -0., -0., -0., -0., -0., -0., -0.]))            fun: -0.0        message: 'optimization terminated successfully.'           nfev: 930            nit: 377         status: 0        success: true              x: array([  2.55758096e+05,   6.99890451e+04,  -1.41860117e+05,          3.88586258e+05,   3.19488400e+05,   4.90209168e+04,          6.43380010e+04,  -1.85436851e+09]) 

whether output makes sense, cannot judge though :)


Comments