python - Pandas - Searching Column of Data Frame from List Efficiently -


i trying figure out efficient way search data frame in pandas list (dataframe) of other values without using brute force methods. there way vectorize it? know can loop each element of list (or dataframe) , extract data using loc method, hoping faster. have data frame 1 million rows , need search within extract index of 600,000 rows.

example:

import pandas pd import numpy np  df = pd.dataframe({'wholelist': np.round(1000000*(np.random.rand(1000000)),0)}) df2 = pd.dataframe({'thingstofind': np.arange(50000)+50000}) df.loc[1:10,:] #edited, think it, 'arange' method have been better populate arrays. 

i want efficient way index of df2 in df, exists in df.

thanks!

here's approach np.searchsorted seems second dataframe has elements sorted , unique -

def find_index(a,b, invalid_specifier = -1):     idx = np.searchsorted(b,a)     idx[idx==b.size] = 0     idx[b[idx] != a] = invalid_specifier     return idx  def process_dfs(df, df2):     = df.wholelist.values.ravel()     b = df2.thingstofind.values.ravel()     return find_index(a,b, invalid_specifier=-1) 

sample run on arrays -

in [200]: out[200]: array([ 3,  5,  8,  4,  3,  2,  5,  2, 12,  6,  3,  7])  in [201]: b out[201]: array([2, 3, 5, 6, 7, 8, 9])  in [202]: find_index(a,b, invalid_specifier=-1) out[202]: array([ 1,  2,  5, -1,  1,  0,  2,  0, -1,  3,  1,  4]) 

sample run on dataframes -

in [188]: df out[188]:      wholelist 0           3 1           5 2           8 3           4 4           3 5           2 6           5 7           2 8          12 9           6 10          3 11          7  in [189]: df2 out[189]:     thingstofind 0             2 1             3 2             5 3             6 4             7 5             8 6             9  in [190]: process_dfs(df, df2) out[190]: array([ 1,  2,  5, -1,  1,  0,  2,  0, -1,  3,  1,  4]) 

Comments