i trying figure out efficient way search data frame in pandas list (dataframe) of other values without using brute force methods. there way vectorize it? know can loop each element of list (or dataframe) , extract data using loc method, hoping faster. have data frame 1 million rows , need search within extract index of 600,000 rows.
example:
import pandas pd import numpy np df = pd.dataframe({'wholelist': np.round(1000000*(np.random.rand(1000000)),0)}) df2 = pd.dataframe({'thingstofind': np.arange(50000)+50000}) df.loc[1:10,:] #edited, think it, 'arange' method have been better populate arrays.
i want efficient way index of df2 in df, exists in df.
thanks!
here's approach np.searchsorted
seems second dataframe has elements sorted , unique -
def find_index(a,b, invalid_specifier = -1): idx = np.searchsorted(b,a) idx[idx==b.size] = 0 idx[b[idx] != a] = invalid_specifier return idx def process_dfs(df, df2): = df.wholelist.values.ravel() b = df2.thingstofind.values.ravel() return find_index(a,b, invalid_specifier=-1)
sample run on arrays -
in [200]: out[200]: array([ 3, 5, 8, 4, 3, 2, 5, 2, 12, 6, 3, 7]) in [201]: b out[201]: array([2, 3, 5, 6, 7, 8, 9]) in [202]: find_index(a,b, invalid_specifier=-1) out[202]: array([ 1, 2, 5, -1, 1, 0, 2, 0, -1, 3, 1, 4])
sample run on dataframes -
in [188]: df out[188]: wholelist 0 3 1 5 2 8 3 4 4 3 5 2 6 5 7 2 8 12 9 6 10 3 11 7 in [189]: df2 out[189]: thingstofind 0 2 1 3 2 5 3 6 4 7 5 8 6 9 in [190]: process_dfs(df, df2) out[190]: array([ 1, 2, 5, -1, 1, 0, 2, 0, -1, 3, 1, 4])
Comments
Post a Comment