python - Efficiently counting duplicate values in a numpy column and appending the counts -


i have dataset representing directed graph. first column source node, second column target node, , can ignore third column (essentially weight). example:

0 1 3 0 13 1 0 37 1 0 51 1 0 438481 1 1 0 3 1 4 354 1 10 2602 1 11 2689 1 12 1 1 18 345 1 19 311 1 23 1 1 24 366 ... 

what append out-degree each node. example, if added out-degree node 0, have:

0 1 3 5 0 13 1 5 0 37 1 5 0 51 1 5 0 438481 1 5 1 0 3 ... 

i have code this, extremely slow because using for loop:

import numpy np  def save_degrees(x):     new_col = np.zeros(x.shape[0], dtype=np.int)     x = np.column_stack((x, new_col))     node_ids, degrees = np.unique(x[:, 0], return_counts=true)     # slow part.     node_id, deg in zip(node_ids, degrees):         indices = x[:, 0] == node_id         x[:, -1][indices] = deg     return x  train_x = np.load('data/train_x.npy') train_x = save_degrees(train_x) np.save('data/train_x_degrees.npy', train_x) 

is there more efficient way build data structure?

you can use numpy.unique.

suppose input data in array data:

in [245]: data out[245]:  array([[     0,      1,      3],        [     0,     13,      1],        [     0,     37,      1],        [     0,     51,      1],        [     0, 438481,      1],        [     1,      0,      3],        [     1,      4,    354],        [     1,     10,   2602],        [     1,     11,   2689],        [     1,     12,      1],        [     1,     18,    345],        [     1,     19,    311],        [     1,     23,      1],        [     1,     24,    366],        [     2,     10,      1],        [     2,     13,      3],        [     2,     99,      5],        [     3,     25,     13],        [     3,     99,     15]]) 

find unique values in first column, along "inverse" array , counts of occurrences of each unique value:

in [246]: nodes, inv, counts = np.unique(data[:,0], return_inverse=true, return_counts=true) 

your column of out degrees counts[inv]:

in [247]: out_degrees = counts[inv]  in [248]: out_degrees out[248]: array([5, 5, 5, 5, 5, 9, 9, 9, 9, 9, 9, 9, 9, 9, 3, 3, 3, 2, 2]) 

this assumes pair (source_node, target_node) not occur more once in data array.


Comments