Faster Auc In Sklearn Or Python

July 25, 2024 Post a Comment

I have over half a million pairs of true labels and predicted scores (the length of each 1d array varies and can be between 10,000-30,000 in length) that I need to calculate the AU

Solution 1:

Is there a faster/better way to do this?

Since the calculation of each true/pred pair is independent (if I understood your setup), you should be able to reduce total processing time by using multiprocessing, effectively parallelizing the calculations:

import multiprocessing as mp

def roc(v):
    """ calculate one pair, return (index, auc) """
    i, true, pred = v
    fpr, tpr, thresholds = metrics.roc_curve(true, pred, drop_intermediate=True)
    auc = metrics.auc(fpr, tpr)
    return i, auc

pool = mp.Pool(3) 
result = pool.map_async(roc, ((i, true[i], pred[i]) for i in range(2)))
pool.close()
pool.join()
print result.get()
=>
[(0, 1.0), (1, 0.83333333333333326)]

Here Pool(3) creates a pool of 3 processes, .map_async maps all true/pred pairs and calls the roc function, passing one pair at a time. The index is sent along to map back results.

If the true/pred pairs are too large to serialize and send to the processes, you might need to write the data into some external data structure before calling roc, passing it just the reference i and read the data for each pair true[i]/pred[i] from within roc before processing.

A Pool automatically manages the scheduling of processes. To reduce the risk of a memory hog, you might need to pass the Pool(...., maxtasksperchild=1) parameter which would start a new process for each true/pred pair (choose any other number as you see fit).

Update

Baca Juga

I'm stuck on a machine with only two processors

naturally this is a limiting factor. However considering the availability of cloud computing resources at very reasonable cost that you only pay for the time you actually need it, you might want to consider alternatives in hardware before you spend eons of hours optimizing a calculation that can be so effectively parallelized. That's a luxury in its own right, really.

Solution 2:

find a better way to vectorize the AUC calculation so that it can be broadcasted across multiple rows

Probably not - sklearn already uses efficient numpy operations for its calculation of relevant parts:

# -- calculate tps, fps, thresholds# sklearn.metrics.ranking:_binary_clf_curve()
(...)
distinct_value_indices = np.where(np.logical_not(isclose(
        np.diff(y_score), 0)))[0]
threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]
# accumulate the true positives with decreasing threshold
tps = (y_true * weight).cumsum()[threshold_idxs]
if sample_weight is not None:
    fps = weight.cumsum()[threshold_idxs] - tps
else:
    fps = 1 + threshold_idxs - tps
return fps, tps, y_score[threshold_idxs]

# -- calculate auc# sklearn.metrics.ranking:auc()
...
area = direction * np.trapz(y, x)
...

You might be able to optimize this by profiling these functions and removing operations that you can apply more efficiently beforehand. A quick profiling of your example input scaled to 5M rows reveals a few potential bottlenecks (marked >>>):

# your for ... loop wrapped in function roc()
%prun -s cumulative roc
722functioncalls (718 primitive calls) in 5.005 secondsOrderedby: cumulativetimencallstottimepercallcumtimepercallfilename:lineno(function)
        1    0.000    0.000    5.005    5.005 <string>:1(<module>)
        1    0.000    0.000    5.005    5.005 <ipython-input-51-27e30c04d997>:1(roc)
        2    0.050    0.025    5.004    2.502 ranking.py:417(roc_curve)
        2    0.694    0.347    4.954    2.477 ranking.py:256(_binary_clf_curve)
     >>>2    0.000    0.000    2.356    1.178 fromnumeric.py:823(argsort)
     >>>2    2.356    1.178    2.356    1.178 {method 'argsort' of 'numpy.ndarray' objects}
        60.0620.0100.9610.160 arraysetops.py:96(unique)
     >>>60.7500.1250.7500.125 {method 'sort' of 'numpy.ndarray' objects}
     >>>20.1810.0900.5700.285 numeric.py:2281(isclose)
        20.2440.1220.3860.193 numeric.py:2340(within_tol)
        20.2140.1070.2140.107 {method 'cumsum' of 'numpy.ndarray' objects}

Python Manual

Faster Auc In Sklearn Or Python

Solution 1:

Solution 2:

Post a Comment for "Faster Auc In Sklearn Or Python"