Faster Auc In Sklearn Or Python
Solution 1:
Is there a faster/better way to do this?
Since the calculation of each true/pred pair is independent (if I understood your setup), you should be able to reduce total processing time by using multiprocessing
, effectively parallelizing the calculations:
import multiprocessing as mp
def roc(v):
""" calculate one pair, return (index, auc) """
i, true, pred = v
fpr, tpr, thresholds = metrics.roc_curve(true, pred, drop_intermediate=True)
auc = metrics.auc(fpr, tpr)
return i, auc
pool = mp.Pool(3)
result = pool.map_async(roc, ((i, true[i], pred[i]) for i in range(2)))
pool.close()
pool.join()
print result.get()
=>
[(0, 1.0), (1, 0.83333333333333326)]
Here Pool(3)
creates a pool of 3 processes, .map_async
maps all true/pred pairs and calls the roc
function, passing one pair at a time. The index is sent along to map back results.
If the true/pred pairs are too large to serialize and send to the processes, you might need to write the data into some external data structure before calling roc
, passing it just the reference i
and read the data for each pair true[i]/pred[i]
from within roc
before processing.
A Pool automatically manages the scheduling of processes. To reduce the risk of a memory hog, you might need to pass the Pool(...., maxtasksperchild=1)
parameter which would start a new process for each true/pred pair (choose any other number as you see fit).
Update
I'm stuck on a machine with only two processors
naturally this is a limiting factor. However considering the availability of cloud computing resources at very reasonable cost that you only pay for the time you actually need it, you might want to consider alternatives in hardware before you spend eons of hours optimizing a calculation that can be so effectively parallelized. That's a luxury in its own right, really.
Solution 2:
find a better way to vectorize the AUC calculation so that it can be broadcasted across multiple rows
Probably not - sklearn already uses efficient numpy operations for its calculation of relevant parts:
# -- calculate tps, fps, thresholds# sklearn.metrics.ranking:_binary_clf_curve()
(...)
distinct_value_indices = np.where(np.logical_not(isclose(
np.diff(y_score), 0)))[0]
threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]
# accumulate the true positives with decreasing threshold
tps = (y_true * weight).cumsum()[threshold_idxs]
if sample_weight is not None:
fps = weight.cumsum()[threshold_idxs] - tps
else:
fps = 1 + threshold_idxs - tps
return fps, tps, y_score[threshold_idxs]
# -- calculate auc# sklearn.metrics.ranking:auc()
...
area = direction * np.trapz(y, x)
...
You might be able to optimize this by profiling these functions and removing operations that you can apply more efficiently beforehand. A quick profiling of your example input scaled to 5M rows reveals a few potential bottlenecks (marked >>>
):
# your for ... loop wrapped in function roc()
%prun -s cumulative roc
722functioncalls (718 primitive calls) in 5.005 secondsOrderedby: cumulativetimencallstottimepercallcumtimepercallfilename:lineno(function)
1 0.000 0.000 5.005 5.005 <string>:1(<module>)
1 0.000 0.000 5.005 5.005 <ipython-input-51-27e30c04d997>:1(roc)
2 0.050 0.025 5.004 2.502 ranking.py:417(roc_curve)
2 0.694 0.347 4.954 2.477 ranking.py:256(_binary_clf_curve)
>>>2 0.000 0.000 2.356 1.178 fromnumeric.py:823(argsort)
>>>2 2.356 1.178 2.356 1.178 {method 'argsort' of 'numpy.ndarray' objects}
60.0620.0100.9610.160 arraysetops.py:96(unique)
>>>60.7500.1250.7500.125 {method 'sort' of 'numpy.ndarray' objects}
>>>20.1810.0900.5700.285 numeric.py:2281(isclose)
20.2440.1220.3860.193 numeric.py:2340(within_tol)
20.2140.1070.2140.107 {method 'cumsum' of 'numpy.ndarray' objects}
Post a Comment for "Faster Auc In Sklearn Or Python"