Skip to content Skip to sidebar Skip to footer

Python - Fast Way To Sample Data From Array When Sample Size Changes

I am trying to sample data from a list of integers. The tricky part is that each sample should have a different size to emulate some other data I have. I am doing a for loop right

Solution 1:

It seems that np.random.choice is indeed not optimized for choice with replacement. However, you can get better performance by using Generator.choice, as discussed here.

I see a 14x speedup for your parameters:

data = np.arange(10**6)
sample_sizes = np.random.randint(1, 70_000, 100)

def f(data, sample_sizes):
  result = []
  for s in sample_sizes:
    result.append(np.random.choice(data, s, replace=False))

def f2(data, sample_sizes):
  g = np.random.Generator(np.random.PCG64())
  n = data.shape[0]
  return [data[g.choice(n, k, replace=False)] for k in sample_sizes]

%timeit f(data, sample_sizes)
%timeit f2(data, sample_sizes)
1loop, best of 3: 5.18 s per loop1loop, best of 3: 375 ms per loop

Solution 2:

Depending on your hardware as well as data sizes, using multiprocessing can give a sizeable speed-up. This needs to be estimated for your specific problem setup, however. For example using multiprocessing.pool.Pool:

from functools import partialfrom multiprocessing.pool import Pool

withPool() as pool:
    result = pool.map(partial(sample, data), sizes)

Performance comparison

Here are some example results (using 4 CPU cores):

from functools import partial
from multiprocessing.pool import Pool
from random import choices, sample
from statistics import mean, stdev
import time


defbaseline(data, sizes):
    return [sample(data, k) for k in sizes]


defmultiprocessing(data, sizes):
    with Pool(4) as pool:
        return pool.map(partial(sample, data), sizes)


deftimeit(f, *args, n=7):
    timings = []
    for __ inrange(n):
        t_start = time.time()  # using time because of multiprocessing
        f(*args)
        t_stop = time.time()
        timings.append(t_stop - t_start)
    print(f'[{f.__name__}] {mean(timings):.2f} +/- {stdev(timings):.2f} s')


data = list(range(1_000_000))
sizes = choices(range(max(data) // 100), k=1_000)

timeit(baseline, data, sizes)
timeit(multiprocessing, data, sizes)

which gives:

[baseline] 3.19 +/- 0.07 s
[multiprocessing] 2.10 +/- 0.02 s

But again, this depends on hardware and data, so it needs to be checked on each individual system.

Post a Comment for "Python - Fast Way To Sample Data From Array When Sample Size Changes"