python – Splunk download using dask delayed


I am trying to download data from a splunk instance for a period of time using delayed. I have created a function that works like this:

@delayed
def _execute_query(query, time_period):
    spl_service = Service(username, password, host, port)
    return spl_service(query, time_period)

def execute_splunk_query(query, time_period):
    time_periods = time_period.split(timedelta(minutes=10)
    delayed = []
    for tm_per in time_periods:
        results = _execute_query(query, tm_per)
        delayed.append(results)
    return delayed

Assume data are larger than my laptops memory for that period of time so I want to use dask for it. I use the distributed scheduler so I set up a local cluster like this

from dask.distributed import Client
client = Client()
# setup query and time_period

results = execute_splunk_query(query, time_period)

# i do the following to convert to dataframe
results_bag = dask.bag.from_delayed(results)
data_df = results_bag.to_dataframe()
data_df.persist() # works with client and local cluster

Then this process takes too long and I get a MemoryError. So I have some questions on that:

  1. Doesn’t dask take care of that (Memory) by moving data from memory to disk? Does this work when persisting data.
  2. I see that when the delayed tasks are executed I don’t see any performance boost, like they are executed sequenctially and not in parallel. I can see that is true also in dask dashboard.

Could you please explain the above a little more in details?



لینک منبع

پاسخ دهید

نشانی ایمیل شما منتشر نخواهد شد. بخش‌های موردنیاز علامت‌گذاری شده‌اند *