I am trying to download data from a splunk instance for a period of time using delayed. I have created a function that works like this:
@delayed def _execute_query(query, time_period): spl_service = Service(username, password, host, port) return spl_service(query, time_period) def execute_splunk_query(query, time_period): time_periods = time_period.split(timedelta(minutes=10) delayed =  for tm_per in time_periods: results = _execute_query(query, tm_per) delayed.append(results) return delayed
Assume data are larger than my laptops memory for that period of time so I want to use dask for it. I use the distributed scheduler so I set up a local cluster like this
from dask.distributed import Client client = Client() # setup query and time_period results = execute_splunk_query(query, time_period) # i do the following to convert to dataframe results_bag = dask.bag.from_delayed(results) data_df = results_bag.to_dataframe() data_df.persist() # works with client and local cluster
Then this process takes too long and I get a MemoryError. So I have some questions on that:
- Doesn’t dask take care of that (Memory) by moving data from memory to disk? Does this work when persisting data.
- I see that when the delayed tasks are executed I don’t see any performance boost, like they are executed sequenctially and not in parallel. I can see that is true also in dask dashboard.
Could you please explain the above a little more in details?