Ray or Dask? A practical guide for data scientists

Photo by the author Ideogram

As scientists, we support vast data sets or elaborate models that require a significant amount of time to start. To save time faster and achieve results, we operate tools that perform tasks simultaneously or in many machines. Two popular Python libraries are for this Radius AND Dask. Both assist accelerate data processing and model training, but they are used for different types of tasks.

In this article we will explain what Ray and Dask are and when to choose each of them.

# What are Dask and Ray?

DASK is a library used to support vast amounts of data. Is designed to work in a way that seems to be known to users PandyIN NumbersOr Scikit-Learn. Dask translates data and tasks into smaller parts and triggers it in parallel. This makes it ideal for data scientists who want to boost data analysis without learning many fresh concepts.

Ray is a more general tool that helps build and run distributed applications. This is especially powerful in machine and AI.

Ray also has additional libraries built on it, for example:

Ray Tune for tuning hyperparameters in machine learning
Ray train For training models on many GPU
Ray serve for the implementation of models as internet services

Ray is great if you want to build scalable machine learning pipelines or implement AI applications that must perform elaborate tasks in parallel.

# Comparison of the function

Structural comparison of Dask and Ray based on the basic attributes:

Function	Dask	Radius
Basic abstraction	Data frames, boards, delayed tasks	Remote functions, actors
Best for	Scalable data processing, machine learning pipelines	Distributed training, tuning and serving machine learning
Ease of operate	High for Panda/Numpy users	Moderate, more fighters
Ecosystem	Integrates with `scikit-learn`IN Xgboost	Built -in libraries: MelodyIN AdministerIN Rllib
Scalability	Very good for batch processing	Excellent, greater control and flexibility
Planning	Schedule of stealing work	Active, schedule of the actors’ schedule
Cluster management	Native or via Kubernetes, yarn	Ray Dashboard, Kubernetes, AWS, GCP
Community/maturity	Older, mature, widely received	Growing brisk, powerful support support

# When to operate what?

Choose Dask if:

Operate Pandas/NumPy and I want scalability
Process tabular data or boards
Perform an ETL party or a feature engineering
Need dataframe Or array abstractions with lethargic performance

Choose Ray if:

You need to start many independent Python functions in parallel
You want to build machine learning pipelines, support models or manage long -term tasks
You need scaling similar to micros services with state tasks

# Ecosystem tools

Both libraries offer or support a number of tools to cover the scientific life cycle, but with a different pressure:

Task	Dask	Radius
Dataframe	`dask.dataframe`	Modin (Built on ray or dask)
Boards	`dask.array`	No native support, rely on numbers
Tuning hyperparametra	Instruction or with Dask-ML	Ray Tune (advanced functions)
Machine learning pipelines	`dask-ml`non -standard work flows	Ray trainIN Ray TuneIN Ray Air
Servant model	Non -standard configuration of the flask/fastapi	Ray serve
Learning strengthening	Not supported	Rllib
Panel	Built -in, very detailed	Built -in, simplified

# Real world scenarios

// Enormous -scale data cleaning and feature engineering

Operate Dask.

Why? Dask integrates smoothly with pandas AND NumPy. Many data teams already operate these tools. If your data set is too vast to fit in your memory, Dask can divide it into smaller parts and process these parts in parallel. This helps with tasks such as data cleaning and creating fresh functions.

Example:

import dask.dataframe as dd
import numpy as np

df = dd.read_csv('s3://data/large-dataset-*.csv')
df = df[df['amount'] > 100]
df['log_amount'] = df['amount'].map_partitions(np.log)
df.to_parquet('s3://processed/output/')

This code reads many vast CSV files from the S3 bucket using DASK in parallel. Filters poems in which the amount of the amount is greater than 100, uses the journal transformation and saves the result as parquet files.

// Parallel tuning hyperparameters for machine learning models

Operate Radius.

Why? Ray Tune It is great to try different settings during machine learning models. Integrates with tools such as Pythorch AND XGBoostAnd he can stop bad running early to save time.

Example:

from ray import tune
from ray.tune.schedulers import ASHAScheduler

def train_fn(config):
    # Model training logic here
    ...

tune.run(
    train_fn,
    config={"lr": tune.grid_search([0.01, 0.001, 0.0001])},
    scheduler=ASHAScheduler(metric="accuracy", mode="max")
)

This code defines the training function and uses Ray Tune to test various learning indicators. Automatically plans and evaluates the best configuration using the ASHA schedule.

// Calculations of the distributed plaque

Operate Dask.

Why? DASK boards are helpful when working with vast numbers sets. Division of the plaque into blocks and processes them in parallel.

Example:

import dask.array as da

x = da.random.random((10000, 10000), chunks=(1000, 1000))
y = x.mean(axis=0).compute()

This code creates a vast random board divided into fragments that can be processed in parallel. Then calculates the average of each column using the parallel computing power Dask.

// Building a comprehensive machine learning service

Operate Radius.

Why? Ray is designed not only for model training, but also for the service and management of the life cycle. WITH Ray serveYou can implement models in production, launch initial logic in parallel, and even scale state actors.

Example:

from ray import serve

@serve.deployment
class ModelDeployment:
    def __init__(self):
        self.model = load_model()

    def __call__(self, request_body):
        data = request_body
        return self.model.predict([data])[0]

serve.run(ModelDeployment.bind())

This code specifies the loading class of machine learning and API support using Ray Serve. The class receives a request, forecasts using the model and returns the result.

# Final recommendations

Case of operate	Recommended tool
Scalable data analysis (in the style of Panda)	Dask
Enormous machine learning training	Radius
Hyperparametra optimization	Radius
Calculation of data outside the core	Dask
Model of machine learning in real time serving	Radius
Non -standard high -parallel pipelines	Radius
Integration with a pile of pydat	Dask

# Application

Ray and Dask are tools that assist scientists of data support vast amounts of data and run programs faster. Ray is good for tasks that require high flexibility, such as machine learning projects. Dask is useful if you want to work with vast data sets with similar tools Pandas Or NumPy.

Which you choose depends on what your design and type of data you need. It is worth trying both diminutive examples to see which one suits your work better.

Jayita Gulati She is an enthusiast of machine learning and a technical writer driven by her passion for building machine learning models. He has a master’s degree in computer science at the University of Liverpool.

Categories

Ray or Dask? A practical guide for data scientists

# What are Dask and Ray?

# Comparison of the function

# When to operate what?

# Ecosystem tools

# Real world scenarios

// Enormous -scale data cleaning and feature engineering

// Parallel tuning hyperparameters for machine learning models

// Calculations of the distributed plaque

// Building a comprehensive machine learning service

# Final recommendations

# Application

Britain’s answer to Darpa wants to reprogram the human brain

OpenAI really wants Codex to stop talking about goblins

Elon Musk Testifies He Launched OpenAI to Prevent ‘Terminator Outcome’

‘It’s undignified’: Hundreds of workers training Meta’s artificial intelligence could be fired

Local transcription of the whisper sound

More News

Britain’s answer to Darpa wants to reprogram the human brain

Local transcription of the whisper sound

A brain implant for depression will soon be tested on humans

10 Python Libraries for Building LLM Applications

Britain’s answer to Darpa wants to reprogram the human brain

OpenAI really wants Codex to stop talking about goblins

Elon Musk Testifies He Launched OpenAI to Prevent ‘Terminator Outcome’