Apache Airflow 2.10 ushers in a fresh era of AI data orchestration

Join our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI coverage. Learn more

Moving data from where it was created to where it can be effectively used for data analysis and AI isn’t always a straight line. That’s the job of data orchestration technologies like the open-source Apache Airflow project to assist create a data pipeline that gets data to where it’s needed.

Today Apache Airflow The project is set to release update 2.10, marking the project’s first major update since the release of Airflow 2.9 in April. Airflow 2.10 introduces hybrid execution, enabling organizations to optimize resource allocation across workloads ranging from straightforward SQL queries to compute-intensive machine learning (ML) tasks. Improved feeder capabilities provide greater visibility into data flows, which is key for governance and compliance.

Taking it a step further, Astronomer, a leading commercial provider of Apache Airflow software, is updating its Astro platform to integrate the open source dbt-core (Data Build Tool) technology, unifying data orchestration and transformation workflows onto a single platform.

These enhancements are designed to streamline data operations and bridge the gap between time-honored data workflows and emerging AI applications. The updates offer enterprises a more pliant approach to data orchestration, addressing the challenges of managing diverse data environments and AI processes.

“If you think about why you’re implementing orchestration in the first place, it’s because you want to coordinate across your data supply chain, you want a centralized view of visibility,” says Julian LaNeve, CTO Astronomer, he told VentureBeat.

How Airflow 2.10 Improves Data Orchestration with Hybrid Workload Execution

One of the biggest updates in Airflow 2.10 is the introduction of a feature called hybrid execution.

Before this update, Airflow users had to choose a single execution mode for their entire deployment. That deployment could involve choosing a Kubernetes cluster or using Airflow’s Celery executor. Kubernetes is better suited for heavier compute tasks that require more granular control at the individual task level. Celery, on the other hand, is lighter and more capable for simpler tasks.

However, as LaNeve explained, real-world data pipelines often have a mix of workload types. For example, he noted that in an Airflow deployment, an organization might simply need to execute a straightforward SQL query to get data. A machine learning workflow could also connect to the same data pipeline, requiring a more advanced Kubernetes deployment to run. This is now possible with hybrid execution.

The hybrid execution capability is a step away from previous versions of Airflow that forced users to make a “one size fits all” choice for their entire deployment. They can now optimize each component of their data pipeline for the right level of compute resources and control.

“The ability to make choices at the pipeline and job level, rather than having the same execution mode for all of them, I think really opens up a whole new level of flexibility and efficiency for Airflow users,” LaNeve said.

Why Data Provenance in Data Orchestration Matters for AI

Understanding where data comes from is the domain of data lineage. It is a key capability for both time-honored data analytics and fresh AI workloads where organizations need to understand where data comes from.

Before Airflow 2.10, there were some limitations in tracking data provenance. LaNeve said that with the fresh provenance features, Airflow will be able to better capture dependencies and data flow in pipelines, even for custom Python code. This improved provenance tracking is key for AI and machine learning workflows where data quality and provenance are paramount.

“A key element of any AI application that people are building today is trust,” LaNeve said.

Therefore, if an AI system delivers incorrect or unreliable output, users will no longer rely on it. Resilient lineage information helps solve this problem by providing a clear, auditable trace that shows how engineers acquired, transformed, and used data to train the model. Additionally, sturdy lineage capabilities enable more comprehensive data management and security controls around sensitive information used in AI applications.

Looking to the Future of Airflow 3.0

“Data governance, security and privacy are becoming more important than ever as we want to ensure we have full control over how our data is used,” LaNeve said.

While Airflow 2.10 brings some significant improvements, LaNeve is already looking ahead to Airflow 3.0.

The goal of Airflow 3.0, according to LaNeve, is to modernize the technology for the AI era. Key priorities for Airflow 3.0 include making the platform more language-agnostic, allowing users to write tasks in any language, and making Airflow more data-aware, shifting the focus from coordinating processes to managing data flows.

“We want to make sure Airflow sets the standard for orchestration for the next 10 to 15 years,” he said.

VB Daily

Stay up to date! Get the latest news in your inbox every day

By subscribing, you agree to the VentureBeat Terms of Service.

Thanks for subscribing. Check out more VB newsletters here.

An error occurred.

Categories

Apache Airflow 2.10 ushers in a fresh era of AI data orchestration

How Airflow 2.10 Improves Data Orchestration with Hybrid Workload Execution

Why Data Provenance in Data Orchestration Matters for AI

Looking to the Future of Airflow 3.0

Kalshi has been temporarily banned in Nevada

At the Palantir developer conference, artificial intelligence is built to win wars

SynthID: what it is and how it works

LinkedIn invited my AI “co-founder” to give a corporate talk and then banned him

‘Uncanny Valley’: Nvidia’s ‘Super Bowl of AI’, Tesla Disappoints and Meta’s VR Metaverse ‘Shutdown’

More News

Kalshi has been temporarily banned in Nevada

At the Palantir developer conference, artificial intelligence is built to win wars

LinkedIn invited my AI “co-founder” to give a corporate talk and then banned him

‘Uncanny Valley’: Nvidia’s ‘Super Bowl of AI’, Tesla Disappoints and Meta’s VR Metaverse ‘Shutdown’

Kalshi has been temporarily banned in Nevada

At the Palantir developer conference, artificial intelligence is built to win wars

SynthID: what it is and how it works