From logs to insights: The AI breakthrough that redefines observability

Presented by Elastic

Logs are set to become the primary tool for finding the “why” when diagnosing network incidents

State-of-the-art IT environments have a data problem: there is too much of it. Organizations that must manage their enterprise environments are increasingly challenged to detect and diagnose problems in real time, optimize performance, improve reliability, and ensure security and compliance – all within tight budgets.

Today’s observability landscape has many tools offering solutions. Most of them focus on DevOps teams or Site Reliability Engineers (SREs) analyzing logs, metrics, and traces to discover patterns and understand what is happening on the network, as well as diagnose the cause of a problem or incident. The problem is that this process creates information overload: a Kubernetes cluster alone can emit 30 to 50 gigabytes of logs per day, and suspicious behavior patterns can escape human eyes.

“Right now in the world of artificial intelligence, it’s an anachronism to think of humans observing infrastructure,” says Ken Exner, chief product officer at Elastic. “I hate to break it to you, but machines are better than humans at pattern matching.”

The industry-wide focus on symptom visualization is forcing engineers to manually search for answers. The key “why” question is hidden in logs, but because they contain huge amounts of unstructured data, the industry uses them as a tool of last resort. This forced teams to make costly trade-offs: they either spend countless hours building intricate data pipelines, remove valuable data from logs and risk critical visibility gaps, or log and forget.

Search AI company Elastic recently released a fresh observability feature called Streams, which aims to become the go-to signal for investigations by taking cacophonous logs and transforming them into patterns, context, and meaning.

Streams uses AI to automatically partition and parse raw logs to extract relevant fields, significantly reducing the effort required by SRE to make the logs usable. Streams also automatically display relevant events, such as critical errors and anomalies, from contextual logs, providing SREs with early warnings and a clear understanding of their workloads, enabling them to investigate and resolve issues more quickly. The ultimate goal is to show remedial steps.

“From raw, voluminous, and unstructured data, Streams automatically structures it into a usable form, automatically alerts you to problems, and helps you fix them,” says Exner. “That’s the magic of the Streams.”

Broken workflow

Streams upend the observability process that some say is broken. Typically, SREs configure metrics, logs, and traces. They then configure alerts and service level objectives (SLOs) – often hard-coded rules that show where a service or process has exceeded a threshold or detected a particular pattern.

When an alert is triggered, it indicates the metric that shows an anomaly. From there, SREs view a metrics dashboard where they can visualize the problem and compare the alert to other metrics or CPU, Memory and I/O, and then start looking for patterns.

You may then need to inspect traces and check upstream and downstream dependencies in the application to find out the root cause of the problem. Once they figure out what’s causing the problem, they go to the logs of that database or service to try to debug the problem.

Some companies simply try to add more tools when current ones prove ineffective. This means SREs jump from tool to tool to constantly monitor and troubleshoot issues in their infrastructure and applications.

“You’re jumping between different tools. You’re relying on a human to interpret these things, visually check the relationships between systems in the service map, visually look at the charts in the dashboard to figure out where and what the problem is,” Exner says. “But AI is automating that workflow.”

With AI-powered streams, logs are used not only reactively to troubleshoot issues, but also to proactively process potential issues and create information-rich alerts that aid teams get straight to troubleshooting, offering a solution to fix the issue or even fix the issue completely, before automatically notifying the team that the issue has been addressed.

“I believe that logs, which are the richest set of information and the original signal type, will start to drive much of the automation that is typically done manually by a service reliability engineer today,” he adds. “Man should not be involved in this process where he digs deep within himself, trying to find out what is going on, where and what the problem is, and then, when he finds the root cause, tries to find a way to fix it.”

The future of observability

Enormous language models (LLM) can play a key role in the future of observability. LLMs excel at recognizing patterns in massive amounts of repetitive data, which is very similar to log and telemetry data in intricate, energetic systems. And today’s LLMs can be trained in specific IT processes. With automation tools, LLM has the information and tools you need to solve database errors or Java heap problems and more. It will be indispensable to incorporate these into platforms that provide context and meaning.

Exner says automatic remediation will take time, but automated runbooks and playbooks generated by LLM will become standard practice within the next few years. In other words, remedial steps will be implemented based on LLM. LLM will offer corrections and a human will verify and implement them, rather than calling in an expert.

Addressing the skills gap

The holistic exploit of AI for observability would aid address the critical talent shortage needed to manage IT infrastructure. Hiring is ponderous because organizations need teams with extensive experience and an understanding of potential problems and how to solve them quickly. Exner says this experience can come from a context-based LLM experience.

“We can help tackle the skills gap by providing people with LLM skills that will make everyone an instant expert,” he explains. “I think it will make it much easier for us to take entry-level practitioners and make them experts in both security and observability, and it will enable more novice practitioners to act like an expert.”

Streams in Elastic Observability are now available. Start with reading more on Streams.

Sponsored articles are content created by a company that pays to publish or has a business relationship with VentureBeat and is always clearly marked. For more information, please contact us sales@venturebeat.com.

Categories

From logs to insights: The AI breakthrough that redefines observability

Broken workflow

The future of observability

Addressing the skills gap

3 questions: Building predictive models to characterize cancer progression

Run miniature AI models locally with BitNet – a beginner’s guide

ChatGPT can now create interactive visualizations to facilitate you understand math and science concepts

From gaming to biology and beyond: 10 years of AlphaGo’s impact

Why CDC RFK Supports “Shared Decision Making” on Vaccines

More News

Yann LeCun is raising $1 billion to create artificial intelligence that understands the physical world

OpenAI and Google Employees File Amicus Brief in Support of Anthropic Against US Government

Nvidia plans to launch an open-source AI agent platform

Anthropic says Pentagon dispute could cost billions

3 questions: Building predictive models to characterize cancer progression

Run miniature AI models locally with BitNet – a beginner’s guide

ChatGPT can now create interactive visualizations to facilitate you understand math and science concepts

Categories

From logs to insights: The AI ​​breakthrough that redefines observability

Broken workflow

The future of observability

Addressing the skills gap

More News

From logs to insights: The AI breakthrough that redefines observability