A beginner's guide to data extraction with LangExtract and LLM

Photo by the author

# Entry

Did you know that a lot of valuable information still exists in unstructured text? For example, research articles, clinical notes, financial reports, etc. Extracting reliable and structured information from these texts has always been a challenge. LangExtract is an open source Python library (published by Google) that solves this problem using Vast Language Models (LLM). You define what to extract with plain hints and a few examples, and then it uses LLM (such as Google Gemini, OpenAI, or local models) to extract that information from documents of any length. Another thing that makes it useful is the handling of very long documents (via fragmentation and multi-pass processing) and the interactive visualization of results. Let’s look at this library in more detail.

# 1. Installation and configuration

To install LangExtract locally, first make sure you have Python 3.10+ installed. The library is available at PyPI. In a terminal or virtual environment, run:

For an isolated environment, you can first create and activate a virtual environment:

python -m venv langextract_env
source langextract_env/bin/activate  # On Windows: .langextract_envScriptsactivate
pip install langextract

There are other options from source and usage Docker you can also check it here.

# 2. Configuring API keys (for cloud models)

LangExtract itself is free and open source, but if you exploit cloud-hosted LLMs (such as Google Gemini or OpenAI GPT models), you must provide an API key. You can set LANGEXTRACT_API_KEY environment variable or save it in file a .env file in the working directory. For example:

export LANGEXTRACT_API_KEY="YOUR_API_KEY_HERE"

or v .env file:

cat >> .env << 'EOF'
LANGEXTRACT_API_KEY=your-api-key-here
EOF
echo '.env' >> .gitignore

LLM on your device via To be or other local backends do not require an API key. To enable OpenAIyou would run pip install langextract[openai]set yours OPENAI_API_KEYand exploit OpenAI model_id. For Apex AI (enterprise users) service account authentication is supported.

# 3. Defining the extraction task

LangExtract works by telling it what information to extract. This can be done by writing a clear description of the prompt and providing one or more ExampleData annotations showing what it looks like to extract sample text correctly. For example, to extract characters, emotions, and relationships from a given literary poem, you might write:

import langextract as lx

prompt = """
  Extract characters, emotions, and relationships in order of appearance.
  Utilize exact text for extractions. Do not paraphrase or overlap entities.
  Provide meaningful attributes for each entity to add context."""
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? ...",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="ROMEO",
                attributes={"emotional_state": "wonder"}
            ),
            lx.data.Extraction(
                extraction_class="emotion",
                extraction_text="But soft!",
                attributes={"feeling": "gentle awe"}
            )
        ]
    )
]

These examples (taken from the LangExtract README) tell the model exactly what kind of structured output is expected. You can create similar examples for your domain.

# 4. Starting the extraction

Once you define the hints and examples, you simply call the method lx.extract() function. The key arguments are:

text_or_documents: Your input text or list of texts, or even a URL string (LangExtract can fetch and parse text from Gutenberg or another URL).
prompt_description: Extraction instructions (string).
examples: List ExampleData that illustrate the desired outcome.
model_id: LLM ID to exploit (e.g "gemini-2.5-flash" for Google Gemini Flash or Ollama model "gemma2:2b"or an OpenAI model such as "gpt-4o").
Other optional parameters: : extraction_passes (to restart extraction to better remember long texts), max_workers (to perform parallel processing on fragments), fence_output, use_schema_constraintse.t.c.

For example:

input_text=""'JULIET. O Romeo, Romeo! wherefore art thou Romeo?
Deny thy father and refuse thy name;
Or, if thou wilt not, be but sworn my love,
And I'll no longer be a Capulet.
ROMEO. Shall I hear more, or shall I speak at this?
JULIET. 'Tis but thy name that is my enemy;
Thou art thyself, though not a Montague.
What’s in a name? That which we call a rose
By any other name would smell as sweet.'''


result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash"
)

This will send the prompts and examples along with the text to the selected LLM and return a Result object. LangExtract automatically supports tokenization of long texts into fragments, parallel grouping of calls and merging of results.

# 5. Handling results and visualizations

Exit lx.extract() is a Python object (often called result), which contains extracted entities and attributes. You can check it programmatically or save it for later. LangExtract also provides helper functions for saving results: for example, you can save the results to a JSONL file (JSON lines) (one document per line) and generate an interactive HTML review. For example:

lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl", output_dir=".")
html = lx.visualize("extraction_results.jsonl")
with open("viz.html", "w") as f:
    f.write(html if isinstance(html, str) else html.data)

This is written by an extraction_results.jsonl file and interactive viz.html file. The JSONL format is convenient for immense data sets and further processing, and the HTML file highlights each extracted range in context (color-coded by class), making it easier for human inspection as follows:

Output and visualization: Langextract

# 6. Supported input formats

LangExtract is versatile in terms of data entry. You can supply:

Plain text strings: Any text you load into Python (e.g. from a file or database) can be parsed.
URLs: As shown above, you can pass a URL (e.g Project Gutenberg link) as text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt". LangExtract will download and extract from this document.
List of texts: Pass a list of Python strings to process multiple documents in one call.
Luxurious text or Markdown: Because LangExtract works at the text level, you can also enter data Price reduction or HTML if you pre-process it into raw text. (LangExtract itself does not analyze PDF files or images, you must extract the text first.)

# 7. Application

LangExtract makes it simple to transform unstructured text into structured data. With high accuracy, clear source mapping, and plain customization, it performs well when rule-based methods fail. This is especially useful for complicated or domain-specific extractions. Although there is still much work to be done, LangExtract will already be a sturdy tool for extracting specific information in 2025.

Kanwal Mehreen is a machine learning engineer and technical writer with a deep passion for data science and the intersection of artificial intelligence and medicine. She is co-author of the e-book “Maximizing Productivity with ChatGPT”. As a 2022 Google Generation Scholar for APAC, she promotes diversity and academic excellence. She is also recognized as a Teradata Diversity in Tech Scholar, a Mitacs Globalink Research Scholar, and a Harvard WeCode Scholar. Kanwal is a staunch supporter of change and founded FEMCodes to empower women in STEM fields.

Categories

A beginner’s guide to data extraction with LangExtract and LLM

# Entry

# 1. Installation and configuration

# 2. Configuring API keys (for cloud models)

# 3. Defining the extraction task

# 4. Starting the extraction

# 5. Handling results and visualizations

# 6. Supported input formats

# 7. Application

Could AI tell you where you left your keys?

Building time series machine learning models with sktime in Python

Unlocking UK home building opportunities with AI-accelerated planning

Around the world, these building solutions assist keep things local

ChatGPT’s market share fell below 50% for the first time.

More News

Building time series machine learning models with sktime in Python

Around the world, these building solutions assist keep things local

1 in 4 World Cup matches may be played in unsafe temperatures

How Mexico’s World Cup stadiums obtained FIFA environmental certificates

Could AI tell you where you left your keys?

Building time series machine learning models with sktime in Python

Unlocking UK home building opportunities with AI-accelerated planning