# Entry
PDF files are widely used in many workflows. You may need to merge reports, split immense files, extract text or tables, add watermarks, or redact sensitive content. These are all routine tasks, but manually handling multiple files can be leisurely and error-prone. These five Python scripts automate this process. They run from the command line, support batch processing, and are effortless to configure.
You can find all scripts on GitHub.
# 1. Merge and split PDF files
// Pain point
Combining multiple PDF files into one or splitting a immense PDF file into separate files by page range are the most common PDF tasks. Performing both operations manually is tedious, especially when dealing with many files or a immense number of pages.
// What the script does
Combines a folder of PDF files into a single output file in a configurable order, or splits a single PDF file into separate files according to fixed page ranges, which N pages or by list of specific page numbers. Both operations are handled by the same script using the mode flag.
// How it works
The script uses pypdf for all page-level operations. In merge mode, it reads all PDF files from the input folder, sorts them by file name (or custom order specified in the text file), and writes them sequentially to a single output PDF file. In split mode, it accepts a list of page ranges, a fixed chunk size, or a list of page numbers to split. Each split segment is written to a numbered output file. Metadata from the first input file is retained in merge mode.
⏩ Download PDF Merge and Split Script
# 2. Extract text and tables from PDF files
// Pain point
Retrieving useful data from a PDF file – whether it’s text from a report or tabular data from an extract – must occur before further processing can occur. Copying and pasting from a PDF browser is impractical for files larger than a few pages, and the output is rarely spotless.
// What the script does
Extracts text and tables from one or more PDF files and writes the results to structured output files. Text is saved as plain text or Markdown files. Tables are saved in CSV or Excel format, with each table containing one sheet. It supports both text-based PDF files and basic layout-preserving extraction.
// How it works
The script uses pypdf for basic text extraction and pdfplumber for layout-aware extraction and table detection. For each input file, it runs page by page, extracting blocks of text and detecting table areas using the pdfplumber table finder. Extracted tables are normalized – blank lines are removed, headers detected – and written to separate output files. The summary report lists the pages and tables found in each file and flags pages for which the extraction produced no results.
⏩ Download script to extract text and tables in PDF format
# 3. Stamping, watermarking and adding page numbers
// Pain point
Adding a watermark, stamp, or page numbers to batches of PDF files before distributing them is elementary by design, but requires executing a single file using a graphical user interface (GUI). When the batch is immense or the demand is repeated, it requires automation.
// What the script does
Applies a text or image stamp to each page of one or more PDF files. Supports diagonal watermarks, header/footer text, page numbers, and graphic overlays. Position, font size, opacity and color are configurable. Processes entire folders in batch.
// How it works
The script uses pypdf to manipulate page i reporting laboratory to generate a stamp layer. For each input PDF file, it creates a single-page stamped PDF file in memory using Reportlab. Renders text at the configured position, angle, font, and transparency, or places an image at specified coordinates. This stamp page is then merged with each page of the source PDF using pypdf page merging. The result is written to a up-to-date output file, leaving the original unchanged. Page numbers are treated as a special case and generate a unique stamp per page.
⏩ Download the PDF markup script
# 4. Editing sensitive content
// Pain point
Before sharing a PDF file externally, you often need to remove sensitive content such as names, reference numbers, financial details, and addresses. Manually drawing black boxes over text in the PDF editor works, but it doesn’t actually remove the underlying text in all tools and is impractical for more than a few pages.
// What the script does
It scans PDF pages for text matching patterns you define – patterns, exact strings, or predefined categories such as email addresses and phone numbers – and permanently redacts the matched content, replacing it with black rectangles. Creates a up-to-date PDF with the underlying text removed, not just visually obscured.
// How it works
The script uses pympdfwhich provides both text search using bounding box coordinates and the ability to draw editorial annotations that, when applied, permanently remove underlying content. For each page, the script searches for all matches of each configured pattern, selects the bounding boxes as editorial annotations, and then applies them, which removes the text from the page’s content stream. A report is prepared listing each redaction made, including the page number, matched text (before redaction), and the pattern that caused the redaction.
⏩ Download the editorial script in PDF format
# 5. Extracting metadata and generating an inventory of PDF files
// Pain point
When working with a immense collection of PDF files, it’s often useful to know basic facts about each one – the number of pages, file size, creation date, author, whether the file is encrypted, contains text, or is a scanned image. Checking each file individually using a browser is not practical on a immense scale.
// What the script does
Scans a folder of PDF files and extracts metadata from each one, including page count, file size, creation and modification dates, author, manufacturer, encryption status, and whether the document contains searchable text or scanned images. Saves everything in one CSV or Excel inventory file.
// How it works
The script uses pypdf to read document metadata from the PDF information dictionary and pdfplumber to sample pages with text content. For each file, it tries to open the PDF and read the standard metadata fields. Samples the first few pages to determine whether the file contains extractable text rather than pages of a scanned image. Encrypted files that cannot be opened are marked rather than silently skipped. The output table contains one line per file with all fields extracted, and a summary line at the bottom with totals and averages.
⏩ Download the inventory script in PDF format
# Summary
These five Python scripts handle PDF tasks that typically turn into repetitive manual work: file splitting, content extraction, batch processing, and document workflow cleanup. Each script is designed to work safely on individual files or entire folders while generating up-to-date results rather than modifying the originals.
Start with a petite batch, check the results, then scale up to larger folders when everything looks good. Most installations only involve installing the listed dependencies and customizing the configuration sections for file paths and settings.
Bala Priya C is a software developer and technical writer from India. He likes working at the intersection of mathematics, programming, data analytics and content creation. Her areas of interest and specialization include DevOps, data analytics and natural language processing. She enjoys reading, writing, coding and coffee! He is currently working on learning and sharing his knowledge with the developer community by writing tutorials, guides, reviews, and more. Bala also creates engaging resource overviews and coding tutorials.
