Translating PDFs Without Breaking Layout: Is It Really Possible?

Apr 19, 2025

∙ Paid

I recently came across an open-source tool called PDFMathTranslate. It can translate PDF files into multiple languages without breaking the layout.

Figure 1: A demo of PDFMathTranslate's results. [Source].

Curious about how it handles PDF parsing under the hood, I decided to dig into the source code.

In this article, first, I’ll walk through the code, then share a few thoughts and reflections.

Overview

Since the project doesn’t include an architecture diagram, so I sketched one myself while exploring the code—just to focus on the PDF parsing logic.

Figure 2: The PDF parsing logic of PDFMathTranslate. Image by author.

Main Entry

The core functionality of PDFMathTranslate lives in the pdf2zh directory, with pdf2zh.py serving as the main entry point.

pdf2zh.py is a command-line tool that extracts both text and images from the original file, runs the text through your chosen translation service (Google, OpenAI, and over 20 others), and then generates a translated PDF. It also supports multiple runtime modes, including GUI, web interface, and Celery-based task queues.

def main(args: Optional[List[str]] = None) -> int:
    from rich.logging import RichHandler

    logging.basicConfig(level=logging.INFO, handlers=[RichHandler()])

    # disable httpx, openai, httpcore, http11 logs
    logging.getLogger("httpx").setLevel("CRITICAL")
    logging.getLogger("httpx").propagate = False
    logging.getLogger("openai").setLevel("CRITICAL")
    logging.getLogger("openai").propagate = False
    logging.getLogger("httpcore").setLevel("CRITICAL")
    logging.getLogger("httpcore").propagate = False
    logging.getLogger("http11").setLevel("CRITICAL")
    logging.getLogger("http11").propagate = False

    parsed_args = parse_args(args)

    if parsed_args.config:
        ConfigManager.custome_config(parsed_args.config)

    if parsed_args.debug:
        log.setLevel(logging.DEBUG)

    if parsed_args.onnx:
        ModelInstance.value = OnnxModel(parsed_args.onnx)
    else:
        ModelInstance.value = OnnxModel.load_available()

    if parsed_args.interactive:
        from pdf2zh.gui import setup_gui

        if parsed_args.serverport:
            setup_gui(
                parsed_args.share, parsed_args.authorized, int(parsed_args.serverport)
            )
        else:
            setup_gui(parsed_args.share, parsed_args.authorized)
        return 0

    if parsed_args.flask:
        from pdf2zh.backend import flask_app

        flask_app.run(port=11008)
        return 0

    if parsed_args.celery:
        from pdf2zh.backend import celery_app

        celery_app.start(argv=sys.argv[2:])
        return 0

    if parsed_args.prompt:
        try:
            with open(parsed_args.prompt, "r", encoding="utf-8") as file:
                content = file.read()
            parsed_args.prompt = Template(content)
        except Exception:
            raise ValueError("prompt error.")

    print(parsed_args)
    if parsed_args.babeldoc:
        return yadt_main(parsed_args)
    if parsed_args.dir:
        untranlate_file = find_all_files_in_directory(parsed_args.files[0])
        parsed_args.files = untranlate_file
        translate(model=ModelInstance.value, **vars(parsed_args))
        return 0

    translate(model=ModelInstance.value, **vars(parsed_args))
    return 0

The main logic of pdf2zh.py centers around the main() function, with two key parts worth noting:

Layout model loading (OnnxModel.load_available() or a custom model via --onnx): Uses an ONNX-based layout model (like DocLayout-YOLO) to detect paragraph, formula, and image regions—laying the groundwork for aligned translation later on.
Per-file translation (translate()): Handles everything from parsing the original PDF, running the translation, rebuilding the layout, and generating output PDF.

Translate

Next, let’s take a closer look at the translate() function, which lives in high_level.py. Think of high_level.py as the engine behind the entire PDF translation pipeline. For each PDF, it

Parses the layout and extracts text blocks and images
Calls the translation service concurrently (number of threads controlled by --thread)
Rebuilds the layout
Generate output PDFs

Keep reading with a 7-day free trial

Subscribe to AI Exploration Journey to keep reading this post and get 7 days of free access to the full post archives.