Why Extracting Hindi Text from PDFs Is So Much Harder Than English (And How You Can Do It)

As a Digital Humanities student working with Hindi-language texts, I expected extracting text from a Hindi PDF to be a straightforward process, much like English documents. But after days of trial and error, I learned that working with Indian-language PDFs is a whole different ballgame.

Let me walk you through the problem, why it’s harder than it seems, and how to finally solve it — depending on what kind of PDF you’re working with.

The Problem

I wanted to extract Hindi text from a PDF and save it as a UTF-8 encoded .txt file. My use case was academic — importing clean Hindi text into annotation tools, digital corpora, and natural language processing pipelines. But I hit a wall:

  1. I could select and copy the text, but pasting it gave me gibberish
  2. Optical Character Recognition (OCR) tools (like Tesseract) produced blank outputs or mixed-up English characters
  3. Adobe Acrobat (German version) didn’t support Hindi OCR
  4. Online PDF converters also failed. They assumed the content was English
  5. Some PDFs looked fine, but others were actually scanned images, not real text

Why Is Hindi Text So Difficult to Extract?

The root problem is often legacy Devanagari fonts. Fonts like Kruti Dev, Shivaji01, Chanakya, and others were commonly used in Indian desktop publishing before Unicode became standard. These fonts:

  1. Show Devanagari script on screen
  2. But are internally stored as Roman characters

Meaning the letter “क” might actually be encoded as “d”, “भ” as “B”, and so on. So when you copy-paste the text, it looks like: dbZ dqN ugha dgk fd instead of: कई कुछ नहीं कहा कि

Many OCR systems are trained only on modern Unicode fonts, so they simply don’t recognize the legacy typefaces that were common in Indian publishing before Unicode became standard. During that period, every publisher used its own custom font encoding, meaning the same visual glyph might map to different characters depending on the software. Since these systems were never standardized, OCR engines struggle to interpret the text accurately. They see it as a series of unfamiliar shapes rather than meaningful characters.

Solution 1: Use a Font-to-Unicode Converter (If Your PDF Uses Legacy Fonts)

If you’re able to select and copy the text from the PDF (even if it’s gibberish), it likely uses one of these legacy fonts. Try this free tool that a friend discovered, Font2Unicode Converter, specifically developed for Gurmukhi and Hindi legacy fonts:

  1. Open your PDF and copy the gibberish text
  2. Go to the tool and paste it in
  3. From the dropdown, select your font (e.g. Shivaji01) or let auto-detect do its job
  4. Click ‘Convert Legacy font to Unicode’
  5. Copy the converted text — now it’s proper Hindi Unicode!

This works like magic if the source font is known and the text isn’t an image.

Solution 2: Python OCR Script (If Your PDF Is Scanned or Image-Based)

But what if such tools are unable to recognise the font and deliver Unicode?

That means your PDF is likely made of scanned images, not real text. In that case, you’ll need to perform OCR. I used the following Python-based workflow using Tesseract OCR:

  1. Convert PDF pages to .jpg images (using tools like ILovePDF or Adobe)
  2. Store the images in a folder
  3. Run the following script — written/developed by me — See file ‘If Your PDF Is Scanned or Image-Based’

Limitations:

  1. This process is slow for large PDFs
  2. Accuracy varies based on image quality
  3. Tesseract must be correctly installed and configured (along with the Hindi language pack)

Solution 3: For Clean Unicode PDFs

There’s a rare but ideal case where the PDF:

  1. Uses Unicode-compliant Hindi fonts
  2. Is digitally generated (not scanned)
  3. Doesn’t use non-standard encodings

In this case, we can extract text directly using tools like pdfplumber or PyMuPDF. Here’s what that (initial) code looked like: See File ‘For Clean Unicode Pdf Files’

Why this often fails:

  1. PDFs created using non-Unicode fonts (e.g., Shivaji01) will return garbled English
  2. Image-based PDFs will return None
  3. No OCR will be applied unless it’s explicitly added

Final Thoughts

Working with Hindi PDFs is not just a technical challenge. It’s also a historical problem, rooted in the legacy of Indian publishing practices and font usage. So the first step is: Know what kind of PDF you’re dealing with.

Type of PDFBest Solution
Legacy font (e.g., Shivaji01)Use Font2Unicode Tool
Image-based scanned PDFUse Python + Tesseract OCR
Modern Unicode PDFUse pdfplumber or similar tools

Let’s hope we eventually move toward fully Unicode-compliant Indian language publishing — but until then, we work with what we’ve got.

Leave a comment