AiViewz: Create and Share Your Content

Blogs, articles, opinions, and more. Your space to express and explore ideas.

How To Create Synthetic Dataset for Donut OCR for your custom language?

In this tutorial we will learn to create a synthetic dataset for training Donut (Vision Transformer and Decoder) on custom languages (Right to Left ) such as Arabic, Hindi, Urdu, Hebrew etc This tutorial is designed for Machine Learning Engineers who want to create synthetic datasets for RTL (Right-to-Left) languages like Arabic, Urdu, Persian, Hebrew, and Hindi using SynthDoG. We'll walk you through generating dataset images using Urdu as an example. These datasets can be tailored for other RTL languages by modifying configurations appropriately.

Donut (🍩), or Document Understanding Transformer, is an innovative OCR-free, end-to-end Transformer model designed for document understanding. Unlike traditional methods, Donut bypasses the need for external OCR engines or APIs, yet it achieves state-of-the-art results across a range of visual document tasks, such as document classification and information extraction (also known as document parsing). While datasets for left-to-right languages like English, Spanish, and Chinese are readily available and can be generated with SynthDog, creating datasets for right-to-left languages, such as Hindi, Arabic, and Urdu, is somewhat complex

We'll use SynthDoG 🐶, a Synthetic Document Generator, to make model pretraining adaptable across various languages and domains.

Here is a list of the main languages that use right to left scripts:

  • Arabic
  • Hindi
  • Aramaic
  • Azeri
  • Dhivehi/Maldivian
  • Hebrew
  • Kurdish (Sorani)
  • Persian/Farsi
  • Urdu

 

Step 1: Installing Required Libraries

First, clone the SynthDoG-RTL GitHub repository to get access to all the necessary tools and configurations:

git clone https://github.com/aiviewz/Synthdog-RTL.git
cd Synthdog-RTL

This repository contains everything you need to get started, including configuration examples, templates, and background resources.

Next, install the required dependencies:

pip install synthtiger Pillow==9.5.0

Step 2: Setting Up Your Project Structure

Inside the cloned Synthdog-RTL directory, organize the project with the following structure:

 

/Synthdog-RTL
  ├── resources/
  │   ├── background/
  │   ├── paper/
  │   ├── font/
  │   │   └── ur/         # Folder for Urdu fonts
  │   └── corpus/
  │       └── urdu_sample.txt  # Sample text for Urdu
  └── config_ur.yaml       # Configuration file for Urdu

 

Explanation:

  • background/: Contains background images for synthetic documents.
  • paper/: Images for the paper texture if needed.
  • font/ur/: Place .ttf font files for the Urdu language here.
  • corpus/urdu_sample.txt: Contains sample paragraphs for Urdu.

Step 3: Creating config_ur.yaml Configuration File

Below is a sample config_ur.yaml file configured for Urdu. This file determines how your synthetic dataset will be generated, including text layout, image effects, and dataset size:

#config_ur.yaml

quality: [95, 100]
landscape: 0.5
short_size: [720, 1024]
aspect_ratio: [1, 2]

background:
  image:
    paths: [resources/background]
    weights: [1]

  effect:
    args:
      # Gaussian blur
      - prob: 1
        args:
          sigma: [0, 10]

document:
  fullscreen: 0.5
  landscape: 0.5
  short_size: [480, 1024]
  aspect_ratio: [1, 2]

  paper:
    image:
      paths: [resources/paper]
      weights: [1]
      alpha: [0, 0.2]
      grayscale: 1
      crop: 1
  content:
    margin: [0, 0.1]
    text:
      path: resources/corpus/urdu_sample.txt
    font:
      paths: [resources/font/ur]
      weights: [1]
      bold: 0
    layout:
      text_scale: [0.0334, 0.1]
      max_row: 10
      max_col: 1
      fill: [0.5, 1]
      full: 0.1
      align: [right]  # Aligns RTL text to the right
      stack_spacing: [0.0334, 0.0334]
      stack_fill: [0.5, 1]
      stack_full: 0.1
    textbox:
      fill: [0.5, 1]
    textbox_color:
      prob: 0.2
      args:
        gray: [0, 64]
        colorize: 1
    content_color:
      prob: 0.2
      args:
        gray: [0, 64]
        colorize: 1
  rtl: true  # Enables RTL language support
  effect:
    args:
      # Elastic distortion
      - prob: 0.3
        args:
          alpha: [0, 0.4]
          sigma: [0, 0.5]
      # Gaussian noise
      - prob: 0.3
        args:
          scale: [0, 3]
          per_channel: 0
      # Perspective distortion
      - prob: 0.5
        args:
          weights: [750, 50, 50, 25, 25, 25, 25, 50]
          args:
            - percents: [[0.75, 1], [0.75, 1], [0.75, 1], [0.75, 1]]
            - percents: [[0.75, 1], [1, 1], [0.75, 1], [1, 1]]
            - percents: [[1, 1], [0.75, 1], [1, 1], [0.75, 1]]
            - percents: [[0.75, 1], [1, 1], [1, 1], [1, 1]]
            - percents: [[1, 1], [0.75, 1], [1, 1], [1, 1]]
            - percents: [[1, 1], [1, 1], [0.75, 1], [1, 1]]
            - percents: [[1, 1], [1, 1], [1, 1], [0.75, 1]]
            - percents: [[1, 1], [1, 1], [1, 1], [1, 1]]

effect:
  args:
    # Color adjustments
    - prob: 0.2
      args:
        rgb: [[0, 255], [0, 255], [0, 255]]
        alpha: [0, 0.2]
    # Shadow effects
    - prob: 0.3
      args:
        intensity: [0, 160]
        amount: [0, 1]
        smoothing: [0.2, 0.4]
        bidirectional: 0
    # Contrast enhancement
    - prob: 1
      args:
        alpha: [1, 1.5]
    # Brightness adjustment
    - prob: 1
      args:
        beta: [-48, 0]
    # Motion blur
    - prob: 0.4
      args:
        k: [3, 5]
        angle: [0, 360]
    # Gaussian blur
    - prob: 0.2
      args:
        sigma: [0, 1.5]

 

Step 4: Creating Sample Corpus (urdu_sample.txt)

Create a text file named urdu_sample.txt inside the resources/corpus/ folder. This file should contain sample Urdu text paragraphs. You can replace the text with any other RTL language content:

Example of urdu_sample.txt:

یہ ایک نمونہ پیراگراف ہے جسے آپ اپنی مرضی کے مطابق تبدیل کر سکتے ہیں۔
دوسری لائن میں کچھ اضافی متن شامل کریں۔

Step 5: Adding Fonts

Place .ttf font files for Urdu in the resources/font/ur/ directory. Ensure the font supports the language you are targeting.

To get high-quality fonts for your target RTL language, you can download them from Google Fonts. Here's how you can do it:

  1. Go to Google Fonts.

  2. In the search bar, type the name of your target language (e.g., "Urdu", "Arabic", "Hebrew") to filter fonts that support that language.

  3. Click on the font you want to use. You will see a "Download family" button on the top-right corner of the font details page. Click it to download the font family as a .zip file.

  4. Extract the .zip file to find .ttf files (TrueType Font files). Copy these .ttf files to the appropriate directory in your project:

cp path_to_downloaded_fonts/*.ttf /Synthdog-RTL/resources/font/ur/

make sure that font directory is set in config_ur.yaml

font:
  paths: [resources/font/ur]

This process can be repeated for other languages by creating new directories under resources/font/ (e.g., resources/font/arabic for Arabic fonts).

Step 6: Generating the Dataset

Run the following command in terminal(shell) to generate your dataset. Adjust the parameters -c (number of samples) and -w (number of workers) as needed:

synthtiger -o ./outputs/SynthDoG_ur -c 1000 -w 2 -v template.py SynthDoG config_ur.yaml

Parameter Explanation:

  • -o: Output directory where generated images will be saved.
  • -c: Number of samples to generate.
  • -w: Number of workers (threads) to speed up generation.
  • config_ur.yaml: Configuration file with all the settings for generation.

Step 7: Modifying Configuration for Different Effects

Here’s a brief guide to some parameters you can tweak in the config_ur.yaml file:

  1. Document Layout:

    • landscape: Adjust to 1 for more landscape documents or 0 for portrait.
    • fullscreen: Change to 1 to fill the document background fully with text.
  2. Text Adjustments:

    • font.bold: Set to 1 to make the text bold.
    • align: Modify to left or center if you want different text alignments for other languages.
  3. Effects:

    • Elastic distortion: Adjust alpha and sigma for distortion intensity.
    • Gaussian blur: Change sigma in effect.args to increase/decrease the blur.

Step 8: Extending to Other RTL Languages

To generate synthetic datasets for other RTL languages, repeat the steps above and:

  • Modify the corpus file to include text for the target language.
  • Add relevant fonts to the resources/font/<language-code>/ directory.
  • Update the paths in the configuration file to point to the new corpus and font directories.

This method is suitable for generating synthetic data for Arabic, Urdu, Persian, Hindi, Hebrew, and similar languages.

Conclusion

This guide helps you create high-quality synthetic datasets for Donut OCR using SynthDoG. With the flexibility of config.yaml, you can adjust parameters to match the specific needs of your project and target language.

For more information and updates, refer to the SynthDoG-RTL GitHub repository.

 

Comments

Please log in to add a comment.

Back to Home
Join Our Newsletter

Stay updated with our latest insights and updates