AiViewz - Your Space for All Types of Content Creation

Donut (🍩), or Document Understanding Transformer, is an innovative OCR-free, end-to-end Transformer model designed for document understanding. Unlike traditional methods, Donut bypasses the need for external OCR engines or APIs, yet it achieves state-of-the-art results across a range of visual document tasks, such as document classification and information extraction (also known as document parsing). While datasets for left-to-right languages like English, Spanish, and Chinese are readily available and can be generated with SynthDog, creating datasets for right-to-left languages, such as Hindi, Arabic, and Urdu, is somewhat complex

We'll use SynthDoG 🐶, a Synthetic Document Generator, to make model pretraining adaptable across various languages and domains.

Here is a list of the main languages that use right to left scripts:

Arabic
Hindi
Aramaic
Azeri
Dhivehi/Maldivian
Hebrew
Kurdish (Sorani)
Persian/Farsi
Urdu

Step 1: Installing Required Libraries

First, clone the SynthDoG-RTL GitHub repository to get access to all the necessary tools and configurations:

git clone https://github.com/aiviewz/Synthdog-RTL.git
cd Synthdog-RTL

This repository contains everything you need to get started, including configuration examples, templates, and background resources.

Next, install the required dependencies:

pip install synthtiger Pillow==9.5.0

Step 2: Setting Up Your Project Structure

Inside the cloned Synthdog-RTL directory, organize the project with the following structure:

/Synthdog-RTL
  ├── resources/
  │   ├── background/
  │   ├── paper/
  │   ├── font/
  │   │   └── ur/         # Folder for Urdu fonts
  │   └── corpus/
  │       └── urdu_sample.txt  # Sample text for Urdu
  └── config_ur.yaml       # Configuration file for Urdu

Explanation:

background/: Contains background images for synthetic documents.
paper/: Images for the paper texture if needed.
font/ur/: Place .ttf font files for the Urdu language here.
corpus/urdu_sample.txt: Contains sample paragraphs for Urdu.

Step 3: Creating `config_ur.yaml` Configuration File

Below is a sample config_ur.yaml file configured for Urdu. This file determines how your synthetic dataset will be generated, including text layout, image effects, and dataset size:

#config_ur.yaml

quality: [95, 100]
landscape: 0.5
short_size: [720, 1024]
aspect_ratio: [1, 2]

background:
  image:
    paths: [resources/background]
    weights: [1]

  effect:
    args:
      # Gaussian blur
      - prob: 1
        args:
          sigma: [0, 10]

document:
  fullscreen: 0.5
  landscape: 0.5
  short_size: [480, 1024]
  aspect_ratio: [1, 2]

  paper:
    image:
      paths: [resources/paper]
      weights: [1]
      alpha: [0, 0.2]
      grayscale: 1
      crop: 1
  content:
    margin: [0, 0.1]
    text:
      path: resources/corpus/urdu_sample.txt
    font:
      paths: [resources/font/ur]
      weights: [1]
      bold: 0
    layout:
      text_scale: [0.0334, 0.1]
      max_row: 10
      max_col: 1
      fill: [0.5, 1]
      full: 0.1
      align: [right]  # Aligns RTL text to the right
      stack_spacing: [0.0334, 0.0334]
      stack_fill: [0.5, 1]
      stack_full: 0.1
    textbox:
      fill: [0.5, 1]
    textbox_color:
      prob: 0.2
      args:
        gray: [0, 64]
        colorize: 1
    content_color:
      prob: 0.2
      args:
        gray: [0, 64]
        colorize: 1
  rtl: true  # Enables RTL language support
  effect:
    args:
      # Elastic distortion
      - prob: 0.3
        args:
          alpha: [0, 0.4]
          sigma: [0, 0.5]
      # Gaussian noise
      - prob: 0.3
        args:
          scale: [0, 3]
          per_channel: 0
      # Perspective distortion
      - prob: 0.5
        args:
          weights: [750, 50, 50, 25, 25, 25, 25, 50]
          args:
            - percents: [[0.75, 1], [0.75, 1], [0.75, 1], [0.75, 1]]
            - percents: [[0.75, 1], [1, 1], [0.75, 1], [1, 1]]
            - percents: [[1, 1], [0.75, 1], [1, 1], [0.75, 1]]
            - percents: [[0.75, 1], [1, 1], [1, 1], [1, 1]]
            - percents: [[1, 1], [0.75, 1], [1, 1], [1, 1]]
            - percents: [[1, 1], [1, 1], [0.75, 1], [1, 1]]
            - percents: [[1, 1], [1, 1], [1, 1], [0.75, 1]]
            - percents: [[1, 1], [1, 1], [1, 1], [1, 1]]

effect:
  args:
    # Color adjustments
    - prob: 0.2
      args:
        rgb: [[0, 255], [0, 255], [0, 255]]
        alpha: [0, 0.2]
    # Shadow effects
    - prob: 0.3
      args:
        intensity: [0, 160]
        amount: [0, 1]
        smoothing: [0.2, 0.4]
        bidirectional: 0
    # Contrast enhancement
    - prob: 1
      args:
        alpha: [1, 1.5]
    # Brightness adjustment
    - prob: 1
      args:
        beta: [-48, 0]
    # Motion blur
    - prob: 0.4
      args:
        k: [3, 5]
        angle: [0, 360]
    # Gaussian blur
    - prob: 0.2
      args:
        sigma: [0, 1.5]

Step 4: Creating Sample Corpus (`urdu_sample.txt`)

Create a text file named urdu_sample.txt inside the resources/corpus/ folder. This file should contain sample Urdu text paragraphs. You can replace the text with any other RTL language content:

Example of urdu_sample.txt:

یہ ایک نمونہ پیراگراف ہے جسے آپ اپنی مرضی کے مطابق تبدیل کر سکتے ہیں۔
دوسری لائن میں کچھ اضافی متن شامل کریں۔

Step 5: Adding Fonts

Place .ttf font files for Urdu in the resources/font/ur/ directory. Ensure the font supports the language you are targeting.

To get high-quality fonts for your target RTL language, you can download them from Google Fonts. Here's how you can do it:

Go to Google Fonts.
In the search bar, type the name of your target language (e.g., "Urdu", "Arabic", "Hebrew") to filter fonts that support that language.
Click on the font you want to use. You will see a "Download family" button on the top-right corner of the font details page. Click it to download the font family as a .zip file.
Extract the .zip file to find .ttf files (TrueType Font files). Copy these .ttf files to the appropriate directory in your project:

cp path_to_downloaded_fonts/*.ttf /Synthdog-RTL/resources/font/ur/

make sure that font directory is set in config_ur.yaml

font:
  paths: [resources/font/ur]

This process can be repeated for other languages by creating new directories under resources/font/ (e.g., resources/font/arabic for Arabic fonts).

Step 6: Generating the Dataset

Run the following command in terminal(shell) to generate your dataset. Adjust the parameters -c (number of samples) and -w (number of workers) as needed:

synthtiger -o ./outputs/SynthDoG_ur -c 1000 -w 2 -v template.py SynthDoG config_ur.yaml

Parameter Explanation:

-o: Output directory where generated images will be saved.
-c: Number of samples to generate.
-w: Number of workers (threads) to speed up generation.
config_ur.yaml: Configuration file with all the settings for generation.

Step 7: Modifying Configuration for Different Effects

Here’s a brief guide to some parameters you can tweak in the config_ur.yaml file:

Document Layout:
- landscape: Adjust to 1 for more landscape documents or 0 for portrait.
- fullscreen: Change to 1 to fill the document background fully with text.
Text Adjustments:
- font.bold: Set to 1 to make the text bold.
- align: Modify to left or center if you want different text alignments for other languages.
Effects:
- Elastic distortion: Adjust alpha and sigma for distortion intensity.
- Gaussian blur: Change sigma in effect.args to increase/decrease the blur.

Step 8: Extending to Other RTL Languages

To generate synthetic datasets for other RTL languages, repeat the steps above and:

Modify the corpus file to include text for the target language.
Add relevant fonts to the resources/font/<language-code>/ directory.
Update the paths in the configuration file to point to the new corpus and font directories.

This method is suitable for generating synthetic data for Arabic, Urdu, Persian, Hindi, Hebrew, and similar languages.

Conclusion

This guide helps you create high-quality synthetic datasets for Donut OCR using SynthDoG. With the flexibility of config.yaml, you can adjust parameters to match the specific needs of your project and target language.

For more information and updates, refer to the SynthDoG-RTL GitHub repository.

AiViewz: Create and Share Your Content

How To Create Synthetic Dataset for Donut OCR for your custom language?

Step 1: Installing Required Libraries

Step 2: Setting Up Your Project Structure

Explanation:

Step 3: Creating `config_ur.yaml` Configuration File

Step 4: Creating Sample Corpus (`urdu_sample.txt`)

Step 5: Adding Fonts

Step 6: Generating the Dataset

Parameter Explanation:

Step 7: Modifying Configuration for Different Effects

Step 8: Extending to Other RTL Languages

Conclusion

Comments

AiViewz: Create and Share Your Content

How To Create Synthetic Dataset for Donut OCR for your custom language?

Step 1: Installing Required Libraries

Step 2: Setting Up Your Project Structure

Explanation:

Step 3: Creating config_ur.yaml Configuration File

Step 4: Creating Sample Corpus (urdu_sample.txt)

Step 5: Adding Fonts

Step 6: Generating the Dataset

Parameter Explanation:

Step 7: Modifying Configuration for Different Effects

Step 8: Extending to Other RTL Languages

Conclusion

Comments

Subscription Status

Subscribe to Our Newsletter

Join Our Newsletter

Step 3: Creating `config_ur.yaml` Configuration File

Step 4: Creating Sample Corpus (`urdu_sample.txt`)