Donut (🍩), or Document Understanding Transformer, is an innovative OCR-free, end-to-end Transformer model designed for document understanding. Unlike traditional methods, Donut bypasses the need for external OCR engines or APIs, yet it achieves state-of-the-art results across a range of visual document tasks, such as document classification and information extraction (also known as document parsing). While datasets for left-to-right languages like English, Spanish, and Chinese are readily available and can be generated with SynthDog, creating datasets for right-to-left languages, such as Hindi, Arabic, and Urdu, is somewhat complex
We'll use SynthDoG 🐶, a Synthetic Document Generator, to make model pretraining adaptable across various languages and domains.
Here is a list of the main languages that use right to left scripts:
- Arabic
- Hindi
- Aramaic
- Azeri
- Dhivehi/Maldivian
- Hebrew
- Kurdish (Sorani)
- Persian/Farsi
- Urdu
Step 1: Installing Required Libraries
First, clone the SynthDoG-RTL GitHub repository to get access to all the necessary tools and configurations:
git clone https://github.com/aiviewz/Synthdog-RTL.git
cd Synthdog-RTL
This repository contains everything you need to get started, including configuration examples, templates, and background resources.
Next, install the required dependencies:
pip install synthtiger Pillow==9.5.0
Step 2: Setting Up Your Project Structure
Inside the cloned Synthdog-RTL
directory, organize the project with the following structure:
/Synthdog-RTL
├── resources/
│ ├── background/
│ ├── paper/
│ ├── font/
│ │ └── ur/ # Folder for Urdu fonts
│ └── corpus/
│ └── urdu_sample.txt # Sample text for Urdu
└── config_ur.yaml # Configuration file for Urdu
Explanation:
- background/: Contains background images for synthetic documents.
- paper/: Images for the paper texture if needed.
- font/ur/: Place
.ttf
font files for the Urdu language here. - corpus/urdu_sample.txt: Contains sample paragraphs for Urdu.
Step 3: Creating config_ur.yaml
Configuration File
Below is a sample config_ur.yaml
file configured for Urdu. This file determines how your synthetic dataset will be generated, including text layout, image effects, and dataset size:
#config_ur.yaml
quality: [95, 100]
landscape: 0.5
short_size: [720, 1024]
aspect_ratio: [1, 2]
background:
image:
paths: [resources/background]
weights: [1]
effect:
args:
# Gaussian blur
- prob: 1
args:
sigma: [0, 10]
document:
fullscreen: 0.5
landscape: 0.5
short_size: [480, 1024]
aspect_ratio: [1, 2]
paper:
image:
paths: [resources/paper]
weights: [1]
alpha: [0, 0.2]
grayscale: 1
crop: 1
content:
margin: [0, 0.1]
text:
path: resources/corpus/urdu_sample.txt
font:
paths: [resources/font/ur]
weights: [1]
bold: 0
layout:
text_scale: [0.0334, 0.1]
max_row: 10
max_col: 1
fill: [0.5, 1]
full: 0.1
align: [right] # Aligns RTL text to the right
stack_spacing: [0.0334, 0.0334]
stack_fill: [0.5, 1]
stack_full: 0.1
textbox:
fill: [0.5, 1]
textbox_color:
prob: 0.2
args:
gray: [0, 64]
colorize: 1
content_color:
prob: 0.2
args:
gray: [0, 64]
colorize: 1
rtl: true # Enables RTL language support
effect:
args:
# Elastic distortion
- prob: 0.3
args:
alpha: [0, 0.4]
sigma: [0, 0.5]
# Gaussian noise
- prob: 0.3
args:
scale: [0, 3]
per_channel: 0
# Perspective distortion
- prob: 0.5
args:
weights: [750, 50, 50, 25, 25, 25, 25, 50]
args:
- percents: [[0.75, 1], [0.75, 1], [0.75, 1], [0.75, 1]]
- percents: [[0.75, 1], [1, 1], [0.75, 1], [1, 1]]
- percents: [[1, 1], [0.75, 1], [1, 1], [0.75, 1]]
- percents: [[0.75, 1], [1, 1], [1, 1], [1, 1]]
- percents: [[1, 1], [0.75, 1], [1, 1], [1, 1]]
- percents: [[1, 1], [1, 1], [0.75, 1], [1, 1]]
- percents: [[1, 1], [1, 1], [1, 1], [0.75, 1]]
- percents: [[1, 1], [1, 1], [1, 1], [1, 1]]
effect:
args:
# Color adjustments
- prob: 0.2
args:
rgb: [[0, 255], [0, 255], [0, 255]]
alpha: [0, 0.2]
# Shadow effects
- prob: 0.3
args:
intensity: [0, 160]
amount: [0, 1]
smoothing: [0.2, 0.4]
bidirectional: 0
# Contrast enhancement
- prob: 1
args:
alpha: [1, 1.5]
# Brightness adjustment
- prob: 1
args:
beta: [-48, 0]
# Motion blur
- prob: 0.4
args:
k: [3, 5]
angle: [0, 360]
# Gaussian blur
- prob: 0.2
args:
sigma: [0, 1.5]
Step 4: Creating Sample Corpus (urdu_sample.txt
)
Create a text file named urdu_sample.txt
inside the resources/corpus/
folder. This file should contain sample Urdu text paragraphs. You can replace the text with any other RTL language content:
Example of urdu_sample.txt
:
یہ ایک نمونہ پیراگراف ہے جسے آپ اپنی مرضی کے مطابق تبدیل کر سکتے ہیں۔
دوسری لائن میں کچھ اضافی متن شامل کریں۔
Step 5: Adding Fonts
Place .ttf
font files for Urdu in the resources/font/ur/
directory. Ensure the font supports the language you are targeting.
To get high-quality fonts for your target RTL language, you can download them from Google Fonts. Here's how you can do it:
-
Go to Google Fonts.
-
In the search bar, type the name of your target language (e.g., "Urdu", "Arabic", "Hebrew") to filter fonts that support that language.
-
Click on the font you want to use. You will see a "Download family" button on the top-right corner of the font details page. Click it to download the font family as a
.zip
file. -
Extract the
.zip
file to find.ttf
files (TrueType Font files). Copy these.ttf
files to the appropriate directory in your project:
cp path_to_downloaded_fonts/*.ttf /Synthdog-RTL/resources/font/ur/
make sure that font directory is set in config_ur.yaml
font:
paths: [resources/font/ur]
This process can be repeated for other languages by creating new directories under resources/font/
(e.g., resources/font/arabic
for Arabic fonts).
Step 6: Generating the Dataset
Run the following command in terminal(shell) to generate your dataset. Adjust the parameters -c
(number of samples) and -w
(number of workers) as needed:
synthtiger -o ./outputs/SynthDoG_ur -c 1000 -w 2 -v template.py SynthDoG config_ur.yaml
Parameter Explanation:
-o
: Output directory where generated images will be saved.-c
: Number of samples to generate.-w
: Number of workers (threads) to speed up generation.config_ur.yaml
: Configuration file with all the settings for generation.
Step 7: Modifying Configuration for Different Effects
Here’s a brief guide to some parameters you can tweak in the config_ur.yaml
file:
-
Document Layout:
landscape
: Adjust to1
for more landscape documents or0
for portrait.fullscreen
: Change to1
to fill the document background fully with text.
-
Text Adjustments:
font.bold
: Set to1
to make the text bold.align
: Modify toleft
orcenter
if you want different text alignments for other languages.
-
Effects:
- Elastic distortion: Adjust
alpha
andsigma
for distortion intensity. - Gaussian blur: Change
sigma
ineffect.args
to increase/decrease the blur.
- Elastic distortion: Adjust
Step 8: Extending to Other RTL Languages
To generate synthetic datasets for other RTL languages, repeat the steps above and:
- Modify the
corpus
file to include text for the target language. - Add relevant fonts to the
resources/font/<language-code>/
directory. - Update the paths in the configuration file to point to the new corpus and font directories.
This method is suitable for generating synthetic data for Arabic, Urdu, Persian, Hindi, Hebrew, and similar languages.
Conclusion
This guide helps you create high-quality synthetic datasets for Donut OCR using SynthDoG. With the flexibility of config.yaml
, you can adjust parameters to match the specific needs of your project and target language.
For more information and updates, refer to the SynthDoG-RTL GitHub repository.