Nhận diện text từ file hình ảnh

Tuan Nguyen
Jul 6, 2020
1 min read

Bước 1: Cài đặt Tesseract - OCR

Tải file cài đặt : https://github.com/UB-Mannheim/tesseract/wiki

Lưu ý để nhận diện được tiếng Việt, khi cài đặt, bấm dấu + mục Additional language data > Tick chọn Vietnamese

Đường dẫn sau khi cài đặt đối với Windows 64bit: C:\Program Files\Tesseract-OCR

Bước 2: Tạo biến môi trường trên Windows

Chuột chuột phải This PC > Properties > Advanced system settings > Trong tab Advanced chọn Environment Variables... > Dưới phần System variables chọn New > Ô Variable name: nhập TESSDATA_PREFIX > Ô Variable value: nhập C:\Program Files\Tesseract-OCR\tessdata

Bước 3: Cài các module cần thiết

Mở terminal gõ các lệnh sau:

pip install opencv-python

pip install numpy

pip install pytesseract

Bước 4: Viết code

# Import modules

import cv2
import numpy as np
import pytesseract

# Path installed Tesseract - OCR

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Load the image

img = cv2.imread('Capture.png')

# Convert img to Gray

imgGray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Convert img to Black and While

adaptive_threshold = cv2.adaptiveThreshold(imgGray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 85, 11)

# Set mode PSM

config = "--psm 3"

# lang = "vie" để hiển thị được tiếng Việt, nếu bỏ không input lang vào thì mặc định là tiếng Anh

text = pytesseract.image_to_string(adaptive_threshold, config=config, lang="vie")

# Write result to file word

with open("outFile.doc", "a", encoding="utf-8") as f:
	f.write(text)

# cv2.imshow("imgBlack&While", adaptive_threshold)
# cv2.waitKey(0)

Information Technology

Nhận diện text từ file hình ảnh

Recent Posts

תגובות