• Latest
  • Trending
Simple Text Extraction Using Python And Tesseract OCR

Simple Text Extraction Using Python And Tesseract OCR

July 4, 2022
ATC Ghana supports Girls-In-ICT Program

ATC Ghana supports Girls-In-ICT Program

April 25, 2023
Vice President Dr. Bawumia inaugurates  ICT Hub

Vice President Dr. Bawumia inaugurates ICT Hub

April 2, 2023
Co-Creation Hub’s edtech accelerator puts $15M towards African startups

Co-Creation Hub’s edtech accelerator puts $15M towards African startups

February 20, 2023
Data Leak Hits Thousands of NHS Workers

Data Leak Hits Thousands of NHS Workers

February 20, 2023
EU Cybersecurity Agency Warns Against Chinese APTs

EU Cybersecurity Agency Warns Against Chinese APTs

February 20, 2023
How Your Storage System Will Still Be Viable in 5 Years’ Time?

How Your Storage System Will Still Be Viable in 5 Years’ Time?

February 20, 2023
The Broken Promises From Cybersecurity Vendors

Cloud Infrastructure Used By WIP26 For Espionage Attacks on Telcos

February 20, 2023
Instagram and Facebook to get paid-for verification

Instagram and Facebook to get paid-for verification

February 20, 2023
YouTube CEO Susan Wojcicki steps down after nine years

YouTube CEO Susan Wojcicki steps down after nine years

February 20, 2023
Inaugural AfCFTA Conference on Women and Youth in Trade

Inaugural AfCFTA Conference on Women and Youth in Trade

September 6, 2022
Instagram fined €405m over children’s data privacy

Instagram fined €405m over children’s data privacy

September 6, 2022
8 Most Common Causes of a Data Breach

5.7bn data entries found exposed on Chinese VPN

August 18, 2022
  • Consumer Watch
  • Kids Page
  • Directory
  • Events
  • Reviews
Monday, 12 May, 2025
  • Login
itechnewsonline.com
  • Home
  • Tech
  • Africa Tech
  • InfoSEC
  • Data Science
  • Data Storage
  • Business
  • Opinion
Subscription
Advertise
No Result
View All Result
itechnewsonline.com
No Result
View All Result

Simple Text Extraction Using Python And Tesseract OCR

by ITECHNEWS
July 4, 2022
in Data Science, Leading Stories
0 0
0
Simple Text Extraction Using Python And Tesseract OCR

Introduction

Hello! In this quick tutorial I will show how to create a simple program using Python and Tesseract OCR that can extract text from image files. 😃


What is Tesseract?

Tesseract is an open source OCR (Optical Character Recognition) engine that can recognize multiple languages.

YOU MAY ALSO LIKE

ATC Ghana supports Girls-In-ICT Program

Vice President Dr. Bawumia inaugurates ICT Hub

An OCR engine can save time by digitilizing documents rather than manually typing the content of the document.


Installing Teesseract and the engine

Instalation method will vary on the Operating System you use.

Instructions on how to download Tesseract can be found here:
https://tesseract-ocr.github.io/tessdoc/Downloads.html

You will also needed the trained data for the language you are dealing with, pre-trained data can be found via the link below are downloaded via the command line etc depending on the Operating System you are using.
https://github.com/tesseract-ocr/tessdata


Initializing the Python Virtual Environment

Creating a virtual environment can be done via the following command:

python3 -m venv env
source env/bin/activate

Coding the program

Next we need to actually write some Python!
Open up “main.py” in your preferred IDE and add the following:

Importing the modules and setting the Tesseract command.

Please note that the Tesseract command will again vary based on the Operating System you are using. (For the record I’m using linux)

import argparse
import pytesseract
import cv2

# Path to the location of the Tesseract-OCR executable/command
pytesseract.pytesseract.tesseract_cmd = "/usr/bin/tesseract"

Next we need to create a method that reads text from an image file and saves it into a file called “results.txt”.

def read_text_from_image(image):
  """Reads text from an image file and outputs found text to text file"""
  # Convert the image to grayscale
  gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

  # Perform OTSU Threshold
  ret, thresh = cv2.threshold(gray_image, 0, 255, cv2.THRESH_OTSU | cv2.THRESH_BINARY_INV)

  rect_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (18, 18))

  dilation = cv2.dilate(thresh, rect_kernel, iterations = 1)

  contours, hierachy = cv2.findContours(dilation, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)

  image_copy = image.copy()

  for contour in contours:
    x, y, w, h = cv2.boundingRect(contour)

    cropped = image_copy[y : y + h, x : x + w]

    file = open("results.txt", "a")

    text = pytesseract.image_to_string(cropped)

    file.write(text)
    file.write("\n")

  file.close()

This method basically takes an image, converts it to grayscale, gets the text from the image and then for each found text appends it to the results file.

Finally we need to write the main method.

if __name__ == "__main__":
  ap = argparse.ArgumentParser()
  ap.add_argument("-i", "--image", required = True, help = "Path to input file")
  args = vars(ap.parse_args())

  image = cv2.imread(args["image"])
  read_text_from_image(image)

Like my previous tutorials the main method basically just takes an image file as input and then passes it on to the text extraction method.

The program can then be run via the following command:

python main.py -i text.png

If all works well you should have a “results.txt” file in your working directory that contains the text from the image.


Conclusion

Here I have shown how to create a simple program that extracts text from an image using Python and Tesseract OCR.

Source: Ethan
Tags: Simple Text Extraction Using Python And Tesseract OCR
ShareTweetShare
Plugin Install : Subscribe Push Notification need OneSignal plugin to be installed.

Search

No Result
View All Result

Recent News

ATC Ghana supports Girls-In-ICT Program

ATC Ghana supports Girls-In-ICT Program

April 25, 2023
Vice President Dr. Bawumia inaugurates  ICT Hub

Vice President Dr. Bawumia inaugurates ICT Hub

April 2, 2023
Co-Creation Hub’s edtech accelerator puts $15M towards African startups

Co-Creation Hub’s edtech accelerator puts $15M towards African startups

February 20, 2023

About What We Do

itechnewsonline.com

We bring you the best Premium Tech News.

Recent News With Image

ATC Ghana supports Girls-In-ICT Program

ATC Ghana supports Girls-In-ICT Program

April 25, 2023
Vice President Dr. Bawumia inaugurates  ICT Hub

Vice President Dr. Bawumia inaugurates ICT Hub

April 2, 2023

Recent News

  • ATC Ghana supports Girls-In-ICT Program April 25, 2023
  • Vice President Dr. Bawumia inaugurates ICT Hub April 2, 2023
  • Co-Creation Hub’s edtech accelerator puts $15M towards African startups February 20, 2023
  • Data Leak Hits Thousands of NHS Workers February 20, 2023
  • Home
  • InfoSec
  • Opinion
  • Africa Tech
  • Data Storage

© 2021-2022 iTechNewsOnline.Com - Powered by BackUPDataSystems

No Result
View All Result
  • Home
  • Tech
  • Africa Tech
  • InfoSEC
  • Data Science
  • Data Storage
  • Business
  • Opinion

© 2021-2022 iTechNewsOnline.Com - Powered by BackUPDataSystems

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
Go to mobile version