Teach Python how to read
This is the first blogpost of a three to four (I haven't decided yet) part project. The main idea is that I want to create a model which will tell me how much I would like the book, given an image of a page as in input.
In this part, I will show you how to turn a image of text into actual text, using pytesseract. So let's first get our packages.
!apt-get update
!apt-get install libleptonica-dev -y
!apt-get install tesseract-ocr tesseract-ocr-dev -y
!apt-get install libtesseract-dev -y
!apt-get install tesseract-ocr-deu
!pip install pytesseract
from fastai.vision.all import *
from fastai.vision.widgets import *
import numpy as np
from pathlib import Path
import pytesseract
import re
from PIL import Image, ImageFilter
import pandas as pd
I took about 100 images of pages from books I own (all in German). I then put them in an image folder, let's have a look at the directory.
p = Path.cwd()/'images'
img_paths = [x for x in p.iterdir()]
img_paths[0].parts[5]
Next to the text from the image, I would like to extract the title of the book so I can later easily join my ratings to the texts. I therefore use a regex.
title_list = [re.match("^.*\_(.*)\..*$",img_paths[i].parts[5]).group(1) for i in range(len(img_paths))]
We use pytesseract for extracting the text from the images. To improve the performance I tried a lot of data transformation: cropping, binarizing and a lot more. The only thing that worked for me was to first rotate the image and then use a MedianFilter.
def proc_img(img_path):
im1 = Image.open(img_path)
im1 = im1.rotate(angle=270, resample=0, expand=10, center=None, translate=None, fillcolor=None)
im1 = im1.filter(ImageFilter.MedianFilter)
return im1
All of my input images are from german books, so I need to use lang="deu".
def get_text(img):
return pytesseract.image_to_string(img, lang="deu")
I again use a regular expression to get rid of common mistakes pytesseract does: putting a \n somewhere or confusing a s for a 5.
def use_pattern(text):
return pattern.sub(lambda m: rep[re.escape(m.group(0))], text)
rep = {"\n": "", "`": "", '%':"", '°': '', '&':'', '‘':'', '€':'e', '®':'', '\\': '', '5':'s', '1':'i', '_':'', '-':''} # define desired replacements here
# use these three lines to do the replacement
rep = dict((re.escape(k), v) for k, v in rep.items())
#Python 3 renamed dict.iteritems to dict.items so use rep.items() for latest versions
pattern = re.compile("|".join(rep.keys()))
Finally, we use a list comprehension (they're super useful) to put all of the text into a list of texts.
text_list = [use_pattern(get_text(proc_img(str(img_paths[i])))) for i in range(len(img_paths))]
And now let's put that into a pandas dataframe.
d = {'text':text_list,'title':title_list}
df = pd.DataFrame(d)
df.head()
df.text[3]
Looking good! For this project I also need my ratings for each of the books. I use a dictionary and the map function to easily create a column with my ratings.
rating_lasse = {'derPate': 5,
'ButchersCrossing': 4,
'derSeewolf': 5,
'JekyllandHyde': 4,
'gegendenStrich': 1,
'FruestueckmitKaengurus': 5,
'HuckleberryFinn': 4,
'diePest': 2,
'HerzderFinsternis': 3,
'derSpieler': 4}
df['rating'] = df['title'].map(rating_lasse)
df.head()
And that's it, let's save this dataframe and we're ready to move on to the model training!
df.to_csv(p/'datasets/text_df.csv', encoding='utf8', index=False)
I hope you enjoyed this blogpost and stay tuned for the next one!
Lasse