How would Lasse rate this book?

This is the third part of my little project to build a rating system on text which we extract from images and which in turn leads to a rating on how much I will like this book. In this notebook I want to show you how to make use of ipywidgets to make a notebook which we can use as a web appplication. Furthermore, I will show you how to download the trained model from part 2 from my private GoogleDrive. So let's get started!

!pip install googledrivedownloader
!pip install transformers
Collecting googledrivedownloader
  Downloading googledrivedownloader-0.4-py2.py3-none-any.whl (3.9 kB)
Installing collected packages: googledrivedownloader
Successfully installed googledrivedownloader-0.4
Collecting transformers
  Downloading transformers-3.3.0-py3-none-any.whl (1.1 MB)
     |████████████████████████████████| 1.1 MB 14.9 MB/s eta 0:00:01    |██▉                             | 92 kB 15.2 MB/s eta 0:00:01     |███████████████████████████████ | 1.0 MB 14.9 MB/s eta 0:00:01
Requirement already satisfied: tqdm>=4.27 in /opt/conda/envs/fastai/lib/python3.8/site-packages (from transformers) (4.48.2)
Requirement already satisfied: requests in /opt/conda/envs/fastai/lib/python3.8/site-packages (from transformers) (2.24.0)
Requirement already satisfied: sentencepiece!=0.1.92 in /opt/conda/envs/fastai/lib/python3.8/site-packages (from transformers) (0.1.86)
Requirement already satisfied: numpy in /opt/conda/envs/fastai/lib/python3.8/site-packages (from transformers) (1.19.1)
Collecting filelock
  Downloading filelock-3.0.12-py3-none-any.whl (7.6 kB)
Collecting regex!=2019.12.17
  Downloading regex-2020.9.27-cp38-cp38-manylinux2010_x86_64.whl (675 kB)
     |████████████████████████████████| 675 kB 7.2 MB/s eta 0:00:01
Collecting sacremoses
  Downloading sacremoses-0.0.43.tar.gz (883 kB)
     |████████████████████████████████| 883 kB 25.0 MB/s eta 0:00:01
Requirement already satisfied: packaging in /opt/conda/envs/fastai/lib/python3.8/site-packages (from transformers) (20.4)
Collecting tokenizers==0.8.1.rc2
  Downloading tokenizers-0.8.1rc2-cp38-cp38-manylinux1_x86_64.whl (3.0 MB)
     |████████████████████████████████| 3.0 MB 23.4 MB/s eta 0:00:01
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/envs/fastai/lib/python3.8/site-packages (from requests->transformers) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/envs/fastai/lib/python3.8/site-packages (from requests->transformers) (2020.6.20)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/envs/fastai/lib/python3.8/site-packages (from requests->transformers) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/envs/fastai/lib/python3.8/site-packages (from requests->transformers) (1.25.10)
Requirement already satisfied: six in /opt/conda/envs/fastai/lib/python3.8/site-packages (from sacremoses->transformers) (1.15.0)
Collecting click
  Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
     |████████████████████████████████| 82 kB 1.3 MB/s  eta 0:00:01
Requirement already satisfied: joblib in /opt/conda/envs/fastai/lib/python3.8/site-packages (from sacremoses->transformers) (0.16.0)
Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/envs/fastai/lib/python3.8/site-packages (from packaging->transformers) (2.4.7)
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... done
  Created wheel for sacremoses: filename=sacremoses-0.0.43-py3-none-any.whl size=893259 sha256=5a3671aa51ef5e5501f57c816f31eea01e646340384391c47489f75a0c3cb57c
  Stored in directory: /root/.cache/pip/wheels/7b/78/f4/27d43a65043e1b75dbddaa421b573eddc67e712be4b1c80677
Successfully built sacremoses
Installing collected packages: filelock, regex, click, sacremoses, tokenizers, transformers
Successfully installed click-7.1.2 filelock-3.0.12 regex-2020.9.27 sacremoses-0.0.43 tokenizers-0.8.1rc2 transformers-3.3.0
from fastai.vision.all import *
from fastai.vision.widgets import *
from fastai.vision.widgets import *
from PIL import Image, ImageFilter 
import pytesseract
import re
from transformers import BertTokenizer, BertForSequenceClassification
from pathlib import Path
from torch.utils.data import TensorDataset, DataLoader

Lots of models especially in the deep learning context can get quite large. I wasn't able to upload my model into git, so I thought of a way to get around that. I uploaded my trained model from part 2 into my GoogleDrive and then use the google_drive_downloader to download my model into my notebook.

from google_drive_downloader import GoogleDriveDownloader as gdd

gdd.download_file_from_google_drive(file_id='1kk_SvwpwZeuLnZirW5vbrd8FEnm7yJRt',
                                    dest_path='./export.pkl',
                                    unzip=True)
Downloading 1kk_SvwpwZeuLnZirW5vbrd8FEnm7yJRt into ./export.pkl... Done.
Unzipping...Done.
import warnings

warnings.filterwarnings("ignore")

Next, we use all the steps you already know from part 2: rotate the image and filter it, use pytesseract to extract the text from the image, tokenize the text and put it in a dataloader and download the pre-trained model from the awesome huggingface library.

def proc_img(input_img):
    
    img = input_img.rotate(angle=270, resample=0, expand=10, center=None, translate=None, fillcolor=None)
    img = img.filter(ImageFilter.MedianFilter)
    
    return img
def get_text(img):
    return pytesseract.image_to_string(img, lang="deu")
def use_pattern(text):
    return pattern.sub(lambda m: rep[re.escape(m.group(0))], text)
rep = {"\n": "", "`": "", '%':"", '°': '', '&':'', '‘':'', '€':'e', '®':'', '\\': '', '5':'s', '1':'i', '_':'', '-':''} # define desired replacements here

# use these three lines to do the replacement
rep = dict((re.escape(k), v) for k, v in rep.items()) 
#Python 3 renamed dict.iteritems to dict.items so use rep.items() for latest versions
pattern = re.compile("|".join(rep.keys()))
# Tokenize all of the sentences and map the tokens to thier word IDs.
def tokenize_text(sent):
    
    input_ids = []
    attention_masks = []

    encoded_dict = tokenizer.encode_plus(
                        sent,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        truncation=True,
                        max_length = 256,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        #padding='longest',
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )

    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])

    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

    # Convert the lists into tensors.
    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    
    return input_ids, attention_masks
def create_dataloader(text):
    
    input_ids, attention_masks = tokenize_text(text)
    dataset = TensorDataset(input_ids, attention_masks)
    batch_size = 1
    app_dataloader = DataLoader(
                dataset, # The validation samples.
                batch_size = batch_size # Evaluate with this batch size.
            )
    return app_dataloader
def predict(dataloader):
    # Prediction on test set
    device = torch.device('cpu')
    # Put model in evaluation mode
    model.eval()

    # Tracking variables 
    predictions = []

    # Predict 
    for batch in dataloader:

            # Add batch to CPU
            batch = tuple(t.to(device) for t in batch)

            # Unpack the inputs from our dataloader
            b_input_ids, b_input_mask = batch

            # Telling the model not to compute or store gradients, saving memory and 
            # speeding up prediction
            with torch.no_grad():
              # Forward pass, calculate logit predictions
              outputs = model(b_input_ids, token_type_ids=None, 
                              attention_mask=b_input_mask)

            logits = outputs[0]

            # Move logits and labels to CPU
            logits = logits.detach().cpu().numpy()

            # Store predictions and true labels
            predictions.append(logits)
            
            return np.argmax(predictions)
PRE_TRAINED_MODEL_NAME = 'bert-base-german-cased'

# Load the BERT tokenizer
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', PRE_TRAINED_MODEL_NAME)    # Download vocabulary from S3 and cache.
n_classes=5

model = BertForSequenceClassification.from_pretrained(
    "bert-base-german-cased", # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = n_classes, # The number of output labels--2 for binary classification.
                    # You can increase this for multi-class tasks.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)
Downloading: "https://github.com/huggingface/pytorch-transformers/archive/master.zip" to /root/.cache/torch/hub/master.zip


p = Path.cwd()

Even though we trained the model on GPU, that's not what we want for production. So I load my model onto CPU.

device = torch.device('cpu')
model.load_state_dict(torch.load(p/'export.pkl', map_location=device))
btn_upload = widgets.FileUpload()
out_pl = widgets.Output()
rating_widget = widgets.Label()
btn_run = widgets.Button(description='Lasses Empfehlung:')
def on_click_text(change):
    img = PILImage.create(btn_upload.data[-1])
    out_pl.clear_output()
    with out_pl: display(proc_img(img).to_thumb(256,256))
    text = use_pattern(get_text(proc_img(img)))
    star_rating = predict(create_dataloader(text))
    rating_widget.value = f'Lasse würde diesem Buch {star_rating+1} Stern(e) von 5 Sternen geben!'
btn_run.on_click(on_click_text)
VBox([widgets.Label('Upload Bild von Buchseite'),
     btn_upload, btn_run, out_pl, rating_widget])

Perfect, that worked like a charme! Coming up I will show you how to take this notebook and turn it into a little web app. So stay tuned!

Lasse