Hosted with nbsanity. See source notebook on GitHub.

Sentiment Prediction of Drug Reviews

In this section, we will predict customer’s reviews towards drugs. We’ll be using machine learning, deep learning, and transformers.

import numpy as np
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
file_path = '/content/drive/MyDrive/DrugReviews/DrugReviews_cleaned.csv'
df = pd.read_csv(file_path)
df
MedicineUsedFor MedicineBrandName MedicineGenericName ReviewDate UserName IntakeTime Reviews ReviewLength Rating NumberOfLikes
0 Cough Acetaminophen / Codeine Not Mentioned 1-Apr-08 smoore Not Specified Works good as a cough suppressant. 34 9 24
1 Cough Benzonatate Not Mentioned 1-Apr-08 Anonymous Not Specified Pneumonia cough was non-stop - gave almost ins... 210 9 39
2 Dermatologic Lesion Methylprednisolone Dose Pack Methylprednisolone 1-Apr-08 Anonymous Not Specified This steriod helped kill the pain of my condit... 162 8 24
3 Hypogonadism, Male Androgel Not Mentioned 1-Apr-08 MikeC... Not Specified I'm a 35 year old male and I had no idea that ... 105 9 380
4 Depression Celexa Not Mentioned 1-Apr-08 Cherpie Not Specified It is so nice to have my life back!!! 37 10 206
... ... ... ... ... ... ... ... ... ... ...
255945 Birth Control Isibloom Desogestrel / Ethinyl Estradiol 9-Sep-22 Skylar Not Specified This birth control is awful severe nausea and ... 108 1 0
255946 Underactive Thyroid Unithroid Levothyroxine 9-Sep-22 Syd Taken for less than 1 month Post partial thyroidectomy due to a large beni... 224 2 7
255947 Bacterial Infection Amoxicillin / Clavulanate Not Mentioned 9-Sep-22 FLgirl Taken for less than 1 month I was given this for a tooth abscess. I was sc... 957 9 1
255948 Strep Throat Augmentin Amoxicillin / Clavulanate 9-Sep-22 peein... Taken for less than 1 month This stuff is great if you wanna pee out of yo... 263 1 0
255949 Erectile Dysfunction Sildenafil Not Mentioned 9-Sep-22 rando... Taken for less than 1 month I'm 53 now and can't recall when my penis and ... 489 10 2

255950 rows × 10 columns

import re
import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

stop_words = set(stopwords.words('english'))
stop_words.remove('not')

lemmatizer = WordNetLemmatizer()

# Lower text (outside fx for faster execution time)
df['Reviews'] = df['Reviews'].str.lower()

def clean_text(text):
    # Remove punctuation
    text = re.sub(r'[^\w\s]', ' ', text)
    # Tokenize text
    text = word_tokenize(text)
    # Remove stopwords
    text = [word for word in text if word not in stop_words]
    # Lemmatization
    text = [lemmatizer.lemmatize(word=word, pos='v') for word in text]
    # Join all text
    text = ' '.join(text)

    return text
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
# Clean text data
df['ReviewsClean'] = df['Reviews'].apply(lambda x: clean_text(str(x)))
df['Rating'].value_counts().sort_index()
Rating
1     53670
2     13442
3     11466
4      8119
5     12867
6      9543
7     13266
8     25803
9     35341
10    72433
Name: count, dtype: int64
df['is_positive'] = np.where(df['Rating'] > 5, 1, 0)
df.sample(5)
MedicineUsedFor MedicineBrandName MedicineGenericName ReviewDate UserName IntakeTime Reviews ReviewLength Rating NumberOfLikes ReviewsClean is_positive
55950 Psoriasis Raptiva Not Mentioned 15-May-09 Anonymous Not Specified it worked great and should not have been pulle... 64 10 1 work great not pull market 1
146999 Insomnia Sonata Zaleplon 25-Jun-08 duwa Not Specified amazing. helps me fall asleep. i don't have a ... 207 10 66 amaze help fall asleep problem stay asleep use... 1
121363 Adhd Amphetamine / Dextroamphetamine Not Mentioned 22-Jan-22 eric Taken for 1 to 6 months i just started taking adderall again after ten... 124 8 0 start take adderall ten years without realize ... 1
123311 Depression Mirtazapine Not Mentioned 22-Mar-18 Anonymous Taken for 1 to 6 months deeper sleep but not sleeping any more than be... 74 5 3 deeper sleep not sleep start take 15mg 0
168587 Hepatitis C Harvoni Not Mentioned 28-Aug-16 ShazDog Not Specified i contracted hep c 1a thirty years ago. 39 10 36 contract hep c 1a thirty years ago 1
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
sentiments = df['Reviews'].apply(lambda x: sid.polarity_scores(x))
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
sentiments = pd.DataFrame(sentiments.tolist())
sentiments
neg neu pos compound
0 0.000 0.580 0.420 0.4404
1 0.000 0.947 0.053 0.2289
2 0.390 0.610 0.000 -0.9402
3 0.191 0.809 0.000 -0.5423
4 0.000 0.640 0.360 0.6697
... ... ... ... ...
255945 0.248 0.752 0.000 -0.6808
255946 0.213 0.686 0.100 -0.5574
255947 0.063 0.863 0.074 0.2731
255948 0.034 0.816 0.151 0.7650
255949 0.056 0.908 0.036 -0.2206

255950 rows × 4 columns

df = pd.concat([df, sentiments], axis=1)
df.sample(5)
MedicineUsedFor MedicineBrandName MedicineGenericName ReviewDate UserName IntakeTime Reviews ReviewLength Rating NumberOfLikes ReviewsClean is_positive neg neu pos compound
234245 Birth Control Loryna Drospirenone / Ethinyl Estradiol 7-Jul-18 Boano Taken for 1 to 6 months i have been on a generic form of yaz for about... 245 1 2 generic form yaz 3 years switch vestura nikki ... 0 0.107 0.893 0.000 -0.5873
84098 Birth Control Depo-Provera Not Mentioned 18-Sep-19 MiMi Not Specified i'm sharing this to everyone who is considerin... 964 2 9 share everyone consider depo provera shoot ble... 0 0.116 0.781 0.103 -0.5417
87300 Bladder Infection Macrobid Not Mentioned 19-Jan-14 mommy... Taken for less than 1 month had a severe bladder infection..passing blood ... 672 8 77 severe bladder infection pass blood clot urine... 1 0.099 0.831 0.070 -0.3025
35162 Acne Yaz Drospirenone / Ethinyl Estradiol 13-Dec-16 Anonymous Not Specified i've had mild to moderate acne since i was 14 ... 721 1 16 mild moderate acne since 14 24 10 yrs try ever... 0 0.198 0.732 0.069 -0.9705
15722 Abnormal Uterine Bleeding Depo-Provera Not Mentioned 10-Oct-19 End... Taken for 1 to 6 months i'm so over this shot! i got this in mid-septe... 416 1 9 shoot get mid september wait exit body many di... 0 0.229 0.737 0.034 -0.9539

Modelling Cleaned Reviews with Deep Learning

train_df = df[['ReviewsClean', 'is_positive']]
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train_df['ReviewsClean'], train_df['is_positive'],
    test_size = 0.2, random_state = 100)
y_train.value_counts()
is_positive
1    125041
0     79719
Name: count, dtype: int64
y_test.value_counts()
is_positive
1    31345
0    19845
Name: count, dtype: int64
import tensorflow as tf

from keras import Input
from keras.models import Sequential
from keras.layers import (
    TextVectorization, Embedding, LSTM, Dense, Bidirectional, Dropout)

from keras.optimizers import Adam
from keras.regularizers import L1, L2, L1L2

from transformers import TFAutoModelForSequenceClassification

Shallow Neural Network

max_tokens   = 7500
input_length = 128
output_dim   = 128

vectorizer_layer = TextVectorization(
    max_tokens  = max_tokens,
    output_mode = 'int',
    standardize = 'lower_and_strip_punctuation',
    output_sequence_length = input_length
)

vectorizer_layer.adapt(X_train)

embedding_layer = Embedding(
    input_dim    = max_tokens,
    output_dim   = output_dim,
    input_length = input_length
)
# Define model
model = Sequential([
    Input(shape=(1,), dtype=tf.string),
    vectorizer_layer,
    embedding_layer,
    Dense(1, activation='sigmoid')
])

# Compile model
model.compile(optimizer=Adam(learning_rate=0.001),
    loss='binary_crossentropy', metrics=['accuracy'])

model.summary()

# Fit model
model.fit(X_train, y_train, epochs=5)

# Evaluate model in accuracy
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f'Test set accuracy: {test_acc}')
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 text_vectorization (TextVe  (None, 128)               0         
 ctorization)                                                    
                                                                 
 embedding (Embedding)       (None, 128, 128)          960000    
                                                                 
 dense (Dense)               (None, 128, 1)            129       
                                                                 
=================================================================
Total params: 960129 (3.66 MB)
Trainable params: 960129 (3.66 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/5
6399/6399 [==============================] - 57s 8ms/step - loss: 0.6652 - accuracy: 0.6129
Epoch 2/5
6399/6399 [==============================] - 25s 4ms/step - loss: 0.6645 - accuracy: 0.6135
Epoch 3/5
6399/6399 [==============================] - 24s 4ms/step - loss: 0.6644 - accuracy: 0.6135
Epoch 4/5
6399/6399 [==============================] - 23s 4ms/step - loss: 0.6643 - accuracy: 0.6136
Epoch 5/5
6399/6399 [==============================] - 22s 3ms/step - loss: 0.6643 - accuracy: 0.6136
1600/1600 [==============================] - 4s 2ms/step - loss: 0.6639 - accuracy: 0.6150
Test set accuracy: 0.6149800419807434

Multi-Layer Deep Text Classification Model

# Define model
model_reg = Sequential([
    Input(shape=(1,), dtype=tf.string),
    vectorizer_layer,
    embedding_layer,
    Dense(128, activation='relu',
         kernel_regularizer=L1(l1=0.0005)),
    Dropout(rate=0.6),
    Dense( 64, activation='relu',
         kernel_regularizer=L1L2(l1=0.0005, l2=0.0005)),
    Dense( 32, activation='relu',
         kernel_regularizer=L2(l2=0.0005)),
    Dense( 16, activation='relu',
         kernel_regularizer=L2(l2=0.0005)),
    Dense(  8, activation='relu',
         kernel_regularizer=L2(l2=0.0005)),
    Dense(  1, activation='sigmoid')
])

# Compile model
model_reg.compile(
    optimizer=Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy'])

model_reg.summary()

# Fit model
model_reg.fit(X_train, y_train, epochs=5)

# Evaluate model in accuracy
reg_test_loss, reg_test_acc = model_reg.evaluate(X_test, y_test)
print(f'Test set accuracy: {reg_test_acc}')
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 text_vectorization (TextVe  (None, 128)               0         
 ctorization)                                                    
                                                                 
 embedding (Embedding)       (None, 128, 128)          960000    
                                                                 
 dense_1 (Dense)             (None, 128, 128)          16512     
                                                                 
 dropout (Dropout)           (None, 128, 128)          0         
                                                                 
 dense_2 (Dense)             (None, 128, 64)           8256      
                                                                 
 dense_3 (Dense)             (None, 128, 32)           2080      
                                                                 
 dense_4 (Dense)             (None, 128, 16)           528       
                                                                 
 dense_5 (Dense)             (None, 128, 8)            136       
                                                                 
 dense_6 (Dense)             (None, 128, 1)            9         
                                                                 
=================================================================
Total params: 987521 (3.77 MB)
Trainable params: 987521 (3.77 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/5
6399/6399 [==============================] - 73s 11ms/step - loss: 0.6809 - accuracy: 0.6107
Epoch 2/5
6399/6399 [==============================] - 42s 7ms/step - loss: 0.6701 - accuracy: 0.6107
Epoch 3/5
6399/6399 [==============================] - 41s 6ms/step - loss: 0.6699 - accuracy: 0.6107
Epoch 4/5
6399/6399 [==============================] - 41s 6ms/step - loss: 0.6699 - accuracy: 0.6107
Epoch 5/5
6399/6399 [==============================] - 41s 6ms/step - loss: 0.6699 - accuracy: 0.6107
1600/1600 [==============================] - 6s 4ms/step - loss: 0.6691 - accuracy: 0.6123
Test set accuracy: 0.6123266220092773
# Scratches for dashboard: Rating factor for Ranks

# rating_factor = df.groupby(['MedicineBrandName', 'MedicineUsedFor']).agg(
#     avg_rating = ('Rating', lambda x: np.mean(x)),
#     std_rating = ('Rating', lambda x: np.std(x)),
#     count_reviews = ('Reviews', lambda x: x.count())
# ).reset_index()

# rating_factor.sort_values(
#     ['count_reviews', 'avg_rating', 'std_rating'],
#     ascending=[False, False, True])[:50]

# x = (rating_factor['avg_rating'] / rating_factor['std_rating'])

# rating_factor['impact_factor'] = (rating_factor['count_reviews'] *
#                                   (1 - np.exp(-x)) / (1 + np.exp(-x)))

# rating_factor.sort_values(
#     ['count_reviews', 'avg_rating', 'impact_factor'],
#     ascending=False).loc[rating_factor['MedicineUsedFor'] ==
#         'Weight Loss (Obesity/Overweight)'][:25]

# rating_factor.sort_values(
#     ['count_reviews', 'avg_rating', 'impact_factor'],
#     ascending=False)[:25]

# rating_factor.sort_values(
#     ['count_reviews', 'avg_rating', 'impact_factor'],
#     ascending=False).loc[rating_factor['MedicineUsedFor'] == 'Birth Control'][:25]

Multi-Layer Bidirectional LSTM Model

# Define model
# Some tweaks:
## The algorithm are supposed to use activation='elu'.
## However, it doesn't fulfill the criteria when
## running the model in Colab's processing unit.

## Therefore, all 'elu' are changed into 'tanh'

model_ml_bi_lstm = Sequential([

    Input(shape=(1,), dtype=tf.string),

    vectorizer_layer,
    embedding_layer,

    Bidirectional(LSTM(128,
        activation='tanh',
        return_sequences=True)),
    Bidirectional(LSTM(128,
        activation='tanh',
        return_sequences=True)),
    Bidirectional(LSTM(64,
        activation='tanh')),

    Dense( 64, activation='tanh',
         kernel_regularizer=L1L2(
             l1=0.0001, l2=0.0001)),
    Dense( 32, activation='tanh',
         kernel_regularizer=L2(l2=0.0001)),
    Dense(  8, activation='tanh',
         kernel_regularizer=L2(l2=0.0005)),
    Dense(  8, activation='tanh',
         kernel_regularizer=L2(l2=0.0005)),
    Dense(  8, activation='tanh'),
    Dense(  4, activation='tanh'),

    Dense(  1, activation='sigmoid')

])

# Compile model
model_ml_bi_lstm.compile(optimizer=Adam(learning_rate=0.0001),
              loss='binary_crossentropy', metrics=['accuracy'])

model_ml_bi_lstm.summary()

# Fit
model_ml_bi_lstm.fit(X_train, y_train, epochs=5)

# Evaluate
bi_lstm_test_loss, bi_lstm_test_acc = model_ml_bi_lstm.evaluate(X_test, y_test)
print(f'Test set accuracy: {bi_lstm_test_acc}')
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 text_vectorization (TextVe  (None, 128)               0         
 ctorization)                                                    
                                                                 
 embedding (Embedding)       (None, 128, 128)          960000    
                                                                 
 bidirectional (Bidirection  (None, 128, 256)          263168    
 al)                                                             
                                                                 
 bidirectional_1 (Bidirecti  (None, 128, 256)          394240    
 onal)                                                           
                                                                 
 bidirectional_2 (Bidirecti  (None, 128)               164352    
 onal)                                                           
                                                                 
 dense_7 (Dense)             (None, 64)                8256      
                                                                 
 dense_8 (Dense)             (None, 32)                2080      
                                                                 
 dense_9 (Dense)             (None, 8)                 264       
                                                                 
 dense_10 (Dense)            (None, 8)                 72        
                                                                 
 dense_11 (Dense)            (None, 8)                 72        
                                                                 
 dense_12 (Dense)            (None, 4)                 36        
                                                                 
 dense_13 (Dense)            (None, 1)                 5         
                                                                 
=================================================================
Total params: 1792545 (6.84 MB)
Trainable params: 1792545 (6.84 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/5
6399/6399 [==============================] - 256s 38ms/step - loss: 0.5273 - accuracy: 0.7767
Epoch 2/5
6399/6399 [==============================] - 216s 34ms/step - loss: 0.4614 - accuracy: 0.7966
Epoch 3/5
6399/6399 [==============================] - 213s 33ms/step - loss: 0.4362 - accuracy: 0.8040
Epoch 4/5
6399/6399 [==============================] - 213s 33ms/step - loss: 0.4182 - accuracy: 0.8111
Epoch 5/5
6399/6399 [==============================] - 215s 34ms/step - loss: 0.4032 - accuracy: 0.8186
1600/1600 [==============================] - 24s 14ms/step - loss: 0.4515 - accuracy: 0.7910
Test set accuracy: 0.7910333871841431

Building a Transformer Model:

distilbert-base-uncased

import os

# !pip install --upgrade transformers
# !pip install tf-keras
# os.environ['TF_USE_LEGACY_KERAS'] = '1'
import transformers
from transformers import DistilBertTokenizer

print(transformers.__version__)
4.42.4
# Findings: Characteristics of 'distilbert-base-uncased':
## is_fast = True : A Rust-based tokenizer; more efficient than Python
## model_max_length = 512       ## vocab_size = 30522

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
tokenizer
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
DistilBertTokenizer(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
    0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
# Max Length: 128 sequences,
# Truncation: Text pruned to amaximum length when
#             surpasses model's max input length
# Padding : When sequence < 128, the rest is filled
#           with padding tokens so all length's the same.

# Tokenize both training and test data
train_encodings = tokenizer(list(X_train),
  max_length=128, truncation=True, padding=True)
test_encodings = tokenizer(list(X_test),
  max_length=128, truncation=True, padding=True)
# Convert the data into TensorFlow datasets for
# more effective computation

train_dataset = tf.data.Dataset.from_tensor_slices(
    ( dict(train_encodings), tf.constant(y_train.values, dtype=tf.int32) )
)

test_dataset  = tf.data.Dataset.from_tensor_slices(
    ( dict( test_encodings), tf.constant(y_test.values, dtype=tf.int32) )
)

# Shuffle the 'train', but not 'test' (for real-world data predictions),
# and batch both to improve training efficiency of model.
train_dataset = train_dataset.shuffle(len(X_train)).batch(16)
test_dataset  = test_dataset.batch(16)
model_distilbert = (
    TFAutoModelForSequenceClassification
        .from_pretrained('distilbert-base-uncased', num_labels=2))
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
# model_distilbert = (`
#     TFAutoModelForSequenceClassification
#         .from_pretrained('distilbert-base-uncased', num_labels=2))

# Compile the model
## For some reason, transformers cannot accept
## customized optimizers (with different value of learning rate)
## In this case, we'll just use regular alias 'adam'.

optimizerr = tf.keras.optimizers.Adam(learning_rate=3e-5)
losss      = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metricss   = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]

model_distilbert.compile(
    optimizer =optimizerr,
    loss      =losss,
    metrics   =metricss
)

model_distilbert.summary()

model_distilbert.fit(train_dataset,
    epochs=5, validation_data=train_dataset)
Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
=================================================================
Total params: 66955010 (255.41 MB)
Trainable params: 66955010 (255.41 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/5
12798/12798 [==============================] - 1733s 133ms/step - loss: 0.4433 - accuracy: 0.7871 - val_loss: 0.3563 - val_accuracy: 0.8420
Epoch 2/5
12798/12798 [==============================] - 1684s 132ms/step - loss: 0.3637 - accuracy: 0.8348 - val_loss: 0.2859 - val_accuracy: 0.8796
Epoch 3/5
12798/12798 [==============================] - 1685s 132ms/step - loss: 0.2939 - accuracy: 0.8711 - val_loss: 0.1917 - val_accuracy: 0.9244
Epoch 4/5
12798/12798 [==============================] - 1684s 132ms/step - loss: 0.2246 - accuracy: 0.9045 - val_loss: 0.1219 - val_accuracy: 0.9555
Epoch 5/5
12798/12798 [==============================] - 1684s 132ms/step - loss: 0.1706 - accuracy: 0.9296 - val_loss: 0.0794 - val_accuracy: 0.9735
<tf_keras.src.callbacks.History at 0x7fb31c2e8970>
distilbert_test_loss, distilbert_test_acc = (
  model_distilbert.evaluate(test_dataset)
)
print(f'Test set accuracy: {distilbert_test_acc}')
3200/3200 [==============================] - 102s 32ms/step - loss: 0.5264 - accuracy: 0.8098
Test set accuracy: 0.8097870945930481

Conclusion

We’ve built a text classification model with comparison of accuracies as follows:

Model model model_reg model_ml_bi_lstm model_distilbert
Training Acc 61.36% 61.07% 81.86% 92.96%
Test Acc 61.50% 61.23% 79.10% 80.98%

Using a pre-trained model in transformers improved the accuracy compared to other models in classifying positive reviews from the other. We can use the algorithm to predict the upcoming data of drug reviews.

Suggestion

With the project ending here, the author realizes this could still have more development. Some suggestions for higher accuracy include: * Increasing the number of epoch for higher inputs * Hyperparameter tuning on models with highest performances, from number of layers to optimizer’s learning rate * You may notice that several models have modifications that doesn’t let it be ran on Google Colab, such as LSTM-Bidirectional with changes of activation function. We can fit the dataset into initial model to look for the accuracy. * Experiment more on another types of model (more LSTM or transformers)