Hosted with nbsanity. See source notebook on GitHub.

Sentiment Prediction of Drug Reviews

In this section, we will predict customer’s reviews towards drugs. We’ll be using machine learning, deep learning, and transformers.

import numpy as np
import pandas as pd

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

file_path = '/content/drive/MyDrive/DrugReviews/DrugReviews_cleaned.csv'
df = pd.read_csv(file_path)

df

	MedicineUsedFor	MedicineBrandName	MedicineGenericName	ReviewDate	UserName	IntakeTime	Reviews	ReviewLength	Rating	NumberOfLikes
0	Cough	Acetaminophen / Codeine	Not Mentioned	1-Apr-08	smoore	Not Specified	Works good as a cough suppressant.	34	9	24
1	Cough	Benzonatate	Not Mentioned	1-Apr-08	Anonymous	Not Specified	Pneumonia cough was non-stop - gave almost ins...	210	9	39
2	Dermatologic Lesion	Methylprednisolone Dose Pack	Methylprednisolone	1-Apr-08	Anonymous	Not Specified	This steriod helped kill the pain of my condit...	162	8	24
3	Hypogonadism, Male	Androgel	Not Mentioned	1-Apr-08	MikeC...	Not Specified	I'm a 35 year old male and I had no idea that ...	105	9	380
4	Depression	Celexa	Not Mentioned	1-Apr-08	Cherpie	Not Specified	It is so nice to have my life back!!!	37	10	206
...	...	...	...	...	...	...	...	...	...	...
255945	Birth Control	Isibloom	Desogestrel / Ethinyl Estradiol	9-Sep-22	Skylar	Not Specified	This birth control is awful severe nausea and ...	108	1	0
255946	Underactive Thyroid	Unithroid	Levothyroxine	9-Sep-22	Syd	Taken for less than 1 month	Post partial thyroidectomy due to a large beni...	224	2	7
255947	Bacterial Infection	Amoxicillin / Clavulanate	Not Mentioned	9-Sep-22	FLgirl	Taken for less than 1 month	I was given this for a tooth abscess. I was sc...	957	9	1
255948	Strep Throat	Augmentin	Amoxicillin / Clavulanate	9-Sep-22	peein...	Taken for less than 1 month	This stuff is great if you wanna pee out of yo...	263	1	0
255949	Erectile Dysfunction	Sildenafil	Not Mentioned	9-Sep-22	rando...	Taken for less than 1 month	I'm 53 now and can't recall when my penis and ...	489	10	2

255950 rows × 10 columns

import re
import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

stop_words = set(stopwords.words('english'))
stop_words.remove('not')

lemmatizer = WordNetLemmatizer()

# Lower text (outside fx for faster execution time)
df['Reviews'] = df['Reviews'].str.lower()

def clean_text(text):
    # Remove punctuation
    text = re.sub(r'[^\w\s]', ' ', text)
    # Tokenize text
    text = word_tokenize(text)
    # Remove stopwords
    text = [word for word in text if word not in stop_words]
    # Lemmatization
    text = [lemmatizer.lemmatize(word=word, pos='v') for word in text]
    # Join all text
    text = ' '.join(text)

    return text

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!

# Clean text data
df['ReviewsClean'] = df['Reviews'].apply(lambda x: clean_text(str(x)))

df['Rating'].value_counts().sort_index()

Rating
1     53670
2     13442
3     11466
4      8119
5     12867
6      9543
7     13266
8     25803
9     35341
10    72433
Name: count, dtype: int64

df['is_positive'] = np.where(df['Rating'] > 5, 1, 0)

df.sample(5)

	MedicineUsedFor	MedicineBrandName	MedicineGenericName	ReviewDate	UserName	IntakeTime	Reviews	ReviewLength	Rating	NumberOfLikes	ReviewsClean	is_positive
55950	Psoriasis	Raptiva	Not Mentioned	15-May-09	Anonymous	Not Specified	it worked great and should not have been pulle...	64	10	1	work great not pull market	1
146999	Insomnia	Sonata	Zaleplon	25-Jun-08	duwa	Not Specified	amazing. helps me fall asleep. i don't have a ...	207	10	66	amaze help fall asleep problem stay asleep use...	1
121363	Adhd	Amphetamine / Dextroamphetamine	Not Mentioned	22-Jan-22	eric	Taken for 1 to 6 months	i just started taking adderall again after ten...	124	8	0	start take adderall ten years without realize ...	1
123311	Depression	Mirtazapine	Not Mentioned	22-Mar-18	Anonymous	Taken for 1 to 6 months	deeper sleep but not sleeping any more than be...	74	5	3	deeper sleep not sleep start take 15mg	0
168587	Hepatitis C	Harvoni	Not Mentioned	28-Aug-16	ShazDog	Not Specified	i contracted hep c 1a thirty years ago.	39	10	36	contract hep c 1a thirty years ago	1

nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
sentiments = df['Reviews'].apply(lambda x: sid.polarity_scores(x))

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!

sentiments = pd.DataFrame(sentiments.tolist())

sentiments

	neg	neu	pos	compound
0	0.000	0.580	0.420	0.4404
1	0.000	0.947	0.053	0.2289
2	0.390	0.610	0.000	-0.9402
3	0.191	0.809	0.000	-0.5423
4	0.000	0.640	0.360	0.6697
...	...	...	...	...
255945	0.248	0.752	0.000	-0.6808
255946	0.213	0.686	0.100	-0.5574
255947	0.063	0.863	0.074	0.2731
255948	0.034	0.816	0.151	0.7650
255949	0.056	0.908	0.036	-0.2206

255950 rows × 4 columns

df = pd.concat([df, sentiments], axis=1)

df.sample(5)

	MedicineUsedFor	MedicineBrandName	MedicineGenericName	ReviewDate	UserName	IntakeTime	Reviews	ReviewLength	Rating	NumberOfLikes	ReviewsClean	is_positive	neg	neu	pos	compound
234245	Birth Control	Loryna	Drospirenone / Ethinyl Estradiol	7-Jul-18	Boano	Taken for 1 to 6 months	i have been on a generic form of yaz for about...	245	1	2	generic form yaz 3 years switch vestura nikki ...	0	0.107	0.893	0.000	-0.5873
84098	Birth Control	Depo-Provera	Not Mentioned	18-Sep-19	MiMi	Not Specified	i'm sharing this to everyone who is considerin...	964	2	9	share everyone consider depo provera shoot ble...	0	0.116	0.781	0.103	-0.5417
87300	Bladder Infection	Macrobid	Not Mentioned	19-Jan-14	mommy...	Taken for less than 1 month	had a severe bladder infection..passing blood ...	672	8	77	severe bladder infection pass blood clot urine...	1	0.099	0.831	0.070	-0.3025
35162	Acne	Yaz	Drospirenone / Ethinyl Estradiol	13-Dec-16	Anonymous	Not Specified	i've had mild to moderate acne since i was 14 ...	721	1	16	mild moderate acne since 14 24 10 yrs try ever...	0	0.198	0.732	0.069	-0.9705
15722	Abnormal Uterine Bleeding	Depo-Provera	Not Mentioned	10-Oct-19	End...	Taken for 1 to 6 months	i'm so over this shot! i got this in mid-septe...	416	1	9	shoot get mid september wait exit body many di...	0	0.229	0.737	0.034	-0.9539

Modelling Cleaned Reviews with Deep Learning

train_df = df[['ReviewsClean', 'is_positive']]

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    train_df['ReviewsClean'], train_df['is_positive'],
    test_size = 0.2, random_state = 100)

y_train.value_counts()

is_positive
1    125041
0     79719
Name: count, dtype: int64

y_test.value_counts()

is_positive
1    31345
0    19845
Name: count, dtype: int64

import tensorflow as tf

from keras import Input
from keras.models import Sequential
from keras.layers import (
    TextVectorization, Embedding, LSTM, Dense, Bidirectional, Dropout)

from keras.optimizers import Adam
from keras.regularizers import L1, L2, L1L2

from transformers import TFAutoModelForSequenceClassification

Shallow Neural Network

max_tokens   = 7500
input_length = 128
output_dim   = 128

vectorizer_layer = TextVectorization(
    max_tokens  = max_tokens,
    output_mode = 'int',
    standardize = 'lower_and_strip_punctuation',
    output_sequence_length = input_length
)

vectorizer_layer.adapt(X_train)

embedding_layer = Embedding(
    input_dim    = max_tokens,
    output_dim   = output_dim,
    input_length = input_length
)

# Define model
model = Sequential([
    Input(shape=(1,), dtype=tf.string),
    vectorizer_layer,
    embedding_layer,
    Dense(1, activation='sigmoid')
])

# Compile model
model.compile(optimizer=Adam(learning_rate=0.001),
    loss='binary_crossentropy', metrics=['accuracy'])

model.summary()

# Fit model
model.fit(X_train, y_train, epochs=5)

# Evaluate model in accuracy
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f'Test set accuracy: {test_acc}')

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 text_vectorization (TextVe  (None, 128)               0         
 ctorization)                                                    
                                                                 
 embedding (Embedding)       (None, 128, 128)          960000    
                                                                 
 dense (Dense)               (None, 128, 1)            129       
                                                                 
=================================================================
Total params: 960129 (3.66 MB)
Trainable params: 960129 (3.66 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/5
6399/6399 [==============================] - 57s 8ms/step - loss: 0.6652 - accuracy: 0.6129
Epoch 2/5
6399/6399 [==============================] - 25s 4ms/step - loss: 0.6645 - accuracy: 0.6135
Epoch 3/5
6399/6399 [==============================] - 24s 4ms/step - loss: 0.6644 - accuracy: 0.6135
Epoch 4/5
6399/6399 [==============================] - 23s 4ms/step - loss: 0.6643 - accuracy: 0.6136
Epoch 5/5
6399/6399 [==============================] - 22s 3ms/step - loss: 0.6643 - accuracy: 0.6136
1600/1600 [==============================] - 4s 2ms/step - loss: 0.6639 - accuracy: 0.6150
Test set accuracy: 0.6149800419807434

Multi-Layer Deep Text Classification Model

# Define model
model_reg = Sequential([
    Input(shape=(1,), dtype=tf.string),
    vectorizer_layer,
    embedding_layer,
    Dense(128, activation='relu',
         kernel_regularizer=L1(l1=0.0005)),
    Dropout(rate=0.6),
    Dense( 64, activation='relu',
         kernel_regularizer=L1L2(l1=0.0005, l2=0.0005)),
    Dense( 32, activation='relu',
         kernel_regularizer=L2(l2=0.0005)),
    Dense( 16, activation='relu',
         kernel_regularizer=L2(l2=0.0005)),
    Dense(  8, activation='relu',
         kernel_regularizer=L2(l2=0.0005)),
    Dense(  1, activation='sigmoid')
])

# Compile model
model_reg.compile(
    optimizer=Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy'])

model_reg.summary()

# Fit model
model_reg.fit(X_train, y_train, epochs=5)

# Evaluate model in accuracy
reg_test_loss, reg_test_acc = model_reg.evaluate(X_test, y_test)
print(f'Test set accuracy: {reg_test_acc}')

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 text_vectorization (TextVe  (None, 128)               0         
 ctorization)                                                    
                                                                 
 embedding (Embedding)       (None, 128, 128)          960000    
                                                                 
 dense_1 (Dense)             (None, 128, 128)          16512     
                                                                 
 dropout (Dropout)           (None, 128, 128)          0         
                                                                 
 dense_2 (Dense)             (None, 128, 64)           8256      
                                                                 
 dense_3 (Dense)             (None, 128, 32)           2080      
                                                                 
 dense_4 (Dense)             (None, 128, 16)           528       
                                                                 
 dense_5 (Dense)             (None, 128, 8)            136       
                                                                 
 dense_6 (Dense)             (None, 128, 1)            9         
                                                                 
=================================================================
Total params: 987521 (3.77 MB)
Trainable params: 987521 (3.77 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/5
6399/6399 [==============================] - 73s 11ms/step - loss: 0.6809 - accuracy: 0.6107
Epoch 2/5
6399/6399 [==============================] - 42s 7ms/step - loss: 0.6701 - accuracy: 0.6107
Epoch 3/5
6399/6399 [==============================] - 41s 6ms/step - loss: 0.6699 - accuracy: 0.6107
Epoch 4/5
6399/6399 [==============================] - 41s 6ms/step - loss: 0.6699 - accuracy: 0.6107
Epoch 5/5
6399/6399 [==============================] - 41s 6ms/step - loss: 0.6699 - accuracy: 0.6107
1600/1600 [==============================] - 6s 4ms/step - loss: 0.6691 - accuracy: 0.6123
Test set accuracy: 0.6123266220092773

# Scratches for dashboard: Rating factor for Ranks

# rating_factor = df.groupby(['MedicineBrandName', 'MedicineUsedFor']).agg(
#     avg_rating = ('Rating', lambda x: np.mean(x)),
#     std_rating = ('Rating', lambda x: np.std(x)),
#     count_reviews = ('Reviews', lambda x: x.count())
# ).reset_index()

# rating_factor.sort_values(
#     ['count_reviews', 'avg_rating', 'std_rating'],
#     ascending=[False, False, True])[:50]

# x = (rating_factor['avg_rating'] / rating_factor['std_rating'])

# rating_factor['impact_factor'] = (rating_factor['count_reviews'] *
#                                   (1 - np.exp(-x)) / (1 + np.exp(-x)))

# rating_factor.sort_values(
#     ['count_reviews', 'avg_rating', 'impact_factor'],
#     ascending=False).loc[rating_factor['MedicineUsedFor'] ==
#         'Weight Loss (Obesity/Overweight)'][:25]

# rating_factor.sort_values(
#     ['count_reviews', 'avg_rating', 'impact_factor'],
#     ascending=False)[:25]

# rating_factor.sort_values(
#     ['count_reviews', 'avg_rating', 'impact_factor'],
#     ascending=False).loc[rating_factor['MedicineUsedFor'] == 'Birth Control'][:25]

Multi-Layer Bidirectional LSTM Model

# Define model
# Some tweaks:
## The algorithm are supposed to use activation='elu'.
## However, it doesn't fulfill the criteria when
## running the model in Colab's processing unit.

## Therefore, all 'elu' are changed into 'tanh'

model_ml_bi_lstm = Sequential([

    Input(shape=(1,), dtype=tf.string),

    vectorizer_layer,
    embedding_layer,

    Bidirectional(LSTM(128,
        activation='tanh',
        return_sequences=True)),
    Bidirectional(LSTM(128,
        activation='tanh',
        return_sequences=True)),
    Bidirectional(LSTM(64,
        activation='tanh')),

    Dense( 64, activation='tanh',
         kernel_regularizer=L1L2(
             l1=0.0001, l2=0.0001)),
    Dense( 32, activation='tanh',
         kernel_regularizer=L2(l2=0.0001)),
    Dense(  8, activation='tanh',
         kernel_regularizer=L2(l2=0.0005)),
    Dense(  8, activation='tanh',
         kernel_regularizer=L2(l2=0.0005)),
    Dense(  8, activation='tanh'),
    Dense(  4, activation='tanh'),

    Dense(  1, activation='sigmoid')

])

# Compile model
model_ml_bi_lstm.compile(optimizer=Adam(learning_rate=0.0001),
              loss='binary_crossentropy', metrics=['accuracy'])

model_ml_bi_lstm.summary()

# Fit
model_ml_bi_lstm.fit(X_train, y_train, epochs=5)

# Evaluate
bi_lstm_test_loss, bi_lstm_test_acc = model_ml_bi_lstm.evaluate(X_test, y_test)
print(f'Test set accuracy: {bi_lstm_test_acc}')

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 text_vectorization (TextVe  (None, 128)               0         
 ctorization)                                                    
                                                                 
 embedding (Embedding)       (None, 128, 128)          960000    
                                                                 
 bidirectional (Bidirection  (None, 128, 256)          263168    
 al)                                                             
                                                                 
 bidirectional_1 (Bidirecti  (None, 128, 256)          394240    
 onal)                                                           
                                                                 
 bidirectional_2 (Bidirecti  (None, 128)               164352    
 onal)                                                           
                                                                 
 dense_7 (Dense)             (None, 64)                8256      
                                                                 
 dense_8 (Dense)             (None, 32)                2080      
                                                                 
 dense_9 (Dense)             (None, 8)                 264       
                                                                 
 dense_10 (Dense)            (None, 8)                 72        
                                                                 
 dense_11 (Dense)            (None, 8)                 72        
                                                                 
 dense_12 (Dense)            (None, 4)                 36        
                                                                 
 dense_13 (Dense)            (None, 1)                 5         
                                                                 
=================================================================
Total params: 1792545 (6.84 MB)
Trainable params: 1792545 (6.84 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/5
6399/6399 [==============================] - 256s 38ms/step - loss: 0.5273 - accuracy: 0.7767
Epoch 2/5
6399/6399 [==============================] - 216s 34ms/step - loss: 0.4614 - accuracy: 0.7966
Epoch 3/5
6399/6399 [==============================] - 213s 33ms/step - loss: 0.4362 - accuracy: 0.8040
Epoch 4/5
6399/6399 [==============================] - 213s 33ms/step - loss: 0.4182 - accuracy: 0.8111
Epoch 5/5
6399/6399 [==============================] - 215s 34ms/step - loss: 0.4032 - accuracy: 0.8186
1600/1600 [==============================] - 24s 14ms/step - loss: 0.4515 - accuracy: 0.7910
Test set accuracy: 0.7910333871841431

Building a Transformer Model:

`distilbert-base-uncased`

import os

# !pip install --upgrade transformers
# !pip install tf-keras
# os.environ['TF_USE_LEGACY_KERAS'] = '1'

import transformers
from transformers import DistilBertTokenizer

print(transformers.__version__)

4.42.4

# Findings: Characteristics of 'distilbert-base-uncased':
## is_fast = True : A Rust-based tokenizer; more efficient than Python
## model_max_length = 512       ## vocab_size = 30522

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
tokenizer

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(

DistilBertTokenizer(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
    0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

# Max Length: 128 sequences,
# Truncation: Text pruned to amaximum length when
#             surpasses model's max input length
# Padding : When sequence < 128, the rest is filled
#           with padding tokens so all length's the same.

# Tokenize both training and test data
train_encodings = tokenizer(list(X_train),
  max_length=128, truncation=True, padding=True)
test_encodings = tokenizer(list(X_test),
  max_length=128, truncation=True, padding=True)

# Convert the data into TensorFlow datasets for
# more effective computation

train_dataset = tf.data.Dataset.from_tensor_slices(
    ( dict(train_encodings), tf.constant(y_train.values, dtype=tf.int32) )
)

test_dataset  = tf.data.Dataset.from_tensor_slices(
    ( dict( test_encodings), tf.constant(y_test.values, dtype=tf.int32) )
)

# Shuffle the 'train', but not 'test' (for real-world data predictions),
# and batch both to improve training efficiency of model.
train_dataset = train_dataset.shuffle(len(X_train)).batch(16)
test_dataset  = test_dataset.batch(16)

model_distilbert = (
    TFAutoModelForSequenceClassification
        .from_pretrained('distilbert-base-uncased', num_labels=2))

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

# model_distilbert = (`
#     TFAutoModelForSequenceClassification
#         .from_pretrained('distilbert-base-uncased', num_labels=2))

# Compile the model
## For some reason, transformers cannot accept
## customized optimizers (with different value of learning rate)
## In this case, we'll just use regular alias 'adam'.

optimizerr = tf.keras.optimizers.Adam(learning_rate=3e-5)
losss      = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metricss   = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]

model_distilbert.compile(
    optimizer =optimizerr,
    loss      =losss,
    metrics   =metricss
)

model_distilbert.summary()

model_distilbert.fit(train_dataset,
    epochs=5, validation_data=train_dataset)

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
=================================================================
Total params: 66955010 (255.41 MB)
Trainable params: 66955010 (255.41 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/5
12798/12798 [==============================] - 1733s 133ms/step - loss: 0.4433 - accuracy: 0.7871 - val_loss: 0.3563 - val_accuracy: 0.8420
Epoch 2/5
12798/12798 [==============================] - 1684s 132ms/step - loss: 0.3637 - accuracy: 0.8348 - val_loss: 0.2859 - val_accuracy: 0.8796
Epoch 3/5
12798/12798 [==============================] - 1685s 132ms/step - loss: 0.2939 - accuracy: 0.8711 - val_loss: 0.1917 - val_accuracy: 0.9244
Epoch 4/5
12798/12798 [==============================] - 1684s 132ms/step - loss: 0.2246 - accuracy: 0.9045 - val_loss: 0.1219 - val_accuracy: 0.9555
Epoch 5/5
12798/12798 [==============================] - 1684s 132ms/step - loss: 0.1706 - accuracy: 0.9296 - val_loss: 0.0794 - val_accuracy: 0.9735

<tf_keras.src.callbacks.History at 0x7fb31c2e8970>

distilbert_test_loss, distilbert_test_acc = (
  model_distilbert.evaluate(test_dataset)
)
print(f'Test set accuracy: {distilbert_test_acc}')

3200/3200 [==============================] - 102s 32ms/step - loss: 0.5264 - accuracy: 0.8098
Test set accuracy: 0.8097870945930481

Conclusion

We’ve built a text classification model with comparison of accuracies as follows:

Model	`model`	`model_reg`	`model_ml_bi_lstm`	`model_distilbert`
Training Acc	`61.36%`	`61.07%`	`81.86%`	`92.96%`
Test Acc	`61.50%`	`61.23%`	`79.10%`	`80.98%`

Using a pre-trained model in transformers improved the accuracy compared to other models in classifying positive reviews from the other. We can use the algorithm to predict the upcoming data of drug reviews.

Suggestion

With the project ending here, the author realizes this could still have more development. Some suggestions for higher accuracy include: * Increasing the number of epoch for higher inputs * Hyperparameter tuning on models with highest performances, from number of layers to optimizer’s learning rate * You may notice that several models have modifications that doesn’t let it be ran on Google Colab, such as LSTM-Bidirectional with changes of activation function. We can fit the dataset into initial model to look for the accuracy. * Experiment more on another types of model (more LSTM or transformers)