import numpy as np
import pandas as pdSentiment Prediction of Drug Reviews
In this section, we will predict customer’s reviews towards drugs. We’ll be using machine learning, deep learning, and transformers.
from google.colab import drive
drive.mount('/content/drive')Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
file_path = '/content/drive/MyDrive/DrugReviews/DrugReviews_cleaned.csv'
df = pd.read_csv(file_path)df| MedicineUsedFor | MedicineBrandName | MedicineGenericName | ReviewDate | UserName | IntakeTime | Reviews | ReviewLength | Rating | NumberOfLikes | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Cough | Acetaminophen / Codeine | Not Mentioned | 1-Apr-08 | smoore | Not Specified | Works good as a cough suppressant. | 34 | 9 | 24 |
| 1 | Cough | Benzonatate | Not Mentioned | 1-Apr-08 | Anonymous | Not Specified | Pneumonia cough was non-stop - gave almost ins... | 210 | 9 | 39 |
| 2 | Dermatologic Lesion | Methylprednisolone Dose Pack | Methylprednisolone | 1-Apr-08 | Anonymous | Not Specified | This steriod helped kill the pain of my condit... | 162 | 8 | 24 |
| 3 | Hypogonadism, Male | Androgel | Not Mentioned | 1-Apr-08 | MikeC... | Not Specified | I'm a 35 year old male and I had no idea that ... | 105 | 9 | 380 |
| 4 | Depression | Celexa | Not Mentioned | 1-Apr-08 | Cherpie | Not Specified | It is so nice to have my life back!!! | 37 | 10 | 206 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 255945 | Birth Control | Isibloom | Desogestrel / Ethinyl Estradiol | 9-Sep-22 | Skylar | Not Specified | This birth control is awful severe nausea and ... | 108 | 1 | 0 |
| 255946 | Underactive Thyroid | Unithroid | Levothyroxine | 9-Sep-22 | Syd | Taken for less than 1 month | Post partial thyroidectomy due to a large beni... | 224 | 2 | 7 |
| 255947 | Bacterial Infection | Amoxicillin / Clavulanate | Not Mentioned | 9-Sep-22 | FLgirl | Taken for less than 1 month | I was given this for a tooth abscess. I was sc... | 957 | 9 | 1 |
| 255948 | Strep Throat | Augmentin | Amoxicillin / Clavulanate | 9-Sep-22 | peein... | Taken for less than 1 month | This stuff is great if you wanna pee out of yo... | 263 | 1 | 0 |
| 255949 | Erectile Dysfunction | Sildenafil | Not Mentioned | 9-Sep-22 | rando... | Taken for less than 1 month | I'm 53 now and can't recall when my penis and ... | 489 | 10 | 2 |
255950 rows × 10 columns
import re
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
stop_words = set(stopwords.words('english'))
stop_words.remove('not')
lemmatizer = WordNetLemmatizer()
# Lower text (outside fx for faster execution time)
df['Reviews'] = df['Reviews'].str.lower()
def clean_text(text):
# Remove punctuation
text = re.sub(r'[^\w\s]', ' ', text)
# Tokenize text
text = word_tokenize(text)
# Remove stopwords
text = [word for word in text if word not in stop_words]
# Lemmatization
text = [lemmatizer.lemmatize(word=word, pos='v') for word in text]
# Join all text
text = ' '.join(text)
return text[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Package omw-1.4 is already up-to-date!
# Clean text data
df['ReviewsClean'] = df['Reviews'].apply(lambda x: clean_text(str(x)))df['Rating'].value_counts().sort_index()Rating
1 53670
2 13442
3 11466
4 8119
5 12867
6 9543
7 13266
8 25803
9 35341
10 72433
Name: count, dtype: int64
df['is_positive'] = np.where(df['Rating'] > 5, 1, 0)df.sample(5)| MedicineUsedFor | MedicineBrandName | MedicineGenericName | ReviewDate | UserName | IntakeTime | Reviews | ReviewLength | Rating | NumberOfLikes | ReviewsClean | is_positive | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 55950 | Psoriasis | Raptiva | Not Mentioned | 15-May-09 | Anonymous | Not Specified | it worked great and should not have been pulle... | 64 | 10 | 1 | work great not pull market | 1 |
| 146999 | Insomnia | Sonata | Zaleplon | 25-Jun-08 | duwa | Not Specified | amazing. helps me fall asleep. i don't have a ... | 207 | 10 | 66 | amaze help fall asleep problem stay asleep use... | 1 |
| 121363 | Adhd | Amphetamine / Dextroamphetamine | Not Mentioned | 22-Jan-22 | eric | Taken for 1 to 6 months | i just started taking adderall again after ten... | 124 | 8 | 0 | start take adderall ten years without realize ... | 1 |
| 123311 | Depression | Mirtazapine | Not Mentioned | 22-Mar-18 | Anonymous | Taken for 1 to 6 months | deeper sleep but not sleeping any more than be... | 74 | 5 | 3 | deeper sleep not sleep start take 15mg | 0 |
| 168587 | Hepatitis C | Harvoni | Not Mentioned | 28-Aug-16 | ShazDog | Not Specified | i contracted hep c 1a thirty years ago. | 39 | 10 | 36 | contract hep c 1a thirty years ago | 1 |
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
sentiments = df['Reviews'].apply(lambda x: sid.polarity_scores(x))[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Package vader_lexicon is already up-to-date!
sentiments = pd.DataFrame(sentiments.tolist())sentiments| neg | neu | pos | compound | |
|---|---|---|---|---|
| 0 | 0.000 | 0.580 | 0.420 | 0.4404 |
| 1 | 0.000 | 0.947 | 0.053 | 0.2289 |
| 2 | 0.390 | 0.610 | 0.000 | -0.9402 |
| 3 | 0.191 | 0.809 | 0.000 | -0.5423 |
| 4 | 0.000 | 0.640 | 0.360 | 0.6697 |
| ... | ... | ... | ... | ... |
| 255945 | 0.248 | 0.752 | 0.000 | -0.6808 |
| 255946 | 0.213 | 0.686 | 0.100 | -0.5574 |
| 255947 | 0.063 | 0.863 | 0.074 | 0.2731 |
| 255948 | 0.034 | 0.816 | 0.151 | 0.7650 |
| 255949 | 0.056 | 0.908 | 0.036 | -0.2206 |
255950 rows × 4 columns
df = pd.concat([df, sentiments], axis=1)df.sample(5)| MedicineUsedFor | MedicineBrandName | MedicineGenericName | ReviewDate | UserName | IntakeTime | Reviews | ReviewLength | Rating | NumberOfLikes | ReviewsClean | is_positive | neg | neu | pos | compound | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 234245 | Birth Control | Loryna | Drospirenone / Ethinyl Estradiol | 7-Jul-18 | Boano | Taken for 1 to 6 months | i have been on a generic form of yaz for about... | 245 | 1 | 2 | generic form yaz 3 years switch vestura nikki ... | 0 | 0.107 | 0.893 | 0.000 | -0.5873 |
| 84098 | Birth Control | Depo-Provera | Not Mentioned | 18-Sep-19 | MiMi | Not Specified | i'm sharing this to everyone who is considerin... | 964 | 2 | 9 | share everyone consider depo provera shoot ble... | 0 | 0.116 | 0.781 | 0.103 | -0.5417 |
| 87300 | Bladder Infection | Macrobid | Not Mentioned | 19-Jan-14 | mommy... | Taken for less than 1 month | had a severe bladder infection..passing blood ... | 672 | 8 | 77 | severe bladder infection pass blood clot urine... | 1 | 0.099 | 0.831 | 0.070 | -0.3025 |
| 35162 | Acne | Yaz | Drospirenone / Ethinyl Estradiol | 13-Dec-16 | Anonymous | Not Specified | i've had mild to moderate acne since i was 14 ... | 721 | 1 | 16 | mild moderate acne since 14 24 10 yrs try ever... | 0 | 0.198 | 0.732 | 0.069 | -0.9705 |
| 15722 | Abnormal Uterine Bleeding | Depo-Provera | Not Mentioned | 10-Oct-19 | End... | Taken for 1 to 6 months | i'm so over this shot! i got this in mid-septe... | 416 | 1 | 9 | shoot get mid september wait exit body many di... | 0 | 0.229 | 0.737 | 0.034 | -0.9539 |
Modelling Cleaned Reviews with Deep Learning
train_df = df[['ReviewsClean', 'is_positive']]from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
train_df['ReviewsClean'], train_df['is_positive'],
test_size = 0.2, random_state = 100)y_train.value_counts()is_positive
1 125041
0 79719
Name: count, dtype: int64
y_test.value_counts()is_positive
1 31345
0 19845
Name: count, dtype: int64
import tensorflow as tf
from keras import Input
from keras.models import Sequential
from keras.layers import (
TextVectorization, Embedding, LSTM, Dense, Bidirectional, Dropout)
from keras.optimizers import Adam
from keras.regularizers import L1, L2, L1L2
from transformers import TFAutoModelForSequenceClassificationShallow Neural Network
max_tokens = 7500
input_length = 128
output_dim = 128
vectorizer_layer = TextVectorization(
max_tokens = max_tokens,
output_mode = 'int',
standardize = 'lower_and_strip_punctuation',
output_sequence_length = input_length
)
vectorizer_layer.adapt(X_train)
embedding_layer = Embedding(
input_dim = max_tokens,
output_dim = output_dim,
input_length = input_length
)# Define model
model = Sequential([
Input(shape=(1,), dtype=tf.string),
vectorizer_layer,
embedding_layer,
Dense(1, activation='sigmoid')
])
# Compile model
model.compile(optimizer=Adam(learning_rate=0.001),
loss='binary_crossentropy', metrics=['accuracy'])
model.summary()
# Fit model
model.fit(X_train, y_train, epochs=5)
# Evaluate model in accuracy
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f'Test set accuracy: {test_acc}')Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
text_vectorization (TextVe (None, 128) 0
ctorization)
embedding (Embedding) (None, 128, 128) 960000
dense (Dense) (None, 128, 1) 129
=================================================================
Total params: 960129 (3.66 MB)
Trainable params: 960129 (3.66 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/5
6399/6399 [==============================] - 57s 8ms/step - loss: 0.6652 - accuracy: 0.6129
Epoch 2/5
6399/6399 [==============================] - 25s 4ms/step - loss: 0.6645 - accuracy: 0.6135
Epoch 3/5
6399/6399 [==============================] - 24s 4ms/step - loss: 0.6644 - accuracy: 0.6135
Epoch 4/5
6399/6399 [==============================] - 23s 4ms/step - loss: 0.6643 - accuracy: 0.6136
Epoch 5/5
6399/6399 [==============================] - 22s 3ms/step - loss: 0.6643 - accuracy: 0.6136
1600/1600 [==============================] - 4s 2ms/step - loss: 0.6639 - accuracy: 0.6150
Test set accuracy: 0.6149800419807434
Multi-Layer Deep Text Classification Model
# Define model
model_reg = Sequential([
Input(shape=(1,), dtype=tf.string),
vectorizer_layer,
embedding_layer,
Dense(128, activation='relu',
kernel_regularizer=L1(l1=0.0005)),
Dropout(rate=0.6),
Dense( 64, activation='relu',
kernel_regularizer=L1L2(l1=0.0005, l2=0.0005)),
Dense( 32, activation='relu',
kernel_regularizer=L2(l2=0.0005)),
Dense( 16, activation='relu',
kernel_regularizer=L2(l2=0.0005)),
Dense( 8, activation='relu',
kernel_regularizer=L2(l2=0.0005)),
Dense( 1, activation='sigmoid')
])
# Compile model
model_reg.compile(
optimizer=Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy'])
model_reg.summary()
# Fit model
model_reg.fit(X_train, y_train, epochs=5)
# Evaluate model in accuracy
reg_test_loss, reg_test_acc = model_reg.evaluate(X_test, y_test)
print(f'Test set accuracy: {reg_test_acc}')Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
text_vectorization (TextVe (None, 128) 0
ctorization)
embedding (Embedding) (None, 128, 128) 960000
dense_1 (Dense) (None, 128, 128) 16512
dropout (Dropout) (None, 128, 128) 0
dense_2 (Dense) (None, 128, 64) 8256
dense_3 (Dense) (None, 128, 32) 2080
dense_4 (Dense) (None, 128, 16) 528
dense_5 (Dense) (None, 128, 8) 136
dense_6 (Dense) (None, 128, 1) 9
=================================================================
Total params: 987521 (3.77 MB)
Trainable params: 987521 (3.77 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/5
6399/6399 [==============================] - 73s 11ms/step - loss: 0.6809 - accuracy: 0.6107
Epoch 2/5
6399/6399 [==============================] - 42s 7ms/step - loss: 0.6701 - accuracy: 0.6107
Epoch 3/5
6399/6399 [==============================] - 41s 6ms/step - loss: 0.6699 - accuracy: 0.6107
Epoch 4/5
6399/6399 [==============================] - 41s 6ms/step - loss: 0.6699 - accuracy: 0.6107
Epoch 5/5
6399/6399 [==============================] - 41s 6ms/step - loss: 0.6699 - accuracy: 0.6107
1600/1600 [==============================] - 6s 4ms/step - loss: 0.6691 - accuracy: 0.6123
Test set accuracy: 0.6123266220092773
# Scratches for dashboard: Rating factor for Ranks
# rating_factor = df.groupby(['MedicineBrandName', 'MedicineUsedFor']).agg(
# avg_rating = ('Rating', lambda x: np.mean(x)),
# std_rating = ('Rating', lambda x: np.std(x)),
# count_reviews = ('Reviews', lambda x: x.count())
# ).reset_index()
# rating_factor.sort_values(
# ['count_reviews', 'avg_rating', 'std_rating'],
# ascending=[False, False, True])[:50]
# x = (rating_factor['avg_rating'] / rating_factor['std_rating'])
# rating_factor['impact_factor'] = (rating_factor['count_reviews'] *
# (1 - np.exp(-x)) / (1 + np.exp(-x)))
# rating_factor.sort_values(
# ['count_reviews', 'avg_rating', 'impact_factor'],
# ascending=False).loc[rating_factor['MedicineUsedFor'] ==
# 'Weight Loss (Obesity/Overweight)'][:25]
# rating_factor.sort_values(
# ['count_reviews', 'avg_rating', 'impact_factor'],
# ascending=False)[:25]
# rating_factor.sort_values(
# ['count_reviews', 'avg_rating', 'impact_factor'],
# ascending=False).loc[rating_factor['MedicineUsedFor'] == 'Birth Control'][:25]Multi-Layer Bidirectional LSTM Model
# Define model
# Some tweaks:
## The algorithm are supposed to use activation='elu'.
## However, it doesn't fulfill the criteria when
## running the model in Colab's processing unit.
## Therefore, all 'elu' are changed into 'tanh'
model_ml_bi_lstm = Sequential([
Input(shape=(1,), dtype=tf.string),
vectorizer_layer,
embedding_layer,
Bidirectional(LSTM(128,
activation='tanh',
return_sequences=True)),
Bidirectional(LSTM(128,
activation='tanh',
return_sequences=True)),
Bidirectional(LSTM(64,
activation='tanh')),
Dense( 64, activation='tanh',
kernel_regularizer=L1L2(
l1=0.0001, l2=0.0001)),
Dense( 32, activation='tanh',
kernel_regularizer=L2(l2=0.0001)),
Dense( 8, activation='tanh',
kernel_regularizer=L2(l2=0.0005)),
Dense( 8, activation='tanh',
kernel_regularizer=L2(l2=0.0005)),
Dense( 8, activation='tanh'),
Dense( 4, activation='tanh'),
Dense( 1, activation='sigmoid')
])
# Compile model
model_ml_bi_lstm.compile(optimizer=Adam(learning_rate=0.0001),
loss='binary_crossentropy', metrics=['accuracy'])
model_ml_bi_lstm.summary()
# Fit
model_ml_bi_lstm.fit(X_train, y_train, epochs=5)
# Evaluate
bi_lstm_test_loss, bi_lstm_test_acc = model_ml_bi_lstm.evaluate(X_test, y_test)
print(f'Test set accuracy: {bi_lstm_test_acc}')Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
text_vectorization (TextVe (None, 128) 0
ctorization)
embedding (Embedding) (None, 128, 128) 960000
bidirectional (Bidirection (None, 128, 256) 263168
al)
bidirectional_1 (Bidirecti (None, 128, 256) 394240
onal)
bidirectional_2 (Bidirecti (None, 128) 164352
onal)
dense_7 (Dense) (None, 64) 8256
dense_8 (Dense) (None, 32) 2080
dense_9 (Dense) (None, 8) 264
dense_10 (Dense) (None, 8) 72
dense_11 (Dense) (None, 8) 72
dense_12 (Dense) (None, 4) 36
dense_13 (Dense) (None, 1) 5
=================================================================
Total params: 1792545 (6.84 MB)
Trainable params: 1792545 (6.84 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/5
6399/6399 [==============================] - 256s 38ms/step - loss: 0.5273 - accuracy: 0.7767
Epoch 2/5
6399/6399 [==============================] - 216s 34ms/step - loss: 0.4614 - accuracy: 0.7966
Epoch 3/5
6399/6399 [==============================] - 213s 33ms/step - loss: 0.4362 - accuracy: 0.8040
Epoch 4/5
6399/6399 [==============================] - 213s 33ms/step - loss: 0.4182 - accuracy: 0.8111
Epoch 5/5
6399/6399 [==============================] - 215s 34ms/step - loss: 0.4032 - accuracy: 0.8186
1600/1600 [==============================] - 24s 14ms/step - loss: 0.4515 - accuracy: 0.7910
Test set accuracy: 0.7910333871841431
Building a Transformer Model:
distilbert-base-uncased
import os
# !pip install --upgrade transformers
# !pip install tf-keras
# os.environ['TF_USE_LEGACY_KERAS'] = '1'import transformers
from transformers import DistilBertTokenizer
print(transformers.__version__)4.42.4
# Findings: Characteristics of 'distilbert-base-uncased':
## is_fast = True : A Rust-based tokenizer; more efficient than Python
## model_max_length = 512 ## vocab_size = 30522
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
tokenizer/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
DistilBertTokenizer(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True), added_tokens_decoder={
0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
# Max Length: 128 sequences,
# Truncation: Text pruned to amaximum length when
# surpasses model's max input length
# Padding : When sequence < 128, the rest is filled
# with padding tokens so all length's the same.
# Tokenize both training and test data
train_encodings = tokenizer(list(X_train),
max_length=128, truncation=True, padding=True)
test_encodings = tokenizer(list(X_test),
max_length=128, truncation=True, padding=True)# Convert the data into TensorFlow datasets for
# more effective computation
train_dataset = tf.data.Dataset.from_tensor_slices(
( dict(train_encodings), tf.constant(y_train.values, dtype=tf.int32) )
)
test_dataset = tf.data.Dataset.from_tensor_slices(
( dict( test_encodings), tf.constant(y_test.values, dtype=tf.int32) )
)
# Shuffle the 'train', but not 'test' (for real-world data predictions),
# and batch both to improve training efficiency of model.
train_dataset = train_dataset.shuffle(len(X_train)).batch(16)
test_dataset = test_dataset.batch(16)model_distilbert = (
TFAutoModelForSequenceClassification
.from_pretrained('distilbert-base-uncased', num_labels=2))Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
# model_distilbert = (`
# TFAutoModelForSequenceClassification
# .from_pretrained('distilbert-base-uncased', num_labels=2))
# Compile the model
## For some reason, transformers cannot accept
## customized optimizers (with different value of learning rate)
## In this case, we'll just use regular alias 'adam'.
optimizerr = tf.keras.optimizers.Adam(learning_rate=3e-5)
losss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metricss = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]
model_distilbert.compile(
optimizer =optimizerr,
loss =losss,
metrics =metricss
)
model_distilbert.summary()
model_distilbert.fit(train_dataset,
epochs=5, validation_data=train_dataset)Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
distilbert (TFDistilBertMa multiple 66362880
inLayer)
pre_classifier (Dense) multiple 590592
classifier (Dense) multiple 1538
dropout_19 (Dropout) multiple 0
=================================================================
Total params: 66955010 (255.41 MB)
Trainable params: 66955010 (255.41 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Epoch 1/5
12798/12798 [==============================] - 1733s 133ms/step - loss: 0.4433 - accuracy: 0.7871 - val_loss: 0.3563 - val_accuracy: 0.8420
Epoch 2/5
12798/12798 [==============================] - 1684s 132ms/step - loss: 0.3637 - accuracy: 0.8348 - val_loss: 0.2859 - val_accuracy: 0.8796
Epoch 3/5
12798/12798 [==============================] - 1685s 132ms/step - loss: 0.2939 - accuracy: 0.8711 - val_loss: 0.1917 - val_accuracy: 0.9244
Epoch 4/5
12798/12798 [==============================] - 1684s 132ms/step - loss: 0.2246 - accuracy: 0.9045 - val_loss: 0.1219 - val_accuracy: 0.9555
Epoch 5/5
12798/12798 [==============================] - 1684s 132ms/step - loss: 0.1706 - accuracy: 0.9296 - val_loss: 0.0794 - val_accuracy: 0.9735
<tf_keras.src.callbacks.History at 0x7fb31c2e8970>
distilbert_test_loss, distilbert_test_acc = (
model_distilbert.evaluate(test_dataset)
)
print(f'Test set accuracy: {distilbert_test_acc}')3200/3200 [==============================] - 102s 32ms/step - loss: 0.5264 - accuracy: 0.8098
Test set accuracy: 0.8097870945930481
Conclusion
We’ve built a text classification model with comparison of accuracies as follows:
| Model | model |
model_reg |
model_ml_bi_lstm |
model_distilbert |
|---|---|---|---|---|
| Training Acc | 61.36% |
61.07% |
81.86% |
92.96% |
| Test Acc | 61.50% |
61.23% |
79.10% |
80.98% |
Using a pre-trained model in transformers improved the accuracy compared to other models in classifying positive reviews from the other. We can use the algorithm to predict the upcoming data of drug reviews.
Suggestion
With the project ending here, the author realizes this could still have more development. Some suggestions for higher accuracy include: * Increasing the number of epoch for higher inputs * Hyperparameter tuning on models with highest performances, from number of layers to optimizer’s learning rate * You may notice that several models have modifications that doesn’t let it be ran on Google Colab, such as LSTM-Bidirectional with changes of activation function. We can fit the dataset into initial model to look for the accuracy. * Experiment more on another types of model (more LSTM or transformers)