symptom-disease-1.2

Hosted with nbsanity. See source notebook on GitHub.

Library Import

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#import fastbook
#fastbook.setup_book()
#from fastbook import *
from fastai.tabular.all import *
import numpy as np
from numpy import random
from tqdm import tqdm
from ipywidgets import interact
from fastai.imports import *
np.set_printoptions(linewidth=130)
from fastai.text.all import *
from pathlib import Path
import os
import warnings
import gc
import pickle
from joblib import dump, load

import tokenize, ast
from io import BytesIO

from transformers import AutoModelForCausalLM,AutoTokenizer,BitsAndBytesConfig
import torch

import ipywidgets as widgets

from openai import OpenAI

import kagglehub

my_specific_path = "/data/" 

# Download latest version
path = kagglehub.dataset_download("rubanzasilva/symptoms-disease-no-id"),
output_path=my_specific_path

print("Path to dataset files:", path)

Warning: Looks like you're using an outdated `kagglehub` version (installed: 0.3.10), please consider upgrading to the latest version (0.3.11).
Path to dataset files: ('/teamspace/studios/this_studio/.cache/kagglehub/datasets/rubanzasilva/symptoms-disease-no-id/versions/1',)

path = Path('/teamspace/studios/this_studio/.cache/kagglehub/datasets/rubanzasilva/symptoms-disease-no-id/versions/1')
path

Path('/teamspace/studios/this_studio/.cache/kagglehub/datasets/rubanzasilva/symptoms-disease-no-id/versions/1')

#symptom_df = pd.read_csv(path_lm/'symptom_synth.csv',index_col=0)
symptom_df = pd.read_csv(path/'symptom_no_id.csv')
sd_df = pd.read_csv(path/'symptom_disease_no_id_col.csv')
symptom_df.head()

	text
0	I have been experiencing a skin rash on my arms, legs, and torso for the past few weeks. It is red, itchy, and covered in dry, scaly patches.
1	My skin has been peeling, especially on my knees, elbows, and scalp. This peeling is often accompanied by a burning or stinging sensation.
2	I have been experiencing joint pain in my fingers, wrists, and knees. The pain is often achy and throbbing, and it gets worse when I move my joints.
3	There is a silver like dusting on my skin, especially on my lower back and scalp. This dusting is made up of small scales that flake off easily when I scratch them.
4	My nails have small dents or pits in them, and they often feel inflammatory and tender to the touch. Even there are minor rashes on my arms.

sd_df

	label	text
0	Psoriasis	I have been experiencing a skin rash on my arms, legs, and torso for the past few weeks. It is red, itchy, and covered in dry, scaly patches.
1	Psoriasis	My skin has been peeling, especially on my knees, elbows, and scalp. This peeling is often accompanied by a burning or stinging sensation.
2	Psoriasis	I have been experiencing joint pain in my fingers, wrists, and knees. The pain is often achy and throbbing, and it gets worse when I move my joints.
3	Psoriasis	There is a silver like dusting on my skin, especially on my lower back and scalp. This dusting is made up of small scales that flake off easily when I scratch them.
4	Psoriasis	My nails have small dents or pits in them, and they often feel inflammatory and tender to the touch. Even there are minor rashes on my arms.
...	...	...
1195	diabetes	I'm shaking and trembling all over. I've lost my sense of taste and smell, and I'm exhausted. I occasionally get palpitations or a speeding heart.
1196	diabetes	Particularly in the crevices of my skin, I have skin rashes and irritations. My skin bruises and cuts take a while to heal as well.
1197	diabetes	I regularly experience these intense urges and the want to urinate. I frequently feel drowsy and lost. I've also significantly lost my vision.
1198	diabetes	I have trouble breathing, especially outside. I start to feel hot and start to sweat. I frequently have urinary tract infections and yeast infections.
1199	diabetes	I constantly sneeze and have a dry cough. My infections don't seem to be healing, and I have palpitations. My throat does ache occasionally, but it usually gets better.

1200 rows × 2 columns


#|include: false 
#| code-fold: true
#| output: false
#| code-summary: "Library Import"
from huggingface_hub import login
login()

mn = "meta-llama/Llama-2-7b-hf"

# Define the model name
model_name = "meta-llama/Llama-2-7b-hf"

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the model with 8-bit quantization directly to GPU
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map=0,  # Use first GPU
    load_in_8bit=True  # Use 8-bit quantization to reduce memory usage
)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.

# Define the symptoms text
symptoms = "I have been experiencing a severe headache for the last few days. It's worse in the mornings and associated with nausea and vomiting. I feel a bit lightheaded, and my vision is blurry at times."

# Create the prompt with clear instructions
prompt = f"Patient symptoms: {symptoms}\n\nTop 3 possible diagnoses with confidence levels:"

# Tokenize the input prompt
toks = tokenizer(prompt, return_tensors="pt")

# Generate prediction following Jeremy's implementation
# Move tokens to GPU, generate with max_new_tokens=40, then move result back to CPU
res = model.generate(**toks.to("cuda"), max_new_tokens=40).to('cpu')

# Decode the generated tokens to text
diagnosis = tokenizer.batch_decode(res)[0]

# Print the full output
print("Complete model output:")
print(diagnosis)

Complete model output:
<s> Patient symptoms: I have been experiencing a severe headache for the last few days. It's worse in the mornings and associated with nausea and vomiting. I feel a bit lightheaded, and my vision is blurry at times.

Top 3 possible diagnoses with confidence levels:

1. [Migraine (94% confidence)](https://en.wikipedia.org/wiki/Migraine)
2. [Sinusitis (92% confidence

# Basic parsing to extract just the generated diagnoses
# This might need adjustment based on the actual output format
if "\n\nTop 3 possible diagnoses" in diagnosis:
    # Extract only the part after our prompt
    generated_text = diagnosis.split("\n\nTop 3 possible diagnoses with confidence levels:")[1].strip()
else:
    generated_text = diagnosis.split(prompt)[1].strip()

print("\nExtracted diagnoses:")
print(generated_text)


Extracted diagnoses:
1. [Migraine (94% confidence)](https://en.wikipedia.org/wiki/Migraine)
2. [Sinusitis (92% confidence

# Try with sampling for more diverse outputs
res_with_sampling = model.generate(
    **toks.to("cuda"), 
    max_new_tokens=40, 
    do_sample=True,  # Enable sampling
    temperature=0.7  # Control randomness (lower = more focused)
).to('cpu')

# Decode sampled response
diagnosis_with_sampling = tokenizer.batch_decode(res_with_sampling)[0]
print("\nOutput with sampling enabled:")
print(diagnosis_with_sampling)


Output with sampling enabled:
<s> Patient symptoms: I have been experiencing a severe headache for the last few days. It's worse in the mornings and associated with nausea and vomiting. I feel a bit lightheaded, and my vision is blurry at times.

Top 3 possible diagnoses with confidence levels:

1. Migraine
2. Tension headache
3. Cluster headache

My 2nd most confident diagnosis is migraine. I am not confident about my

# Multiple symptom descriptions
symptom_list = [
    "Persistent cough, fever of 101°F for 5 days, and fatigue.",
    "Red, itchy rash on face and arms, started after camping trip.",
    "Joint pain in fingers and wrists, worse in the morning, with stiffness."
]

# Create prompts for each symptom description
prompts = [f"Patient symptoms: {s}\n\nTop 3 possible diagnoses with confidence levels:" for s in symptom_list]

# Process each prompt
for prompt in prompts:
    # Tokenize
    toks = tokenizer(prompt, return_tensors="pt")
    
    # Generate (using Jeremy's style)
    res = model.generate(**toks.to("cuda"), max_new_tokens=40, do_sample=True).to('cpu')
    
    # Decode
    diagnosis = tokenizer.batch_decode(res)[0]
    
    # Print result
    print("\n" + "="*50)
    print(prompt)
    print("-"*50)
    print(diagnosis.split(prompt)[1].strip() if prompt in diagnosis else diagnosis)


==================================================
Patient symptoms: Persistent cough, fever of 101°F for 5 days, and fatigue.

Top 3 possible diagnoses with confidence levels:
--------------------------------------------------
1. Influenza A
2. Influenza B
3. Influenza C

### 1. Influenza A

|

==================================================
Patient symptoms: Red, itchy rash on face and arms, started after camping trip.

Top 3 possible diagnoses with confidence levels:
--------------------------------------------------
1. Poison ivy
2. Shingles
3. Lyme disease

### 1. Poison ivy

- **Confidence level:**

==================================================
Patient symptoms: Joint pain in fingers and wrists, worse in the morning, with stiffness.

Top 3 possible diagnoses with confidence levels:
--------------------------------------------------
1. 20%: Bursitis
2. 20%: Carpal Tunnel Syndrome
3. 20%: Arthritis

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import pandas as pd
import random

# Load your datasets
#sd_df = pd.read_csv('path/to/symptom_disease_no_id_col.csv')

# Load model and tokenizer
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map=0, load_in_8bit=True)

# Select a random symptom from your dataset to test
random_idx = random.randint(0, len(sd_df) - 1)
test_symptoms = sd_df.iloc[random_idx]['text']
actual_diagnosis = sd_df.iloc[random_idx]['label']

# Create a prompt with the symptoms
prompt = f"Patient symptoms: {test_symptoms}\n\nTop 3 possible diagnoses with confidence levels:"

# Tokenize the input
toks = tokenizer(prompt, return_tensors="pt")

# Generate prediction (Jeremy's style)
res = model.generate(**toks.to("cuda"), max_new_tokens=50, do_sample=True, temperature=0.7).to('cpu')

# Decode the response
prediction = tokenizer.batch_decode(res)[0]

# Print results
print(f"SYMPTOMS: {test_symptoms}")
print(f"ACTUAL DIAGNOSIS: {actual_diagnosis}")
print(f"MODEL PREDICTION:\n{prediction}")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.

SYMPTOMS: I am having some diarrhea and constipation, which has been quite concerning. In my stomach, there is a severe, painful ache. I'm constantly exhausted and don't feel like eating anything.
ACTUAL DIAGNOSIS: Typhoid
MODEL PREDICTION:
<s> Patient symptoms: I am having some diarrhea and constipation, which has been quite concerning. In my stomach, there is a severe, painful ache. I'm constantly exhausted and don't feel like eating anything.

Top 3 possible diagnoses with confidence levels:

1. \strong{Acute appendicitis} (80%)

2. \strong{Diverticulitis} (60%)

3. \strong{Ulcerative colitis} (40