This project analyzes open-ended patient reviews to predict sentiment (positive or negative) using Natural Language Processing (NLP) techniques and Large Language Models (LLMs).
- Goal: Classify patient feedback based on sentiment
- Dataset: 996 hospital reviews with labeled sentiment
- Techniques used:
- Text cleaning and preprocessing (NLTK, regex)
- Exploratory Data Analysis (word clouds, word frequencies, review length)
- Feature extraction with TF-IDF
- Sentiment classification using:
- Logistic Regression (baseline)
- distilBERT LLM from Hugging Face (zero-shot)
patient-sentiment-healthcare/
βββ data/
β βββ dataset_hospital_reviews.csv #raw
β βββ dataset_hospital_reviews_cleaned.csv #processed
βββ notebooks/
β βββ 01_data_cleaning_and_eda.ipynb # Data cleaning + EDA
β βββ 02_modeling_and_llm_comparison.ipynb # Model training + LLM comparison
βββ README.md
- Accuracy: 0.86
- High precision on positive class
- Poor recall on negative class
- Accuracy: 0.78
- Much better at identifying negative reviews
- Balanced recall across classes
"Wait hour despite appointment isnβt first time happened understanding manage appointment queue itβs random unorganised lot scope improve"
--> Detected as NEGATIVE by distilBERT
- Python, Pandas, Scikit-learn, NLTK, Matplotlib, Seaborn
- Hugging Face Transformers (distilBERT)
- Google Colab (for LLM execution)
- Open
02_modeling_and_llm_comparison.ipynbin Google Colab - Mount your Google Drive and upload the cleaned dataset (or use the one provided)
- Run the cells to explore, train, and evaluate both models