DIGI405 Text Analysis Project Notebook¶
0.2.5 - 2025-10-08 - optional functionality to run inference on augmented data or data from a CSV
0.2.3 - 2025-09-15 - metric ordering
0.2.2 - 2025-09-08 - quality of life improvements, probability labels
0.2.1 - 2025-08-19 - ensure filtering / grid search handled as expected
Note: Search notebook for 0.2.1/0.2.2/0.2.3/0.2.5 to find changes if you want to apply to your existing notebook.
Introduction¶
You should use this notebook as a starting point for your DIGI405 project. It provides code to select your dataset, and run a complete text classification pipeline with textplumber, a package that provides an easy to use interface to methods covered in this course.
Name: David Ewing
Student ID: 82171165
Project option: Sentiment
Problem Statement¶
"This project evaluates the quality of sentiment labels applied to the Cardiff NLP [tweet_eval sentiment dataset](https://huggingface.co/datasets/cardiffnlp/tweet_eval/viewer/sentiment) -- a sample of approximagely 45,000 tweets labelled: *negative*, *neutral*, or *positive*.\n",
My central question:
- How reliably is a text classifier able to reproduce the human-assigned sentiment labels?
- What does a text classifier's performance tell us about the quality and consistency of the labels?**
Sub-questions:
- What is the nature of problems exist with the labelling of tweets?
- What types of tweet are systematically hard to classify?
- Which feature(s) are most informative for distinguishing sentiment classes?
Evaluation metrics¶
The dataset is dramatically imbalanced. Because the dataset is imbalanced (neutral ≈ 45% of training data), macro avg F1 will be used as the primary evaluation metric:
- macro avg F1 weights each class equally regardless of size,
- Precision and recall are reported per-class to show us where misclassifications occur, and
- Accuracy is not used as the primary metric.
Model plan¶
We will run the pipeline twice, in a loop, swapping which line is commented out:
- logistic regression as a baseline, and
- decision tree to see what rules emerge from the data.
Features we are to test, (building from simple to more complex):
- unigram token counts (small vocab to start)
- unigram tokens, TF-IDF weighted (larger vocab)
- bigrams
- VADER sentiment lexicon counts — likely the most relevant given the task
- text statistics (tweet length, punctuation)
- POS tag sequences
We will also try running as a binary task (negative vs positive only) to check whether neutral is where most of the difficulty sits.
Notebook structure¶
Sections 1-4 provide code you should modify or extend. In your report, you can refer to code sections by their section number, eg 2.1.
1. Setup¶
You must select the Python 3.12 kernel to run the code in this notebook.
import subprocess
import sys
required_packages = {
'datasets': 'datasets',
'sklearn': 'scikit-learn',
'textplumber': 'textplumber',
'imblearn': 'imbalanced-learn',
'nlpaug': 'nlpaug',
'spacy': 'spacy',
}
missing = []
for import_name, package_name in required_packages.items():
try:
__import__(import_name)
print(f"OK : {package_name}")
except ModuleNotFoundError:
print(f"INSTALL : {package_name}")
missing.append(package_name)
if missing:
print(f"\nInstalling {len(missing)} missing package(s)...")
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q"] + missing)
print("OK : Installation complete\n")
else:
print("\nOK : All dependencies available\n")
OK : datasets OK : scikit-learn OK : textplumber OK : imbalanced-learn OK : nlpaug OK : spacy OK : All dependencies available
from datasets import load_dataset, ClassLabel, DatasetDict
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, mutual_info_classif, chi2
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import FeatureUnion
from sklearn.metrics import confusion_matrix, classification_report
from textplumber.core import *
from textplumber.clean import *
from textplumber.preprocess import *
from textplumber.tokens import *
from textplumber.pos import *
from textplumber.embeddings import *
from textplumber.report import *
from textplumber.store import *
from textplumber.lexicons import *
from textplumber.textstats import *
from imblearn.under_sampling import RandomUnderSampler
from IPython.display import Image
import matplotlib.pyplot as plt
plt.switch_backend('Agg')
import warnings
warnings.filterwarnings("ignore", message="Your stop_words may be inconsistent with your preprocessing")
These settings control the display of Pandas dataframes in the notebook.
pd.set_option('display.max_columns', None) # show all columns
pd.set_option('display.max_colwidth', 500) # increase this to see more text in the dataframe
Get word lists:
- The stop word list is from NLTK.
- All of the word lists (including the stop word list) can be used to extract lexicon count features to extract features based on a set of words.
import pickle
stop_words = get_stop_words()
stop_words_lexicon = {'stop_words': stop_words}
empath_lexicons = get_empath_lexicons()
vader_lexicons = get_sentiment_lexicons()
def _tok(ngram, max_f, feature_store, stop_words):
return Pipeline([('vec', TokensVectorizer(
feature_store=feature_store, vectorizer_type='count',
max_features=max_f, lowercase=True, remove_punctuation=True,
stop_words=stop_words, ngram_range=ngram))])
def setup_feature_configs(feature_store, stop_words, vader_lexicons):
return {
'unigrams': FeatureUnion([('tokens', _tok((1, 1), 200, feature_store, stop_words))]),
'bigrams': FeatureUnion([('tokens', _tok((2, 2), 200, feature_store, stop_words))]),
'uni+pos': FeatureUnion([('tokens', _tok((1, 1), 100, feature_store, stop_words)),
('pos', Pipeline([('vec', POSVectorizer(feature_store=feature_store)),
('scl', StandardScaler(with_mean=False))]))]),
'textstats': FeatureUnion([('ts', Pipeline([('vec', TextstatsTransformer(feature_store=feature_store)),
('scl', StandardScaler(with_mean=False))]))]),
'lexicon': FeatureUnion([('lex', Pipeline([('vec', LexiconCountVectorizer(feature_store=feature_store, lexicons=vader_lexicons)),
('scl', StandardScaler(with_mean=False))]))]),
'embeddings':FeatureUnion([('emb', Model2VecEmbedder(feature_store=feature_store))]),
}
def setup_classifiers():
return [
('LR', LogisticRegression(max_iter=5000, random_state=42)),
('DT', DecisionTreeClassifier(max_depth=3, random_state=42)),
]
def save_results(results_dict, run_name):
path = Path(f'../results/runs/{run_name}.pkl')
path.parent.mkdir(parents=True, exist_ok=True)
with open(path, 'wb') as f:
pickle.dump(results_dict, f)
return path
def load_results(run_name):
path = Path(f'../results/runs/{run_name}.pkl')
if path.exists():
with open(path, 'rb') as f:
return pickle.load(f)
return None
def train_and_evaluate_model(feat_name, feat_union, clf_name, clf, X_train, y_train, X_test, y_test,
target_classes, target_names, feature_store, results_dict, is_binary=False, store_pipeline=False, silent=True):
pipe = Pipeline([
('cleaner', TextCleaner(strip_whitespace=True)),
('spacy', SpacyPreprocessor(feature_store=feature_store)),
('features', feat_union),
('classifier', clone(clf) if is_binary else clf),
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
rpt = classification_report(
y_test, y_pred,
labels=target_classes, target_names=target_names,
digits=3, zero_division=0, output_dict=True)
results_dict[(feat_name, clf_name)] = {
'macro_f1': rpt['macro avg']['f1-score'],
'report': rpt,
'y_pred': y_pred,
}
if store_pipeline:
results_dict[(feat_name, clf_name)]['pipeline'] = pipe
if not silent:
if is_binary:
print(f'\n [binary] {feat_name} x {clf_name}')
else:
print(f'\n {feat_name} x {clf_name}')
print(classification_report(
y_test, y_pred,
labels=target_classes, target_names=target_names,
digits=3, zero_division=0))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
def save_run(run_num, title, results_dict, clf_name, y_test, classes, names):
print(f"\n{title}")
for feat_name in feat_configs.keys():
rpt = results_dict[(feat_name, clf_name)]['report']
y_pred = results_dict[(feat_name, clf_name)]['y_pred']
stem = f'cm_run{run_num}_{feat_name}_{clf_name}'
pd.DataFrame(rpt).T.to_csv(FIGS_DIR / f'{stem}.csv')
fig_path = FIGS_DIR / f'{stem}.png'
with plt.ioff():
plot_confusion_matrix(y_test, y_pred, classes, names)
plt.savefig(fig_path, bbox_inches='tight', dpi=150)
plt.close('all')
print(f" SAVED : {stem}")
best = max([(k, results_dict[k]['macro_f1']) for k in results_dict], key=lambda x: x[1])
print(f" best: {best[0][0]} x {best[0][1]} macro F1 = {best[1]:.3f}")
return best
2. Load and inspect data¶
2.0 Parquet split cache¶
Builds reproducible, balanced train/test splits from the raw CSVs and saves them as parquet files. If the files already exist the cell is a no-op. Both the 3-class task (negative / neutral / positive) and the binary task (negative / positive) are prepared here so downstream cells can load them without repeating the undersampling.
import shutil
from pathlib import Path
CLEANUP_BEFORE_RUN = True
if CLEANUP_BEFORE_RUN:
shutil.rmtree(Path('../results/runs'), ignore_errors=True)
shutil.rmtree(Path('../figs'), ignore_errors=True)
print("CLEANUP : results/runs and figs")
CLEANUP : results/runs and figs
from pathlib import Path
import pandas as pd
import numpy as np
from imblearn.under_sampling import RandomUnderSampler
DATA_DIR = Path('../data').resolve()
RESULTS_DATA_DIR = Path('../results/data').resolve()
FIGS_DIR = Path('../figs').resolve()
DATA_DIR.mkdir(parents=True, exist_ok=True)
RESULTS_DATA_DIR.mkdir(parents=True, exist_ok=True)
FIGS_DIR.mkdir(parents=True, exist_ok=True)
print(f"OK : {DATA_DIR}")
print(f"OK : {RESULTS_DATA_DIR}")
print(f"OK : {FIGS_DIR}")
TRAIN_CSV = DATA_DIR / 'tweets_train.csv'
VAL_CSV = DATA_DIR / 'tweets_validation.csv'
if not TRAIN_CSV.exists() or not VAL_CSV.exists():
print("Downloading from HuggingFace...")
from datasets import load_dataset
dataset_hf = load_dataset('cardiffnlp/tweet_eval', 'sentiment')
train_df = dataset_hf['train'].to_pandas()
val_df = dataset_hf['validation'].to_pandas()
train_df.to_csv(TRAIN_CSV, index=False)
val_df.to_csv(VAL_CSV, index=False)
print(f"OK : Downloaded and saved to {DATA_DIR}")
else:
print(f"OK : Using existing files from {DATA_DIR}")
PARQUET_3CLASS = RESULTS_DATA_DIR / 'splits_3class.parquet'
PARQUET_BINARY = RESULTS_DATA_DIR / 'splits_binary.parquet'
def _build_parquet(classes, out_path):
train_df = pd.read_csv(TRAIN_CSV)
val_df = pd.read_csv(VAL_CSV)
mask_tr = train_df['label'].isin(classes)
mask_te = val_df['label'].isin(classes)
X_tr = train_df.loc[mask_tr, 'text'].to_numpy().reshape(-1, 1)
y_tr = train_df.loc[mask_tr, 'label'].to_numpy()
X_tr, y_tr = RandomUnderSampler(random_state=82171165).fit_resample(X_tr, y_tr)
X_tr = X_tr.reshape(-1)
X_te = val_df.loc[mask_te, 'text'].to_numpy()
y_te = val_df.loc[mask_te, 'label'].to_numpy()
combined = pd.concat([
pd.DataFrame({'split': 'train', 'text': X_tr, 'label': y_tr}),
pd.DataFrame({'split': 'test', 'text': X_te, 'label': y_te}),
], ignore_index=True)
combined.to_parquet(out_path, index=False)
print(f'OK : {out_path.name} (train n={len(X_tr):,}, test n={len(X_te):,})')
for label, classes, path in [
('3-class (neg/neu/pos)', [0, 1, 2], PARQUET_3CLASS),
('binary (neg/pos)', [0, 2], PARQUET_BINARY),
]:
if path.exists():
print(f"OK : {path.name} already exists")
else:
print(f"Building {label}...")
_build_parquet(classes, path)
OK : /home/dew59/DIGI405/data OK : /home/dew59/DIGI405/results/data OK : /home/dew59/DIGI405/figs OK : Using existing files from /home/dew59/DIGI405/data OK : splits_3class.parquet already exists OK : splits_binary.parquet already exists
2.1 Choose a dataset and preview the labels¶
Below you can select a dataset for the assignment. The options are sentiment, essay and genre. Change the value of dataset_option below. The datasets available on Huggingface.co will be downloaded automatically and a link provided to the dataset card with more information. The genre dataset was distributed with this notebook.
Note: The movie_reviews dataset is being used to demonstrate the notebook and is not one of your options for the assignment.
# Choose 'essay', 'sentiment', or 'genre' ('movie_reviews' is just for testing/demonstration)
dataset_option = 'sentiment'
if dataset_option == 'movie_reviews':
dataset_name = 'polsci/sentiment-polarity-dataset-v2.0'
dataset_dir = None
target_labels = ['neg', 'pos']
text_column = 'text'
label_column = 'label'
train_split_name = 'train'
test_split_name = 'train'
print('The movie_reviews is to demonstrate the notebook and is not an assignment option.')
elif dataset_option == 'sentiment':
dataset_name = 'cardiffnlp/tweet_eval'
dataset_dir = 'sentiment'
target_labels = ['negative', 'neutral', 'positive']
text_column = 'text'
label_column = 'label'
train_split_name = 'train'
test_split_name = 'validation'
print('You selected the sentiment dataset. Read more about this at https://huggingface.co/datasets/cardiffnlp/tweet_eval')
elif dataset_option == 'essay':
dataset_name = 'polsci/ghostbuster-essay-cleaned'
dataset_dir = None
target_labels = ['claude', 'gpt', 'human']
text_column = 'text'
label_column = 'label'
train_split_name = 'train'
test_split_name = 'test'
print('You selected the essay dataset. Read more about this at https://huggingface.co/datasets/polsci/ghostbuster-essay-cleaned')
elif dataset_option == 'genre':
dataset_name = 'genre'
dataset_type = 'json'
# Note: Quality of life improvement for version 0.2.2
dataset_dir = '/srv/source-data/genre_dataset.json' # if you are running this locally change to the path on your machine
target_labels = ['Fiction', 'Letter', 'Notice', 'Obituary', 'Poetry or verse', 'Recipe', 'Review']
text_column = 'text'
label_column = 'label'
train_split_name = 'train'
test_split_name = 'test'
print('You selected the genre dataset.')
else:
print('Try again! That was not an option!')
You selected the sentiment dataset. Read more about this at https://huggingface.co/datasets/cardiffnlp/tweet_eval
Important notes about specific datasets:¶
- Make sure you go to the relevant Huggingface page to read more about the essay and sentiment datasets. Note the sentiment dataset is one subset of the larger 'tweet_eval' dataset.
- For the sentiment dataset, it is challenging to get good accuracy with three classes. If you like you can remove the
neutralclass. There is a cell below that does this for you - don't change the cell above. - For the essay dataset, there are differences in punctuation between classes. You should use
character_replacements = {"’": "'", '“': '"', '”': '"',}in theTextCleanercomponent in your pipeline to make sure you are not overfitting to a quirk of the data.
This loads the dataset.
from datasets import load_from_disk, DatasetDict
DATASET_CACHE = DATA_DIR / 'tweet_eval_raw'
if dataset_option == 'genre':
if DATASET_CACHE.exists():
print(f"\nLoading cached genre dataset from {DATASET_CACHE}...")
dataset = load_from_disk(str(DATASET_CACHE))
print("OK : Dataset loaded from cache\n")
else:
print(f"\nLoading genre dataset from {dataset_dir}...")
dataset = load_dataset(dataset_type, data_files=dataset_dir)
train_dataset = dataset['train'].filter(lambda example: example['split'] == 'train')
test_dataset = dataset['train'].filter(lambda example: example['split'] == 'test')
dataset = DatasetDict({
'train': train_dataset,
'test': test_dataset
})
print(f"Caching dataset to {DATASET_CACHE}...")
dataset.save_to_disk(str(DATASET_CACHE))
print("OK : Genre dataset cached\n")
else:
if DATASET_CACHE.exists():
print(f"\nLoading cached dataset from {DATASET_CACHE}...")
dataset = load_from_disk(str(DATASET_CACHE))
print("OK : Dataset loaded from cache\n")
else:
print(f"\nDownloading {dataset_option} dataset from HuggingFace...")
dataset = load_dataset(dataset_name, data_dir=dataset_dir)
print(f"Caching dataset to {DATASET_CACHE}...")
dataset.save_to_disk(str(DATASET_CACHE))
print("OK : Dataset downloaded and cached\n")
Loading cached dataset from /home/dew59/DIGI405/data/tweet_eval_raw... OK : Dataset loaded from cache
# dataset loaded and cached above
This cell will show you information on the dataset fields and the splits.
preview_dataset(dataset)
Split: train (45615 samples)
Available fields: text, label
- Field 'text' has 45586 unique values
Value(dtype='string', id=None)
- Field 'label' has 3 unique values
ClassLabel(names=['negative', 'neutral', 'positive'], id=None)
Split: validation (2000 samples)
Available fields: text, label
- Field 'text' has 2000 unique values
Value(dtype='string', id=None)
- Field 'label' has 3 unique values
ClassLabel(names=['negative', 'neutral', 'positive'], id=None)
Split: test (12284 samples)
Available fields: text, label
- Field 'text' has 12284 unique values
Value(dtype='string', id=None)
- Field 'label' has 3 unique values
ClassLabel(names=['negative', 'neutral', 'positive'], id=None)
Notices
- Field 'text' appears to be a text column.
- Field 'label' is a label column (ClassLabel).
Here is the breakdown of the composition of labels in each data-set split.
# casting label column to ClassLabel if not already
cast_column_to_label(dataset, label_column)
label_names = get_label_names(dataset, label_column)
dfs = {}
for split in dataset.keys():
dfs[split] = dataset[split].to_pandas()
dfs[split].insert(1, 'label_name', dfs[split][label_column].apply(lambda x: dataset[split].features[label_column].int2str(x)))
print('Labels for {}:'.format(split))
preview_label_counts(dfs[split], label_column, label_names)
Column 'label' is already a ClassLabel. Labels for train:
| label_name | count | |
|---|---|---|
| label | ||
| 0 | negative | 7093 |
| 1 | neutral | 20673 |
| 2 | positive | 17849 |
Labels for validation:
| label_name | count | |
|---|---|---|
| label | ||
| 0 | negative | 312 |
| 1 | neutral | 869 |
| 2 | positive | 819 |
Labels for test:
| label_name | count | |
|---|---|---|
| label | ||
| 0 | negative | 3972 |
| 1 | neutral | 5937 |
| 2 | positive | 2375 |
2.2 Configure the labels (optional)¶
- You can override the default labels for the data-set here to make the task more or less challenging. High accuracy does not guarantee a high grade.
- See the assignment instructions and the dataset card or corresponding paper for explanations of the data.
- Read the comments below and uncomment the relevant lines for your data-set if and amend the label names if needed.
- Remember, this is optional.
# for the movie reviews dataset (this is just for testing/demonstration) - there are 2 labels and that is it!
# for the sentiment dataset - there are 3 labels - you can make the task simpler as a binary classification problem using one of these options:
#target_labels = ['negative', 'neutral']
#target_labels = ['negative', 'positive']
#target_labels = ['neutral', 'positive']
# for the essay dataset - there are 7 labels - you can make the task simpler as a binary classification problem using one of these options:
#target_labels = ['claude', 'gpt']
#target_labels = ['human', 'gpt']
#target_labels = ['human', 'claude']
# for the genre dataset - there are 7 labels - you can turn the task into one or more binary classification problems using options such as:
#target_labels = ['Letter', 'Notice']
#target_labels = ['Letter', 'Fiction']
#target_labels = ['Review', 'Fiction']
#target_labels = ['Notice', 'Obituary']
print(target_labels)
['negative', 'neutral', 'positive']
2.3 Prepare the train and test splits¶
- This cell handles the train-test split for you.
- Some of the data-sets are unbalanced. This cell will balance the training data using under-sampling.
target_classes = [label_names.index(name) for name in target_labels]
target_names = [label_names[i] for i in target_classes]
if train_split_name == test_split_name:
X = dataset[train_split_name].to_pandas()
X.insert(1, 'label_name', dfs[train_split_name][label_column].apply(lambda x: dataset[train_split_name].features[label_column].int2str(x)))
y = np.array(dataset[train_split_name][label_column])
mask = np.isin(y, target_classes)
X = X.loc[mask]
y = y[mask]
# creating df splits with original data first - so can look at the train data if needed
dfs['train'], dfs['test'], y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# we're just using the text for features
X_train = np.array(dfs['train'][text_column])
X_test = np.array(dfs['test'][text_column])
else:
X_train = np.array(dataset[train_split_name][text_column])
y_train = np.array(dataset[train_split_name][label_column])
X_test = np.array(dataset[test_split_name][text_column])
y_test = np.array(dataset[test_split_name][label_column])
mask = np.isin(y_train, target_classes)
mask_test = np.isin(y_test, target_classes)
X_train = X_train[mask]
y_train = y_train[mask]
X_test = X_test[mask_test]
y_test = y_test[mask_test]
# this cell undersamples all but the minority class to balance the training data
X_train = X_train.reshape(-1, 1)
X_train, y_train = RandomUnderSampler(random_state=0).fit_resample(X_train, y_train)
X_train = X_train.reshape(-1)
preview_splits(X_train, y_train, X_test, y_test, target_classes = target_classes, target_names = target_names)
Train: 21279 samples, 3 classes
| label_name | count | |
|---|---|---|
| 0 | ||
| 0 | negative | 7093 |
| 1 | neutral | 7093 |
| 2 | positive | 7093 |
Test: 2000 samples, 3 classes
| label_name | count | |
|---|---|---|
| 0 | ||
| 1 | neutral | 869 |
| 2 | positive | 819 |
| 0 | negative | 312 |
2.4 Preview the texts¶
Time to get to know your data. We will only preview the train split.
y_train_names = map(lambda x: label_names[x], y_train)
# Note: Version 0.2.1 corrects display of the dataframe to ensure filtering by the selected labels
display(dfs['train'][dfs['train']['label_name'].isin(y_train_names)].sample(10))
| text | label_name | label | |
|---|---|---|---|
| 36281 | @user Nia's asking if you want a current event for tomorrow? She can print it right now | neutral | 1 |
| 30963 | @user @user @user @user #tbt My 1st Ducks game, only took 15yrs to get there! Best day EVER! | positive | 2 |
| 16016 | I stand with Harper on this. You shouldn't be leaving in the 7th on what essentially is a playoff game. It's not like the game went | neutral | 1 |
| 12744 | Janet Jackson is coming on September 18th I will give someone $50 to shoot her in the fucking face | negative | 0 |
| 6238 | @user won twice on the last day to stay up. Finished 10th and carling cup final. Sent West Ham down!!! Lot of good times\u002c more to come | positive | 2 |
| 38022 | I want to go see Equalizer tomorrow | positive | 2 |
| 41141 | Come celebrate National Hot Dog Day tomorrow with us at #spsmarket and get a $1.00 off any of our hot dogs all... | positive | 2 |
| 35598 | @user Jan, I just spoke to Joe Zellner & SMB team, they are fine. We even had two couples who went ahead with their weddings today :)." | positive | 2 |
| 13805 | Jane was ever-so-smart on Sunday's #TheMentalist. Great performance by Owain too! I like the new style adopted in season 5. Keep it up! | positive | 2 |
| 17403 | @user i see this fight as pointless and declare you as my equal in gay, may we share the title for eternity" | negative | 0 |
Enter the index (the number in the first column) as selected_index to see the row. The limit value controls how much of the text you see. Set a higher limit to see more of the text or set it to 0 to see all of the text.
# We can display the full text of a selected article by dataframe index
selected_index = 10
preview_row_text(dfs['train'], selected_index, text_column = text_column, limit=400) # change limit to see more of the text if needed
| Value | |
|---|---|
| Attribute | |
| label_name | neutral |
| label | 1 |
text: @user Well said on HMW. Can you now address why Texans fans file out of the stadium midway through the 4th qtr of every game?
2.5 Dataset class distribution¶
Class counts at each stage of data preparation: raw HF splits - filtered to target labels - after undersampling. The validation split is used as the held-out test set throughout.
_lmap = {0: 'negative', 1: 'neutral', 2: 'positive'}
train_raw = pd.read_csv(DATA_DIR / 'tweets_train.csv')
val_raw = pd.read_csv(DATA_DIR / 'tweets_validation.csv')
stages = [
('Train — raw HF', train_raw, None),
('Train — filtered to target', train_raw[train_raw['label'].isin(target_classes)], None),
('Train — after undersampling', None, (y_train, target_classes)),
('Test — validation (filtered)', val_raw[val_raw['label'].isin(target_classes)], None),
]
col_w = max(len(n) for n, _, _ in stages) + 2
hdr = f"{'Stage':{col_w}} {'negative':>10} {'neutral':>10} {'positive':>10} {'Total':>10}"
sep = '-' * len(hdr)
print(hdr)
print(sep)
for name, df, arr_info in stages:
if arr_info is not None:
y_arr, classes = arr_info
counts = {c: int((y_arr == c).sum()) for c in classes}
else:
counts = df['label'].value_counts().to_dict()
neg = counts.get(0, 0)
neu = counts.get(1, 0)
pos = counts.get(2, 0)
tot = neg + neu + pos
print(f"{name:{col_w}} {neg:>10,} {neu:>10,} {pos:>10,} {tot:>10,}")
print(sep)
Stage negative neutral positive Total ------------------------------------------------------------------------------- Train — raw HF 7,093 20,673 17,849 45,615 Train — filtered to target 7,093 20,673 17,849 45,615 Train — after undersampling 7,093 7,093 7,093 21,279 Test — validation (filtered) 312 869 819 2,000 -------------------------------------------------------------------------------
3. Create a classification pipeline and train a model¶
Create a Sci-kit Learn pipeline to preprocess the texts and train a classification model. The pipeline components will be added in through the notebook. There are a number of pipeline components you can access through the textplumber package. You will have an opportunity to learn about this in labs, but documentation is available here.
To speed up preprocessing some of the pipeline components store the preprocessed data in a cache to avoid recomputing them. Run this as is - it will create an SQLite file with the name of your dataset option in the directory of the notebook. This will speed up some repeated processing (e.g. tokenization with Spacy).
feature_store = TextFeatureStore(f'assignment-{dataset_option}.sqlite')
The pipeline below includes a number of different components. Most are commented out on the first run of the notebook. There are lots of options for each component. You will need to look at the documentation and examples in labs to learn about these. These components can extract different kinds of features, any of which can be applied to build a model. The feature types include:
- Token features
- Bigram features
- Parts of speech features
- Lexicon-based features
- Document-level statistics
- Text embeddings
pipeline = Pipeline([
('cleaner', TextCleaner(strip_whitespace=True)), # for the essay dataset you should use character_replacements = {"’": "'", '“': '"', '”': '"',}
('spacy', SpacyPreprocessor(feature_store=feature_store)),
('features', FeatureUnion([
('tokens', # token features - these can be single tokens or ngrams of tokens using TokensVectorizer - see textplumber documentation for examples
Pipeline([
('spacy_token_vectorizer', TokensVectorizer(feature_store = feature_store, vectorizer_type='count', max_features=100, lowercase = True, remove_punctuation = True, stop_words = stop_words, min_df=0.0, max_df=1.0, ngram_range=(1, 1))),
# ('selector', SelectKBest(score_func=mutual_info_classif, k=100)), # uncomment for feature selection
# ('scaler', StandardScaler(with_mean=False)),
], verbose = True)),
# ('pos', # pos features - these can be a single label or ngrams of pos tags using POSVectorizer - see textplumber documentation for examples
# Pipeline([
# ('spacy_pos_vectorizer', POSVectorizer(feature_store=feature_store)),
# #('selector', SelectKBest(score_func=mutual_info_classif, k=5)),
# ('scaler', StandardScaler(with_mean=False)),
# ], verbose = True)),
#('textstats', # document-level text statistics using TextstatsTransformer - see textplumber documentation for examples
# Pipeline([
# ('textstats_vectorizer', TextstatsTransformer(feature_store=feature_store)),
# ('scaler', StandardScaler(with_mean=False)),
# ], verbose = True)),
# ('lexicon', # lexicon features - defined above are empath_lexicons, sentiment_lexicons and stop_words_lexicon - see textplumber documentation for examples
# Pipeline([
# ('lexicon_vectorizer', LexiconCountVectorizer(feature_store=feature_store, lexicons=empath_lexicons)), # the notebook has already provided example lexicons right at the top!
# #('selector', SelectKBest(score_func=mutual_info_classif, k=5)),
# ('scaler', StandardScaler(with_mean=False)),
# ], verbose = True)),
# ('embeddings', Model2VecEmbedder(feature_store=feature_store)), # extract embeddings using Model2Vec - textplumber documentation for examples
], verbose = True)),
('classifier', LogisticRegression(max_iter=5000, random_state=42)) # for logistic regression - only select one classifier!
#('classifier', DecisionTreeClassifier(max_depth = 3, random_state=42)) # for decision tree - only select one classifier!
], verbose = True) # using verbose because I like to see what is going on
display(pipeline)
Pipeline(steps=[('cleaner', TextCleaner(strip_whitespace=True)),
('spacy',
SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7f919e99be30>)),
('features',
FeatureUnion(transformer_list=[('tokens',
Pipeline(steps=[('spacy_token_vectorizer',
TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f919e99be30>,...
remove_punctuation=True,
stop_words=["'d",
"'ll",
"'m",
"'re",
"'s",
"'ve",
'a',
'about',
'above',
'after',
'again',
'against',
'ain',
'all',
'am',
'an',
'and',
'any',
'are',
'aren',
'as',
'at',
'be',
'because',
'been',
'before',
'being',
'below',
'between',
'both', ...]))],
verbose=True))],
verbose=True)),
('classifier',
LogisticRegression(max_iter=5000, random_state=42))],
verbose=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('cleaner', TextCleaner(strip_whitespace=True)),
('spacy',
SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7f919e99be30>)),
('features',
FeatureUnion(transformer_list=[('tokens',
Pipeline(steps=[('spacy_token_vectorizer',
TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f919e99be30>,...
remove_punctuation=True,
stop_words=["'d",
"'ll",
"'m",
"'re",
"'s",
"'ve",
'a',
'about',
'above',
'after',
'again',
'against',
'ain',
'all',
'am',
'an',
'and',
'any',
'are',
'aren',
'as',
'at',
'be',
'because',
'been',
'before',
'being',
'below',
'between',
'both', ...]))],
verbose=True))],
verbose=True)),
('classifier',
LogisticRegression(max_iter=5000, random_state=42))],
verbose=True)TextCleaner(strip_whitespace=True)
SpacyPreprocessor(feature_store=<textplumber.store.TextFeatureStore object at 0x7f919e99be30>)
FeatureUnion(transformer_list=[('tokens',
Pipeline(steps=[('spacy_token_vectorizer',
TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f919e99be30>,
lowercase=True,
max_features=100,
min_df=0.0,
remove_punctuation=True,
stop_words=["'d",
"'ll",
"'m",
"'re",
"'s",
"'ve",
'a',
'about',
'above',
'after',
'again',
'against',
'ain',
'all',
'am',
'an',
'and',
'any',
'are',
'aren',
'as',
'at',
'be',
'because',
'been',
'before',
'being',
'below',
'between',
'both', ...]))],
verbose=True))],
verbose=True)TokensVectorizer(feature_store=<textplumber.store.TextFeatureStore object at 0x7f919e99be30>,
lowercase=True, max_features=100, min_df=0.0,
remove_punctuation=True,
stop_words=["'d", "'ll", "'m", "'re", "'s", "'ve", 'a',
'about', 'above', 'after', 'again', 'against',
'ain', 'all', 'am', 'an', 'and', 'any', 'are',
'aren', 'as', 'at', 'be', 'because', 'been',
'before', 'being', 'below', 'between', 'both', ...])LogisticRegression(max_iter=5000, random_state=42)
import time
_df3 = pd.read_parquet(PARQUET_3CLASS)
X_train = _df3.loc[_df3['split'] == 'train', 'text'].to_numpy()
y_train = _df3.loc[_df3['split'] == 'train', 'label'].to_numpy()
X_test = _df3.loc[_df3['split'] == 'test', 'text'].to_numpy()
y_test = _df3.loc[_df3['split'] == 'test', 'label'].to_numpy()
print(f'3-class splits loaded (train={len(X_train):,}, test={len(X_test):,})')
feat_configs = setup_feature_configs(feature_store, stop_words, vader_lexicons)
# RUN 1: 3-class + Logistic Regression
run1 = 'run_3class_LR'
loaded = load_results(run1)
if loaded:
results_3class_LR = loaded
print(f"CELL : Loaded {run1} from cache")
else:
print(f"CELL : Computing {run1}...")
results_3class_LR = {}
start = time.time()
for feat_name, feat_union in feat_configs.items():
train_and_evaluate_model(feat_name, feat_union, 'LR', LogisticRegression(max_iter=5000, random_state=42),
X_train, y_train, X_test, y_test, target_classes, target_names,
feature_store, results_3class_LR, is_binary=False, store_pipeline=True, silent=True)
elapsed = time.time() - start
save_results(results_3class_LR, run1)
print(f"CELL : Completed {run1} in {elapsed:.1f}s")
# RUN 2: 3-class + Decision Tree
run2 = 'run_3class_DT'
loaded = load_results(run2)
if loaded:
results_3class_DT = loaded
print(f"CELL : Loaded {run2} from cache")
else:
print(f"CELL : Computing {run2}...")
results_3class_DT = {}
start = time.time()
for feat_name, feat_union in feat_configs.items():
train_and_evaluate_model(feat_name, feat_union, 'DT', DecisionTreeClassifier(max_depth=3, random_state=42),
X_train, y_train, X_test, y_test, target_classes, target_names,
feature_store, results_3class_DT, is_binary=False, store_pipeline=True, silent=True)
elapsed = time.time() - start
save_results(results_3class_DT, run2)
print(f"CELL : Completed {run2} in {elapsed:.1f}s")
3-class splits loaded (train=21,279, test=2,000) CELL : Computing run_3class_LR... CELL : Completed run_3class_LR in 41.2s CELL : Computing run_3class_DT... CELL : Completed run_3class_DT in 41.3s
_best_LR = save_run(1, "RUN 1: 3-class + LR", results_3class_LR, 'LR', y_test, target_classes, target_names)
RUN 1: 3-class + LR
SAVED : cm_run1_unigrams_LR
SAVED : cm_run1_bigrams_LR
SAVED : cm_run1_uni+pos_LR
SAVED : cm_run1_textstats_LR
SAVED : cm_run1_lexicon_LR
SAVED : cm_run1_embeddings_LR best: embeddings x LR macro F1 = 0.590
Results for all six feature configurations × two classifiers are printed above. pipeline and y_predicted are set to the best-performing combination (highest macro F1) so that the evaluation cells work without modification.
_best_DT = save_run(2, "RUN 2: 3-class + DT", results_3class_DT, 'DT', y_test, target_classes, target_names)
RUN 2: 3-class + DT
SAVED : cm_run2_unigrams_DT
SAVED : cm_run2_bigrams_DT
SAVED : cm_run2_uni+pos_DT
SAVED : cm_run2_textstats_DT
SAVED : cm_run2_lexicon_DT
SAVED : cm_run2_embeddings_DT best: lexicon x DT macro F1 = 0.539
# # Note: Version 0.2.1 commented grid search out by default as intended
# # Note: if you get a warning about tokenizers and parallelism - uncomment this line
# # os.environ["TOKENIZERS_PARALLELISM"] = "false"
# # setup gridsearch to test different max_features
# from sklearn.model_selection import GridSearchCV
# param_grid = {
# 'features__tokens__spacy_token_vectorizer__max_features': [50, 100, 150, 200, 250, 300], # this assumes you are using the tokens part of the pipeline
# # 'features__tokens__selector__k': [50, 100, 150, 200, 250, 300], # this assumes you have enabled the selector for tokens
# }
# grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='f1_macro', verbose=100, n_jobs=1)
# grid_search.fit(X_train, y_train)
# print('\n-----------------------------------------------------------------')
# print("Best parameters found: ", grid_search.best_params_)
# print("Best score found: ", grid_search.best_score_)
# print('-----------------------------------------------------------------\n')
# y_pred = grid_search.predict(X_test)
# print(classification_report(y_test, y_pred, target_names = target_names, digits=3))
# plot_confusion_matrix(y_test, y_pred, target_classes, target_names)
3.x Binary task (negative vs positive)¶
Repeats the experiment with neutral tweets removed. This addresses CQ2: if classifier performance on the binary task is substantially higher than on the 3-class task, it suggests that neutral is the primary source of labelling ambiguity rather than the negative/positive boundary.
from sklearn.base import clone
_dfb = pd.read_parquet(PARQUET_BINARY)
X_train_b = _dfb.loc[_dfb['split'] == 'train', 'text'].to_numpy()
y_train_b = _dfb.loc[_dfb['split'] == 'train', 'label'].to_numpy()
X_test_b = _dfb.loc[_dfb['split'] == 'test', 'text'].to_numpy()
y_test_b = _dfb.loc[_dfb['split'] == 'test', 'label'].to_numpy()
print(f'binary splits loaded (train={len(X_train_b):,}, test={len(X_test_b):,})')
bin_classes = [0, 2]
bin_names = ['negative', 'positive']
feat_configs = setup_feature_configs(feature_store, stop_words, vader_lexicons)
# RUN 3: Binary + Logistic Regression
run3 = 'run_binary_LR'
loaded = load_results(run3)
if loaded:
results_binary_LR = loaded
print(f"CELL : Loaded {run3} from cache")
else:
print(f"CELL : Computing {run3}...")
results_binary_LR = {}
start = time.time()
for feat_name, feat_union in feat_configs.items():
train_and_evaluate_model(feat_name, feat_union, 'LR', LogisticRegression(max_iter=5000, random_state=42),
X_train_b, y_train_b, X_test_b, y_test_b, bin_classes, bin_names,
feature_store, results_binary_LR, is_binary=True, store_pipeline=False, silent=True)
elapsed = time.time() - start
save_results(results_binary_LR, run3)
print(f"CELL : Completed {run3} in {elapsed:.1f}s")
# RUN 4: Binary + Decision Tree
run4 = 'run_binary_DT'
loaded = load_results(run4)
if loaded:
results_binary_DT = loaded
print(f"CELL : Loaded {run4} from cache")
else:
print(f"CELL : Computing {run4}...")
results_binary_DT = {}
start = time.time()
for feat_name, feat_union in feat_configs.items():
train_and_evaluate_model(feat_name, feat_union, 'DT', DecisionTreeClassifier(max_depth=3, random_state=42),
X_train_b, y_train_b, X_test_b, y_test_b, bin_classes, bin_names,
feature_store, results_binary_DT, is_binary=True, store_pipeline=False, silent=True)
elapsed = time.time() - start
save_results(results_binary_DT, run4)
print(f"CELL : Completed {run4} in {elapsed:.1f}s")
binary splits loaded (train=14,186, test=1,131) CELL : Computing run_binary_LR... CELL : Completed run_binary_LR in 21.3s CELL : Computing run_binary_DT... CELL : Completed run_binary_DT in 22.6s
_best_bin_LR = save_run(3, "RUN 3: binary + LR", results_binary_LR, 'LR', y_test_b, bin_classes, bin_names)
RUN 3: binary + LR
SAVED : cm_run3_unigrams_LR
SAVED : cm_run3_bigrams_LR
SAVED : cm_run3_uni+pos_LR
SAVED : cm_run3_textstats_LR
SAVED : cm_run3_lexicon_LR
SAVED : cm_run3_embeddings_LR best: embeddings x LR macro F1 = 0.806
_best_bin_DT = save_run(4, "RUN 4: binary + DT", results_binary_DT, 'DT', y_test_b, bin_classes, bin_names)
RUN 4: binary + DT
SAVED : cm_run4_unigrams_DT
SAVED : cm_run4_bigrams_DT
SAVED : cm_run4_uni+pos_DT
SAVED : cm_run4_textstats_DT
SAVED : cm_run4_lexicon_DT
SAVED : cm_run4_embeddings_DT best: lexicon x DT macro F1 = 0.638
feat_names = list(feat_configs.keys())
col = 8
print(f"\n{'3-CLASS TASK':48}")
print(f"{'Feature set':14} {'LR':>{col}} {'DT':>{col}}")
print('-' * 40)
for fn in feat_names:
f1_lr = results_3class_LR[(fn, 'LR')]['macro_f1']
f1_dt = results_3class_DT[(fn, 'DT')]['macro_f1']
print(f"{fn:14} {f1_lr:>{col}.3f} {f1_dt:>{col}.3f}")
print(f"\n{'BINARY TASK':48}")
print(f"{'Feature set':14} {'LR':>{col}} {'DT':>{col}}")
print('-' * 40)
for fn in feat_names:
f1_lr = results_binary_LR[(fn, 'LR')]['macro_f1']
f1_dt = results_binary_DT[(fn, 'DT')]['macro_f1']
print(f"{fn:14} {f1_lr:>{col}.3f} {f1_dt:>{col}.3f}")
3-CLASS TASK Feature set LR DT ---------------------------------------- unigrams 0.467 0.277 bigrams 0.334 0.232 uni+pos 0.467 0.327 textstats 0.389 0.368 lexicon 0.540 0.539 embeddings 0.590 0.419 BINARY TASK Feature set LR DT ---------------------------------------- unigrams 0.667 0.521 bigrams 0.525 0.448 uni+pos 0.656 0.561 textstats 0.538 0.553 lexicon 0.639 0.638 embeddings 0.806 0.614
print(f"\n{'LOGISTIC REGRESSION':48}")
print(f"{'Feature set':14} {'3-class':>8} {'binary':>8} {'gain':>8}")
print('-' * 52)
for fn in feat_names:
f1_3 = results_3class_LR[(fn, 'LR')]['macro_f1']
f1_b = results_binary_LR[(fn, 'LR')]['macro_f1']
print(f"{fn:14} {f1_3:>8.3f} {f1_b:>8.3f} {f1_b - f1_3:>+8.3f}")
print(f"\n{'DECISION TREE':48}")
print(f"{'Feature set':14} {'3-class':>8} {'binary':>8} {'gain':>8}")
print('-' * 52)
for fn in feat_names:
f1_3 = results_3class_DT[(fn, 'DT')]['macro_f1']
f1_b = results_binary_DT[(fn, 'DT')]['macro_f1']
print(f"{fn:14} {f1_3:>8.3f} {f1_b:>8.3f} {f1_b - f1_3:>+8.3f}")
LOGISTIC REGRESSION Feature set 3-class binary gain ---------------------------------------------------- unigrams 0.467 0.667 +0.200 bigrams 0.334 0.525 +0.191 uni+pos 0.467 0.656 +0.189 textstats 0.389 0.538 +0.149 lexicon 0.540 0.639 +0.099 embeddings 0.590 0.806 +0.216 DECISION TREE Feature set 3-class binary gain ---------------------------------------------------- unigrams 0.277 0.521 +0.244 bigrams 0.232 0.448 +0.217 uni+pos 0.327 0.561 +0.234 textstats 0.368 0.553 +0.185 lexicon 0.539 0.638 +0.098 embeddings 0.419 0.614 +0.195
4. Evaluate your model and investigate model predictions¶
You already have some metrics in the cell above. Below is some additional reporting to help you understand your model.
4.1 Classifier-specific features¶
If you are using a Decision Tree classifier in your pipeline, this will plot it ...
if pipeline.named_steps['classifier'].__class__.__name__ == 'DecisionTreeClassifier':
fig_path = FIGS_DIR / 'decision_tree.png'
with plt.ioff():
plot_decision_tree_from_pipeline(pipeline, X_train, y_train, target_classes, target_names, 'classifier', 'features')
plt.savefig(fig_path, bbox_inches='tight', dpi=150)
plt.close('all')
print(f" [saved] decision_tree.png")
else:
print('The classifier is not a decision tree - so no plot is shown!')
The classifier is not a decision tree - so no plot is shown!
If you are using a Logistic Regression classifier in your pipeline, this will plot the coefficients of the features in the model.
if pipeline.named_steps['classifier'].__class__.__name__ == 'LogisticRegression':
try:
fig_path = FIGS_DIR / 'lr_features.png'
with plt.ioff():
plot_logistic_regression_features_from_pipeline(pipeline, target_classes, target_names, top_n=20, classifier_step_name='classifier', features_step_name='features')
plt.savefig(fig_path, bbox_inches='tight', dpi=150)
plt.close('all')
print(f" [saved] lr_features.png")
except AttributeError:
print("[SKIP] plot_logistic_regression_features_from_pipeline: TokensVectorizer does not support get_feature_names_out()")
[SKIP] plot_logistic_regression_features_from_pipeline: TokensVectorizer does not support get_feature_names_out()
from IPython.display import display, Image as IPImage
run_specs = [
("RUN 1: 3-CLASS + LOGISTIC REGRESSION", 1, list(feat_configs.keys()), 'LR'),
("RUN 2: 3-CLASS + DECISION TREE", 2, list(feat_configs.keys()), 'DT'),
("RUN 3: BINARY + LOGISTIC REGRESSION", 3, list(feat_configs.keys()), 'LR'),
("RUN 4: BINARY + DECISION TREE", 4, list(feat_configs.keys()), 'DT'),
]
for run_title, run_num, feat_names_list, clf_name in run_specs:
print(f"\n{run_title}")
for feat_name in feat_names_list:
stem = f'cm_run{run_num}_{feat_name}_{clf_name}'
csv_path = FIGS_DIR / f'{stem}.csv'
fig_path = FIGS_DIR / f'{stem}.png'
print(f'\n {feat_name} x {clf_name}')
if csv_path.exists():
rpt_df = pd.read_csv(csv_path, index_col=0)
print(rpt_df.to_string())
else:
print(f" MISSING : {csv_path.name}")
if fig_path.exists():
display(IPImage(str(fig_path)))
else:
print(f" MISSING : {fig_path.name}")
RUN 1: 3-CLASS + LOGISTIC REGRESSION
unigrams x LR
precision recall f1-score support
negative 0.263407 0.535256 0.353066 312.000
neutral 0.543563 0.481013 0.510379 869.000
positive 0.638191 0.465201 0.538136 819.000
accuracy 0.483000 0.483000 0.483000 0.483
macro avg 0.481720 0.493824 0.467193 2000.000
weighted avg 0.538609 0.483000 0.497204 2000.000
bigrams x LR
precision recall f1-score support
negative 0.192369 0.775641 0.308280 312.0000
neutral 0.547297 0.186421 0.278112 869.0000
positive 0.589686 0.321123 0.415810 819.0000
accuracy 0.333500 0.333500 0.333500 0.3335
macro avg 0.443117 0.427729 0.334067 2000.0000
weighted avg 0.509287 0.333500 0.339206 2000.0000
uni+pos x LR
precision recall f1-score support
negative 0.264179 0.567308 0.360489 312.000
neutral 0.564626 0.477560 0.517456 869.000
positive 0.621849 0.451770 0.523338 819.000
accuracy 0.481000 0.481000 0.481000 0.481
macro avg 0.483551 0.498880 0.467094 2000.000
weighted avg 0.541189 0.481000 0.495378 2000.000
textstats x LR
precision recall f1-score support
negative 0.203779 0.483974 0.286800 312.0000
neutral 0.529231 0.395857 0.452930 869.0000
positive 0.502463 0.373626 0.428571 819.0000
accuracy 0.400500 0.400500 0.400500 0.4005
macro avg 0.411824 0.417819 0.389434 2000.0000
weighted avg 0.467499 0.400500 0.417039 2000.0000
lexicon x LR
precision recall f1-score support
negative 0.402778 0.464744 0.431548 312.0000
neutral 0.591687 0.556962 0.573800 869.0000
positive 0.613139 0.615385 0.614260 819.0000
accuracy 0.566500 0.566500 0.566500 0.5665
macro avg 0.535868 0.545697 0.539869 2000.0000
weighted avg 0.571002 0.566500 0.568177 2000.0000
embeddings x LR
precision recall f1-score support
negative 0.410681 0.714744 0.521637 312.000
neutral 0.645441 0.513234 0.571795 869.000
positive 0.701044 0.655678 0.677603 819.000
accuracy 0.603000 0.603000 0.603000 0.603
macro avg 0.585722 0.627885 0.590345 2000.000
weighted avg 0.631588 0.603000 0.607299 2000.000
RUN 2: 3-CLASS + DECISION TREE
unigrams x DT
precision recall f1-score support
negative 0.000000 0.000000 0.000000 312.000
neutral 0.453575 0.978136 0.619759 869.000
positive 0.793651 0.122100 0.211640 819.000
accuracy 0.475000 0.475000 0.475000 0.475
macro avg 0.415742 0.366745 0.277133 2000.000
weighted avg 0.522078 0.475000 0.355952 2000.000
bigrams x DT
precision recall f1-score support
negative 0.450000 0.028846 0.054217 312.000
neutral 0.438614 0.990794 0.608051 869.000
positive 0.823529 0.017094 0.033493 819.000
accuracy 0.442000 0.442000 0.442000 0.442
macro avg 0.570715 0.345578 0.231920 2000.000
weighted avg 0.598013 0.442000 0.286371 2000.000
uni+pos x DT
precision recall f1-score support
negative 0.202934 0.532051 0.293805 312.000
neutral 0.467281 0.583429 0.518936 869.000
positive 0.793814 0.094017 0.168122 819.000
accuracy 0.375000 0.375000 0.375000 0.375
macro avg 0.488010 0.403166 0.326954 2000.000
weighted avg 0.559758 0.375000 0.340157 2000.000
textstats x DT
precision recall f1-score support
negative 0.208633 0.557692 0.303665 312.0000
neutral 0.582822 0.218642 0.317992 869.0000
positive 0.477381 0.489621 0.483424 819.0000
accuracy 0.382500 0.382500 0.382500 0.3825
macro avg 0.422945 0.421985 0.368360 2000.0000
weighted avg 0.481270 0.382500 0.383501 2000.0000
lexicon x DT
precision recall f1-score support
negative 0.400000 0.474359 0.434018 312.000
neutral 0.591687 0.556962 0.573800 869.000
positive 0.613300 0.608059 0.610668 819.000
accuracy 0.565000 0.565000 0.565000 0.565
macro avg 0.534996 0.546460 0.539495 2000.000
weighted avg 0.570635 0.565000 0.567091 2000.000
embeddings x DT
precision recall f1-score support
negative 0.259823 0.657051 0.372389 312.0000
neutral 0.503597 0.483314 0.493247 869.0000
positive 0.620690 0.285714 0.391304 819.0000
accuracy 0.429500 0.429500 0.429500 0.4295
macro avg 0.461370 0.475360 0.418980 2000.0000
weighted avg 0.513518 0.429500 0.432648 2000.0000
RUN 3: BINARY + LOGISTIC REGRESSION
unigrams x LR
precision recall f1-score support
negative 0.464427 0.753205 0.574572 312.000000
positive 0.876800 0.669109 0.759003 819.000000
accuracy 0.692308 0.692308 0.692308 0.692308
macro avg 0.670613 0.711157 0.666787 1131.000000
weighted avg 0.763042 0.692308 0.708125 1131.000000
bigrams x LR
precision recall f1-score support
negative 0.352632 0.858974 0.500000 312.000000
positive 0.881402 0.399267 0.549580 819.000000
accuracy 0.526083 0.526083 0.526083 0.526083
macro avg 0.617017 0.629121 0.524790 1131.000000
weighted avg 0.735534 0.526083 0.535903 1131.000000
uni+pos x LR
precision recall f1-score support
negative 0.451923 0.753205 0.564904 312.000000
positive 0.873977 0.652015 0.746853 819.000000
accuracy 0.679929 0.679929 0.679929 0.679929
macro avg 0.662950 0.702610 0.655878 1131.000000
weighted avg 0.757548 0.679929 0.696660 1131.000000
textstats x LR
precision recall f1-score support
negative 0.338004 0.618590 0.437146 312.000000
positive 0.787500 0.538462 0.639594 819.000000
accuracy 0.560566 0.560566 0.560566 0.560566
macro avg 0.562752 0.578526 0.538370 1131.000000
weighted avg 0.663501 0.560566 0.583746 1131.000000
lexicon x LR
precision recall f1-score support
negative 0.433453 0.772436 0.555300 312.000000
positive 0.876522 0.615385 0.723099 819.000000
accuracy 0.658709 0.658709 0.658709 0.658709
macro avg 0.654987 0.693910 0.639199 1131.000000
weighted avg 0.754296 0.658709 0.676809 1131.000000
embeddings x LR
precision recall f1-score support
negative 0.646489 0.855769 0.736552 312.000000
positive 0.937326 0.821734 0.875732 819.000000
accuracy 0.831123 0.831123 0.831123 0.831123
macro avg 0.791908 0.838752 0.806142 1131.000000
weighted avg 0.857095 0.831123 0.837337 1131.000000
RUN 4: BINARY + DECISION TREE
unigrams x DT
precision recall f1-score support
negative 0.351190 0.189103 0.245833 312.000000
positive 0.737279 0.866911 0.796857 819.000000
accuracy 0.679929 0.679929 0.679929 0.679929
macro avg 0.544235 0.528007 0.521345 1131.000000
weighted avg 0.630772 0.679929 0.644851 1131.000000
bigrams x DT
precision recall f1-score support
negative 0.692308 0.028846 0.055385 312.000000
positive 0.728980 0.995116 0.841507 819.000000
accuracy 0.728559 0.728559 0.728559 0.728559
macro avg 0.710644 0.511981 0.448446 1131.000000
weighted avg 0.718864 0.728559 0.624646 1131.000000
uni+pos x DT
precision recall f1-score support
negative 0.355499 0.445513 0.395448 312.000000
positive 0.766216 0.692308 0.727389 819.000000
accuracy 0.624226 0.624226 0.624226 0.624226
macro avg 0.560857 0.568910 0.561419 1131.000000
weighted avg 0.652915 0.624226 0.635819 1131.000000
textstats x DT
precision recall f1-score support
negative 0.348881 0.599359 0.441038 312.000000
positive 0.789916 0.573871 0.664781 819.000000
accuracy 0.580902 0.580902 0.580902 0.580902
macro avg 0.569398 0.586615 0.552909 1131.000000
weighted avg 0.668251 0.580902 0.603059 1131.000000
lexicon x DT
precision recall f1-score support
negative 0.431858 0.782051 0.556442 312.000000
positive 0.879859 0.608059 0.719134 819.000000
accuracy 0.656057 0.656057 0.656057 0.656057
macro avg 0.655859 0.695055 0.637788 1131.000000
weighted avg 0.756272 0.656057 0.674253 1131.000000
embeddings x DT
precision recall f1-score support
negative 0.410714 0.810897 0.545259 312.000000
positive 0.885437 0.556777 0.683658 819.000000
accuracy 0.626879 0.626879 0.626879 0.626879
macro avg 0.648076 0.683837 0.614458 1131.000000
weighted avg 0.754479 0.626879 0.645479 1131.000000
4.2 Investigate correct and incorrect predictions¶
To see the predictions of your model run this cell. The output can be quite long depending on the dataset and the number of misclassifications. The Pandas max_rows is configured at the top of the cell to restrict the length of output. You can adjust this as required. This is reset back to the Pandas default at the end of the cell.
pipeline = results_3class_LR[_best_LR[0]]['pipeline']
y_predicted = results_3class_LR[_best_LR[0]]['y_pred']
pd.set_option('display.max_rows', 5)
predictions_df = pd.DataFrame(data = {'true': y_test, 'predicted': y_predicted})
y_predicted_probs = pipeline.predict_proba(X_test)
y_predicted_probs = np.round(y_predicted_probs, 3)
# Note: Version 0.2.2 changed the following line to ensure probability labels are correct regardless of the order of target classes
columns = [f'{label_names[c]}_prob' for c in pipeline.named_steps['classifier'].classes_ if c in target_classes]
predictions_df['predicted'] = predictions_df['predicted'].apply(lambda x: label_names[x])
predictions_df['true'] = predictions_df['true'].apply(lambda x: label_names[x])
predictions_df['correct'] = predictions_df['true'] == predictions_df['predicted']
predictions_df['text'] = X_test
predictions_df = pd.concat([predictions_df, pd.DataFrame(y_predicted_probs, columns=columns)], axis=1)
for true_target, target_name in enumerate(target_names):
for predicted_target, target_name in enumerate(target_names):
if true_target == predicted_target:
print(f'\nCORRECTLY CLASSIFIED: {target_names[true_target]}')
else:
print(f'\n{target_names[true_target]} INCORRECTLY CLASSIFIED as: {target_names[predicted_target]}')
print('=================================================================')
display(predictions_df[(predictions_df['true'] == target_names[true_target]) & (predictions_df['predicted'] == target_names[predicted_target])])
pd.set_option('display.max_rows', 60)
CORRECTLY CLASSIFIED: negative =================================================================
| true | predicted | correct | text | negative_prob | neutral_prob | positive_prob | |
|---|---|---|---|---|---|---|---|
| 7 | negative | negative | True | Omg this show is so predictable even for the 3rd ep. Rui En\u2019s ex boyfriend was framed for murder probably\u002c by the rich guy. | 0.528 | 0.322 | 0.150 |
| 19 | negative | negative | True | The sad part about this is tomorrow Nicki will be the angry black woman who went after poor white girl Miley | 0.985 | 0.013 | 0.001 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1989 | negative | negative | True | @user @user Islam is an Abrahamic faith, Andrew. It may make you feel a little uneasy but it's the same God you worship. Sorry." | 0.762 | 0.196 | 0.042 |
| 1992 | negative | negative | True | kingpin Saudi Arabia posted a record $98 billion budget deficit in 2015 due to the sharp fall in oil prices finance ministry said on Monday | 0.724 | 0.260 | 0.016 |
223 rows × 7 columns
negative INCORRECTLY CLASSIFIED as: neutral =================================================================
| true | predicted | correct | text | negative_prob | neutral_prob | positive_prob | |
|---|---|---|---|---|---|---|---|
| 2 | negative | neutral | False | When girls become bandwagon fans of the Packers because of Harry. Do y'all even know who Aaron Rodgers is? Or what a 1st down is? | 0.320 | 0.400 | 0.280 |
| 10 | negative | neutral | False | @user so the thing next Thursday isn't free, you'd have to pay $15 to get in since you don't go to UMBC :/ and it ends at 11:30" | 0.352 | 0.528 | 0.121 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1973 | negative | neutral | False | "I do worry about the mentality of the average tweeter when I see Jeremy Kyle\u002c and \""""Christmas\"""" trending on November 1st..." | 0.367 | 0.470 | 0.163 |
| 1993 | negative | neutral | False | @user @user I think after Charlie Hebdo the French did NOT react as the US did after 9/11. But they may do this time around. | 0.333 | 0.579 | 0.088 |
59 rows × 7 columns
negative INCORRECTLY CLASSIFIED as: positive =================================================================
| true | predicted | correct | text | negative_prob | neutral_prob | positive_prob | |
|---|---|---|---|---|---|---|---|
| 79 | negative | positive | False | "When I'm soaring on Sunday afternoon, I learn Frank Gifford--one of my faves on the field and inside the broadcast booth--has died." | 0.077 | 0.426 | 0.498 |
| 229 | negative | positive | False | just bought my 1st Heineken beer in Las Vegas. ps I\u2019ve lived here for 5 yrs ~what took me so long! | 0.076 | 0.115 | 0.809 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1695 | negative | positive | False | @user @user michael ball is incredible 10th anniversary with him and colm is sick | 0.054 | 0.096 | 0.850 |
| 1835 | negative | positive | False | "\""""@nodoubt: Tune into @user tomorrow for a special @user #PushAndShove News segment during the 7AM & 9AM hours!\"""" NOOOOOOOOO" | 0.284 | 0.288 | 0.427 |
30 rows × 7 columns
neutral INCORRECTLY CLASSIFIED as: negative =================================================================
| true | predicted | correct | text | negative_prob | neutral_prob | positive_prob | |
|---|---|---|---|---|---|---|---|
| 9 | neutral | negative | False | Irving Plaza NYC Blackout Saturday night. Got limited spots left on the guest list. Tweet me why you think you deserve them | 0.512 | 0.230 | 0.258 |
| 17 | neutral | negative | False | Why do y'all want Nicki to be pregnant so bad like maybe around the 7th album but she's literally still in her prime. | 0.941 | 0.047 | 0.012 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1971 | neutral | negative | False | Bowling tomorrow c; Don\u2019t want things to be awkard lol | 0.410 | 0.269 | 0.321 |
| 1996 | neutral | negative | False | Harper's Worst Offense against Refugees may be Climate Record as rising temperatures add to chaos in the Middle East | 0.936 | 0.060 | 0.005 |
224 rows × 7 columns
CORRECTLY CLASSIFIED: neutral =================================================================
| true | predicted | correct | text | negative_prob | neutral_prob | positive_prob | |
|---|---|---|---|---|---|---|---|
| 3 | neutral | neutral | True | @user I may or may not have searched it up on google | 0.242 | 0.665 | 0.093 |
| 12 | neutral | neutral | True | We just received more tickets for Blue Rodeo at The KEE to Bala Saturday May 19th and Sunday May 20th. Tickets... | 0.020 | 0.583 | 0.397 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1990 | neutral | neutral | True | "The BAGRANGI new Pic,Of SALMAN khan That VERY FAMOUS IN PAK CENEMA'S at the 1st day of EID that pic,made 1.5 milion Rs Lolywood/Bolywood" | 0.128 | 0.454 | 0.418 |
| 1999 | neutral | neutral | True | "Interview with Devon Alexander \""""Speed Kills\"""" (VIDEO) On Tuesday Oct 16th we had the privilege of catch up with... | 0.247 | 0.457 | 0.296 |
446 rows × 7 columns
neutral INCORRECTLY CLASSIFIED as: positive =================================================================
| true | predicted | correct | text | negative_prob | neutral_prob | positive_prob | |
|---|---|---|---|---|---|---|---|
| 0 | neutral | positive | False | Dark Souls 3 April Launch Date Confirmed With New Trailer: Embrace the darkness. | 0.054 | 0.359 | 0.587 |
| 4 | neutral | positive | False | Here's your starting TUESDAY MORNING Line up at Gentle Yoga with Laura 9:30 am to 10:30 am... | 0.032 | 0.476 | 0.491 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1995 | neutral | positive | False | "LONDON (AP) "" Prince George celebrates his second birthday on Wednesday and while he's just a toddler, he's al... | 0.095 | 0.364 | 0.541 |
| 1998 | neutral | positive | False | Gonna watch Final Destination 5 tonight. I always leave the theater so afraid of everything. No huge escalators for sure :S | 0.156 | 0.142 | 0.702 |
199 rows × 7 columns
positive INCORRECTLY CLASSIFIED as: negative =================================================================
| true | predicted | correct | text | negative_prob | neutral_prob | positive_prob | |
|---|---|---|---|---|---|---|---|
| 28 | positive | negative | False | tomorrow I've to wake up early so Zayn's erformance on VMA better be true otherwise u'll regret for playing with my emotions and sleep | 0.436 | 0.240 | 0.323 |
| 30 | positive | negative | False | Nicki did that for white media Idgaf . Nicki may act like she don't give af but she cares what the media thinks | 0.745 | 0.233 | 0.021 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1948 | positive | negative | False | When I wake up tomorrow I'll be in a different country. Whoa! I didn't run into a David Beckham at the airport. That's a bummer. | 0.409 | 0.330 | 0.261 |
| 1988 | positive | negative | False | @user call Hafiz saeed sir he may help u out. Maybe Pope can b handy . Try it. | 0.444 | 0.394 | 0.162 |
96 rows × 7 columns
positive INCORRECTLY CLASSIFIED as: neutral =================================================================
| true | predicted | correct | text | negative_prob | neutral_prob | positive_prob | |
|---|---|---|---|---|---|---|---|
| 6 | positive | neutral | False | #US 1st Lady Michelle Obama speaking at the 2015 Beating the Odds Summit to over 130 college-bound students at the pentagon office. >> | 0.167 | 0.604 | 0.230 |
| 16 | positive | neutral | False | Tom Brady is locked for Thursday. Let the season begin! #RepeatSeason | 0.165 | 0.655 | 0.179 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1987 | positive | neutral | False | "\""""@_eryflores: March 16 Luke Bryan is gonna at the Houston Rodeo. I HAVE to go\u002c Its a MUST!\""""" | 0.145 | 0.574 | 0.281 |
| 1997 | positive | neutral | False | Hold on... Sam Smith may do the theme to Spectre!? Dope!!!!!! #007 #SPECTRE #JamesBond | 0.152 | 0.689 | 0.159 |
186 rows × 7 columns
CORRECTLY CLASSIFIED: positive =================================================================
| true | predicted | correct | text | negative_prob | neutral_prob | positive_prob | |
|---|---|---|---|---|---|---|---|
| 1 | positive | positive | True | "National hot dog day, national tequila day, then national dance day... Sounds like a Friday night." | 0.047 | 0.268 | 0.684 |
| 8 | positive | positive | True | "What a round by Paul Dunne, good luck tomorrow and I hope you win the Open." | 0.042 | 0.134 | 0.825 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1982 | positive | positive | True | This Saturday & Sunday come join us the @user at the Pomona Fairplex! Your ticket can WIN you a Brand New Car! | 0.028 | 0.345 | 0.626 |
| 1994 | positive | positive | True | Beautiful Bouquet with our Beautiful Bentley #bride #groom #wedding #wednesday #weddingcars #love #Repost... | 0.016 | 0.078 | 0.905 |
537 rows × 7 columns
# Note: Quality of life improvement for version 0.2.2
# We can display the full text of a selected misclassified article by dataframe index
selected_index = 15
preview_row_text(predictions_df, selected_index, text_column = text_column, limit=400) # change limit to see more of the text if needed
| Value | |
|---|---|
| Attribute | |
| true | positive |
| predicted | positive |
| correct | True |
| negative_prob | 0.052 |
| neutral_prob | 0.154 |
| positive_prob | 0.794 |
text: "Thank you @user for the message. I'm very proud to be a Liverpudlian, may i get your follback? #LiverpudlianLoyalitasTanpaBatas #YNWA"
4.3 Run inference on new (or old) data¶
You can also run inference on new data (or any of the texts from training/validation) by changing the contents of the texts list below. This outputs a prediction, the probabilities of each class and the features present within the text that are used by the model to make its predictions. The numbers for each feature are the input to the final step of the pipeline. They may be scaled or transformed depending on the pipeline components you've chosen.
texts = ['''
It was excellent!
''',
'''
This was a terrible movie!
''',
'''
This might not not be the best movie ever made, or it could be the best movie of no time.
''',
]
y_inference = pipeline.predict(texts)
preprocessor = Pipeline(pipeline.steps[:-1])
feature_names = preprocessor.named_steps['features'].get_feature_names_out()
for i, text in enumerate(texts):
print(f"Text {i}: {text}")
print(f"\tPredicted class: {label_names[y_inference[i]]}")
print()
y_inference_proba = pipeline.predict_proba([text])
# Note: Version 0.2.2 changed the following lines to ensure probability labels are correct regardless of the order of target classes
for idx, prob in enumerate(y_inference_proba[0]):
c = pipeline.named_steps['classifier'].classes_[idx]
if c in target_classes:
print(f"\tProbability of class {label_names[c]}: {prob:.2f}")
# End change for 0.2.2
print()
print("\tFeatures:")
embeddings = 0
frequencies = preprocessor.transform([text])
if not isinstance(frequencies, np.ndarray):
frequencies = frequencies.toarray()
frequencies = frequencies[0].T
for j, freq in enumerate(frequencies):
if feature_names[j].startswith('embeddings_'):
embeddings += 1
elif freq > 0:
print(f"\t{feature_names[j]}: {freq:.2f}")
if embeddings > 0:
print(f"\tFeatures also include {embeddings} embedding dimensions")
print()
Text 0: It was excellent! Predicted class: positive Probability of class negative: 0.00 Probability of class neutral: 0.00 Probability of class positive: 1.00 Features: emb__emb_3: 0.17 emb__emb_4: 0.05 emb__emb_5: 0.30 emb__emb_7: 0.11 emb__emb_8: 0.01 emb__emb_9: 0.01 emb__emb_12: 0.05 emb__emb_14: 0.01 emb__emb_15: 0.04 emb__emb_17: 0.06 emb__emb_22: 0.03 emb__emb_23: 0.06 emb__emb_24: 0.01 emb__emb_27: 0.03 emb__emb_28: 0.10 emb__emb_30: 0.06 emb__emb_31: 0.01 emb__emb_35: 0.07 emb__emb_38: 0.06 emb__emb_40: 0.17 emb__emb_41: 0.04 emb__emb_42: 0.04 emb__emb_43: 0.06 emb__emb_44: 0.14 emb__emb_45: 0.10 emb__emb_46: 0.01 emb__emb_48: 0.05 emb__emb_52: 0.04 emb__emb_53: 0.07 emb__emb_54: 0.04 emb__emb_55: 0.01 emb__emb_56: 0.14 emb__emb_58: 0.03 emb__emb_66: 0.00 emb__emb_67: 0.07 emb__emb_68: 0.03 emb__emb_69: 0.03 emb__emb_72: 0.03 emb__emb_73: 0.01 emb__emb_74: 0.04 emb__emb_76: 0.06 emb__emb_79: 0.05 emb__emb_82: 0.09 emb__emb_85: 0.08 emb__emb_86: 0.03 emb__emb_90: 0.02 emb__emb_94: 0.01 emb__emb_101: 0.01 emb__emb_108: 0.03 emb__emb_111: 0.03 emb__emb_112: 0.06 emb__emb_115: 0.03 emb__emb_116: 0.10 emb__emb_118: 0.02 emb__emb_119: 0.05 emb__emb_121: 0.08 emb__emb_123: 0.01 emb__emb_124: 0.05 emb__emb_125: 0.10 emb__emb_126: 0.05 emb__emb_127: 0.08 emb__emb_130: 0.11 emb__emb_133: 0.02 emb__emb_137: 0.01 emb__emb_139: 0.05 emb__emb_142: 0.09 emb__emb_144: 0.03 emb__emb_146: 0.08 emb__emb_151: 0.01 emb__emb_154: 0.06 emb__emb_155: 0.04 emb__emb_156: 0.11 emb__emb_157: 0.03 emb__emb_158: 0.10 emb__emb_159: 0.02 emb__emb_160: 0.07 emb__emb_162: 0.01 emb__emb_163: 0.01 emb__emb_165: 0.07 emb__emb_169: 0.03 emb__emb_171: 0.05 emb__emb_173: 0.05 emb__emb_182: 0.08 emb__emb_183: 0.05 emb__emb_185: 0.10 emb__emb_187: 0.02 emb__emb_189: 0.03 emb__emb_190: 0.06 emb__emb_195: 0.07 emb__emb_196: 0.00 emb__emb_199: 0.08 emb__emb_200: 0.05 emb__emb_207: 0.02 emb__emb_208: 0.04 emb__emb_209: 0.02 emb__emb_210: 0.08 emb__emb_213: 0.02 emb__emb_214: 0.16 emb__emb_215: 0.02 emb__emb_217: 0.00 emb__emb_218: 0.00 emb__emb_220: 0.07 emb__emb_223: 0.03 emb__emb_224: 0.02 emb__emb_225: 0.07 emb__emb_226: 0.02 emb__emb_227: 0.01 emb__emb_230: 0.03 emb__emb_231: 0.02 emb__emb_233: 0.00 emb__emb_234: 0.12 emb__emb_236: 0.01 emb__emb_239: 0.06 emb__emb_240: 0.03 emb__emb_243: 0.01 emb__emb_244: 0.05 emb__emb_245: 0.06 emb__emb_246: 0.02 emb__emb_248: 0.01 emb__emb_250: 0.03 emb__emb_251: 0.00 emb__emb_254: 0.03 Text 1: This was a terrible movie! Predicted class: negative Probability of class negative: 1.00 Probability of class neutral: 0.00 Probability of class positive: 0.00 Features: emb__emb_2: 0.22 emb__emb_3: 0.06 emb__emb_4: 0.07 emb__emb_6: 0.10 emb__emb_9: 0.07 emb__emb_11: 0.10 emb__emb_12: 0.03 emb__emb_13: 0.14 emb__emb_16: 0.04 emb__emb_17: 0.13 emb__emb_18: 0.02 emb__emb_20: 0.07 emb__emb_21: 0.11 emb__emb_22: 0.01 emb__emb_23: 0.13 emb__emb_26: 0.04 emb__emb_27: 0.00 emb__emb_29: 0.07 emb__emb_32: 0.04 emb__emb_33: 0.01 emb__emb_34: 0.07 emb__emb_35: 0.10 emb__emb_40: 0.08 emb__emb_45: 0.02 emb__emb_48: 0.01 emb__emb_49: 0.03 emb__emb_51: 0.04 emb__emb_55: 0.06 emb__emb_60: 0.03 emb__emb_63: 0.02 emb__emb_64: 0.00 emb__emb_65: 0.01 emb__emb_69: 0.03 emb__emb_70: 0.04 emb__emb_72: 0.09 emb__emb_74: 0.04 emb__emb_75: 0.02 emb__emb_77: 0.00 emb__emb_79: 0.00 emb__emb_82: 0.03 emb__emb_84: 0.00 emb__emb_92: 0.02 emb__emb_93: 0.03 emb__emb_95: 0.04 emb__emb_97: 0.02 emb__emb_99: 0.12 emb__emb_100: 0.03 emb__emb_101: 0.03 emb__emb_103: 0.07 emb__emb_105: 0.02 emb__emb_107: 0.08 emb__emb_110: 0.05 emb__emb_111: 0.08 emb__emb_113: 0.02 emb__emb_116: 0.02 emb__emb_118: 0.01 emb__emb_119: 0.03 emb__emb_123: 0.05 emb__emb_124: 0.01 emb__emb_125: 0.04 emb__emb_127: 0.01 emb__emb_128: 0.11 emb__emb_129: 0.00 emb__emb_132: 0.11 emb__emb_133: 0.07 emb__emb_138: 0.07 emb__emb_139: 0.01 emb__emb_143: 0.01 emb__emb_144: 0.08 emb__emb_146: 0.03 emb__emb_148: 0.02 emb__emb_149: 0.02 emb__emb_150: 0.02 emb__emb_153: 0.13 emb__emb_156: 0.02 emb__emb_157: 0.02 emb__emb_159: 0.07 emb__emb_160: 0.07 emb__emb_161: 0.04 emb__emb_165: 0.02 emb__emb_171: 0.06 emb__emb_173: 0.08 emb__emb_175: 0.01 emb__emb_176: 0.01 emb__emb_177: 0.04 emb__emb_179: 0.02 emb__emb_180: 0.03 emb__emb_182: 0.02 emb__emb_183: 0.04 emb__emb_185: 0.08 emb__emb_188: 0.01 emb__emb_191: 0.04 emb__emb_193: 0.08 emb__emb_198: 0.02 emb__emb_200: 0.00 emb__emb_201: 0.05 emb__emb_207: 0.01 emb__emb_208: 0.05 emb__emb_209: 0.00 emb__emb_210: 0.08 emb__emb_212: 0.00 emb__emb_214: 0.03 emb__emb_217: 0.07 emb__emb_221: 0.03 emb__emb_223: 0.04 emb__emb_226: 0.06 emb__emb_228: 0.02 emb__emb_230: 0.02 emb__emb_232: 0.02 emb__emb_234: 0.12 emb__emb_235: 0.01 emb__emb_238: 0.00 emb__emb_239: 0.03 emb__emb_240: 0.04 emb__emb_241: 0.01 emb__emb_244: 0.03 emb__emb_245: 0.05 emb__emb_246: 0.01 emb__emb_247: 0.04 emb__emb_248: 0.02 emb__emb_252: 0.03 Text 2: This might not not be the best movie ever made, or it could be the best movie of no time. Predicted class: positive Probability of class negative: 0.35 Probability of class neutral: 0.11 Probability of class positive: 0.54 Features: emb__emb_3: 0.05 emb__emb_4: 0.05 emb__emb_5: 0.13 emb__emb_8: 0.04 emb__emb_9: 0.13 emb__emb_14: 0.06 emb__emb_17: 0.05 emb__emb_20: 0.09 emb__emb_21: 0.06 emb__emb_29: 0.05 emb__emb_32: 0.01 emb__emb_34: 0.02 emb__emb_35: 0.13 emb__emb_36: 0.04 emb__emb_38: 0.02 emb__emb_39: 0.01 emb__emb_40: 0.05 emb__emb_41: 0.01 emb__emb_44: 0.11 emb__emb_45: 0.09 emb__emb_46: 0.02 emb__emb_48: 0.04 emb__emb_49: 0.04 emb__emb_53: 0.03 emb__emb_54: 0.01 emb__emb_55: 0.03 emb__emb_58: 0.02 emb__emb_59: 0.03 emb__emb_60: 0.03 emb__emb_63: 0.08 emb__emb_68: 0.04 emb__emb_70: 0.12 emb__emb_72: 0.10 emb__emb_74: 0.04 emb__emb_78: 0.03 emb__emb_83: 0.05 emb__emb_85: 0.07 emb__emb_86: 0.02 emb__emb_88: 0.06 emb__emb_90: 0.04 emb__emb_93: 0.02 emb__emb_95: 0.02 emb__emb_99: 0.01 emb__emb_100: 0.05 emb__emb_101: 0.05 emb__emb_103: 0.08 emb__emb_105: 0.01 emb__emb_107: 0.07 emb__emb_108: 0.02 emb__emb_110: 0.10 emb__emb_111: 0.03 emb__emb_116: 0.07 emb__emb_118: 0.01 emb__emb_119: 0.07 emb__emb_122: 0.04 emb__emb_123: 0.03 emb__emb_124: 0.03 emb__emb_126: 0.00 emb__emb_128: 0.13 emb__emb_129: 0.06 emb__emb_131: 0.03 emb__emb_132: 0.03 emb__emb_133: 0.03 emb__emb_136: 0.07 emb__emb_138: 0.06 emb__emb_143: 0.11 emb__emb_146: 0.01 emb__emb_147: 0.02 emb__emb_148: 0.05 emb__emb_150: 0.01 emb__emb_151: 0.06 emb__emb_153: 0.13 emb__emb_156: 0.02 emb__emb_159: 0.07 emb__emb_160: 0.17 emb__emb_161: 0.06 emb__emb_162: 0.08 emb__emb_163: 0.01 emb__emb_164: 0.03 emb__emb_165: 0.02 emb__emb_166: 0.04 emb__emb_167: 0.02 emb__emb_168: 0.08 emb__emb_171: 0.03 emb__emb_172: 0.02 emb__emb_173: 0.11 emb__emb_175: 0.02 emb__emb_179: 0.01 emb__emb_180: 0.04 emb__emb_182: 0.02 emb__emb_184: 0.04 emb__emb_185: 0.02 emb__emb_189: 0.04 emb__emb_191: 0.07 emb__emb_193: 0.09 emb__emb_195: 0.01 emb__emb_196: 0.02 emb__emb_198: 0.11 emb__emb_201: 0.04 emb__emb_203: 0.06 emb__emb_207: 0.03 emb__emb_208: 0.02 emb__emb_209: 0.04 emb__emb_210: 0.04 emb__emb_211: 0.00 emb__emb_212: 0.01 emb__emb_214: 0.07 emb__emb_215: 0.00 emb__emb_217: 0.03 emb__emb_219: 0.04 emb__emb_221: 0.02 emb__emb_222: 0.01 emb__emb_223: 0.01 emb__emb_226: 0.07 emb__emb_227: 0.07 emb__emb_228: 0.01 emb__emb_235: 0.00 emb__emb_236: 0.01 emb__emb_237: 0.03 emb__emb_240: 0.01 emb__emb_241: 0.01 emb__emb_245: 0.03 emb__emb_246: 0.07 emb__emb_248: 0.03 emb__emb_251: 0.03 emb__emb_252: 0.04 emb__emb_254: 0.00
4.4 (Optional) Run inference on augmented data¶
This is new functionality for version 0.2.5.
This is optional functionality to run inference on augmented data. You do not need to add this code to your notebook, or run this code, or discuss this to successfully complete the assignment. If you do want to use it, copy and paste the following cells into your notebook.
import nlpaug.augmenter.char as nac
The following cell augments the texts from 4.3 to demonstrate the augmentation. There are two possible augmentations here that you can uncomment and you can change the maximum number of augmentations applied to whatever makes sense for the length of the text you are working with. You can read the nlpaug documentation for more information.
Please note: There is nlpaug functionality that applies transformations with contextual embeddings and large language models. Using this functionality on the class JupyterHub is not permitted as it may negatively affect the performance of your classmates. You can run the augmentation options below. If you do want to run the more complex augmentations on your own machine, that is up to you to work out. This is not necessary for the class assignment. Also note, that the word-level transformations in nlpaug require libraries that are incompatible with the existing Python environment.
aug = nac.KeyboardAug(aug_char_max=10) # simulates common typos based on keyboard layout
#aug = nac.OcrAug(aug_char_max=50) # simulates ocr errors
augmented_texts = aug.augment(texts)
for augmented_text in augmented_texts:
print(augmented_text)
It was RxcelO2nt! TBiZ was a ^DrribPe movie! This might not not be the bRs$ moDje ever Nave, or it cijld be the fDst m*Die of no ti<$.
The next cell creates an augmented version of your test data.
X_test_augmented = np.array(aug.augment(list(X_test))) # you could obviously inspect X_test_augmented if you wanted to
Run predictions on the augmented data ...
y_predicted_augmented = pipeline.predict(X_test_augmented)
rpt_aug = classification_report(y_test, y_predicted_augmented, labels=target_classes, target_names=target_names, digits=3, zero_division=0, output_dict=True)
pd.DataFrame(rpt_aug).T.to_csv(FIGS_DIR / 'cm_augmented.csv')
fig_path = FIGS_DIR / 'cm_augmented.png'
with plt.ioff():
plot_confusion_matrix(y_test, y_predicted_augmented, target_classes, target_names)
plt.savefig(fig_path, bbox_inches='tight', dpi=150)
plt.close('all')
print(f" [saved] cm_augmented.csv and cm_augmented.png")
[saved] cm_augmented.csv and cm_augmented.png
4.5 (Optional) Run inference on data from a CSV¶
This is new functionality for version 0.2.5.
This is optional functionality to run inference on arbitrary data from a CSV. You do not need to add this code to your notebook, or run this code, or discuss this to successfully complete the assignment. If you do want to use it, copy and paste the following cells into your notebook.
You can create your own data, generate some data, or convert a dataset you found online. This is something for you to work out if you use these cells. Remember, this is optional.
Running the following cells will be straightforward if you have a CSV with a column called text (with your text ) and label (with the label names that match the training dataset label names).
from datasets import Dataset, DatasetDict
import os
Change the values below for your file name and column names.
Check that the CSV loads ok. If your text for the labels do not match the text for the original data, this is something you can work out how to resolve.
csv_file = 'example.csv'
csv_label_column = 'label'
csv_text_column = 'text'
if not os.path.exists(csv_file):
print('There is no CSV file, so nothing to do here.')
else:
csv_df = df = pd.read_csv(csv_file)
display(csv_df.sample(5))
There is no CSV file, so nothing to do here.
For consistency, we're converting to the datasets format and making sure the label column representation matches the dataset used for training.
if not os.path.exists(csv_file):
print('There is no CSV file, so nothing to do here.')
else:
# Convert pandas DataFrame to Hugging Face Dataset
csv_dataset = Dataset.from_pandas(csv_df)
# Create or extend a DatasetDict with the new split
csv_dataset = DatasetDict({
'test': csv_dataset
})
existing_class_feature = dataset[list(dataset.keys())[0]].features[label_column]
csv_dataset = csv_dataset.cast_column(csv_label_column, existing_class_feature)
preview_dataset(csv_dataset)
X_csv = np.array(csv_dataset['test'][text_column])
y_csv = np.array(csv_dataset['test'][label_column])
There is no CSV file, so nothing to do here.
Running inference with our trained model on the new data.
if not os.path.exists(csv_file):
print('There is no CSV file, so nothing to do here.')
else:
y_predicted_csv = pipeline.predict(X_csv)
rpt_csv = classification_report(y_csv, y_predicted_csv, labels=target_classes, target_names=target_names, digits=3, zero_division=0, output_dict=True)
pd.DataFrame(rpt_csv).T.to_csv(FIGS_DIR / 'cm_csv.csv')
fig_path = FIGS_DIR / 'cm_csv.png'
with plt.ioff():
plot_confusion_matrix(y_csv, y_predicted_csv, target_classes, target_names)
plt.savefig(fig_path, bbox_inches='tight', dpi=150)
plt.close('all')
print(f" [saved] cm_csv.csv and cm_csv.png")
There is no CSV file, so nothing to do here.