Can a text classifier reliably reproduce human-assigned sentiment labels — and what does its performance reveal about label quality? Assignment 2 applies Logistic Regression and Decision Tree classifiers to the Cardiff NLP tweet_eval sentiment dataset (~45,000 tweets: negative, neutral, positive), evaluating six feature types across 3-class and binary task framings.
Written report on sentiment label quality in the Cardiff NLP tweet_eval dataset. Covers the problem framing (label reliability as the central question), feature engineering rationale (unigram counts, TF-IDF, bigrams, VADER lexicon scores, text statistics, POS sequences), classifier comparison (Logistic Regression vs Decision Tree), and evaluation design using macro F1 as the primary metric given class imbalance (neutral ≈ 45%). Discusses which tweet types are systematically hard to classify and what misclassification patterns suggest about the consistency of the original human labels.
Static HTML render of the Jupyter notebook with all code, outputs, tables, and figures embedded inline. Covers the full four-run pipeline (3-class LR, 3-class DT, binary LR, binary DT) with per-feature classification reports, confusion matrices, and a cross-run macro F1 comparison table. Readable without a Jupyter environment.
Executable pipeline: data loading from the Cardiff NLP tweet_eval parquet files, feature construction via textplumber, and four independent model runs whose results are cached to disk for idempotent re-execution. Reporting cells display confusion matrices and per-feature F1 tables after all computation is complete. Requires Python with textplumber, scikit-learn, VADER, and the processed tweet_eval dataset.