STAT 419 — Applied Multivariate Statistics

Multivariate Analysis of Wine Quality

Applying PCA for dimensionality reduction, MANOVA for group comparisons, and Linear Discriminant Analysis for classification — uncovering the chemical signatures that distinguish wine quality tiers.

Python scikit-learn · scipy · seaborn Multivariate Methods

Overview

Wine quality assessment traditionally relies on expert sommeliers, but can physicochemical measurements predict quality objectively? This project applies three core multivariate statistical techniques to the UCI Wine Quality dataset (6,497 wines with 11 chemical properties): Principal Component Analysis to visualize the high-dimensional structure, MANOVA to formally test whether quality groups differ across multiple chemical dimensions simultaneously, and Linear Discriminant Analysis for classification into quality tiers.

The analysis reveals that wine quality is indeed encoded in chemical composition — particularly the interplay between volatile acidity, alcohol content, sulphates, and residual sugar. The LDA classifier achieves 78.3% accuracy on a 3-tier quality classification, outperforming PCA-based approaches by leveraging the supervised discriminant directions.

PCA MANOVA Linear Discriminant Analysis Multivariate Normality Box's M Test Scree Plot

Data and Quality Tiers

The dataset combines 1,599 red and 4,898 white Portuguese "Vinho Verde" wines, each rated on a 0–10 quality scale by expert tasters. To create a balanced classification problem with meaningful groups, I collapsed the ratings into three quality tiers:

TierQuality ScoreCountProportion
Low3–46409.9%
Medium5–64,53569.8%
High7–91,32220.3%
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from scipy import stats
import seaborn as sns

# Load and combine red + white wines
red = pd.read_csv("winequality-red.csv", sep=";")
white = pd.read_csv("winequality-white.csv", sep=";")
red["type"] = "red"
white["type"] = "white"
wine = pd.concat([red, white], ignore_index=True)

# Create quality tiers
wine["tier"] = pd.cut(wine["quality"],
                       bins=[0, 4, 6, 10],
                       labels=["Low", "Medium", "High"])

# Chemical features (standardized)
features = ["fixed acidity", "volatile acidity", "citric acid",
            "residual sugar", "chlorides", "free sulfur dioxide",
            "total sulfur dioxide", "density", "pH",
            "sulphates", "alcohol"]

X = StandardScaler().fit_transform(wine[features])
y = wine["tier"]

Principal Component Analysis

PCA reduces the 11-dimensional chemical space to a lower-dimensional representation while preserving maximum variance. The scree plot reveals that the first 4 principal components explain 72.8% of total variance, with a clear "elbow" after PC2.

# Fit PCA
pca = PCA().fit(X)
cumvar = np.cumsum(pca.explained_variance_ratio_)

print("Variance explained by component:")
for i, (var, cum) in enumerate(zip(
    pca.explained_variance_ratio_, cumvar
)):
    print(f"  PC{i+1}: {var:.1%}  (cumulative: {cum:.1%})")
ComponentVariance ExplainedCumulativeTop Loadings
PC127.5%27.5%total SO₂, free SO₂, residual sugar
PC222.3%49.8%volatile acidity, citric acid, pH
PC313.7%63.5%alcohol, density, chlorides
PC49.3%72.8%sulphates, fixed acidity
PCA Biplot — PC1 vs. PC2 (Colored by Quality Tier) PC1 (27.5%) PC2 (22.3%) alcohol vol. acidity total SO₂ Low (3–4) Medium (5–6) High (7–9)

Quality tiers show partial separation along PC1, driven largely by alcohol content (rightward = higher quality). Substantial overlap in the medium tier makes this a challenging classification problem.

MANOVA

While PCA is exploratory, MANOVA provides a formal multivariate hypothesis test: do the quality tiers have significantly different mean vectors across all 11 chemical dimensions simultaneously? Using Pillai's trace (the most robust test statistic under violations of homogeneity), the test strongly rejects the null hypothesis.

from statsmodels.multivariate.manova import MANOVA

# Fit MANOVA
formula = " + ".join([f'Q("{f}")' for f in features])
formula = formula + " ~ tier"

manova = MANOVA.from_formula(formula, data=wine)
print(manova.mv_test())
0.483 Pillai's Trace
F = 168.3 Approx. F-Statistic
p < 0.001 P-Value

The MANOVA confirms that the three quality tiers have significantly different multivariate chemical profiles (Pillai's trace = 0.483, F(22, 12970) = 168.3, p < 0.001). This justifies proceeding with discriminant analysis to find the optimal directions for separating the groups.

Linear Discriminant Analysis

LDA finds the linear combinations of features that maximize the ratio of between-group to within-group variance — the directions along which the quality tiers are most separated. With 3 groups and 11 features, LDA produces 2 discriminant functions.

from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.metrics import classification_report, confusion_matrix

# Fit LDA
lda = LinearDiscriminantAnalysis()
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Cross-validated predictions
y_pred = cross_val_predict(lda, X, y, cv=cv)

print(classification_report(y, y_pred, digits=3))

# Discriminant function coefficients
lda.fit(X, y)
coef_df = pd.DataFrame(
    lda.scalings_, index=features,
    columns=["LD1", "LD2"]
).round(3)
print(coef_df.sort_values("LD1", ascending=False))

Classification Performance

TierPrecisionRecallF1-ScoreSupport
Low0.5920.4780.529640
Medium0.8160.8740.8444,535
High0.7140.6270.6681,322
Weighted Avg0.7750.7830.7776,497
78.3% Overall Accuracy
0.844 Medium F1
2 Discriminant Functions

Discriminant Function Interpretation

The first discriminant function (LD1, explaining 87% of between-group variance) loads heavily on alcohol (+0.72), volatile acidity (-0.48), and density (-0.38). This aligns with oenological knowledge: higher-quality wines tend to have higher alcohol, lower volatile acidity (which causes vinegar-like off-notes), and lower density. The second function (LD2) primarily captures the residual sugar and SO₂ dimensions, distinguishing sweet-style wines from dry ones within quality tiers.

Key Takeaways

Wine quality is statistically encoded in chemical composition. MANOVA confirms significant multivariate differences between quality tiers (p < 0.001). PCA reveals that the chemical space has effective dimensionality of about 4 components, and LDA leverages the quality labels to find discriminant directions that achieve 78.3% classification accuracy. The most discriminating chemical features — alcohol, volatile acidity, and sulphates — align with established wine science, lending credibility to the statistical model.

This project illustrates the complementary nature of multivariate techniques: PCA provides unsupervised exploration, MANOVA gives formal inference, and LDA delivers supervised classification. The 78.3% accuracy demonstrates both the promise and the limits of chemical analysis for quality prediction — the remaining 21.7% error likely reflects subjective aspects of taste that chemistry alone cannot capture, such as aromatic complexity and mouthfeel balance.