Applying PCA for dimensionality reduction, MANOVA for group comparisons, and Linear Discriminant Analysis for classification — uncovering the chemical signatures that distinguish wine quality tiers.
Wine quality assessment traditionally relies on expert sommeliers, but can physicochemical measurements predict quality objectively? This project applies three core multivariate statistical techniques to the UCI Wine Quality dataset (6,497 wines with 11 chemical properties): Principal Component Analysis to visualize the high-dimensional structure, MANOVA to formally test whether quality groups differ across multiple chemical dimensions simultaneously, and Linear Discriminant Analysis for classification into quality tiers.
The analysis reveals that wine quality is indeed encoded in chemical composition — particularly the interplay between volatile acidity, alcohol content, sulphates, and residual sugar. The LDA classifier achieves 78.3% accuracy on a 3-tier quality classification, outperforming PCA-based approaches by leveraging the supervised discriminant directions.
The dataset combines 1,599 red and 4,898 white Portuguese "Vinho Verde" wines, each rated on a 0–10 quality scale by expert tasters. To create a balanced classification problem with meaningful groups, I collapsed the ratings into three quality tiers:
| Tier | Quality Score | Count | Proportion |
|---|---|---|---|
| Low | 3–4 | 640 | 9.9% |
| Medium | 5–6 | 4,535 | 69.8% |
| High | 7–9 | 1,322 | 20.3% |
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from scipy import stats
import seaborn as sns
# Load and combine red + white wines
red = pd.read_csv("winequality-red.csv", sep=";")
white = pd.read_csv("winequality-white.csv", sep=";")
red["type"] = "red"
white["type"] = "white"
wine = pd.concat([red, white], ignore_index=True)
# Create quality tiers
wine["tier"] = pd.cut(wine["quality"],
bins=[0, 4, 6, 10],
labels=["Low", "Medium", "High"])
# Chemical features (standardized)
features = ["fixed acidity", "volatile acidity", "citric acid",
"residual sugar", "chlorides", "free sulfur dioxide",
"total sulfur dioxide", "density", "pH",
"sulphates", "alcohol"]
X = StandardScaler().fit_transform(wine[features])
y = wine["tier"]
PCA reduces the 11-dimensional chemical space to a lower-dimensional representation while preserving maximum variance. The scree plot reveals that the first 4 principal components explain 72.8% of total variance, with a clear "elbow" after PC2.
# Fit PCA
pca = PCA().fit(X)
cumvar = np.cumsum(pca.explained_variance_ratio_)
print("Variance explained by component:")
for i, (var, cum) in enumerate(zip(
pca.explained_variance_ratio_, cumvar
)):
print(f" PC{i+1}: {var:.1%} (cumulative: {cum:.1%})")
| Component | Variance Explained | Cumulative | Top Loadings |
|---|---|---|---|
| PC1 | 27.5% | 27.5% | total SO₂, free SO₂, residual sugar |
| PC2 | 22.3% | 49.8% | volatile acidity, citric acid, pH |
| PC3 | 13.7% | 63.5% | alcohol, density, chlorides |
| PC4 | 9.3% | 72.8% | sulphates, fixed acidity |
Quality tiers show partial separation along PC1, driven largely by alcohol content (rightward = higher quality). Substantial overlap in the medium tier makes this a challenging classification problem.
While PCA is exploratory, MANOVA provides a formal multivariate hypothesis test: do the quality tiers have significantly different mean vectors across all 11 chemical dimensions simultaneously? Using Pillai's trace (the most robust test statistic under violations of homogeneity), the test strongly rejects the null hypothesis.
from statsmodels.multivariate.manova import MANOVA
# Fit MANOVA
formula = " + ".join([f'Q("{f}")' for f in features])
formula = formula + " ~ tier"
manova = MANOVA.from_formula(formula, data=wine)
print(manova.mv_test())
The MANOVA confirms that the three quality tiers have significantly different multivariate chemical profiles (Pillai's trace = 0.483, F(22, 12970) = 168.3, p < 0.001). This justifies proceeding with discriminant analysis to find the optimal directions for separating the groups.
LDA finds the linear combinations of features that maximize the ratio of between-group to within-group variance — the directions along which the quality tiers are most separated. With 3 groups and 11 features, LDA produces 2 discriminant functions.
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.metrics import classification_report, confusion_matrix
# Fit LDA
lda = LinearDiscriminantAnalysis()
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
# Cross-validated predictions
y_pred = cross_val_predict(lda, X, y, cv=cv)
print(classification_report(y, y_pred, digits=3))
# Discriminant function coefficients
lda.fit(X, y)
coef_df = pd.DataFrame(
lda.scalings_, index=features,
columns=["LD1", "LD2"]
).round(3)
print(coef_df.sort_values("LD1", ascending=False))
| Tier | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Low | 0.592 | 0.478 | 0.529 | 640 |
| Medium | 0.816 | 0.874 | 0.844 | 4,535 |
| High | 0.714 | 0.627 | 0.668 | 1,322 |
| Weighted Avg | 0.775 | 0.783 | 0.777 | 6,497 |
The first discriminant function (LD1, explaining 87% of between-group variance) loads heavily on alcohol (+0.72), volatile acidity (-0.48), and density (-0.38). This aligns with oenological knowledge: higher-quality wines tend to have higher alcohol, lower volatile acidity (which causes vinegar-like off-notes), and lower density. The second function (LD2) primarily captures the residual sugar and SO₂ dimensions, distinguishing sweet-style wines from dry ones within quality tiers.
Wine quality is statistically encoded in chemical composition. MANOVA confirms significant multivariate differences between quality tiers (p < 0.001). PCA reveals that the chemical space has effective dimensionality of about 4 components, and LDA leverages the quality labels to find discriminant directions that achieve 78.3% classification accuracy. The most discriminating chemical features — alcohol, volatile acidity, and sulphates — align with established wine science, lending credibility to the statistical model.
This project illustrates the complementary nature of multivariate techniques: PCA provides unsupervised exploration, MANOVA gives formal inference, and LDA delivers supervised classification. The 78.3% accuracy demonstrates both the promise and the limits of chemical analysis for quality prediction — the remaining 21.7% error likely reflects subjective aspects of taste that chemistry alone cannot capture, such as aromatic complexity and mouthfeel balance.