Department of Mathematics

The Chinese University of Hong Kong

HSMMC Math Modelling Advanced Training Workshop

June 27, 2026

1 Introduction

Support Vector Machines (abbreviated SVMs), a unified framework for classification and regression tasks, were developed during the 1990s within the discipline of computer science. Although this methodology was initially created exclusively for binary classification problems, its practical applicability has since been extended to multi-class classification and regression modelling.

SVMs consistently deliver superior classification performance across a wide range of real-world and synthetic datasets; for this reason, they are regarded as a foundational benchmark method in the fields of statistical learning and machine learning.

The theoretical backbone of Support Vector Machines is the Maximal Margin Classifier, whose core geometric building block is the hyperplane. This note presents each foundational concept in sequential logical order. A complete, rigorous grasp of SVM theory requires robust prior knowledge of linear algebra.

Within the R programming environment, the software packages e1071 and LiblineaR implement all core algorithms required to train simple binary classification models, multi-class classification models, and regression models constructed under the Support Vector Machine paradigm.

2 Hyperplane and Maximal Margin Classifier

In a $p$-dimensional space, a hyperplane is defined as a flat affine subspace of dimension $p-1$. The term means the subspace is not required to pass through the origin.

In a 2-dimensional space: the hyperplane is a 1-dimensional subspace, i.e., a straight line.
In a 3-dimensional space: the hyperplane is a 2-dimensional subspace, an ordinary plane.
For $p>3$: direct visualisation is not feasible, but the definition of a $(p-1)$-dimensional affine subspace remains valid.

2.1 Mathematical Definition of a Hyperplane

The algebraic formulation of a hyperplane is a linear equation.

2.2 Two-dimensional case

For $p=2$, the hyperplane corresponds to a straight line with equation: \[\begin{equation} (1): \quad \quad \beta_0 + \beta_1 x_1 + \beta_2 x_2 = 0 \label{eq:hyperplane_2d} \end{equation}\]

where $\beta_0, \beta_1, \beta_2$ are fixed real-valued coefficients.
Any coordinate pair $\boldsymbol{x}=(x_1,x_2)$ that satisfies Equation (1) lies exactly on the hyperplane.

2.3 General $p$-dimensional case

Equation (1) generalises to arbitrary dimension $p$: \[\begin{equation} (2): \quad \quad \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p = 0 \label{eq:hyperplane_pd} \end{equation}\] All vectors $\boldsymbol{x} = \begin{pmatrix} x_1 & x_2 & \dots & x_p \end{pmatrix}^\top$ satisfying (2) are points on the $p$-dimensional hyperplane.

2.4 Half-Space Partition Induced by a Hyperplane

For a point $\boldsymbol{x}$ not lying on the hyperplane, evaluating the linear functional yields one of two strict inequalities: \[\begin{equation} (3): \quad \quad \beta_0 + \sum_{j=1}^p \beta_j x_j < 0 \label{eq:halfspace_neg} \end{equation}\] or \[\begin{equation} (4): \quad \quad \beta_0 + \sum_{j=1}^p \beta_j x_j > 0 \label{eq:halfspace_pos} \end{equation}\]

These two inequalities define two disjoint open half-spaces separated by the hyperplane. The sign of the linear expression $\beta_0 + \sum_{j=1}^p \beta_j x_j$ uniquely determines which half-space contains the point $\boldsymbol{x}$.

2.5 Concrete 2D Example Calculation

Take the specific hyperplane equation: \[\begin{equation} (5): \quad \quad 1 + 2x_1 + 3x_2 = 0 \label{eq:example_hp} \end{equation}\]

Blue half-space: All $\boldsymbol{x}$ such that $1 + 2x_1 + 3x_2 > 0$ (Equation (4) with $\beta_0=1,\beta_1=2,\beta_2=3$)
Red half-space: All $\boldsymbol{x}$ such that $1 + 2x_1 + 3x_2 < 0$ (Equation (3) with $\beta_0=1,\beta_1=2,\beta_2=3$)

Below is executable R code to compute the linear functional, classify regions, generate the colored partition diagram, and verify sample point calculations numerically.

# Install ggplot2 if missing
if (!require("ggplot2")) {
  install.packages("ggplot2")
  library(ggplot2)
}

# Step 1: Create 2D coordinate grid
x1 <- seq(-4, 4, length.out = 300)
x2 <- seq(-4, 4, length.out = 300)
grid_data <- expand.grid(x1 = x1, x2 = x2)

# Step 2: Compute hyperplane linear functional f(x1,x2) = 1 + 2*x1 + 3*x2
grid_data$f <- 1 + 2 * grid_data$x1 + 3 * grid_data$x2

# Step 3: Label each region
grid_data$region <- ifelse(
  grid_data$f > 1e-5,
  "Positive half-space (1+2x1+3x2 > 0)",
  ifelse(
    grid_data$f < -1e-5,
    "Negative half-space (1+2x1+3x2 < 0)",
    "Hyperplane line (1+2x1+3x2 = 0)"
  )
)

# Step 4: Plot hyperplane and two half-spaces
plot_hyperplane <- ggplot() +
  geom_tile(
    data = grid_data,
    aes(x = x1, y = x2, fill = region),
    alpha = 0.7
  ) +
  scale_fill_manual(
    values = c(
      "Positive half-space (1+2x1+3x2 > 0)" = "#4477ff",
      "Negative half-space (1+2x1+3x2 < 0)" = "#dd3333",
      "Hyperplane line (1+2x1+3x2 = 0)" = "#000000"
    )
  ) +
  labs(
    title = "2D Hyperplane Partition: $1 + 2x_1 + 3x_2 = 0$",
    x = expression(x[1]),
    y = expression(x[2]),
    fill = "Region Classification"
  ) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))

# Render plot
print(plot_hyperplane)

# Step 5: Manual mathematical verification for sample points
# Test point 1: (x1=1, x2=1)
f1 <- 1 + 2*1 + 3*1
cat("Point (1,1): f =", f1, "> 0 → Positive blue region\n")

## Point (1,1): f = 6 > 0 → Positive blue region

# Test point 2: (x1=-2, x2=-2)
f2 <- 1 + 2*(-2) + 3*(-2)
cat("Point (-2,-2): f =", f2, "< 0 → Negative red region\n")

## Point (-2,-2): f = -9 < 0 → Negative red region

# Test point exactly on hyperplane: solve 1+2x1+3x2=0, pick x1=-2 → x2=( -1 -2*(-2) )/3
x1_hp <- -2
x2_hp <- (-1 - 2 * x1_hp) / 3
f_hp <- 1 + 2*x1_hp + 3*x2_hp
cat("Point on hyperplane (",x1_hp,",",x2_hp,"): f =", f_hp, "→ Black separating line\n")

## Point on hyperplane ( -2 , 1 ): f = 0 → Black separating line

3 Binary Classification Using a Hyperplane

Suppose that we have $n$ observations, where each observation contains $p$ predictor variables, and the response variable has only two possible classes. Throughout this discussion, the two classes are denoted by $+1$ and $-1$.

Hyperplanes can be used to construct a classifier that predicts the class membership of an observation based on its predictor values. The same classification problem can also be addressed using other methods, such as logistic regression, Linear Discriminant Analysis (LDA), classification trees, and many others, each of which has its own advantages and disadvantages.

For ease of visualization, the following explanations are presented in a two-dimensional space, where a hyperplane is simply a straight line. However, exactly the same concepts extend naturally to higher-dimensional spaces.

4 Perfectly Linearly Separable Cases

Assume that the observations are distributed in such a way that the two classes ($+1$ and $-1$) can be perfectly separated by a linear boundary.

A hyperplane in a $p$-dimensional space is defined by

\[ \beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_px_p=0. \]

If the classes are perfectly separable, there exist coefficients

\[ \beta_0,\beta_1,\ldots,\beta_p \]

such that every observation satisfies:

\[ \beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_px_p >0, \qquad \text{if } y_i=+1, \]

and

\[ \beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_px_p <0, \qquad \text{if } y_i=-1. \]

4.1 Combining the Two Conditions

Since each class label is either $+1$ or $-1$, the two inequalities above can be combined into a single condition.

4.1.1 Case 1: $y_i=+1$

Multiplying the hyperplane expression by $y_i$ gives

\[ y_i\left( \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\cdots+\beta_px_{ip} \right) = (+1) \left( \beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip} \right). \]

Therefore,

\[ y_i\left( \beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip} \right)>0. \]

4.1.2 Case 2: $y_i=-1$

Since

\[ \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\cdots+\beta_px_{ip}<0, \]

multiplying both sides by $y_i=-1$ yields

\[ (-1) \left( \beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip} \right)>0, \]

because the product of two negative quantities is positive.

Hence,

\[ y_i\left( \beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip} \right)>0. \]

Since this condition holds in both cases, we obtain the compact representation

\[ \boxed{ y_i \left( \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\cdots+\beta_px_{ip} \right) >0, \qquad i=1,\ldots,n. } \]

This equation states that every training observation is correctly classified by the hyperplane.

4.2 Hyperplane Classifier

Under the perfectly separable scenario, the simplest classifier assigns a class according to which side of the hyperplane an observation lies.

Let a new observation be

\[ \mathbf{x}^* = (x_1^*,x_2^*,\ldots,x_p^*). \]

Define the decision function

\[ f(\mathbf{x}^*) = \beta_0 +\beta_1x_1^* +\beta_2x_2^* +\cdots +\beta_px_p^*. \]

The classification rule is

\[ \hat{y} = \begin{cases} +1, & \text{if } f(\mathbf{x}^*)>0,\\[6pt] -1, & \text{if } f(\mathbf{x}^*)<0. \end{cases} \]

4.3 Interpretation of the Decision Function

The sign of $f(\mathbf{x}^*)$ determines the predicted class:

$f(\mathbf{x}^*)>0$: classify the observation as class $+1$.
$f(\mathbf{x}^*)<0$: classify the observation as class $-1$.
$f(\mathbf{x}^*)=0$: the observation lies exactly on the hyperplane.

Furthermore, the magnitude

\[ \left|f(\mathbf{x}^*)\right| \]

provides information about how far the observation is from the decision boundary. In general:

A larger value of $\left|f(\mathbf{x}^*)\right|$ indicates that the observation is farther from the hyperplane and is therefore classified with greater confidence.
A smaller value of $\left|f(\mathbf{x}^*)\right|$ indicates that the observation is closer to the hyperplane and is therefore classified with less confidence.

5 Why an Optimal Hyperplane Is Needed

The conditions for perfect linear separability only require that all observations be correctly classified. If the data are perfectly separable, there are generally infinitely many hyperplanes satisfying

\[ y_i \left( \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\cdots+\beta_px_{ip} \right) >0, \qquad i=1,\ldots,n. \]

Consequently, infinitely many classifiers can achieve zero training error.

Because of this non-uniqueness, an additional criterion is required to select a single ``best’’ hyperplane. This leads to the concept of the $\emph{optimal separating hyperplane}$, also known as the $\emph{maximum-margin hyperplane}$, which chooses the hyperplane that maximizes the separation (margin) between the two classes. Maximizing the margin generally improves the robustness and generalization ability of the classifier.

# Install required package if missing
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)

# Fix random seed for identical synthetic data
set.seed(68)

# Generate two groups of standard normal random variables
X1 <- rnorm(n = 10, mean = 2, sd = 1)
X2 <- rnorm(n = 10, mean = 2, sd = 1)

# Construct dataset: two distinct classes
observations <- data.frame(
  X1 = c(X1, X1 + 2),
  X2 = c(X2, X2 + 2),
  class = rep(c(1, -1), each = 10)
)

# Convert class label to categorical factor
observations$class <- as.factor(observations$class)

# Plot data points + 5 separating hyperplanes (straight lines in 2D)
ggplot() +
  geom_point(
    data = observations,
    aes(x = X1, y = X2, color = class),
    size = 4
  ) +
  # Draw all 5 candidate separating hyperplanes
  geom_abline(intercept = 9, slope = -2) +
  geom_abline(intercept = 8.5, slope = -1.7) +
  geom_abline(intercept = 8, slope = -1.5) +
  geom_abline(intercept = 6.5, slope = -1) +
  geom_abline(intercept = 5.4, slope = -0.75) +
  # Black-and-white plot theme
  theme_bw() +
  # Plot title (translated to English)
  labs(title = "5 Possible Separating Hyperplanes") +
  # Customize plot aesthetics
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5, size = 11)
  )

6 Maximal Margin Separating Hyperplane: Theoretical Derivation

6.1 Optimal Linear Classifier Definition

The solution to this classification problem selects the $\textit{maximal margin hyperplane}$ (also called the optimal separating hyperplane) as the optimal classifier. This hyperplane is defined as the linear decision boundary that lies the farthest perpendicular distance from all training observations.

To construct this hyperplane, we first compute the perpendicular Euclidean distance from every training data point to a candidate separating hyperplane. The smallest of these perpendicular distances is termed the $\textit{margin}$; this value quantifies how far the hyperplane sits from the closest training samples. The maximal margin hyperplane is the unique separating hyperplane that maximizes this minimal perpendicular distance across all training points.

While this conceptual framework is intuitive, a brute-force search is computationally infeasible: there exist infinitely many candidate hyperplanes to evaluate for distance calculations. Instead, we formalize the problem as a constrained convex optimization program.

6.2 Support Vectors: Critical Boundary Points

The figire above visualizes the maximal margin hyperplane for a linearly separable training dataset. Three training points lie exactly equidistant to the central maximal margin hyperplane, resting on the two dashed boundary lines that mark the full width of the margin band. These equidistant observations are named $\textit{support vectors}$: they are $p$-dimensional data vectors that fully define and constrain the maximal margin hyperplane.

Key property of support vectors: Any shift or alteration to a support vector will directly modify the position, slope, or margin width of the maximal margin hyperplane. In contrast, modifying training observations that are $\textbf{not}$ support vectors produces zero change to the optimal separating hyperplane, provided the support vectors remain unchanged.

7 Nearly Linearly Separable Data Cases

The maximal margin hyperplane described above provides an intuitive, simple linear classification rule $\textbf{if and only if}$ a perfect linear separating hyperplane exists for the dataset. In nearly all real-world applied problems, training data cannot be perfectly split by a straight linear boundary—no exact separating hyperplane exists, which means the hard-margin maximal margin hyperplane formulation cannot be directly applied.

8 Mathematical Derivation of Maximal Margin Hyperplane

8.1 Formal Hyperplane Notation in $\mathbb{R}^p$

A linear hyperplane in $p$-dimensional feature space $\mathbb{R}^p$ takes the standard form: \[ \boldsymbol{w}^T \boldsymbol{x} + b = 0 \] where:

$\boldsymbol{w} = [w_1, w_2, \dots, w_p]^T \in \mathbb{R}^p$: weight vector orthogonal to the hyperplane
$\boldsymbol{x} = [x_1, x_2, \dots, x_p]^T \in \mathbb{R}^p$: input feature vector for one observation
$b \in \mathbb{R}$: scalar bias term

For binary classification with labels $y_i \in \{-1, 1\}$ for observation $i$, a separating hyperplane satisfies: \[ \begin{cases} \boldsymbol{w}^T \boldsymbol{x}_i + b > 0 & \text{if } y_i = 1 \\ \boldsymbol{w}^T \boldsymbol{x}_i + b < 0 & \text{if } y_i = -1 \end{cases} \] We rewrite this unified constraint by multiplying through by the label $y_i$: \[ y_i \big(\boldsymbol{w}^T \boldsymbol{x}_i + b\big) > 0 \quad \forall i = 1,2,\dots,n \]

8.2 Perpendicular Distance from Point to Hyperplane (Critical Calculation)

The perpendicular Euclidean distance from a single point $\boldsymbol{x}_i$ to hyperplane $\boldsymbol{w}^T\boldsymbol{x}+b=0$ is the standard plane-point distance formula: \[ d_i = \frac{\big|\boldsymbol{w}^T \boldsymbol{x}_i + b\big|}{\|\boldsymbol{w}\|_2}, \quad \|\boldsymbol{w}\|_2 = \sqrt{w_1^2 + w_2^2 + \dots + w_p^2} \] Since we have perfect separation $y_i(\boldsymbol{w}^T\boldsymbol{x}_i + b) > 0$, the absolute value may be eliminated: \[ d_i = \frac{y_i \big(\boldsymbol{w}^T \boldsymbol{x}_i + b\big)}{\|\boldsymbol{w}\|_2} \] The margin $M$ of the hyperplane is defined as the minimum perpendicular distance across all training samples: \[ M = \min_{i=1,\dots,n} \, d_i = \min_{i=1,\dots,n} \frac{y_i \big(\boldsymbol{w}^T \boldsymbol{x}_i + b\big)}{\|\boldsymbol{w}\|_2} \]

8.3 Maximize the Margin: Optimization Objective

Our goal is to maximize $M$ over all valid $\boldsymbol{w},b$ that satisfy separation. We perform a normalization simplification without loss of generality: scale $(\boldsymbol{w},b)$ such that $\min_i y_i(\boldsymbol{w}^T\boldsymbol{x}_i + b) = 1$. This normalization is valid because scaling $\boldsymbol{w},b$ by a constant factor does not alter the hyperplane itself.

Under this normalization: \[ M = \frac{1}{\|\boldsymbol{w}\|_2} \] Maximizing $M = 1/\|\boldsymbol{w}\|_2$ is algebraically equivalent to minimizing $\|\boldsymbol{w}\|_2$. Squaring the norm (monotonic transformation, preserves minima) gives the standard convex objective: \[ \min_{\boldsymbol{w},b} \frac{1}{2}\|\boldsymbol{w}\|_2^2 \] Subject to the hard separation constraints for all $n$ observations: \[ y_i \big(\boldsymbol{w}^T \boldsymbol{x}_i + b\big) \ge 1, \quad i=1,\dots,n \]

8.4 Support Vector Condition Derivation

At the optimal solution $(\boldsymbol{w}^*,b^*)$, the minimal distance constraint is tight for exactly the support vectors: \[ y_i \big((\boldsymbol{w}^*)^T \boldsymbol{x}_i + b^*\big) = 1 \quad \iff \boldsymbol{x}_i \text{ is a support vector} \] For all non-support vectors, the inequality holds strictly: \[ y_i \big((\boldsymbol{w}^*)^T \boldsymbol{x}_i + b^*\big) > 1 \] This mathematical condition proves that only support vectors determine the hyperplane solution: removing or adjusting non-support vectors does not change the optimal $\boldsymbol{w}^*,b^*$, while perturbing any support vector breaks the tight equality constraint and forces a new optimal hyperplane.

8.5 Limitation for Non-Separable Data

The above hard-margin optimization problem only has a feasible solution if there exists some $(\boldsymbol{w},b)$ satisfying $y_i(\boldsymbol{w}^T\boldsymbol{x}_i + b)\ge1$ for every $i$. If the dataset cannot be perfectly linearly separated, no feasible $(\boldsymbol{w},b)$ exists, so the maximal margin hyperplane formulation cannot be used directly. This motivates soft-margin SVM extensions for nearly separable real-world data.

# ============================================================
# Maximal Margin Hyperplane with Square Window Frame
# ============================================================

# Install required packages if missing
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("e1071")) install.packages("e1071") # For SVM maximal margin fit

library(ggplot2)
library(e1071)

# ------------------------------------------------------------
# Step 1: Generate linearly separable dataset
# ------------------------------------------------------------
set.seed(123)

# Class 1 (blue points, y=1)
x1_blue <- data.frame(
  X1 = c(-1.2, -0.6, -0.4, -0.2, 0.1, 0.3, 0.5, -0.8, -0.5, 0.2),
  X2 = c(1.8, 1.4, 1.1, 0.6, 1.7, 3.5, 2.9, 2.9, 2.9, 2.0),
  y = factor(1)
)

# Class -1 (pink points, y=-1)
x2_red <- data.frame(
  X1 = c(0.4, 0.7, 0.9, 1.1, 1.4, 1.8, 2.2, 2.6, 3.1, 0.8),
  X2 = c(-0.2, -0.9, -1.1, -1.5, 0.1, 0.6, 0.0, -1.0, -1.3, -0.7),
  y = factor(-1)
)

# Combine dataset
df <- rbind(x1_blue, x2_red)

# ------------------------------------------------------------
# Step 2: Fit hard-margin SVM
# ------------------------------------------------------------
svm_fit <- svm(y ~ X1 + X2, data = df, kernel = "linear", cost = 1e6, scale = FALSE)

# Extract hyperplane parameters
w <- t(svm_fit$SV) %*% svm_fit$coefs
w1 <- w[1,1]
w2 <- w[2,1]
b <- svm_fit$rho

# Hyperplane equation: w1*X1 + w2*X2 - b = 0 → X2 = (-w1/w2)*X1 + b/w2
slope_main <- -w1 / w2
intercept_main <- b / w2

# Margin boundary lines (offset ±1/||w||)
margin_offset <- 1 / sqrt(w1^2 + w2^2)
intercept_upper <- intercept_main + margin_offset / w2
intercept_lower <- intercept_main - margin_offset / w2

# Support vectors
support_pts <- df[svm_fit$index, ]

# ------------------------------------------------------------
# Step 3: Background grid shading
# ------------------------------------------------------------
grid_x <- seq(min(df$X1)-0.5, max(df$X1)+0.5, by = 0.05)
grid_y <- seq(min(df$X2)-0.5, max(df$X2)+0.5, by = 0.05)
grid_df <- expand.grid(X1 = grid_x, X2 = grid_y)
grid_df$val <- w1 * grid_df$X1 + w2 * grid_df$X2 - b
grid_df$region <- ifelse(grid_df$val > 0, "Class 1", "Class -1")

# ------------------------------------------------------------
# Step 4: Build plot with square window frame
# ------------------------------------------------------------
p <- ggplot() +
  # Background shading
  geom_tile(data = grid_df, aes(x = X1, y = X2, fill = region), alpha = 0.2, color = "gray80") +
  scale_fill_manual(values = c("Class 1" = "#87CEFA", "Class -1" = "#FFB6C1")) +
  # Margin lines
  geom_abline(slope = slope_main, intercept = intercept_upper, linetype = "dashed", linewidth = 1) +
  geom_abline(slope = slope_main, intercept = intercept_lower, linetype = "dashed", linewidth = 1) +
  # Central hyperplane
  geom_abline(slope = slope_main, intercept = intercept_main, color = "black", linewidth = 1.3) +
  # Data points
  geom_point(data = df, aes(x = X1, y = X2, color = y), size = 4) +
  scale_color_manual(values = c("1" = "#3388dd", "-1" = "#bb6699")) +
  # Arrows between support vectors
  geom_segment(
    data = support_pts,
    aes(x = X1, y = X2, xend = lag(X1), yend = lag(X2)),
    arrow = arrow(length = unit(0.15, "cm")), color = "black"
  ) +
  # Square window frame (adjust bounds to cover data)
  annotate("rect",
           xmin = -2, xmax = 4, ymin = -2, ymax = 4,
           color = "black", fill = NA, size = 1.2) +
  # Axis labels
  labs(x = expression(X[1]), y = expression(X[2])) +
  # Equal aspect ratio ensures square frame
  coord_fixed(ratio = 1) +
  theme_bw() +
  theme(
    legend.position = "none",
    plot.margin = margin(10,10,10,10),
    axis.title = element_text(size = 14)
  ) +
  labs(caption = "Maximal margin hyperplane with square window frame")

# Display plot
print(p)

# Install required plotting package if missing
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)

# Fix random seed for perfect reproducibility
set.seed(101)

# Generate 20x2 matrix of standard normal random variables
coordinates <- matrix(rnorm(n = 40), nrow = 20, ncol = 2)
colnames(coordinates) <- c("X1", "X2")

# Create binary class labels: 10 samples of class -1, 10 samples of class 1
class_label <- c(rep(-1, times = 10), rep(1, times = 10))
class_label <- as.factor(class_label)

# Shift all class 1 observations by +1 along both feature axes to create overlap
coordinates[class_label == 1, ] <- coordinates[class_label == 1, ] + 1

# Assemble final plotting dataset
data <- data.frame(coordinates, class = class_label)

# Generate scatter plot matching the original figure exactly
ggplot(data = data, aes(x = X1, y = X2, color = class)) +
  geom_point(size = 4) +
  theme_bw() +
  labs(title = "Linearly Non-Separable Classes") +
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5, size = 11)
  )

9 Soft-Margin Support Vector Classifiers (Soft Margin SVM)

9.1 Motivation for Extending the Maximal Margin Hyperplane

To address cases where data cannot be perfectly linearly separated, we extend the maximal margin hyperplane framework to construct a hyperplane that nearly separates the two classes while allowing a small number of misclassification errors. This type of decision boundary is known as a $\textit{Support Vector Classifier}$, or equivalently a $\textit{soft-margin SVM}$.

9.2 Practical Limitations of the Hard-Margin Maximal Margin Classifier

The maximal margin classifier introduced in the prior section has limited real-world applicability, as perfectly linearly separable datasets are extremely rare in practice. Even when perfect linear separation is theoretically achievable, the hard-margin formulation suffers from two critical flaws:

$\textbf{Low robustness to data perturbations}$: The hard-margin hyperplane requires zero misclassifications, making it highly sensitive to minor shifts or new outlying observations. Adding a single new data point can drastically alter the separating hyperplane.
$\textbf{High risk of overfitting}$: Forcing the hyperplane to partition all training samples without error over-adapts to random noise in the training set, degrading predictive performance on unseen test data.

For these reasons, we prefer a classifier defined by a hyperplane that tolerates a small number of misclassifications, while delivering greater robustness and stronger generalization to new data (reduced overfitting). This design goal is exactly achieved by support vector classifiers (soft-margin classifiers). Instead of enforcing all training points to lie strictly on the correct side of the margin band, we permit some observations to fall inside the margin or even on the opposite side of the hyperplane.

The figure above displays a soft-margin support vector classifier fit to a small synthetic training dataset. The solid black line is the optimal separating hyperplane; the two dashed parallel lines mark the width of the soft margin band on either side of the hyperplane. We categorize all training points by their position relative to the margin and hyperplane:

Observations 2, 3, 4, 5, 6, 7, 10: Lie fully on the correct side of the margin band (and hence the hyperplane); correctly classified.
Observations 1, 8: Fall inside the margin band but remain on the correct side of the hyperplane; still correctly classified.
Observations 11, 12: Lie on the opposite side of the hyperplane; misclassified training samples.

Any observation that falls inside the margin band or crosses to the wrong side of the hyperplane constitutes a margin violation or full misclassification, respectively.

9.3 Convex Optimization Formulation for Soft-Margin SVM (Full Step-by-Step Derivation)

Identifying the optimal soft-margin hyperplane that correctly classifies most samples (with limited allowed violations) reduces to a constrained convex optimization problem. A full rigorous proof of the optimization derivation is beyond this introductory text, but we formalize the objective function, constraints, and hyperparameter behavior below.

9.3.1 Slack Variable Definition for Margin Violations

Introduce non-negative slack variables $\xi_i \ge 0$ for each training observation $i=1,\dots,n$ to quantify the magnitude of margin/hyperplane violations: \[ \xi_i = \begin{cases} 0 & \text{if } \boldsymbol{x}_i \text{ lies on the correct side of the margin}, \\ \in (0,1) & \text{if } \boldsymbol{x}_i \text{ lies inside the margin band (correct side of hyperplane)}, \\ \ge 1 & \text{if } \boldsymbol{x}_i \text{ lies on the wrong side of the hyperplane (misclassified)}. \end{cases} \] The hard separation constraint from maximal margin SVM is relaxed to: \[ y_i \big(\boldsymbol{w}^T \boldsymbol{x}_i + b\big) \ge 1 - \xi_i, \quad \forall i,\quad \xi_i \ge 0. \]

9.3.2 Objective Function with Cost Hyperparameter $C$

The convex minimization problem balances maximizing margin width (minimizing $\|\boldsymbol{w}\|_2^2$) and penalizing total margin violations ($\sum_{i=1}^n \xi_i$): \[ \min_{\boldsymbol{w},b,\boldsymbol{\xi}} \frac{1}{2}\|\boldsymbol{w}\|_2^2 + C \sum_{i=1}^n \xi_i \] Subject to: \[ y_i \big(\boldsymbol{w}^T \boldsymbol{x}_i + b\big) \ge 1 - \xi_i,\quad \xi_i \ge 0,\quad i=1,\dots,n. \] The scalar hyperparameter $C>0$ controls the penalty applied to total margin violations $\sum \xi_i$.

9.3.3 Mathematical Behavior of Hyperparameter $C$

$\boldsymbol{C \to \infty}$: The penalty for violations becomes infinitely large; all $\xi_i = 0$ is forced. This recovers the hard-margin maximal margin classifier, which is only feasible if the dataset is perfectly linearly separable.
$\boldsymbol{C \to 0^+}$: Violations carry negligible penalty; the optimization tolerates extensive margin crossings and misclassifications. The margin band widens substantially.

$C$ directly governs the bias–variance tradeoff of the SVM model, and its optimal value is selected via cross-validation in practice.

9.3.4 Support Vector Definition & Bias-Variance Tradeoff Mechanism

A critical property of soft-margin SVM: only observations with $\xi_i > 0$ (points touching the margin band or violating it) influence the final hyperplane solution. These points are defined as $\textit{support vectors}$, which fully determine the decision boundary. The value of $C$ modulates the number of support vectors and the resulting model complexity:

Small $C$: Low penalty for violations → wide margin band → many points cross the margin and become support vectors. The hyperplane is constrained by more training data, increasing model bias and reducing variance (less overfitting).
Large $C$: Severe penalty for violations → narrow margin band → few support vectors. The hyperplane adapts closely to a small subset of training points, decreasing bias and increasing variance (higher overfitting risk).

9.3.5 Robustness Advantage Over Linear Discriminant Analysis (LDA)

Since the soft-margin hyperplane depends exclusively on support vectors (a small subset of training data), the classifier is robust to outlying observations far from the margin band. This distinguishes SVMs from methods such as Linear Discriminant Analysis (LDA), whose classification rule relies on the sample mean of $\textit{all}$ training observations and is highly sensitive to distant outliers.

9.3.6 Reproducible R Script for ISLR Soft-Margin SVM Figure

This code uses the e1071 package to fit linear soft-margin SVM, generates the margin lines, support vector markers, and replicates the textbook plot structure exactly.

# ============================================================
# Soft-Margin SVM Visualization with Square Window Frame
# ============================================================

# Install required packages if missing
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("e1071")) install.packages("e1071")

library(ggplot2)
library(e1071)

# --------------------------
# Step 1: Exact coordinates of all 12 labeled points from ISLR
# --------------------------
svm_data <- data.frame(
  ID = 1:12,
  X1 = c(1.2, 1.4, -0.4, 0.1, 0.6, 0.0, 1.6, 1.8, 2.2, 2.4, 1.3, 0.7),
  X2 = c(0.8, -0.6, 0.0, -0.8, -0.5, -1.0, 3.8, 2.0, 2.4, 4.0, 2.8, 0.2),
  y = factor(c(-1, -1, -1, -1, -1, -1, 1, 1, 1, 1, -1, 1))
)

# --------------------------
# Step 2: Fit soft-margin linear SVM
# --------------------------
svm_fit <- svm(y ~ X1 + X2, data = svm_data, kernel = "linear", cost = 1, scale = FALSE)

# Extract hyperplane parameters
w_matrix <- t(svm_fit$SV) %*% svm_fit$coefs
w1 <- w_matrix[1, 1]
w2 <- w_matrix[2, 1]
b <- svm_fit$rho

# Hyperplane equation: w1*X1 + w2*X2 - b = 0 → X2 = slope*X1 + intercept
slope_main <- -w1 / w2
intercept_main <- b / w2

# Margin boundaries
margin_gap <- 1 / sqrt(w1^2 + w2^2)
intercept_upper_margin <- intercept_main + margin_gap / w2
intercept_lower_margin <- intercept_main - margin_gap / w2

# --------------------------
# Step 3: Compute square window frame bounds
# --------------------------
range_X1 <- range(svm_data$X1)
range_X2 <- range(svm_data$X2)

x_span <- diff(range_X1)
y_span <- diff(range_X2)
span <- max(x_span, y_span)

x_mid <- mean(range_X1)
y_mid <- mean(range_X2)

xmin <- x_mid - span/2
xmax <- x_mid + span/2
ymin <- y_mid - span/2
ymax <- y_mid + span/2

# --------------------------
# Step 4: Build plot
# --------------------------
p <- ggplot() +
  # Dashed margin boundary lines
  geom_abline(slope = slope_main, intercept = intercept_upper_margin, linetype = "dashed", linewidth = 1) +
  geom_abline(slope = slope_main, intercept = intercept_lower_margin, linetype = "dashed", linewidth = 1) +
  # Solid central hyperplane
  geom_abline(slope = slope_main, intercept = intercept_main, color = "black", linewidth = 1.4) +
  # Data points
  geom_point(data = svm_data, aes(x = X1, y = X2, color = y), size = 5) +
  scale_color_manual(values = c("-1" = "#cc77aa", "1" = "#66aadd")) +
  # Number labels
  geom_text(data = svm_data, aes(x = X1, y = X2, label = ID), size = 6, color = "black") +
  # Square window frame
  annotate("rect", xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax,
           color = "black", fill = NA, size = 1.2) +
  # Axis labels
  labs(
    x = expression(X[1]),
    y = expression(X[2]),
    caption = "Soft-Margin Support Vector Classifier figure reproduced from ISLR"
  ) +
  coord_fixed(ratio = 1, xlim = c(xmin, xmax), ylim = c(ymin, ymax)) +
  theme_bw() +
  theme(
    legend.position = "none",
    plot.caption = element_text(hjust = 0.5, size = 14, margin = margin(t = 15)),
    axis.title = element_text(size = 16),
    panel.grid = element_blank()
  )

# Display plot
print(p)

10 Binary Classification with Soft-Margin Support Vector Classifiers

10.1 Simulation of Two-Dimensional Binary Dataset

To demonstrate the application of a Support Vector Classifier as a binary classification model, we simulate training observations residing in a two-dimensional feature space, split into two distinct classes.

All examples below utilize the svm() function from the R package e1071. This function fits a Support Vector Classifier when the argument kernel="linear" is specified. As derived later in this section, linear-kernel Support Vector Machines reduce exactly to the standard Support Vector Classifier framework. The cost argument controls the penalty magnitude assigned to margin violations; this argument corresponds to the hyperparameter $C$ defined in soft-margin SVM optimization theory.

10.2 Reproducible `R` Code for Dataset Simulation and Visualization

The following script generates synthetic non-linearly separable data and produces a scatter plot of the two classes:

# Install required packages if missing
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)

# Fix random seed for full reproducibility
set.seed(10111)

# Generate 20×2 matrix of standard normal random samples N(0,1)
coordinates <- matrix(rnorm(n = 40), nrow = 20, ncol = 2)
colnames(coordinates) <- c("X1", "X2")

# Define binary class labels: first 10 samples class -1, last 10 samples class 1
y <- c(rep(-1, times = 10), rep(1, times = 10))

# Shift all class 1 points by +1 along both feature axes to create class overlap
coordinates[y == 1, ] <- coordinates[y == 1, ] + 1

# Assemble final tidy data frame
data <- data.frame(coordinates, y)

# Scatter plot of simulated dataset
ggplot(data = data, aes(x = X1, y = X2, color = as.factor(y))) +
  geom_point(size = 6) +
  theme_bw() +
  theme(legend.position = "none")

Visual inspection of the scatter plot in the figure above confirms the two classes cannot be perfectly separated by any linear hyperplane.

10.3 Automatic Task Detection in the `e1071::svm()` Function

The svm() function automatically distinguishes between classification and regression tasks:

Classification: If the response variable is stored as a factor type.
Regression: If the response variable is stored as a numeric vector.

10.4 `R` Script for Fitting Linear Soft-Margin Support Vector Classifier

# Install and load SVM package
if (!require("e1071")) install.packages("e1071")
library(e1071)

# Convert numeric class labels to factor for classification task
data$y <- as.factor(data$y)

# Fit linear soft-margin Support Vector Classifier
svm_model <- svm(
  formula = y ~ X1 + X2,
  data = data,
  kernel = "linear",
  cost = 10,
  scale = FALSE
)

# Print full model summary
summary(svm_model)

## 
## Call:
## svm(formula = y ~ X1 + X2, data = data, kernel = "linear", cost = 10, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  6
## 
##  ( 3 3 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1

10.4.1 Critical Note on Feature Scaling

In this simulation, both predictors $X_1$ and $X_2$ share identical units and value ranges, so feature standardization is unnecessary (scale = FALSE). For datasets with predictors measured on disparate scales, standardization is mandatory. Without scaling, predictors with larger numerical magnitudes will dominate the distance calculations inside the SVM loss function and overshadow the contribution of smaller-scale features, producing biased hyperplane solutions.

10.5 Model Summary Interpretation

The output of summary(svm_model) reports a total of 6 support vectors: 3 support vectors belong to class $-1$, and the remaining 3 belong to class $1$. Only these 6 observations define the optimal separating hyperplane; all other training samples have no impact on the final decision boundary.

10.6 Mathematical Derivations

10.6.1 Synthetic Data Generation Calculations

10.6.1.1 Random Feature Matrix Construction

The command draws 40 independent samples from the standard normal distribution: \[ Z \sim \mathcal{N}(\mu=0,\;\sigma^2=1) \] These samples fill a matrix with $n=20$ rows (observations) and $p=2$ feature columns ($X_1,X_2$): \[ \mathbf{Coor} = \begin{bmatrix} z_{11} & z_{12} \\ z_{21} & z_{22} \\ \vdots & \vdots \\ z_{20,1} & z_{20,2} \end{bmatrix} \in \mathbb{R}^{20\times 2} \]

10.6.1.2 Binary Label Vector Definition

The label vector $\boldsymbol{y} \in \{-1,1\}^{20}$ partitions observations into two groups: \[ y_i = \begin{cases} -1 & i = 1,2,\dots,10 \\ 1 & i = 11,12,\dots,20 \end{cases} \]

10.6.1.3 Class Shift Operation to Introduce Non-Separability

For all rows with $y_i=1$, we apply a uniform vector shift of $(+1,+1)$ to both features: \[ \begin{bmatrix}x_{i1}^* \\ x_{i2}^*\end{bmatrix} = \begin{bmatrix}x_{i1} + 1 \\ x_{i2} + 1\end{bmatrix} \quad \forall\; i: y_i=1 \] This shift creates overlapping points between the two classes, eliminating the existence of any perfect linear separating hyperplane.

10.6.2 Linear Soft-Margin SVM Optimization Problem

10.6.2.1 Slack Variable Definition

Non-negative slack variables $\xi_i \ge 0$ quantify margin violations for each observation $i$: \[ \xi_i = \begin{cases} 0 & \text{Point } \boldsymbol{x}_i \text{ lies on the correct side of the margin band}, \\ 0 < \xi_i < 1 & \text{Point } \boldsymbol{x}_i \text{ lies inside the margin band (correct class)}, \\ \xi_i \ge 1 & \text{Point } \boldsymbol{x}_i \text{ misclassified on opposite hyperplane side}. \end{cases} \]

10.6.2.2 Objective Function with Cost Hyperparameter $C$

The convex minimization objective balanced margin width and total violations: \[ \min_{\boldsymbol{w},b,\boldsymbol{\xi}} \frac{1}{2}\|\boldsymbol{w}\|_2^2 + C \sum_{i=1}^{20} \xi_i \] Subject to relaxed separation constraints: \[ y_i \big(\boldsymbol{w}^T \boldsymbol{x}_i + b\big) \ge 1 - \xi_i,\quad \xi_i \ge 0,\quad i=1,\dots,20 \]

The R argument cost = 10 sets $C=10$, assigning a moderate-to-high penalty to margin violations. Larger $C$ narrows the margin and reduces misclassifications at the cost of higher model variance.

10.6.2.3 Linear Kernel Equivalence to Support Vector Classifier

The general SVM kernel function $K(\boldsymbol{x}_i,\boldsymbol{x}_j) = \phi(\boldsymbol{x}_i)^T\phi(\boldsymbol{x}_j)$ computes inner products in transformed feature space. For the linear kernel: \[ K(\boldsymbol{x}_i,\boldsymbol{x}_j) = \boldsymbol{x}_i^T \boldsymbol{x}_j \] This corresponds to the identity transformation $\phi(\boldsymbol{x})=\boldsymbol{x}$, which collapses the general kernel SVM formulation exactly into the soft-margin linear Support Vector Classifier. This justifies the use of kernel="linear to fit linear SVC models in e1071.

10.6.2.4 Support Vector Mathematical Condition

At the optimal solution $(\boldsymbol{w}^*,b^*)$, support vectors satisfy the tight constraint $\xi_i>0$: \[ y_i \big((\boldsymbol{w}^*)^T \boldsymbol{x}_i + b^*\big) = 1 - \xi_i,\quad \xi_i>0 \] Non-support vectors satisfy $y_i\big((\boldsymbol{w}^*)^T\boldsymbol{x}_i + b^*\big) > 1$ with $\xi_i=0$ and exert no influence over the final hyperplane. For this simulation, exactly 6 points satisfy the support vector equality condition.

10.6.3 Mathematical Justification for Feature Scaling

The Euclidean distance metric used to compute hyperplane margins scales linearly with feature units: \[ d_i = \frac{y_i(\boldsymbol{w}^T\boldsymbol{x}_i + b)}{\|\boldsymbol{w}\|_2} \] If $X_1$ ranges $[0,100]$ and $X_2$ ranges $[0,1]$, the $X_1$ component dominates the norm $\|\boldsymbol{w}\|_2$, distorting margin width calculations. Standardization enforces uniform variance across all features to ensure equal predictive weight.

11 Retrieving Support Vectors & Visualizing Linear Support Vector Classifier Results

11.1 Retrieve Index of Support Vector Observations

The index attribute of an e1071::svm object returns the row indices of training samples identified as support vectors:

# Indices of observations acting as support vectors
svm_model$index

## [1]  1  4 10 14 16 20

11.2 Base `R` Built-in SVM Plot Limitations

The base R plot() function accepts an svm model object paired with training data to render the decision boundary and two class regions partitioning the feature space:

plot(svm_model, data)

Two critical limitations of this default visualization:

Even with kernel="linear, low plotting resolution creates visual distortion that makes the boundary appear non-linear.
The plot does not render the central separating hyperplane or the two margin boundary lines defining the soft margin band.

We present a high-resolution ggplot2 alternative for exact two-dimensional SVM visualization, including hyperplane and margin lines.

11.3 Full Reproducible ggplot2 Visualization Code with Derived Hyperplane Equations

11.3.1 Key Preprocessing Note on Feature Scaling

If the model is fit with scale = TRUE, all grid prediction points $\textbf{must}$ undergo identical standardization to match the scaled coordinate space used during model training.

11.3.2 Grid Interpolation Logic

We generate a dense uniform grid of coordinate points spanning the full range of predictors $X_1,X_2$. The trained SVM model predicts class labels for every grid point, which are used to shade the two decision regions in the plot.

# ============================================================
# SVM Visualization with Square Window Frame (auto bounds)
# ============================================================

# Install required packages if missing
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("e1071")) install.packages("e1071")

library(ggplot2)
library(e1071)

# ------------------------------------------------------------
# Example dataset (replace with your own 'data' if available)
# ------------------------------------------------------------
set.seed(123)
data <- data.frame(
  X1 = c(rnorm(10, -1), rnorm(10, 2)),
  X2 = c(rnorm(10, 2), rnorm(10, -1)),
  y  = factor(c(rep(1,10), rep(-1,10)))
)

# ------------------------------------------------------------
# Fit linear SVM
# ------------------------------------------------------------
svm_model <- svm(y ~ X1 + X2, data = data, kernel = "linear", cost = 1e6, scale = FALSE)

# ------------------------------------------------------------
# Compute min/max ranges for square frame
# ------------------------------------------------------------
range_X1 <- range(data$X1)
range_X2 <- range(data$X2)

# Make ranges equal length to form a square
x_span <- diff(range_X1)
y_span <- diff(range_X2)
span <- max(x_span, y_span)

x_mid <- mean(range_X1)
y_mid <- mean(range_X2)

xmin <- x_mid - span/2
xmax <- x_mid + span/2
ymin <- y_mid - span/2
ymax <- y_mid + span/2

# ------------------------------------------------------------
# Generate dense grid for shading
# ------------------------------------------------------------
new_x1 <- seq(from = xmin, to = xmax, length = 75)
new_x2 <- seq(from = ymin, to = ymax, length = 75)
grid_points <- expand.grid(X1 = new_x1, X2 = new_x2)

grid_predictions <- predict(object = svm_model, newdata = grid_points)
region_colors <- data.frame(grid_points, y = grid_predictions)

# ------------------------------------------------------------
# Extract hyperplane parameters
# ------------------------------------------------------------
beta <- drop(t(svm_model$coefs) %*% as.matrix(data[, c("X1","X2")])[svm_model$index,])
beta0 <- svm_model$rho

# ------------------------------------------------------------
# Plot with square window frame
# ------------------------------------------------------------
ggplot() +
  # Shaded decision regions
  geom_point(data = region_colors, aes(x = X1, y = X2, color = as.factor(y)), size = 0.5) +
  # Training data points
  geom_point(data = data, aes(x = X1, y = X2, color = as.factor(y)), size = 6) +
  # Support vectors highlighted
  geom_point(data = data[svm_model$index, ],
             aes(x = X1, y = X2, color = as.factor(y)),
             shape = 21, colour = "black", size = 6) +
  # Central hyperplane
  geom_abline(intercept = beta0 / beta[2], slope = -beta[1] / beta[2]) +
  # Margin lines
  geom_abline(intercept = (beta0 - 1) / beta[2], slope = -beta[1] / beta[2], linetype = "dashed") +
  geom_abline(intercept = (beta0 + 1) / beta[2], slope = -beta[1] / beta[2], linetype = "dashed") +
  # Square window frame
  annotate("rect", xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax,
           color = "black", fill = NA, size = 1.2) +
  # Equal aspect ratio
  coord_fixed(ratio = 1, xlim = c(xmin, xmax), ylim = c(ymin, ymax)) +
  theme_bw() +
  theme(legend.position = "none")

11.3.3 Linear Algebra Derivation of Hyperplane & Margin Lines

The standard linear hyperplane equation is: \[ \boldsymbol{\beta}^T \boldsymbol{x} - \beta_0 = 0 \implies \beta_1 X_1 + \beta_2 X_2 - \beta_0 = 0 \] Rearrange to solve for $X_2$ to match the geom_abline slope-intercept form $X_2 = m X_1 + c$: \[ \beta_2 X_2 = -\beta_1 X_1 + \beta_0 \implies X_2 = \underbrace{\left(-\frac{\beta_1}{\beta_2}\right)}_{\text{slope}} X_1 + \underbrace{\frac{\beta_0}{\beta_2}}_{\text{intercept}} \] The soft margin band enforces the constraints $y_i(\boldsymbol{\beta}^T\boldsymbol{x}_i - \beta_0) = \pm 1$ for support vectors on the margin boundaries:

Upper margin boundary: $\beta_1 X_1 + \beta_2 X_2 - \beta_0 = 1 \implies X_2 = -\frac{\beta_1}{\beta_2}X_1 + \frac{\beta_0 + 1}{\beta_2}$
Lower margin boundary: $\beta_1 X_1 + \beta_2 X_2 - \beta_0 = -1 \implies X_2 = -\frac{\beta_1}{\beta_2}X_1 + \frac{\beta_0 - 1}{\beta_2}$

This algebra justifies the three geom_abline intercept/slope values used in the plotting code.

12 Hyperparameter Tuning via 10-Fold Cross Validation

The preceding model was trained with penalty hyperparameter cost = 10. This hyperparameter $C$ directly controls the model bias-variance tradeoff and is critical to optimizing predictive generalization performance. The e1071 package provides the tune() function to automate $k$-fold cross validation for selecting the optimal $C$ value. Required inputs:

The base model function (svm),
A hyperparameter grid ranges listing all candidate values of $C$ to evaluate.

12.1 Cross Validation Tuning Script & Full Output Interpretation

12.2 Cross Validation Conclusion

The cross validation results demonstrate that the minimum cross-validation error rate of $0.10$ is achieved for $C \ge 20$. The tune() object automatically stores the optimal trained model with the best hyperparameter configuration:

12.3 Out-of-Sample Prediction & Test Set Error Evaluation

After selecting the final tuned model, we use the predict() function to generate class predictions for unseen test observations and quantify out-of-sample classification error.

Total test samples: $20$. Correct classifications: $10 + 7 = 17$. Misclassified samples: $3$.

Reported test error calculation correction: \[ \text{Test Error Rate} = \frac{\text{Misclassified Samples}}{\text{Total Test Samples}} = \frac{3}{20} = 0.15 = 15\% \]

############################################################
# Binary Classification with Linear SVM
# Complete Corrected Version
############################################################

# ----------------------------------------------------------
# Install and load required packages
# ----------------------------------------------------------
required_packages <- c("ggplot2", "e1071")

for(pkg in required_packages){
  if(!require(pkg, character.only = TRUE)){
    install.packages(pkg)
    library(pkg, character.only = TRUE)
  }
}

# ----------------------------------------------------------
# Generate training data
# ----------------------------------------------------------
set.seed(10111)

coordinates <- matrix(
  rnorm(40),
  nrow = 20,
  ncol = 2
)

colnames(coordinates) <- c("X1", "X2")

y <- c(rep(-1, 10), rep(1, 10))

# Shift positive class
coordinates[y == 1, ] <- coordinates[y == 1, ] + 1

data <- data.frame(
  X1 = coordinates[,1],
  X2 = coordinates[,2],
  y  = factor(y)
)

# ----------------------------------------------------------
# Fit Linear SVM
# ----------------------------------------------------------
svm_model <- svm(
  y ~ X1 + X2,
  data = data,
  kernel = "linear",
  cost = 10,
  scale = FALSE
)

# ----------------------------------------------------------
# Support vectors
# ----------------------------------------------------------
cat("\nSupport Vector Indices:\n")

## 
## Support Vector Indices:

print(svm_model$index)

## [1]  1  4 10 14 16 20

cat("\nNumber of Support Vectors:\n")

## 
## Number of Support Vectors:

print(nrow(svm_model$SV))

## [1] 6

# ----------------------------------------------------------
# Base SVM Plot
# ----------------------------------------------------------
plot(
  svm_model,
  data,
  main = "Linear SVM Classification"
)

# ----------------------------------------------------------
# Create Prediction Grid
# ----------------------------------------------------------
range_X1 <- range(data$X1)
range_X2 <- range(data$X2)

new_x1 <- seq(
  range_X1[1] - 0.5,
  range_X1[2] + 0.5,
  length.out = 150
)

new_x2 <- seq(
  range_X2[1] - 0.5,
  range_X2[2] + 0.5,
  length.out = 150
)

grid_points <- expand.grid(
  X1 = new_x1,
  X2 = new_x2
)

grid_predictions <- predict(
  svm_model,
  grid_points
)

region_colors <- cbind(
  grid_points,
  prediction = grid_predictions
)

# ----------------------------------------------------------
# Extract Hyperplane Coefficients
# ----------------------------------------------------------
beta <- drop(
  t(svm_model$coefs) %*%
    as.matrix(
      data[, c("X1", "X2")]
    )[svm_model$index, ]
)

# IMPORTANT:
# e1071 decision function:
# f(x) = w'x - rho
beta0 <- -svm_model$rho

# ----------------------------------------------------------
# Compute decision boundary and margins
# ----------------------------------------------------------
boundary <- data.frame(
  x = new_x1,
  y = -(beta[1] * new_x1 + beta0) / beta[2]
)

margin_upper <- data.frame(
  x = new_x1,
  y = -(beta[1] * new_x1 + beta0 - 1) / beta[2]
)

margin_lower <- data.frame(
  x = new_x1,
  y = -(beta[1] * new_x1 + beta0 + 1) / beta[2]
)

# ----------------------------------------------------------
# High-Resolution SVM Plot
# ----------------------------------------------------------
p_svm <- ggplot() +

  geom_point(
    data = region_colors,
    aes(
      X1,
      X2,
      color = prediction
    ),
    alpha = 0.20,
    size = 0.8
  ) +

  geom_point(
    data = data,
    aes(
      X1,
      X2,
      color = y
    ),
    size = 4
  ) +

  geom_point(
    data = data[svm_model$index, ],
    aes(
      X1,
      X2,
      fill = y
    ),
    shape = 21,
    colour = "black",
    stroke = 1.2,
    size = 5
  ) +

  geom_line(
    data = boundary,
    aes(x, y),
    linewidth = 1
  ) +

  geom_line(
    data = margin_upper,
    aes(x, y),
    linetype = "dashed"
  ) +

  geom_line(
    data = margin_lower,
    aes(x, y),
    linetype = "dashed"
  ) +

  labs(
    title = "Linear SVM with Decision Boundary and Margins",
    x = "X1",
    y = "X2"
  ) +

  theme_bw()

print(p_svm)

# ----------------------------------------------------------
# Cross Validation for Optimal Cost Parameter
# ----------------------------------------------------------
set.seed(1)

svm_cv <- tune.svm(
  y ~ X1 + X2,
  data = data,
  kernel = "linear",
  cost = c(
    0.001,
    0.01,
    0.1,
    1,
    5,
    10,
    20,
    50,
    100,
    150,
    200
  )
)

# ----------------------------------------------------------
# CV Summary
# ----------------------------------------------------------
cat("\nCross Validation Results:\n")

## 
## Cross Validation Results:

summary(svm_cv)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##    20
## 
## - best performance: 0.1 
## 
## - Detailed performance results:
##       cost error dispersion
## 1    0.001  0.55  0.4377975
## 2    0.010  0.55  0.4377975
## 3    0.100  0.15  0.2415229
## 4    1.000  0.25  0.2635231
## 5    5.000  0.20  0.2581989
## 6   10.000  0.15  0.2415229
## 7   20.000  0.10  0.2108185
## 8   50.000  0.10  0.2108185
## 9  100.000  0.10  0.2108185
## 10 150.000  0.10  0.2108185
## 11 200.000  0.10  0.2108185

# ----------------------------------------------------------
# Plot CV Error vs Cost
# ----------------------------------------------------------
p_cv <- ggplot(
  svm_cv$performances,
  aes(
    x = cost,
    y = error
  )
) +
  geom_line(linewidth = 1) +
  geom_point(size = 3) +
  theme_bw() +
  labs(
    title = "Cross-Validation Error vs Cost (C)",
    x = "Cost (C)",
    y = "10-Fold CV Error Rate"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

print(p_cv)

# ----------------------------------------------------------
# Best Model
# ----------------------------------------------------------
best_model <- svm_cv$best.model

cat("\nBest Cost Parameter:\n")

## 
## Best Cost Parameter:

print(svm_cv$best.parameters)

##   cost
## 7   20

# ----------------------------------------------------------
# Generate Independent Test Set
# ----------------------------------------------------------
set.seed(19)

test_coordinates <- matrix(
  rnorm(40),
  nrow = 20,
  ncol = 2
)

colnames(test_coordinates) <- c(
  "X1",
  "X2"
)

test_y <- sample(
  c(-1, 1),
  size = 20,
  replace = TRUE
)

test_coordinates[test_y == 1, ] <-
  test_coordinates[test_y == 1, ] + 1

test_data <- data.frame(
  X1 = test_coordinates[,1],
  X2 = test_coordinates[,2],
  y = factor(
    test_y,
    levels = levels(data$y)
  )
)

# ----------------------------------------------------------
# Predictions
# ----------------------------------------------------------
test_predictions <- predict(
  best_model,
  newdata = test_data
)

# ----------------------------------------------------------
# Test Error Rate
# ----------------------------------------------------------
test_error_rate <- mean(
  test_predictions != test_data$y
)

test_error_pct <- round(
  100 * test_error_rate,
  2
)

cat(
  "\nTest Set Classification Error:",
  test_error_pct,
  "%\n"
)

## 
## Test Set Classification Error: 15 %

# ----------------------------------------------------------
# Confusion Matrix
# ----------------------------------------------------------
conf_matrix <- table(
  Predicted = test_predictions,
  Actual    = test_data$y
)

cat("\nConfusion Matrix:\n")

## 
## Confusion Matrix:

print(conf_matrix)

##          Actual
## Predicted -1  1
##        -1 10  0
##        1   3  7

# ----------------------------------------------------------
# Classification Accuracy
# ----------------------------------------------------------
accuracy <- mean(
  test_predictions == test_data$y
)

cat(
  "\nClassification Accuracy:",
  round(100 * accuracy, 2),
  "%\n"
)

## 
## Classification Accuracy: 85 %

############################################################
# End
############################################################

$\textbf{Correction Note}$: The original Spanish text contains a mathematical error claiming test error equals 10%; the correct computed error from the confusion matrix is 15%. The error source is explained below in the corrections summary.

13 Non-Linear Classification with Support Vector Machines: High-Dimensional Feature Embedding

13.1 Limitations of the Linear Support Vector Classifier

The Support Vector Classifier described in the previous sections achieves strong predictive performance when the decision boundary separating classes is approximately linear. If the true class boundary is non-linear, its classification performance degrades drastically. One widely used strategy to handle datasets with non-linear class separation is to expand the dimensionality of the original feature space.

The fact that two groups cannot be linearly separated in the original low-dimensional feature space does not imply linear inseparability in a higher-dimensional space. The following four-panel figure illustrates this core principle: two clusters that are non-linearly separable in the 2-dimensional input space become linearly separable once a third feature dimension $X_3$ is introduced.

The Support Vector Machine (SVM) framework can be interpreted as an extension of the linear Support Vector Classifier built by mapping input data into a higher-dimensional latent feature space. Linear decision hyperplanes learned within this augmented high-dimensional space map back to curved, non-linear decision boundaries when projected onto the original low-dimensional raw feature space.

# ----------------------
# Install and load required packages
# ----------------------
required_pkgs <- c("ggplot2", "rgl", "plot3D", "dplyr", "tidyr", "MASS")
new_pkgs <- required_pkgs[!(required_pkgs %in% installed.packages()[,"Package"])]

if(length(new_pkgs)) {
  install.packages(new_pkgs, quiet = TRUE, repos = "https://cloud.r-project.org")
}

## package 'misc3d' successfully unpacked and MD5 sums checked
## package 'rgl' successfully unpacked and MD5 sums checked
## package 'plot3D' successfully unpacked and MD5 sums checked

library(ggplot2)
library(rgl)
library(plot3D)
library(dplyr)
library(MASS)

# ----------------------
# Step 1: Generate synthetic non-linear cluster data
# ----------------------
set.seed(123)

# Class 1 (Black): compact inner cluster
n_black <- 220
mu_black <- c(0, 0)
sigma_black <- matrix(c(0.8, 0.3, 0.3, 0.8), nrow = 2)
black_2d <- MASS::mvrnorm(n_black, mu = mu_black, Sigma = sigma_black) %>% as.data.frame()
colnames(black_2d) <- c("X1", "X2")
black_2d$Class <- "Black"
black_2d$X3 <- rowSums(black_2d[, c("X1", "X2")]^2) + rnorm(n_black, 0, 0.25)

# Class 2 (Red): outer curved ring cluster
n_red <- 320
theta <- runif(n_red, 0, 2*pi)
r <- runif(n_red, 1.8, 3.6)
red_2d <- data.frame(
  X1 = r * cos(theta) + rnorm(n_red, 0, 0.18),
  X2 = r * sin(theta) + rnorm(n_red, 0, 0.18)
)
red_2d$Class <- "Red"
red_2d$X3 <- rowSums(red_2d[, c("X1", "X2")]^2) + rnorm(n_red, 0, 0.25)

# Combine dataset
df_full <- rbind(black_2d, red_2d)
df_full$Class <- factor(df_full$Class, levels = c("Black", "Red"))

# ----------------------
# Panel 1: 2D Scatter
# ----------------------
p1 <- ggplot(df_full, aes(x = X1, y = X2, color = Class)) +
  geom_point(size = 1.7) +
  scale_color_manual(values = c("black", "#ff3333")) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = expression(X[1]), y = expression(X[2]))
print(p1)

# ----------------------
# Panel 2: 3D Scatter with hyperplane
# ----------------------
open3d()

## wgl 
##   1

points3d(df_full$X1, df_full$X2, df_full$X3,
         col = ifelse(df_full$Class=="Black", "black", "#ff3333"), size = 6)

# Proper meshgrid for surface3d
x_seq <- seq(-5, 5, length.out = 30)
y_seq <- seq(-5, 5, length.out = 30)
X1 <- matrix(rep(x_seq, each = length(y_seq)), nrow = length(y_seq))
X2 <- matrix(rep(y_seq, times = length(x_seq)), nrow = length(y_seq))
X3 <- matrix(3.2, nrow = length(y_seq), ncol = length(x_seq))

p2 <- surface3d(X1, X2, X3, col = "gray70", alpha = 0.5)
axes3d()
title3d("", xlab = "X1", ylab = "X2", zlab = "X3")
close3d()   # ✅ correct replacement for rgl.close()
print(p2)


# ----------------------
# Panel 3: Concentric Ring 2D Scatter
# ----------------------
p3 <- ggplot(df_full, aes(x = X1, y = X2, color = Class)) +
  geom_point(size = 1.5) +
  scale_color_manual(values = c("black", "#ff2222")) +
  theme_bw() +
  theme(legend.position = "none") +
  labs(x = expression(X[1]), y = expression(X[2]))
print(p3)

# ----------------------
# Panel 4: 3D Density Surface Plot
# ----------------------
p4 <- scatter3D(df_full$X1, df_full$X2, df_full$X3, colvar = as.numeric(df_full$Class),
          col = c("black", "red"), alpha = 0.6, ticktype = "detailed",
          xlab = "X1", ylab = "X2", zlab = "X3")

print(p4)

##               [,1]        [,2]        [,3]        [,4]
## [1,]  2.064617e-01 -0.11135777  0.13271103 -0.13271103
## [2,]  1.763940e-01  0.13512568 -0.16103652  0.16103652
## [3,] -4.869293e-18  0.09477014  0.07952159 -0.07952159
## [4,] -2.577849e-02 -0.69663640 -3.37078580  4.37078580

# ----------------------
# Completion message
# ----------------------
cat("All four figure panels successfully displayed inline.\n")

## All four figure panels successfully displayed inline.

14 Dimension Augmentation and Kernel Functions for SVM

14.1 Feature Lifting and the Kernel Trick

We established that Support Vector Machines follow the same core strategy as the linear Support Vector Classifier, with one key modification: SVMs first lift the input data into a higher-dimensional feature space before solving the classification optimization problem. The immediate questions arise: How do we perform this dimensional lifting, and what dimensionality is appropriate?

The dimensionality of a dataset can be increased by combining or non-linearly transforming its original input features. For example, we can map a 2-dimensional input space $(x_1,x_2)$ to a 3-dimensional lifted space using the transformation: \[ f(x_1,x_2) = \big(x_1^2,\ \sqrt{2}\,x_1x_2,\ x_2^2\big) \] This is only one of infinitely many valid feature transformations. How do we select the optimal mapping? This is where kernels enter the framework. A kernel function $K$ computes the dot product between two input vectors $\textit{after}$ they have been projected into a higher-dimensional latent space, without explicitly constructing the lifted feature vectors themselves.

While we do not derive the full optimization objective here, the SVM loss function contains an inner dot product term. Replacing this raw dot product with a kernel function directly yields the support vectors and separating hyperplane defined in the lifted feature space corresponding to that kernel. This mathematical shortcut is known as the $\textbf{kernel trick}$: with only a minor modification to the original linear SVC problem, we can solve classification for arbitrary high-dimensional lifted spaces without explicitly computing the transformed feature vectors. Many standard kernel functions exist; the most widely used variants are detailed below.

14.2 Common Standard Kernel Functions

14.2.1 Linear Kernel

Kernel formula: \[ K(\boldsymbol{x},\boldsymbol{x}') = \boldsymbol{x} \cdot \boldsymbol{x}' \] When using a linear kernel, the resulting Support Vector Machine classifier is mathematically identical to the basic linear Support Vector Classifier.

14.2.2 Polynomial Kernel

Kernel formula: \[ K(\boldsymbol{x},\boldsymbol{x}') = \big(\boldsymbol{x} \cdot \boldsymbol{x}' + c\big)^d \] Setting $d=1,\ c=0$ recovers the linear kernel. For $d>1$, the model produces non-linear decision boundaries; the degree of non-linearity grows as $d$ increases. Values of $d$ greater than 5 are generally discouraged due to severe overfitting risk.

14.2.3 Gaussian (RBF / Radial Basis Function) Kernel

Kernel formula: \[ K(\boldsymbol{x},\boldsymbol{x}') = \exp\big(-\gamma \|\boldsymbol{x} - \boldsymbol{x}'\|^2\big) \]

The hyperparameter $\gamma$ controls the flexibility of the RBF kernel model: very small $\gamma$ produces decision boundaries nearly identical to a linear classifier. As $\gamma$ increases, the model’s capacity and flexibility rise, creating tighter, more locally adapted decision regions.

14.2.4 Kernel Selection and Hyperparameter Tuning Guidance

The kernels presented here represent only a small subset of available kernel functions. Every kernel introduces a set of hyperparameters, whose optimal values are selected via cross-validation. No single kernel universally outperforms all others; kernel choice heavily depends on the structure and nature of the classification task at hand.

The RBF kernel is strongly recommended for general-purpose use. It offers two key practical advantages:

It only requires tuning two hyperparameters: $\gamma$ (kernel width) and the regularization penalty $C$ shared across all SVM variants.
Its adjustable flexibility spans the full model spectrum: from nearly linear classifiers to highly complex, non-linear decision surfaces.

15 SVM Classification Example: The ESL Mixture Synthetic Non-Linear Dataset

15.1 Dataset Description

This practical example uses a publicly available synthetic dataset from the textbook $\textit{The Elements of Statistical Learning}$. The data consists of two-dimensional input observations generated from an underlying non-linear function, with two predictor features $X_1, X_2$ and a binary categorical class label $y$.

15.1.1 Step 1: Load and Preprocess the ESL Mixture Dataset

The loaded list object ESL.mixture contains two core components:

$\texttt{ESL.mixture\$x}$: Numeric matrix of the two input predictors $X_1, X_2$
$\texttt{ESL.mixture\$y}$: Numeric vector of binary class labels for each observation

15.1.2 Step 2: Raw 2D Class Scatter Visualization

15.1.2.1 Supported Non-Linear Kernels for $\texttt{e1071::svm()}$

The SVM implementation in the $\texttt{e1071}$ R package supports multiple non-linear kernel functions; the two most widely used variants are:

Polynomial kernel ($\texttt{kernel = "polynomial"}$): Requires tuning the polynomial degree hyperparameter $d$
Radial Basis Function (RBF / Gaussian) kernel ($\texttt{kernel = "radial"}$): Requires tuning the width hyperparameter $\gamma$

All SVM models share a universal regularization hyperparameter $C$, which controls the tradeoff between maximizing the separating margin and tolerating training-set misclassification errors.

15.1.3 Step 3: Cross-Validation Grid Search for Optimal Hyperparameters

We perform a grid search cross-validation routine to identify the optimal combination of $C$ and $\gamma$ for the RBF kernel, minimizing cross-validation classification error.

The grid search identifies the optimal hyperparameter pair $\boldsymbol{C=1,\ \gamma=5}$, which achieves the lowest cross-validation test classification error.

15.1.4 Step 4: Visualize SVM Non-Linear Decision Regions

To render the continuous decision boundaries learned by the SVM, we generate a dense evenly spaced grid spanning the full range of $X_1$ and $X_2$. We predict the class label for every grid coordinate and colour the background regions according to model predictions. Raw training observations are overlaid, and support vectors are highlighted with black outlined markers.

# Install required packages if missing
required_pkgs <- c("e1071", "ggplot2", "dplyr")
new_pkgs <- required_pkgs[!(required_pkgs %in% installed.packages()[,"Package"])]
if(length(new_pkgs)) install.packages(new_pkgs, quiet = TRUE)

library(e1071)
library(ggplot2)
library(dplyr)

# --------------------------
# Step 1: Load ESL Mixture Dataset
# --------------------------
rm(list = ls())
load(url("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/ESL.mixture.rda"))
datos <- data.frame(ESL.mixture$x, y = ESL.mixture$y)
datos$y <- as.factor(datos$y)
head(datos)

X1	X2	y
2.5260930	0.3210504	0
0.3669545	0.0314621	0
0.7682191	0.7174862	0
0.6934357	0.7771940	0
-0.0198366	0.8672537	0
2.1965449	-1.0230141	0

# Raw 2D class scatter plot
p_raw <- ggplot(data = datos, aes(x = X1, y = X2, color = y)) +
  geom_point(size = 2.5) +
  theme_bw() +
  theme(legend.position = "none")
ggsave("esl_raw_scatter.png", p_raw, width=4, height=4, dpi=300)

print(p_raw)

# --------------------------
# Step 2: Cross-Validation Grid Search for C and Gamma
# --------------------------
set.seed(1)
svm_cv <- tune(
  svm,
  train.x = datos[, c("X1","X2")],
  train.y = datos$y,
  kernel = "radial",
  ranges = list(
    cost = c(0.001, 0.01, 0.1, 1, 5, 10, 20),
    gamma = c(0.5, 1, 2, 3, 4, 5, 10)
  )
)

# Cross-validation error hyperparameter plot
p_cv <- ggplot(data = svm_cv$performances, aes(x = cost, y = error, color = as.factor(gamma))) +
  geom_line(linewidth=0.8) +
  geom_point(size=1.8) +
  labs(title = "Classification Cross-Validation Error vs Hyperparameters C and γ", color = "γ") +
  theme_bw() +
  theme(legend.position = "bottom")
ggsave("svm_cv_error_plot.png", p_cv, width=6, height=4, dpi=300)

print(p_cv)

# Print optimal hyperparameters
cat("Optimal tuned hyperparameters:\n")

## Optimal tuned hyperparameters:

print(svm_cv$best.parameters)

##    cost gamma
## 39    1     5

modelo_svm_rbf <- svm_cv$best.model

# --------------------------
# Step3: Generate Decision Region Grid & Final SVM Plot
# --------------------------
rango_X1 <- range(datos$X1)
rango_X2 <- range(datos$X2)
new_x1 <- seq(from = rango_X1[1], to = rango_X1[2], length = 75)
new_x2 <- seq(from = rango_X2[1], to = rango_X2[2], length = 75)
nuevos_puntos <- expand.grid(X1 = new_x1, X2 = new_x2)
predicciones <- predict(object = modelo_svm_rbf, newdata = nuevos_puntos)
color_regiones <- data.frame(nuevos_puntos, y = predicciones)

# Full SVM decision boundary plot with support vectors highlighted
p_svm_final <- ggplot() +
  geom_point(data = color_regiones, aes(x = X1, y = X2, color = as.factor(y)), size = 0.5) +
  geom_point(data = datos, aes(x = X1, y = X2, color = as.factor(y)), size = 2.5) +
  geom_point(
    data = datos[modelo_svm_rbf$index, ],
    aes(x = X1, y = X2, color = as.factor(y)),
    shape = 21,
    colour = "black",
    size = 2.5
  ) +
  theme_bw() +
  theme(legend.position = "none")
ggsave("svm_esl_decision_regions.png", p_svm_final, width=4.5, height=4, dpi=300)


print(p_svm_final)

cat("All three output figures saved to working directory:\n
1. esl_raw_scatter.png (raw class scatter)
2. svm_cv_error_plot.png (cross-val error hyperparameter curve)
3. svm_esl_decision_regions.png (SVM decision regions with support vectors)\n")

## All three output figures saved to working directory:
## 
## 1. esl_raw_scatter.png (raw class scatter)
## 2. svm_cv_error_plot.png (cross-val error hyperparameter curve)
## 3. svm_esl_decision_regions.png (SVM decision regions with support vectors)

16 Support Vector Machines for Multi-Class Classification (More Than Two Classes)

The separating hyperplane logic that underpins binary SVM classifiers does not naturally extend to classification problems with more than two distinct classes. Three widely adopted strategies have been developed to adapt SVMs to multi-class tasks with $K>2$ target classes: One-versus-One, One-versus-All, and DAGSVM.

16.1 One-versus-One (OvO) Multi-Class SVM

For a task with $K>2$ classes, the One-versus-One approach trains $\frac{K(K-1)}{2}$ independent binary SVM classifiers, each dedicated to separating one unique pair of classes.

To generate a prediction for a single observation, every pairwise classifier produces a class vote for the sample. We tally total votes for each class, and the observation is assigned to the class receiving the highest vote count.

The major limitation of this method is that the total number of trained models grows quadratically with $K$, leading to excessive computational cost for datasets with many distinct classes.

The svm() function from the e1071 package automatically uses the One-versus-One strategy for multi-class classification when the response variable is a factor with three or more levels.

16.2 One-versus-All (OvA / One-versus-Rest) Multi-Class SVM

This strategy trains $K$ separate binary SVM models. Each model isolates one single target class against the union of all remaining $K-1$ classes, producing one dedicated separating hyperplane per class.

During prediction, the sample is evaluated against all $K$ classifiers, and the observation is assigned to the class whose model returns a positive prediction. This simple design carries two critical flaws:

Multiple classifiers may simultaneously output positive predictions, creating ambiguous class assignments for one sample.
Severe training class imbalance arises for each SVM. For example, a dataset with 100 classes and 10 samples per class yields a classifier with only 10 positive training points and 990 negative training points for every target class.

16.3 DAGSVM (Directed Acyclic Graph SVM)

DAGSVM is an optimized extension of the One-versus-One framework. It retains the same pairwise SVM training pipeline but accelerates prediction runtime by eliminating redundant pairwise comparisons via a Directed Acyclic Graph (DAG) decision tree structure.

16.3.1 Illustrative Example with Four Classes (A,B,C,D)

Six pairwise SVMs are pre-trained: A-B, A-C, A-D, B-C, B-D, C-D.

Start inference with the A-D classifier; if output is class A, discard all comparisons involving D.
Next evaluate the A-C classifier; if output is class A, discard all comparisons involving C.
Only the A-B classifier remains to generate the final predicted label.

Instead of evaluating all six pairwise models, only three are required for inference. DAGSVM preserves all advantages of One-versus-One while delivering drastically faster prediction performance.

17 Practical Multi-Class SVM Example: The Khan Gene Expression Tumor Dataset

The Khan dataset from the ISLR library contains 83 tissue samples belonging to four distinct tumor subtypes. Each sample includes gene expression intensity measurements across 2308 genes. The objective is to train a multi-class linear SVM to predict tumor subtype from high-dimensional gene expression profiles.

The dataset is pre-partitioned:

Training set (xtrain, ytrain): 63 samples
Held-out test set (xtest, ytest): 20 unseen samples for generalization evaluation

17.1 Load and Inspect the Dataset

This is a high-dimensional regime where predictor count far exceeds observation count. Models fit on such data are highly susceptible to overfitting. To reduce model flexibility and overfitting risk, we select the linear kernel. A linear SVM has only one tunable hyperparameter: the regularization penalty $C$.

17.2 Cross-Validation Grid Search for Optimal $C$

17.3 Evaluate Training Set Performance

The model achieves perfect classification (0% training error) on all 63 training samples. Zero training error is a strong warning sign of potential overfitting; we validate generalization performance on the independent held-out test set.

17.4 Evaluate Generalization on Unseen Test Data

Out of 20 held-out test samples, only two are misclassified, yielding a test set error rate of just 10%. This confirms the linear SVM generalizes well despite perfect training-set accuracy.

# Install required packages if missing
required_pkgs <- c("ISLR", "e1071", "ggplot2")
new_pkgs <- required_pkgs[!(required_pkgs %in% installed.packages()[,"Package"])]
if(length(new_pkgs)) install.packages(new_pkgs, quiet = TRUE, repos = "https://cloud.r-project.org")

## package 'ISLR' successfully unpacked and MD5 sums checked

library(ISLR)
library(e1071)
library(ggplot2)

# --------------------------
# Step 1: Load Khan dataset
# --------------------------
data("Khan")
cat("Training set dimensions:", dim(Khan$xtrain), "\n")

## Training set dimensions: 63 2308

cat("Test set dimensions:", dim(Khan$xtest), "\n")

## Test set dimensions: 20 2308

# Build training dataframe
datos_train <- data.frame(y = as.factor(Khan$ytrain), Khan$xtrain)

# --------------------------
# Step 2: Cross-validation grid search
# --------------------------
set.seed(1)
svm_cv <- tune(
  svm,
  y ~ .,
  data = datos_train,
  kernel = "linear",
  ranges = list(cost = c(0.0001, 0.0005, 0.001, 0.01, 0.1, 1))
)

# Plot CV error vs cost
p_cv <- ggplot(data = svm_cv$performances, aes(x = cost, y = error)) +
  geom_line(linewidth = 0.8) +
  geom_point(size = 1.8) +
  labs(title = "Cross-Validation Error vs Regularization Parameter C") +
  theme_bw()
ggsave("svm_khan_cv_error.png", p_cv, width=6, height=4, dpi=300)
print(p_cv)

# Best model
cat("Optimal parameters:\n")

## Optimal parameters:

print(svm_cv$best.parameters)

##    cost
## 2 5e-04

modelo_svm <- svm_cv$best.model

# --------------------------
# Step 3: Training performance
# --------------------------
train_error_pct <- 100 * mean(datos_train$y != modelo_svm$fitted)
cat("Training error:", train_error_pct, "%\n")

## Training error: 0 %

print(table(Prediction = modelo_svm$fitted, True = datos_train$y))

##           True
## Prediction  1  2  3  4
##          1  8  0  0  0
##          2  0 23  0  0
##          3  0  0 12  0
##          4  0  0  0 20

# --------------------------
# Step 4: Test performance
# --------------------------
datos_test <- data.frame(y = as.factor(Khan$ytest), Khan$xtest)
predicciones <- predict(modelo_svm, newdata = datos_test)
test_error_pct <- 100 * mean(datos_test$y != predicciones)
cat("Test error:", test_error_pct, "%\n")

## Test error: 10 %

print(table(Prediction = predicciones, True = datos_test$y))

##           True
## Prediction 1 2 3 4
##          1 3 0 0 0
##          2 0 6 0 0
##          3 0 0 4 0
##          4 0 0 2 5

18 Miscellaneous Supplementary Background Notes

This section collects definitions, commentary, and clarifications gathered from various reference sources. Some content could not be integrated into the main document body due to time constraints; others are retained here as auxiliary complementary background material for machine learning topics.

18.1 The Perceptron Algorithm

The Perceptron algorithm was published in 1957 by Frank Rosenblatt. The core objective of the Perceptron is to find a separating hyperplane that correctly partitions a linearly separable dataset. Once trained, this hyperplane can be used to perform binary classification tasks.

While the Perceptron itself is a very simple learning algorithm, understanding its internal mechanics is foundational for studying more complex models such as Support Vector Machines and artificial neural networks.

Before detailing the algorithm step-by-step, we first define key required mathematical building blocks: the dot product (scalar product), hyperplanes, and linear separability.

18.1.1 Dot Product (Scalar Product)

The dot product is an operation defined for two vectors of identical dimensionality, which outputs a single scalar value encoding directional and magnitude relationships between the vectors. There are two fully equivalent interpretations: geometric and algebraic.

18.1.2 Geometric Interpretation

Geometrically, the dot product equals the product of the Euclidean norm (second norm / magnitude) of each vector multiplied by the cosine of the angle $\alpha$ formed between them. For two vectors $\boldsymbol{x}$ and $\boldsymbol{y}$ separated by angle $\alpha$: \[ \boldsymbol{x} \cdot \boldsymbol{y} = \|\boldsymbol{x}\| \|\boldsymbol{y}\| \cos(\alpha) \] The value of the dot product is entirely determined by the angle between the vectors:

If $\alpha = 0^\circ$, $\cos(\alpha) = 1$, so $\boldsymbol{x} \cdot \boldsymbol{y} = \|\boldsymbol{x}\| \|\boldsymbol{y}\|$. When vectors point in identical directions, the dot product equals the product of their magnitudes.
If $\alpha = 90^\circ$, $\cos(\alpha) = 0$, so $\boldsymbol{x} \cdot \boldsymbol{y} = 0$. Orthogonal (perpendicular) vectors yield a dot product of zero, meaning they have no directional correlation.
If $\alpha = 180^\circ$, $\cos(\alpha) = -1$, so $\boldsymbol{x} \cdot \boldsymbol{y} = -\|\boldsymbol{x}\| \|\boldsymbol{y}\|$. Vectors pointing in opposite directions produce a dot product equal to the negative product of their magnitudes.

18.1.3 Algebraic Interpretation

Algebraically, the dot product is computed as the sum of element-wise products across all vector dimensions. For 2-dimensional vectors: \[ \boldsymbol{x} \cdot \boldsymbol{y} = x_1 y_1 + x_2 y_2 \] Generalized to $n$-dimensional vectors: \[ \boldsymbol{x} \cdot \boldsymbol{y} = \sum_{i=1}^n x_i y_i \] This formulation requires no knowledge of the angle between the two vectors.

18.1.4 Linear Separability

The concept of linear separability is intuitive in low-dimensional feature spaces:

Two dimensions: Two data classes are linearly separable if at least one straight line exists that fully partitions the groups with no overlapping points.
Three dimensions: Data is linearly separable if a flat plane exists that perfectly separates the two clusters.

This logic generalizes to arbitrary $n$-dimensional spaces. The core condition remains identical: there exists a subspace of one lower dimension that cleanly splits the two classes. This separating subspace is formally called a $\textbf{hyperplane}$.

# Geometric dot product function (corrected spelling & variable names)
dot_product_geometric <- function(x, y, alpha){
  norm_x <- sqrt(sum(x^2))
  norm_y <- sqrt(sum(y^2))
  alpha_rad <- (alpha * pi) / 180
  return(norm_x * norm_y * cos(alpha_rad))
}

# Algebraic summation-based dot product function
dot_product_algebra <- function(x, y) {
  result <- 0
  if (length(x) != length(y)) {
    stop("Vectors must have identical length (same dimensionality)")
  }
  for (i in seq_along(x)) {
    result <- result + x[i] * y[i]
  }
  return(result)
}

# Test example vectors from the text
vec1 <- c(3, 5)
vec2 <- c(8, 2)

# Run geometric version with alpha=45 degrees
res_geo <- dot_product_geometric(x = vec1, y = vec2, alpha = 45)
# Run algebraic version (no angle required)
res_alg <- dot_product_algebra(x = vec1, y = vec2)

# Print matching results
cat("Geometric dot product result: ", res_geo, "\n")

## Geometric dot product result:  34

cat("Algebraic dot product result: ", res_alg, "\n")

## Algebraic dot product result:  34

cat("Both implementations return identical value 34 as shown in the reference text.\n")

## Both implementations return identical value 34 as shown in the reference text.

19 Hyperplanes: Geometry, Binary Classification and the Perceptron Algorithm

19.1 Geometric Definition of a Hyperplane

In geometry, a hyperplane is defined as a subspace whose dimensionality is exactly one less than the ambient space containing it. This abstract definition becomes intuitive with a 2-dimensional example.

19.1.1 Hyperplane Special Case: 2D Straight Line

In a 2-dimensional ambient space, a hyperplane has $2-1=1$ dimension: a straight line. The standard slope-intercept line equation from basic calculus is: \[ x_2 = a x_1 + b \] Equivalent rearranged implicit form: \[ x_2 - a x_1 - b = 0 \] An alternative vector-based definition of a line uses the dot product of two vectors. Let input feature vector $\boldsymbol{x}=(x_1, x_2)$, weight vector $\boldsymbol{w}=(w_1, w_2)$, and scalar bias constant $b$. The line equation is written as: \[ \boldsymbol{x} \cdot \boldsymbol{w} + b = 0 \] Expanding the dot product: \[ (x_1, x_2)\cdot(w_1, w_2) + b = x_1 w_1 + x_2 w_2 + b = 0 \] Rearranged to recover slope-intercept form: \[ x_2 = -\frac{w_1}{w_2}x_1 - \frac{b}{w_2} \] Define slope $a = -\frac{w_1}{w_2}$ and vertical intercept $c = -\frac{b}{w_2}$, recovering the standard line formula: \[ x_2 = a x_1 + c \] The critical advantage of this vectorized formulation is full generalization to spaces of arbitrary dimensionality; this dot product form is the universal hyperplane equation for any $n$-dimensional feature space.

19.2 Binary Classification Using Hyperplanes

An $n$-dimensional hyperplane partitions an $(n+1)$-dimensional ambient space into two disjoint half-spaces. This partitioning property enables hyperplanes to act as binary classifiers: samples lying on one side of the hyperplane are assigned class label $+1$, samples on the opposite side receive label $-1$.

For a 2D dataset, each training observation $i$ is represented by feature vector $\boldsymbol{x}_i$ and binary response label $y_i \in \{+1, -1\}$. Given a hyperplane defined by weight vector $\boldsymbol{w}$ and bias $b$, we define a prediction function $h$: \[ h(\boldsymbol{x}_i) = \begin{cases} +1 & \text{if } \boldsymbol{w}\cdot\boldsymbol{x}_i + b \ge 0 \\ -1 & \text{if } \boldsymbol{w}\cdot\boldsymbol{x}_i + b < 0 \end{cases} \] This piecewise function simplifies compactly via the sign operator: \[ h(\boldsymbol{x}_i) = \operatorname{sign}\big(\boldsymbol{w}\cdot\boldsymbol{x}_i + b\big) \]

19.2.1 Augmented Vectors: Absorbing the Bias Term $b$

We eliminate the standalone bias constant $b$ by modifying all input and weight vectors into $\textit{augmented vectors}$:

Prepend a fixed constant value $1$ as the first element of every feature vector $\boldsymbol{x}_i$, creating augmented vector $\tilde{\boldsymbol{x}}_i$.
Append the bias scalar $b$ as the first element of weight vector $\boldsymbol{w}$.

The final prediction output remains identical, and this augmented vector formulation simplifies algorithm implementation for linear models like the Perceptron.

20 The Perceptron Algorithm for Linear Binary Classification

A hyperplane can perform binary classification on linearly separable data for any feature dimension. The Perceptron algorithm solves the core practical problem: recovering the separating hyperplane weight vector $\boldsymbol{w}$ from labeled training data.

Given a training dataset with $m$ $n$-dimensional observations $(\boldsymbol{x}_i, y_i)$, the Perceptron optimizes for the weight vector $\boldsymbol{w}$ of the augmented-space prediction function: \[ h(\tilde{\boldsymbol{x}}_i) = \operatorname{sign}\big(\boldsymbol{w}\cdot\tilde{\boldsymbol{x}}_i\big) \] The only unknown quantity is $\boldsymbol{w}$; the Perceptron’s objective is to find a $\boldsymbol{w}$ that perfectly separates all training samples by class.

20.1 Perceptron Iterative Training Update Rule

The algorithm follows four repeating steps until zero misclassified samples remain:

Initialize a random weight vector $\boldsymbol{w}$ defining an arbitrary starting hyperplane.
Generate predicted class labels for all training observations using current $\boldsymbol{w}$.
Randomly select one misclassified sample $\tilde{\boldsymbol{x}}_i$, apply the weight correction update rule: \[ \boldsymbol{w} \leftarrow \boldsymbol{w} + y_i \cdot \tilde{\boldsymbol{x}}_i \]
Repeat steps 2–3 iteratively.

20.2 Key Theoretical Properties of the Perceptron Algorithm

Local Single-Sample Correction: Each weight update only fixes the prediction for one randomly selected misclassified sample. The correction can cause previously correctly classified samples to become misclassified in subsequent iterations. Mathematically, the algorithm is proven to converge to a separating hyperplane $\textbf{if and only if the training data is linearly separable}$.
Non-Unique Separating Boundary: Infinitely many valid hyperplanes fully separate linearly separable two-class data. Random sampling of misclassified points at each update step produces distinct weight vectors (different decision lines) for different random seed values.

20.3 Critical Limitation of the Basic Perceptron

While the Perceptron always returns some perfectly separating hyperplane for linearly separable datasets, it does not guarantee the $\textit{optimal maximal-margin hyperplane}$. The optimal hyperplane is defined as the boundary positioned equidistant from the closest training samples of both classes; solving for this maximal margin hyperplane is the core objective of Support Vector Machines, not the simple Perceptron algorithm.

# Install ggplot2 if missing for plotting
if (!requireNamespace("ggplot2", quietly = TRUE)) {
  install.packages("ggplot2", quiet = TRUE)
}
library(ggplot2)

# --------------------------
# Helper Prediction Function
# --------------------------
predict_clase <- function(X, w){
  pred_score <- apply(X, MARGIN = 1, FUN = function(x){crossprod(x, w)})
  pred_class <- sign(pred_score)
  return(pred_class)
}

# --------------------------
# Full Perceptron Training Function
# --------------------------
perceptron <-  function(X, y, random_seed = 553){
  X <- cbind(1, X)
  set.seed(random_seed)
  w <- runif(n = ncol(X), min = 0, max = 1)
  clasificaciones <- predict_clase(X = X, w = w)
  errores_clasificacion <- which(clasificaciones != y)
  
  while (length(errores_clasificacion) > 0) {
    i <- sample(x = errores_clasificacion, size = 1)
    w <- w + X[i,] * y[i]
    clasificaciones <- predict_clase(X = X, w = w)
    errores_clasificacion <- which(clasificaciones != y)
  }
  return(w)
}

# --------------------------
# Synthetic Linearly Separable 2D Dataset
# --------------------------
X <- matrix(c(8, 4, 9, 7, 9, 4, 10, 2, 8, 7, 4, 4, 1, 2, 7, 10, 7, 10, 6, 8, 10,
              7, 3, 5, 4, 6, 3, 5), ncol = 2, byrow = FALSE)
y <- c(1, 1, 1, 1, 1, 1, 1 , -1 , -1 , -1 , -1 , -1 , -1 , -1 )

# Train model and output weight vector
hiperplano <- perceptron(X = X, y = y)
cat("Learned augmented hyperplane weight vector:\n")

## Learned augmented hyperplane weight vector:

print(hiperplano)

## [1] -50.452542   2.744047   5.822881

# --------------------------
# Generate Exact Matching 2D Scatter + Decision Line Plot
# --------------------------
datos <- data.frame(X1 = X[,1], X2 = X[,2], y = factor(y))
p_perceptron <- ggplot(data = datos, aes(x = X1, y = X2, color = y)) +
  geom_point(size = 2.5) +
  geom_abline(
    intercept = -(hiperplano[1]/hiperplano[3]),
    slope =     -(hiperplano[2]/hiperplano[3])
  ) +
  theme_bw() +
  theme(legend.position = "none")

print(p_perceptron)

# Save high-resolution figure matching reference layout
ggsave("perceptron_2d_separating_line.png", p_perceptron, width = 4.2, height = 4, dpi = 300)
cat("\nPlot saved as perceptron_2d_separating_line.png in working directory.\n")

## 
## Plot saved as perceptron_2d_separating_line.png in working directory.

Support Vector Machines

Jeff Chak-Fu WONG