Department of Mathematics
The Chinese University of Hong Kong
HSMMC Math Modelling Advanced Training Workshop
June 27, 2026
Support Vector Machines (abbreviated SVMs), a unified framework for classification and regression tasks, were developed during the 1990s within the discipline of computer science. Although this methodology was initially created exclusively for binary classification problems, its practical applicability has since been extended to multi-class classification and regression modelling.
SVMs consistently deliver superior classification performance across a wide range of real-world and synthetic datasets; for this reason, they are regarded as a foundational benchmark method in the fields of statistical learning and machine learning.
The theoretical backbone of Support Vector Machines is the Maximal Margin Classifier, whose core geometric building block is the hyperplane. This note presents each foundational concept in sequential logical order. A complete, rigorous grasp of SVM theory requires robust prior knowledge of linear algebra.
Within the R programming environment, the software
packages e1071 and LiblineaR implement all
core algorithms required to train simple binary classification models,
multi-class classification models, and regression models constructed
under the Support Vector Machine paradigm.
In a \(p\)-dimensional space, a hyperplane is defined as a flat affine subspace of dimension \(p-1\). The term means the subspace is not required to pass through the origin.
The algebraic formulation of a hyperplane is a linear equation.
For \(p=2\), the hyperplane corresponds to a straight line with equation: \[\begin{equation} (1): \quad \quad \beta_0 + \beta_1 x_1 + \beta_2 x_2 = 0 \label{eq:hyperplane_2d} \end{equation}\]
where \(\beta_0, \beta_1, \beta_2\)
are fixed real-valued coefficients.
Any coordinate pair \(\boldsymbol{x}=(x_1,x_2)\) that satisfies
Equation (1) lies exactly on the hyperplane.
Equation (1) generalises to arbitrary dimension \(p\): \[\begin{equation} (2): \quad \quad \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_p x_p = 0 \label{eq:hyperplane_pd} \end{equation}\] All vectors \(\boldsymbol{x} = \begin{pmatrix} x_1 & x_2 & \dots & x_p \end{pmatrix}^\top\) satisfying (2) are points on the \(p\)-dimensional hyperplane.
For a point \(\boldsymbol{x}\) not lying on the hyperplane, evaluating the linear functional yields one of two strict inequalities: \[\begin{equation} (3): \quad \quad \beta_0 + \sum_{j=1}^p \beta_j x_j < 0 \label{eq:halfspace_neg} \end{equation}\] or \[\begin{equation} (4): \quad \quad \beta_0 + \sum_{j=1}^p \beta_j x_j > 0 \label{eq:halfspace_pos} \end{equation}\]
These two inequalities define two disjoint open half-spaces separated by the hyperplane. The sign of the linear expression \(\beta_0 + \sum_{j=1}^p \beta_j x_j\) uniquely determines which half-space contains the point \(\boldsymbol{x}\).
Take the specific hyperplane equation: \[\begin{equation} (5): \quad \quad 1 + 2x_1 + 3x_2 = 0 \label{eq:example_hp} \end{equation}\]
Below is executable R code to compute the linear
functional, classify regions, generate the colored partition diagram,
and verify sample point calculations numerically.
# Install ggplot2 if missing
if (!require("ggplot2")) {
install.packages("ggplot2")
library(ggplot2)
}
# Step 1: Create 2D coordinate grid
x1 <- seq(-4, 4, length.out = 300)
x2 <- seq(-4, 4, length.out = 300)
grid_data <- expand.grid(x1 = x1, x2 = x2)
# Step 2: Compute hyperplane linear functional f(x1,x2) = 1 + 2*x1 + 3*x2
grid_data$f <- 1 + 2 * grid_data$x1 + 3 * grid_data$x2
# Step 3: Label each region
grid_data$region <- ifelse(
grid_data$f > 1e-5,
"Positive half-space (1+2x1+3x2 > 0)",
ifelse(
grid_data$f < -1e-5,
"Negative half-space (1+2x1+3x2 < 0)",
"Hyperplane line (1+2x1+3x2 = 0)"
)
)
# Step 4: Plot hyperplane and two half-spaces
plot_hyperplane <- ggplot() +
geom_tile(
data = grid_data,
aes(x = x1, y = x2, fill = region),
alpha = 0.7
) +
scale_fill_manual(
values = c(
"Positive half-space (1+2x1+3x2 > 0)" = "#4477ff",
"Negative half-space (1+2x1+3x2 < 0)" = "#dd3333",
"Hyperplane line (1+2x1+3x2 = 0)" = "#000000"
)
) +
labs(
title = "2D Hyperplane Partition: $1 + 2x_1 + 3x_2 = 0$",
x = expression(x[1]),
y = expression(x[2]),
fill = "Region Classification"
) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
# Render plot
print(plot_hyperplane)# Step 5: Manual mathematical verification for sample points
# Test point 1: (x1=1, x2=1)
f1 <- 1 + 2*1 + 3*1
cat("Point (1,1): f =", f1, "> 0 → Positive blue region\n")## Point (1,1): f = 6 > 0 → Positive blue region
# Test point 2: (x1=-2, x2=-2)
f2 <- 1 + 2*(-2) + 3*(-2)
cat("Point (-2,-2): f =", f2, "< 0 → Negative red region\n")## Point (-2,-2): f = -9 < 0 → Negative red region
# Test point exactly on hyperplane: solve 1+2x1+3x2=0, pick x1=-2 → x2=( -1 -2*(-2) )/3
x1_hp <- -2
x2_hp <- (-1 - 2 * x1_hp) / 3
f_hp <- 1 + 2*x1_hp + 3*x2_hp
cat("Point on hyperplane (",x1_hp,",",x2_hp,"): f =", f_hp, "→ Black separating line\n")## Point on hyperplane ( -2 , 1 ): f = 0 → Black separating line
Suppose that we have \(n\) observations, where each observation contains \(p\) predictor variables, and the response variable has only two possible classes. Throughout this discussion, the two classes are denoted by \(+1\) and \(-1\).
Hyperplanes can be used to construct a classifier that predicts the class membership of an observation based on its predictor values. The same classification problem can also be addressed using other methods, such as logistic regression, Linear Discriminant Analysis (LDA), classification trees, and many others, each of which has its own advantages and disadvantages.
For ease of visualization, the following explanations are presented in a two-dimensional space, where a hyperplane is simply a straight line. However, exactly the same concepts extend naturally to higher-dimensional spaces.
Assume that the observations are distributed in such a way that the two classes (\(+1\) and \(-1\)) can be perfectly separated by a linear boundary.
A hyperplane in a \(p\)-dimensional space is defined by
\[ \beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_px_p=0. \]
If the classes are perfectly separable, there exist coefficients
\[ \beta_0,\beta_1,\ldots,\beta_p \]
such that every observation satisfies:
\[ \beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_px_p >0, \qquad \text{if } y_i=+1, \]
and
\[ \beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_px_p <0, \qquad \text{if } y_i=-1. \]
Since each class label is either \(+1\) or \(-1\), the two inequalities above can be combined into a single condition.
Multiplying the hyperplane expression by \(y_i\) gives
\[ y_i\left( \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\cdots+\beta_px_{ip} \right) = (+1) \left( \beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip} \right). \]
Therefore,
\[ y_i\left( \beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip} \right)>0. \]
Since
\[ \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\cdots+\beta_px_{ip}<0, \]
multiplying both sides by \(y_i=-1\) yields
\[ (-1) \left( \beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip} \right)>0, \]
because the product of two negative quantities is positive.
Hence,
\[ y_i\left( \beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip} \right)>0. \]
Since this condition holds in both cases, we obtain the compact representation
\[ \boxed{ y_i \left( \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\cdots+\beta_px_{ip} \right) >0, \qquad i=1,\ldots,n. } \]
This equation states that every training observation is correctly classified by the hyperplane.
Under the perfectly separable scenario, the simplest classifier assigns a class according to which side of the hyperplane an observation lies.
Let a new observation be
\[ \mathbf{x}^* = (x_1^*,x_2^*,\ldots,x_p^*). \]
Define the decision function
\[ f(\mathbf{x}^*) = \beta_0 +\beta_1x_1^* +\beta_2x_2^* +\cdots +\beta_px_p^*. \]
The classification rule is
\[ \hat{y} = \begin{cases} +1, & \text{if } f(\mathbf{x}^*)>0,\\[6pt] -1, & \text{if } f(\mathbf{x}^*)<0. \end{cases} \]
The sign of \(f(\mathbf{x}^*)\) determines the predicted class:
Furthermore, the magnitude
\[ \left|f(\mathbf{x}^*)\right| \]
provides information about how far the observation is from the decision boundary. In general:
The conditions for perfect linear separability only require that all observations be correctly classified. If the data are perfectly separable, there are generally infinitely many hyperplanes satisfying
\[ y_i \left( \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\cdots+\beta_px_{ip} \right) >0, \qquad i=1,\ldots,n. \]
Consequently, infinitely many classifiers can achieve zero training error.
Because of this non-uniqueness, an additional criterion is required to select a single ``best’’ hyperplane. This leads to the concept of the \(\emph{optimal separating hyperplane}\), also known as the \(\emph{maximum-margin hyperplane}\), which chooses the hyperplane that maximizes the separation (margin) between the two classes. Maximizing the margin generally improves the robustness and generalization ability of the classifier.
# Install required package if missing
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)
# Fix random seed for identical synthetic data
set.seed(68)
# Generate two groups of standard normal random variables
X1 <- rnorm(n = 10, mean = 2, sd = 1)
X2 <- rnorm(n = 10, mean = 2, sd = 1)
# Construct dataset: two distinct classes
observations <- data.frame(
X1 = c(X1, X1 + 2),
X2 = c(X2, X2 + 2),
class = rep(c(1, -1), each = 10)
)
# Convert class label to categorical factor
observations$class <- as.factor(observations$class)
# Plot data points + 5 separating hyperplanes (straight lines in 2D)
ggplot() +
geom_point(
data = observations,
aes(x = X1, y = X2, color = class),
size = 4
) +
# Draw all 5 candidate separating hyperplanes
geom_abline(intercept = 9, slope = -2) +
geom_abline(intercept = 8.5, slope = -1.7) +
geom_abline(intercept = 8, slope = -1.5) +
geom_abline(intercept = 6.5, slope = -1) +
geom_abline(intercept = 5.4, slope = -0.75) +
# Black-and-white plot theme
theme_bw() +
# Plot title (translated to English)
labs(title = "5 Possible Separating Hyperplanes") +
# Customize plot aesthetics
theme(
legend.position = "none",
plot.title = element_text(hjust = 0.5, size = 11)
)The solution to this classification problem selects the \(\textit{maximal margin hyperplane}\) (also called the optimal separating hyperplane) as the optimal classifier. This hyperplane is defined as the linear decision boundary that lies the farthest perpendicular distance from all training observations.
To construct this hyperplane, we first compute the perpendicular Euclidean distance from every training data point to a candidate separating hyperplane. The smallest of these perpendicular distances is termed the \(\textit{margin}\); this value quantifies how far the hyperplane sits from the closest training samples. The maximal margin hyperplane is the unique separating hyperplane that maximizes this minimal perpendicular distance across all training points.
While this conceptual framework is intuitive, a brute-force search is computationally infeasible: there exist infinitely many candidate hyperplanes to evaluate for distance calculations. Instead, we formalize the problem as a constrained convex optimization program.
The figire above visualizes the maximal margin hyperplane for a linearly separable training dataset. Three training points lie exactly equidistant to the central maximal margin hyperplane, resting on the two dashed boundary lines that mark the full width of the margin band. These equidistant observations are named \(\textit{support vectors}\): they are \(p\)-dimensional data vectors that fully define and constrain the maximal margin hyperplane.
Key property of support vectors: Any shift or alteration to a support vector will directly modify the position, slope, or margin width of the maximal margin hyperplane. In contrast, modifying training observations that are \(\textbf{not}\) support vectors produces zero change to the optimal separating hyperplane, provided the support vectors remain unchanged.
The maximal margin hyperplane described above provides an intuitive, simple linear classification rule \(\textbf{if and only if}\) a perfect linear separating hyperplane exists for the dataset. In nearly all real-world applied problems, training data cannot be perfectly split by a straight linear boundary—no exact separating hyperplane exists, which means the hard-margin maximal margin hyperplane formulation cannot be directly applied.
A linear hyperplane in \(p\)-dimensional feature space \(\mathbb{R}^p\) takes the standard form: \[ \boldsymbol{w}^T \boldsymbol{x} + b = 0 \] where:
For binary classification with labels \(y_i \in \{-1, 1\}\) for observation \(i\), a separating hyperplane satisfies: \[ \begin{cases} \boldsymbol{w}^T \boldsymbol{x}_i + b > 0 & \text{if } y_i = 1 \\ \boldsymbol{w}^T \boldsymbol{x}_i + b < 0 & \text{if } y_i = -1 \end{cases} \] We rewrite this unified constraint by multiplying through by the label \(y_i\): \[ y_i \big(\boldsymbol{w}^T \boldsymbol{x}_i + b\big) > 0 \quad \forall i = 1,2,\dots,n \]
The perpendicular Euclidean distance from a single point \(\boldsymbol{x}_i\) to hyperplane \(\boldsymbol{w}^T\boldsymbol{x}+b=0\) is the standard plane-point distance formula: \[ d_i = \frac{\big|\boldsymbol{w}^T \boldsymbol{x}_i + b\big|}{\|\boldsymbol{w}\|_2}, \quad \|\boldsymbol{w}\|_2 = \sqrt{w_1^2 + w_2^2 + \dots + w_p^2} \] Since we have perfect separation \(y_i(\boldsymbol{w}^T\boldsymbol{x}_i + b) > 0\), the absolute value may be eliminated: \[ d_i = \frac{y_i \big(\boldsymbol{w}^T \boldsymbol{x}_i + b\big)}{\|\boldsymbol{w}\|_2} \] The margin \(M\) of the hyperplane is defined as the minimum perpendicular distance across all training samples: \[ M = \min_{i=1,\dots,n} \, d_i = \min_{i=1,\dots,n} \frac{y_i \big(\boldsymbol{w}^T \boldsymbol{x}_i + b\big)}{\|\boldsymbol{w}\|_2} \]
Our goal is to maximize \(M\) over all valid \(\boldsymbol{w},b\) that satisfy separation. We perform a normalization simplification without loss of generality: scale \((\boldsymbol{w},b)\) such that \(\min_i y_i(\boldsymbol{w}^T\boldsymbol{x}_i + b) = 1\). This normalization is valid because scaling \(\boldsymbol{w},b\) by a constant factor does not alter the hyperplane itself.
Under this normalization: \[ M = \frac{1}{\|\boldsymbol{w}\|_2} \] Maximizing \(M = 1/\|\boldsymbol{w}\|_2\) is algebraically equivalent to minimizing \(\|\boldsymbol{w}\|_2\). Squaring the norm (monotonic transformation, preserves minima) gives the standard convex objective: \[ \min_{\boldsymbol{w},b} \frac{1}{2}\|\boldsymbol{w}\|_2^2 \] Subject to the hard separation constraints for all \(n\) observations: \[ y_i \big(\boldsymbol{w}^T \boldsymbol{x}_i + b\big) \ge 1, \quad i=1,\dots,n \]
At the optimal solution \((\boldsymbol{w}^*,b^*)\), the minimal distance constraint is tight for exactly the support vectors: \[ y_i \big((\boldsymbol{w}^*)^T \boldsymbol{x}_i + b^*\big) = 1 \quad \iff \boldsymbol{x}_i \text{ is a support vector} \] For all non-support vectors, the inequality holds strictly: \[ y_i \big((\boldsymbol{w}^*)^T \boldsymbol{x}_i + b^*\big) > 1 \] This mathematical condition proves that only support vectors determine the hyperplane solution: removing or adjusting non-support vectors does not change the optimal \(\boldsymbol{w}^*,b^*\), while perturbing any support vector breaks the tight equality constraint and forces a new optimal hyperplane.
The above hard-margin optimization problem only has a feasible solution if there exists some \((\boldsymbol{w},b)\) satisfying \(y_i(\boldsymbol{w}^T\boldsymbol{x}_i + b)\ge1\) for every \(i\). If the dataset cannot be perfectly linearly separated, no feasible \((\boldsymbol{w},b)\) exists, so the maximal margin hyperplane formulation cannot be used directly. This motivates soft-margin SVM extensions for nearly separable real-world data.
# ============================================================
# Maximal Margin Hyperplane with Square Window Frame
# ============================================================
# Install required packages if missing
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("e1071")) install.packages("e1071") # For SVM maximal margin fit
library(ggplot2)
library(e1071)
# ------------------------------------------------------------
# Step 1: Generate linearly separable dataset
# ------------------------------------------------------------
set.seed(123)
# Class 1 (blue points, y=1)
x1_blue <- data.frame(
X1 = c(-1.2, -0.6, -0.4, -0.2, 0.1, 0.3, 0.5, -0.8, -0.5, 0.2),
X2 = c(1.8, 1.4, 1.1, 0.6, 1.7, 3.5, 2.9, 2.9, 2.9, 2.0),
y = factor(1)
)
# Class -1 (pink points, y=-1)
x2_red <- data.frame(
X1 = c(0.4, 0.7, 0.9, 1.1, 1.4, 1.8, 2.2, 2.6, 3.1, 0.8),
X2 = c(-0.2, -0.9, -1.1, -1.5, 0.1, 0.6, 0.0, -1.0, -1.3, -0.7),
y = factor(-1)
)
# Combine dataset
df <- rbind(x1_blue, x2_red)
# ------------------------------------------------------------
# Step 2: Fit hard-margin SVM
# ------------------------------------------------------------
svm_fit <- svm(y ~ X1 + X2, data = df, kernel = "linear", cost = 1e6, scale = FALSE)
# Extract hyperplane parameters
w <- t(svm_fit$SV) %*% svm_fit$coefs
w1 <- w[1,1]
w2 <- w[2,1]
b <- svm_fit$rho
# Hyperplane equation: w1*X1 + w2*X2 - b = 0 → X2 = (-w1/w2)*X1 + b/w2
slope_main <- -w1 / w2
intercept_main <- b / w2
# Margin boundary lines (offset ±1/||w||)
margin_offset <- 1 / sqrt(w1^2 + w2^2)
intercept_upper <- intercept_main + margin_offset / w2
intercept_lower <- intercept_main - margin_offset / w2
# Support vectors
support_pts <- df[svm_fit$index, ]
# ------------------------------------------------------------
# Step 3: Background grid shading
# ------------------------------------------------------------
grid_x <- seq(min(df$X1)-0.5, max(df$X1)+0.5, by = 0.05)
grid_y <- seq(min(df$X2)-0.5, max(df$X2)+0.5, by = 0.05)
grid_df <- expand.grid(X1 = grid_x, X2 = grid_y)
grid_df$val <- w1 * grid_df$X1 + w2 * grid_df$X2 - b
grid_df$region <- ifelse(grid_df$val > 0, "Class 1", "Class -1")
# ------------------------------------------------------------
# Step 4: Build plot with square window frame
# ------------------------------------------------------------
p <- ggplot() +
# Background shading
geom_tile(data = grid_df, aes(x = X1, y = X2, fill = region), alpha = 0.2, color = "gray80") +
scale_fill_manual(values = c("Class 1" = "#87CEFA", "Class -1" = "#FFB6C1")) +
# Margin lines
geom_abline(slope = slope_main, intercept = intercept_upper, linetype = "dashed", linewidth = 1) +
geom_abline(slope = slope_main, intercept = intercept_lower, linetype = "dashed", linewidth = 1) +
# Central hyperplane
geom_abline(slope = slope_main, intercept = intercept_main, color = "black", linewidth = 1.3) +
# Data points
geom_point(data = df, aes(x = X1, y = X2, color = y), size = 4) +
scale_color_manual(values = c("1" = "#3388dd", "-1" = "#bb6699")) +
# Arrows between support vectors
geom_segment(
data = support_pts,
aes(x = X1, y = X2, xend = lag(X1), yend = lag(X2)),
arrow = arrow(length = unit(0.15, "cm")), color = "black"
) +
# Square window frame (adjust bounds to cover data)
annotate("rect",
xmin = -2, xmax = 4, ymin = -2, ymax = 4,
color = "black", fill = NA, size = 1.2) +
# Axis labels
labs(x = expression(X[1]), y = expression(X[2])) +
# Equal aspect ratio ensures square frame
coord_fixed(ratio = 1) +
theme_bw() +
theme(
legend.position = "none",
plot.margin = margin(10,10,10,10),
axis.title = element_text(size = 14)
) +
labs(caption = "Maximal margin hyperplane with square window frame")
# Display plot
print(p)# Install required plotting package if missing
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)
# Fix random seed for perfect reproducibility
set.seed(101)
# Generate 20x2 matrix of standard normal random variables
coordinates <- matrix(rnorm(n = 40), nrow = 20, ncol = 2)
colnames(coordinates) <- c("X1", "X2")
# Create binary class labels: 10 samples of class -1, 10 samples of class 1
class_label <- c(rep(-1, times = 10), rep(1, times = 10))
class_label <- as.factor(class_label)
# Shift all class 1 observations by +1 along both feature axes to create overlap
coordinates[class_label == 1, ] <- coordinates[class_label == 1, ] + 1
# Assemble final plotting dataset
data <- data.frame(coordinates, class = class_label)
# Generate scatter plot matching the original figure exactly
ggplot(data = data, aes(x = X1, y = X2, color = class)) +
geom_point(size = 4) +
theme_bw() +
labs(title = "Linearly Non-Separable Classes") +
theme(
legend.position = "none",
plot.title = element_text(hjust = 0.5, size = 11)
)To address cases where data cannot be perfectly linearly separated, we extend the maximal margin hyperplane framework to construct a hyperplane that nearly separates the two classes while allowing a small number of misclassification errors. This type of decision boundary is known as a \(\textit{Support Vector Classifier}\), or equivalently a \(\textit{soft-margin SVM}\).
The maximal margin classifier introduced in the prior section has limited real-world applicability, as perfectly linearly separable datasets are extremely rare in practice. Even when perfect linear separation is theoretically achievable, the hard-margin formulation suffers from two critical flaws:
For these reasons, we prefer a classifier defined by a hyperplane that tolerates a small number of misclassifications, while delivering greater robustness and stronger generalization to new data (reduced overfitting). This design goal is exactly achieved by support vector classifiers (soft-margin classifiers). Instead of enforcing all training points to lie strictly on the correct side of the margin band, we permit some observations to fall inside the margin or even on the opposite side of the hyperplane.
The figure above displays a soft-margin support vector classifier fit to a small synthetic training dataset. The solid black line is the optimal separating hyperplane; the two dashed parallel lines mark the width of the soft margin band on either side of the hyperplane. We categorize all training points by their position relative to the margin and hyperplane:
Any observation that falls inside the margin band or crosses to the wrong side of the hyperplane constitutes a margin violation or full misclassification, respectively.
Identifying the optimal soft-margin hyperplane that correctly classifies most samples (with limited allowed violations) reduces to a constrained convex optimization problem. A full rigorous proof of the optimization derivation is beyond this introductory text, but we formalize the objective function, constraints, and hyperparameter behavior below.
Introduce non-negative slack variables \(\xi_i \ge 0\) for each training observation \(i=1,\dots,n\) to quantify the magnitude of margin/hyperplane violations: \[ \xi_i = \begin{cases} 0 & \text{if } \boldsymbol{x}_i \text{ lies on the correct side of the margin}, \\ \in (0,1) & \text{if } \boldsymbol{x}_i \text{ lies inside the margin band (correct side of hyperplane)}, \\ \ge 1 & \text{if } \boldsymbol{x}_i \text{ lies on the wrong side of the hyperplane (misclassified)}. \end{cases} \] The hard separation constraint from maximal margin SVM is relaxed to: \[ y_i \big(\boldsymbol{w}^T \boldsymbol{x}_i + b\big) \ge 1 - \xi_i, \quad \forall i,\quad \xi_i \ge 0. \]
The convex minimization problem balances maximizing margin width (minimizing \(\|\boldsymbol{w}\|_2^2\)) and penalizing total margin violations (\(\sum_{i=1}^n \xi_i\)): \[ \min_{\boldsymbol{w},b,\boldsymbol{\xi}} \frac{1}{2}\|\boldsymbol{w}\|_2^2 + C \sum_{i=1}^n \xi_i \] Subject to: \[ y_i \big(\boldsymbol{w}^T \boldsymbol{x}_i + b\big) \ge 1 - \xi_i,\quad \xi_i \ge 0,\quad i=1,\dots,n. \] The scalar hyperparameter \(C>0\) controls the penalty applied to total margin violations \(\sum \xi_i\).
\(C\) directly governs the bias–variance tradeoff of the SVM model, and its optimal value is selected via cross-validation in practice.
A critical property of soft-margin SVM: only observations with \(\xi_i > 0\) (points touching the margin band or violating it) influence the final hyperplane solution. These points are defined as \(\textit{support vectors}\), which fully determine the decision boundary. The value of \(C\) modulates the number of support vectors and the resulting model complexity:
Since the soft-margin hyperplane depends exclusively on support vectors (a small subset of training data), the classifier is robust to outlying observations far from the margin band. This distinguishes SVMs from methods such as Linear Discriminant Analysis (LDA), whose classification rule relies on the sample mean of \(\textit{all}\) training observations and is highly sensitive to distant outliers.
This code uses the e1071 package to fit linear
soft-margin SVM, generates the margin lines, support vector markers, and
replicates the textbook plot structure exactly.
# ============================================================
# Soft-Margin SVM Visualization with Square Window Frame
# ============================================================
# Install required packages if missing
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("e1071")) install.packages("e1071")
library(ggplot2)
library(e1071)
# --------------------------
# Step 1: Exact coordinates of all 12 labeled points from ISLR
# --------------------------
svm_data <- data.frame(
ID = 1:12,
X1 = c(1.2, 1.4, -0.4, 0.1, 0.6, 0.0, 1.6, 1.8, 2.2, 2.4, 1.3, 0.7),
X2 = c(0.8, -0.6, 0.0, -0.8, -0.5, -1.0, 3.8, 2.0, 2.4, 4.0, 2.8, 0.2),
y = factor(c(-1, -1, -1, -1, -1, -1, 1, 1, 1, 1, -1, 1))
)
# --------------------------
# Step 2: Fit soft-margin linear SVM
# --------------------------
svm_fit <- svm(y ~ X1 + X2, data = svm_data, kernel = "linear", cost = 1, scale = FALSE)
# Extract hyperplane parameters
w_matrix <- t(svm_fit$SV) %*% svm_fit$coefs
w1 <- w_matrix[1, 1]
w2 <- w_matrix[2, 1]
b <- svm_fit$rho
# Hyperplane equation: w1*X1 + w2*X2 - b = 0 → X2 = slope*X1 + intercept
slope_main <- -w1 / w2
intercept_main <- b / w2
# Margin boundaries
margin_gap <- 1 / sqrt(w1^2 + w2^2)
intercept_upper_margin <- intercept_main + margin_gap / w2
intercept_lower_margin <- intercept_main - margin_gap / w2
# --------------------------
# Step 3: Compute square window frame bounds
# --------------------------
range_X1 <- range(svm_data$X1)
range_X2 <- range(svm_data$X2)
x_span <- diff(range_X1)
y_span <- diff(range_X2)
span <- max(x_span, y_span)
x_mid <- mean(range_X1)
y_mid <- mean(range_X2)
xmin <- x_mid - span/2
xmax <- x_mid + span/2
ymin <- y_mid - span/2
ymax <- y_mid + span/2
# --------------------------
# Step 4: Build plot
# --------------------------
p <- ggplot() +
# Dashed margin boundary lines
geom_abline(slope = slope_main, intercept = intercept_upper_margin, linetype = "dashed", linewidth = 1) +
geom_abline(slope = slope_main, intercept = intercept_lower_margin, linetype = "dashed", linewidth = 1) +
# Solid central hyperplane
geom_abline(slope = slope_main, intercept = intercept_main, color = "black", linewidth = 1.4) +
# Data points
geom_point(data = svm_data, aes(x = X1, y = X2, color = y), size = 5) +
scale_color_manual(values = c("-1" = "#cc77aa", "1" = "#66aadd")) +
# Number labels
geom_text(data = svm_data, aes(x = X1, y = X2, label = ID), size = 6, color = "black") +
# Square window frame
annotate("rect", xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax,
color = "black", fill = NA, size = 1.2) +
# Axis labels
labs(
x = expression(X[1]),
y = expression(X[2]),
caption = "Soft-Margin Support Vector Classifier figure reproduced from ISLR"
) +
coord_fixed(ratio = 1, xlim = c(xmin, xmax), ylim = c(ymin, ymax)) +
theme_bw() +
theme(
legend.position = "none",
plot.caption = element_text(hjust = 0.5, size = 14, margin = margin(t = 15)),
axis.title = element_text(size = 16),
panel.grid = element_blank()
)
# Display plot
print(p)To demonstrate the application of a Support Vector Classifier as a binary classification model, we simulate training observations residing in a two-dimensional feature space, split into two distinct classes.
All examples below utilize the svm() function from the
R package e1071. This function fits a Support
Vector Classifier when the argument kernel="linear" is
specified. As derived later in this section, linear-kernel Support
Vector Machines reduce exactly to the standard Support Vector Classifier
framework. The cost argument controls the penalty magnitude
assigned to margin violations; this argument corresponds to the
hyperparameter \(C\) defined in
soft-margin SVM optimization theory.
R Code for Dataset Simulation and VisualizationThe following script generates synthetic non-linearly separable data and produces a scatter plot of the two classes:
# Install required packages if missing
if (!require("ggplot2")) install.packages("ggplot2")
library(ggplot2)
# Fix random seed for full reproducibility
set.seed(10111)
# Generate 20Ă—2 matrix of standard normal random samples N(0,1)
coordinates <- matrix(rnorm(n = 40), nrow = 20, ncol = 2)
colnames(coordinates) <- c("X1", "X2")
# Define binary class labels: first 10 samples class -1, last 10 samples class 1
y <- c(rep(-1, times = 10), rep(1, times = 10))
# Shift all class 1 points by +1 along both feature axes to create class overlap
coordinates[y == 1, ] <- coordinates[y == 1, ] + 1
# Assemble final tidy data frame
data <- data.frame(coordinates, y)
# Scatter plot of simulated dataset
ggplot(data = data, aes(x = X1, y = X2, color = as.factor(y))) +
geom_point(size = 6) +
theme_bw() +
theme(legend.position = "none")Visual inspection of the scatter plot in the figure above confirms the two classes cannot be perfectly separated by any linear hyperplane.
e1071::svm() FunctionThe svm() function automatically distinguishes between
classification and regression tasks:
factor type.R
Script for Fitting Linear Soft-Margin Support Vector Classifier# Install and load SVM package
if (!require("e1071")) install.packages("e1071")
library(e1071)
# Convert numeric class labels to factor for classification task
data$y <- as.factor(data$y)
# Fit linear soft-margin Support Vector Classifier
svm_model <- svm(
formula = y ~ X1 + X2,
data = data,
kernel = "linear",
cost = 10,
scale = FALSE
)
# Print full model summary
summary(svm_model)##
## Call:
## svm(formula = y ~ X1 + X2, data = data, kernel = "linear", cost = 10,
## scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
##
## Number of Support Vectors: 6
##
## ( 3 3 )
##
##
## Number of Classes: 2
##
## Levels:
## -1 1
In this simulation, both predictors \(X_1\) and \(X_2\) share identical units and value
ranges, so feature standardization is unnecessary
(scale = FALSE). For datasets with predictors measured on
disparate scales, standardization is mandatory. Without scaling,
predictors with larger numerical magnitudes will dominate the distance
calculations inside the SVM loss function and overshadow the
contribution of smaller-scale features, producing biased hyperplane
solutions.
The output of summary(svm_model) reports a total of 6
support vectors: 3 support vectors belong to class \(-1\), and the remaining 3 belong to class
\(1\). Only these 6 observations define
the optimal separating hyperplane; all other training samples have no
impact on the final decision boundary.
The command draws 40 independent samples from the standard normal distribution: \[ Z \sim \mathcal{N}(\mu=0,\;\sigma^2=1) \] These samples fill a matrix with \(n=20\) rows (observations) and \(p=2\) feature columns (\(X_1,X_2\)): \[ \mathbf{Coor} = \begin{bmatrix} z_{11} & z_{12} \\ z_{21} & z_{22} \\ \vdots & \vdots \\ z_{20,1} & z_{20,2} \end{bmatrix} \in \mathbb{R}^{20\times 2} \]
The label vector \(\boldsymbol{y} \in \{-1,1\}^{20}\) partitions observations into two groups: \[ y_i = \begin{cases} -1 & i = 1,2,\dots,10 \\ 1 & i = 11,12,\dots,20 \end{cases} \]
For all rows with \(y_i=1\), we apply a uniform vector shift of \((+1,+1)\) to both features: \[ \begin{bmatrix}x_{i1}^* \\ x_{i2}^*\end{bmatrix} = \begin{bmatrix}x_{i1} + 1 \\ x_{i2} + 1\end{bmatrix} \quad \forall\; i: y_i=1 \] This shift creates overlapping points between the two classes, eliminating the existence of any perfect linear separating hyperplane.
Non-negative slack variables \(\xi_i \ge 0\) quantify margin violations for each observation \(i\): \[ \xi_i = \begin{cases} 0 & \text{Point } \boldsymbol{x}_i \text{ lies on the correct side of the margin band}, \\ 0 < \xi_i < 1 & \text{Point } \boldsymbol{x}_i \text{ lies inside the margin band (correct class)}, \\ \xi_i \ge 1 & \text{Point } \boldsymbol{x}_i \text{ misclassified on opposite hyperplane side}. \end{cases} \]
The convex minimization objective balanced margin width and total violations: \[ \min_{\boldsymbol{w},b,\boldsymbol{\xi}} \frac{1}{2}\|\boldsymbol{w}\|_2^2 + C \sum_{i=1}^{20} \xi_i \] Subject to relaxed separation constraints: \[ y_i \big(\boldsymbol{w}^T \boldsymbol{x}_i + b\big) \ge 1 - \xi_i,\quad \xi_i \ge 0,\quad i=1,\dots,20 \]
The R argument cost = 10 sets \(C=10\), assigning a moderate-to-high
penalty to margin violations. Larger \(C\) narrows the margin and reduces
misclassifications at the cost of higher model variance.
The general SVM kernel function \(K(\boldsymbol{x}_i,\boldsymbol{x}_j) =
\phi(\boldsymbol{x}_i)^T\phi(\boldsymbol{x}_j)\) computes inner
products in transformed feature space. For the linear kernel: \[
K(\boldsymbol{x}_i,\boldsymbol{x}_j) = \boldsymbol{x}_i^T
\boldsymbol{x}_j
\] This corresponds to the identity transformation \(\phi(\boldsymbol{x})=\boldsymbol{x}\),
which collapses the general kernel SVM formulation exactly into the
soft-margin linear Support Vector Classifier. This justifies the use of
kernel="linear to fit linear SVC models in
e1071.
At the optimal solution \((\boldsymbol{w}^*,b^*)\), support vectors satisfy the tight constraint \(\xi_i>0\): \[ y_i \big((\boldsymbol{w}^*)^T \boldsymbol{x}_i + b^*\big) = 1 - \xi_i,\quad \xi_i>0 \] Non-support vectors satisfy \(y_i\big((\boldsymbol{w}^*)^T\boldsymbol{x}_i + b^*\big) > 1\) with \(\xi_i=0\) and exert no influence over the final hyperplane. For this simulation, exactly 6 points satisfy the support vector equality condition.
The Euclidean distance metric used to compute hyperplane margins scales linearly with feature units: \[ d_i = \frac{y_i(\boldsymbol{w}^T\boldsymbol{x}_i + b)}{\|\boldsymbol{w}\|_2} \] If \(X_1\) ranges \([0,100]\) and \(X_2\) ranges \([0,1]\), the \(X_1\) component dominates the norm \(\|\boldsymbol{w}\|_2\), distorting margin width calculations. Standardization enforces uniform variance across all features to ensure equal predictive weight.
The index attribute of an e1071::svm object
returns the row indices of training samples identified as support
vectors:
## [1] 1 4 10 14 16 20
R
Built-in SVM Plot LimitationsThe base R plot() function accepts an svm
model object paired with training data to render the decision boundary
and two class regions partitioning the feature space:
Two critical limitations of this default visualization:
kernel="linear, low plotting resolution
creates visual distortion that makes the boundary appear
non-linear.We present a high-resolution ggplot2 alternative for
exact two-dimensional SVM visualization, including hyperplane and margin
lines.
If the model is fit with scale = TRUE, all grid
prediction points \(\textbf{must}\)
undergo identical standardization to match the scaled coordinate space
used during model training.
We generate a dense uniform grid of coordinate points spanning the full range of predictors \(X_1,X_2\). The trained SVM model predicts class labels for every grid point, which are used to shade the two decision regions in the plot.
# ============================================================
# SVM Visualization with Square Window Frame (auto bounds)
# ============================================================
# Install required packages if missing
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("e1071")) install.packages("e1071")
library(ggplot2)
library(e1071)
# ------------------------------------------------------------
# Example dataset (replace with your own 'data' if available)
# ------------------------------------------------------------
set.seed(123)
data <- data.frame(
X1 = c(rnorm(10, -1), rnorm(10, 2)),
X2 = c(rnorm(10, 2), rnorm(10, -1)),
y = factor(c(rep(1,10), rep(-1,10)))
)
# ------------------------------------------------------------
# Fit linear SVM
# ------------------------------------------------------------
svm_model <- svm(y ~ X1 + X2, data = data, kernel = "linear", cost = 1e6, scale = FALSE)
# ------------------------------------------------------------
# Compute min/max ranges for square frame
# ------------------------------------------------------------
range_X1 <- range(data$X1)
range_X2 <- range(data$X2)
# Make ranges equal length to form a square
x_span <- diff(range_X1)
y_span <- diff(range_X2)
span <- max(x_span, y_span)
x_mid <- mean(range_X1)
y_mid <- mean(range_X2)
xmin <- x_mid - span/2
xmax <- x_mid + span/2
ymin <- y_mid - span/2
ymax <- y_mid + span/2
# ------------------------------------------------------------
# Generate dense grid for shading
# ------------------------------------------------------------
new_x1 <- seq(from = xmin, to = xmax, length = 75)
new_x2 <- seq(from = ymin, to = ymax, length = 75)
grid_points <- expand.grid(X1 = new_x1, X2 = new_x2)
grid_predictions <- predict(object = svm_model, newdata = grid_points)
region_colors <- data.frame(grid_points, y = grid_predictions)
# ------------------------------------------------------------
# Extract hyperplane parameters
# ------------------------------------------------------------
beta <- drop(t(svm_model$coefs) %*% as.matrix(data[, c("X1","X2")])[svm_model$index,])
beta0 <- svm_model$rho
# ------------------------------------------------------------
# Plot with square window frame
# ------------------------------------------------------------
ggplot() +
# Shaded decision regions
geom_point(data = region_colors, aes(x = X1, y = X2, color = as.factor(y)), size = 0.5) +
# Training data points
geom_point(data = data, aes(x = X1, y = X2, color = as.factor(y)), size = 6) +
# Support vectors highlighted
geom_point(data = data[svm_model$index, ],
aes(x = X1, y = X2, color = as.factor(y)),
shape = 21, colour = "black", size = 6) +
# Central hyperplane
geom_abline(intercept = beta0 / beta[2], slope = -beta[1] / beta[2]) +
# Margin lines
geom_abline(intercept = (beta0 - 1) / beta[2], slope = -beta[1] / beta[2], linetype = "dashed") +
geom_abline(intercept = (beta0 + 1) / beta[2], slope = -beta[1] / beta[2], linetype = "dashed") +
# Square window frame
annotate("rect", xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax,
color = "black", fill = NA, size = 1.2) +
# Equal aspect ratio
coord_fixed(ratio = 1, xlim = c(xmin, xmax), ylim = c(ymin, ymax)) +
theme_bw() +
theme(legend.position = "none")The standard linear hyperplane equation is: \[
\boldsymbol{\beta}^T \boldsymbol{x} - \beta_0 = 0 \implies \beta_1 X_1 +
\beta_2 X_2 - \beta_0 = 0
\] Rearrange to solve for \(X_2\) to match the geom_abline
slope-intercept form \(X_2 = m X_1 +
c\): \[
\beta_2 X_2 = -\beta_1 X_1 + \beta_0 \implies X_2 =
\underbrace{\left(-\frac{\beta_1}{\beta_2}\right)}_{\text{slope}} X_1 +
\underbrace{\frac{\beta_0}{\beta_2}}_{\text{intercept}}
\] The soft margin band enforces the constraints \(y_i(\boldsymbol{\beta}^T\boldsymbol{x}_i -
\beta_0) = \pm 1\) for support vectors on the margin
boundaries:
This algebra justifies the three geom_abline
intercept/slope values used in the plotting code.
The preceding model was trained with penalty hyperparameter
cost = 10. This hyperparameter \(C\) directly controls the model
bias-variance tradeoff and is critical to optimizing predictive
generalization performance. The e1071 package provides the
tune() function to automate \(k\)-fold cross validation for selecting the
optimal \(C\) value. Required
inputs:
svm),ranges listing all candidate
values of \(C\) to evaluate.The cross validation results demonstrate that the minimum
cross-validation error rate of \(0.10\)
is achieved for \(C \ge 20\). The
tune() object automatically stores the optimal trained
model with the best hyperparameter configuration:
After selecting the final tuned model, we use the
predict() function to generate class predictions for unseen
test observations and quantify out-of-sample classification error.
Total test samples: \(20\). Correct classifications: \(10 + 7 = 17\). Misclassified samples: \(3\).
Reported test error calculation correction: \[ \text{Test Error Rate} = \frac{\text{Misclassified Samples}}{\text{Total Test Samples}} = \frac{3}{20} = 0.15 = 15\% \]
############################################################
# Binary Classification with Linear SVM
# Complete Corrected Version
############################################################
# ----------------------------------------------------------
# Install and load required packages
# ----------------------------------------------------------
required_packages <- c("ggplot2", "e1071")
for(pkg in required_packages){
if(!require(pkg, character.only = TRUE)){
install.packages(pkg)
library(pkg, character.only = TRUE)
}
}
# ----------------------------------------------------------
# Generate training data
# ----------------------------------------------------------
set.seed(10111)
coordinates <- matrix(
rnorm(40),
nrow = 20,
ncol = 2
)
colnames(coordinates) <- c("X1", "X2")
y <- c(rep(-1, 10), rep(1, 10))
# Shift positive class
coordinates[y == 1, ] <- coordinates[y == 1, ] + 1
data <- data.frame(
X1 = coordinates[,1],
X2 = coordinates[,2],
y = factor(y)
)
# ----------------------------------------------------------
# Fit Linear SVM
# ----------------------------------------------------------
svm_model <- svm(
y ~ X1 + X2,
data = data,
kernel = "linear",
cost = 10,
scale = FALSE
)
# ----------------------------------------------------------
# Support vectors
# ----------------------------------------------------------
cat("\nSupport Vector Indices:\n")##
## Support Vector Indices:
## [1] 1 4 10 14 16 20
##
## Number of Support Vectors:
## [1] 6
# ----------------------------------------------------------
# Base SVM Plot
# ----------------------------------------------------------
plot(
svm_model,
data,
main = "Linear SVM Classification"
)# ----------------------------------------------------------
# Create Prediction Grid
# ----------------------------------------------------------
range_X1 <- range(data$X1)
range_X2 <- range(data$X2)
new_x1 <- seq(
range_X1[1] - 0.5,
range_X1[2] + 0.5,
length.out = 150
)
new_x2 <- seq(
range_X2[1] - 0.5,
range_X2[2] + 0.5,
length.out = 150
)
grid_points <- expand.grid(
X1 = new_x1,
X2 = new_x2
)
grid_predictions <- predict(
svm_model,
grid_points
)
region_colors <- cbind(
grid_points,
prediction = grid_predictions
)
# ----------------------------------------------------------
# Extract Hyperplane Coefficients
# ----------------------------------------------------------
beta <- drop(
t(svm_model$coefs) %*%
as.matrix(
data[, c("X1", "X2")]
)[svm_model$index, ]
)
# IMPORTANT:
# e1071 decision function:
# f(x) = w'x - rho
beta0 <- -svm_model$rho
# ----------------------------------------------------------
# Compute decision boundary and margins
# ----------------------------------------------------------
boundary <- data.frame(
x = new_x1,
y = -(beta[1] * new_x1 + beta0) / beta[2]
)
margin_upper <- data.frame(
x = new_x1,
y = -(beta[1] * new_x1 + beta0 - 1) / beta[2]
)
margin_lower <- data.frame(
x = new_x1,
y = -(beta[1] * new_x1 + beta0 + 1) / beta[2]
)
# ----------------------------------------------------------
# High-Resolution SVM Plot
# ----------------------------------------------------------
p_svm <- ggplot() +
geom_point(
data = region_colors,
aes(
X1,
X2,
color = prediction
),
alpha = 0.20,
size = 0.8
) +
geom_point(
data = data,
aes(
X1,
X2,
color = y
),
size = 4
) +
geom_point(
data = data[svm_model$index, ],
aes(
X1,
X2,
fill = y
),
shape = 21,
colour = "black",
stroke = 1.2,
size = 5
) +
geom_line(
data = boundary,
aes(x, y),
linewidth = 1
) +
geom_line(
data = margin_upper,
aes(x, y),
linetype = "dashed"
) +
geom_line(
data = margin_lower,
aes(x, y),
linetype = "dashed"
) +
labs(
title = "Linear SVM with Decision Boundary and Margins",
x = "X1",
y = "X2"
) +
theme_bw()
print(p_svm)# ----------------------------------------------------------
# Cross Validation for Optimal Cost Parameter
# ----------------------------------------------------------
set.seed(1)
svm_cv <- tune.svm(
y ~ X1 + X2,
data = data,
kernel = "linear",
cost = c(
0.001,
0.01,
0.1,
1,
5,
10,
20,
50,
100,
150,
200
)
)
# ----------------------------------------------------------
# CV Summary
# ----------------------------------------------------------
cat("\nCross Validation Results:\n")##
## Cross Validation Results:
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 20
##
## - best performance: 0.1
##
## - Detailed performance results:
## cost error dispersion
## 1 0.001 0.55 0.4377975
## 2 0.010 0.55 0.4377975
## 3 0.100 0.15 0.2415229
## 4 1.000 0.25 0.2635231
## 5 5.000 0.20 0.2581989
## 6 10.000 0.15 0.2415229
## 7 20.000 0.10 0.2108185
## 8 50.000 0.10 0.2108185
## 9 100.000 0.10 0.2108185
## 10 150.000 0.10 0.2108185
## 11 200.000 0.10 0.2108185
# ----------------------------------------------------------
# Plot CV Error vs Cost
# ----------------------------------------------------------
p_cv <- ggplot(
svm_cv$performances,
aes(
x = cost,
y = error
)
) +
geom_line(linewidth = 1) +
geom_point(size = 3) +
theme_bw() +
labs(
title = "Cross-Validation Error vs Cost (C)",
x = "Cost (C)",
y = "10-Fold CV Error Rate"
) +
theme(
plot.title = element_text(hjust = 0.5)
)
print(p_cv)# ----------------------------------------------------------
# Best Model
# ----------------------------------------------------------
best_model <- svm_cv$best.model
cat("\nBest Cost Parameter:\n")##
## Best Cost Parameter:
## cost
## 7 20
# ----------------------------------------------------------
# Generate Independent Test Set
# ----------------------------------------------------------
set.seed(19)
test_coordinates <- matrix(
rnorm(40),
nrow = 20,
ncol = 2
)
colnames(test_coordinates) <- c(
"X1",
"X2"
)
test_y <- sample(
c(-1, 1),
size = 20,
replace = TRUE
)
test_coordinates[test_y == 1, ] <-
test_coordinates[test_y == 1, ] + 1
test_data <- data.frame(
X1 = test_coordinates[,1],
X2 = test_coordinates[,2],
y = factor(
test_y,
levels = levels(data$y)
)
)
# ----------------------------------------------------------
# Predictions
# ----------------------------------------------------------
test_predictions <- predict(
best_model,
newdata = test_data
)
# ----------------------------------------------------------
# Test Error Rate
# ----------------------------------------------------------
test_error_rate <- mean(
test_predictions != test_data$y
)
test_error_pct <- round(
100 * test_error_rate,
2
)
cat(
"\nTest Set Classification Error:",
test_error_pct,
"%\n"
)##
## Test Set Classification Error: 15 %
# ----------------------------------------------------------
# Confusion Matrix
# ----------------------------------------------------------
conf_matrix <- table(
Predicted = test_predictions,
Actual = test_data$y
)
cat("\nConfusion Matrix:\n")##
## Confusion Matrix:
## Actual
## Predicted -1 1
## -1 10 0
## 1 3 7
# ----------------------------------------------------------
# Classification Accuracy
# ----------------------------------------------------------
accuracy <- mean(
test_predictions == test_data$y
)
cat(
"\nClassification Accuracy:",
round(100 * accuracy, 2),
"%\n"
)##
## Classification Accuracy: 85 %
############################################################
# End
############################################################\(\textbf{Correction Note}\): The original Spanish text contains a mathematical error claiming test error equals 10%; the correct computed error from the confusion matrix is 15%. The error source is explained below in the corrections summary.
The Support Vector Classifier described in the previous sections achieves strong predictive performance when the decision boundary separating classes is approximately linear. If the true class boundary is non-linear, its classification performance degrades drastically. One widely used strategy to handle datasets with non-linear class separation is to expand the dimensionality of the original feature space.
The fact that two groups cannot be linearly separated in the original low-dimensional feature space does not imply linear inseparability in a higher-dimensional space. The following four-panel figure illustrates this core principle: two clusters that are non-linearly separable in the 2-dimensional input space become linearly separable once a third feature dimension \(X_3\) is introduced.
The Support Vector Machine (SVM) framework can be interpreted as an extension of the linear Support Vector Classifier built by mapping input data into a higher-dimensional latent feature space. Linear decision hyperplanes learned within this augmented high-dimensional space map back to curved, non-linear decision boundaries when projected onto the original low-dimensional raw feature space.
# ----------------------
# Install and load required packages
# ----------------------
required_pkgs <- c("ggplot2", "rgl", "plot3D", "dplyr", "tidyr", "MASS")
new_pkgs <- required_pkgs[!(required_pkgs %in% installed.packages()[,"Package"])]
if(length(new_pkgs)) {
install.packages(new_pkgs, quiet = TRUE, repos = "https://cloud.r-project.org")
}## package 'misc3d' successfully unpacked and MD5 sums checked
## package 'rgl' successfully unpacked and MD5 sums checked
## package 'plot3D' successfully unpacked and MD5 sums checked
library(ggplot2)
library(rgl)
library(plot3D)
library(dplyr)
library(MASS)
# ----------------------
# Step 1: Generate synthetic non-linear cluster data
# ----------------------
set.seed(123)
# Class 1 (Black): compact inner cluster
n_black <- 220
mu_black <- c(0, 0)
sigma_black <- matrix(c(0.8, 0.3, 0.3, 0.8), nrow = 2)
black_2d <- MASS::mvrnorm(n_black, mu = mu_black, Sigma = sigma_black) %>% as.data.frame()
colnames(black_2d) <- c("X1", "X2")
black_2d$Class <- "Black"
black_2d$X3 <- rowSums(black_2d[, c("X1", "X2")]^2) + rnorm(n_black, 0, 0.25)
# Class 2 (Red): outer curved ring cluster
n_red <- 320
theta <- runif(n_red, 0, 2*pi)
r <- runif(n_red, 1.8, 3.6)
red_2d <- data.frame(
X1 = r * cos(theta) + rnorm(n_red, 0, 0.18),
X2 = r * sin(theta) + rnorm(n_red, 0, 0.18)
)
red_2d$Class <- "Red"
red_2d$X3 <- rowSums(red_2d[, c("X1", "X2")]^2) + rnorm(n_red, 0, 0.25)
# Combine dataset
df_full <- rbind(black_2d, red_2d)
df_full$Class <- factor(df_full$Class, levels = c("Black", "Red"))
# ----------------------
# Panel 1: 2D Scatter
# ----------------------
p1 <- ggplot(df_full, aes(x = X1, y = X2, color = Class)) +
geom_point(size = 1.7) +
scale_color_manual(values = c("black", "#ff3333")) +
theme_bw() +
theme(legend.position = "none") +
labs(x = expression(X[1]), y = expression(X[2]))
print(p1)## wgl
## 1
points3d(df_full$X1, df_full$X2, df_full$X3,
col = ifelse(df_full$Class=="Black", "black", "#ff3333"), size = 6)
# Proper meshgrid for surface3d
x_seq <- seq(-5, 5, length.out = 30)
y_seq <- seq(-5, 5, length.out = 30)
X1 <- matrix(rep(x_seq, each = length(y_seq)), nrow = length(y_seq))
X2 <- matrix(rep(y_seq, times = length(x_seq)), nrow = length(y_seq))
X3 <- matrix(3.2, nrow = length(y_seq), ncol = length(x_seq))
p2 <- surface3d(X1, X2, X3, col = "gray70", alpha = 0.5)
axes3d()
title3d("", xlab = "X1", ylab = "X2", zlab = "X3")
close3d() # âś… correct replacement for rgl.close()
print(p2)
# ----------------------
# Panel 3: Concentric Ring 2D Scatter
# ----------------------
p3 <- ggplot(df_full, aes(x = X1, y = X2, color = Class)) +
geom_point(size = 1.5) +
scale_color_manual(values = c("black", "#ff2222")) +
theme_bw() +
theme(legend.position = "none") +
labs(x = expression(X[1]), y = expression(X[2]))
print(p3)# ----------------------
# Panel 4: 3D Density Surface Plot
# ----------------------
p4 <- scatter3D(df_full$X1, df_full$X2, df_full$X3, colvar = as.numeric(df_full$Class),
col = c("black", "red"), alpha = 0.6, ticktype = "detailed",
xlab = "X1", ylab = "X2", zlab = "X3")## [,1] [,2] [,3] [,4]
## [1,] 2.064617e-01 -0.11135777 0.13271103 -0.13271103
## [2,] 1.763940e-01 0.13512568 -0.16103652 0.16103652
## [3,] -4.869293e-18 0.09477014 0.07952159 -0.07952159
## [4,] -2.577849e-02 -0.69663640 -3.37078580 4.37078580
# ----------------------
# Completion message
# ----------------------
cat("All four figure panels successfully displayed inline.\n")## All four figure panels successfully displayed inline.
We established that Support Vector Machines follow the same core strategy as the linear Support Vector Classifier, with one key modification: SVMs first lift the input data into a higher-dimensional feature space before solving the classification optimization problem. The immediate questions arise: How do we perform this dimensional lifting, and what dimensionality is appropriate?
The dimensionality of a dataset can be increased by combining or non-linearly transforming its original input features. For example, we can map a 2-dimensional input space \((x_1,x_2)\) to a 3-dimensional lifted space using the transformation: \[ f(x_1,x_2) = \big(x_1^2,\ \sqrt{2}\,x_1x_2,\ x_2^2\big) \] This is only one of infinitely many valid feature transformations. How do we select the optimal mapping? This is where kernels enter the framework. A kernel function \(K\) computes the dot product between two input vectors \(\textit{after}\) they have been projected into a higher-dimensional latent space, without explicitly constructing the lifted feature vectors themselves.
While we do not derive the full optimization objective here, the SVM loss function contains an inner dot product term. Replacing this raw dot product with a kernel function directly yields the support vectors and separating hyperplane defined in the lifted feature space corresponding to that kernel. This mathematical shortcut is known as the \(\textbf{kernel trick}\): with only a minor modification to the original linear SVC problem, we can solve classification for arbitrary high-dimensional lifted spaces without explicitly computing the transformed feature vectors. Many standard kernel functions exist; the most widely used variants are detailed below.
Kernel formula: \[ K(\boldsymbol{x},\boldsymbol{x}') = \boldsymbol{x} \cdot \boldsymbol{x}' \] When using a linear kernel, the resulting Support Vector Machine classifier is mathematically identical to the basic linear Support Vector Classifier.
Kernel formula: \[ K(\boldsymbol{x},\boldsymbol{x}') = \big(\boldsymbol{x} \cdot \boldsymbol{x}' + c\big)^d \] Setting \(d=1,\ c=0\) recovers the linear kernel. For \(d>1\), the model produces non-linear decision boundaries; the degree of non-linearity grows as \(d\) increases. Values of \(d\) greater than 5 are generally discouraged due to severe overfitting risk.
Kernel formula: \[ K(\boldsymbol{x},\boldsymbol{x}') = \exp\big(-\gamma \|\boldsymbol{x} - \boldsymbol{x}'\|^2\big) \]
The hyperparameter \(\gamma\) controls the flexibility of the RBF kernel model: very small \(\gamma\) produces decision boundaries nearly identical to a linear classifier. As \(\gamma\) increases, the model’s capacity and flexibility rise, creating tighter, more locally adapted decision regions.
The kernels presented here represent only a small subset of available kernel functions. Every kernel introduces a set of hyperparameters, whose optimal values are selected via cross-validation. No single kernel universally outperforms all others; kernel choice heavily depends on the structure and nature of the classification task at hand.
The RBF kernel is strongly recommended for general-purpose use. It offers two key practical advantages:
This practical example uses a publicly available synthetic dataset from the textbook \(\textit{The Elements of Statistical Learning}\). The data consists of two-dimensional input observations generated from an underlying non-linear function, with two predictor features \(X_1, X_2\) and a binary categorical class label \(y\).
The loaded list object ESL.mixture contains two core
components:
The SVM implementation in the \(\texttt{e1071}\) R package supports multiple non-linear kernel functions; the two most widely used variants are:
All SVM models share a universal regularization hyperparameter \(C\), which controls the tradeoff between maximizing the separating margin and tolerating training-set misclassification errors.
We perform a grid search cross-validation routine to identify the optimal combination of \(C\) and \(\gamma\) for the RBF kernel, minimizing cross-validation classification error.
The grid search identifies the optimal hyperparameter pair \(\boldsymbol{C=1,\ \gamma=5}\), which achieves the lowest cross-validation test classification error.
To render the continuous decision boundaries learned by the SVM, we generate a dense evenly spaced grid spanning the full range of \(X_1\) and \(X_2\). We predict the class label for every grid coordinate and colour the background regions according to model predictions. Raw training observations are overlaid, and support vectors are highlighted with black outlined markers.
# Install required packages if missing
required_pkgs <- c("e1071", "ggplot2", "dplyr")
new_pkgs <- required_pkgs[!(required_pkgs %in% installed.packages()[,"Package"])]
if(length(new_pkgs)) install.packages(new_pkgs, quiet = TRUE)
library(e1071)
library(ggplot2)
library(dplyr)
# --------------------------
# Step 1: Load ESL Mixture Dataset
# --------------------------
rm(list = ls())
load(url("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/ESL.mixture.rda"))
datos <- data.frame(ESL.mixture$x, y = ESL.mixture$y)
datos$y <- as.factor(datos$y)
head(datos)| X1 | X2 | y |
|---|---|---|
| 2.5260930 | 0.3210504 | 0 |
| 0.3669545 | 0.0314621 | 0 |
| 0.7682191 | 0.7174862 | 0 |
| 0.6934357 | 0.7771940 | 0 |
| -0.0198366 | 0.8672537 | 0 |
| 2.1965449 | -1.0230141 | 0 |
# Raw 2D class scatter plot
p_raw <- ggplot(data = datos, aes(x = X1, y = X2, color = y)) +
geom_point(size = 2.5) +
theme_bw() +
theme(legend.position = "none")
ggsave("esl_raw_scatter.png", p_raw, width=4, height=4, dpi=300)
print(p_raw)# --------------------------
# Step 2: Cross-Validation Grid Search for C and Gamma
# --------------------------
set.seed(1)
svm_cv <- tune(
svm,
train.x = datos[, c("X1","X2")],
train.y = datos$y,
kernel = "radial",
ranges = list(
cost = c(0.001, 0.01, 0.1, 1, 5, 10, 20),
gamma = c(0.5, 1, 2, 3, 4, 5, 10)
)
)
# Cross-validation error hyperparameter plot
p_cv <- ggplot(data = svm_cv$performances, aes(x = cost, y = error, color = as.factor(gamma))) +
geom_line(linewidth=0.8) +
geom_point(size=1.8) +
labs(title = "Classification Cross-Validation Error vs Hyperparameters C and Îł", color = "Îł") +
theme_bw() +
theme(legend.position = "bottom")
ggsave("svm_cv_error_plot.png", p_cv, width=6, height=4, dpi=300)
print(p_cv)## Optimal tuned hyperparameters:
## cost gamma
## 39 1 5
modelo_svm_rbf <- svm_cv$best.model
# --------------------------
# Step3: Generate Decision Region Grid & Final SVM Plot
# --------------------------
rango_X1 <- range(datos$X1)
rango_X2 <- range(datos$X2)
new_x1 <- seq(from = rango_X1[1], to = rango_X1[2], length = 75)
new_x2 <- seq(from = rango_X2[1], to = rango_X2[2], length = 75)
nuevos_puntos <- expand.grid(X1 = new_x1, X2 = new_x2)
predicciones <- predict(object = modelo_svm_rbf, newdata = nuevos_puntos)
color_regiones <- data.frame(nuevos_puntos, y = predicciones)
# Full SVM decision boundary plot with support vectors highlighted
p_svm_final <- ggplot() +
geom_point(data = color_regiones, aes(x = X1, y = X2, color = as.factor(y)), size = 0.5) +
geom_point(data = datos, aes(x = X1, y = X2, color = as.factor(y)), size = 2.5) +
geom_point(
data = datos[modelo_svm_rbf$index, ],
aes(x = X1, y = X2, color = as.factor(y)),
shape = 21,
colour = "black",
size = 2.5
) +
theme_bw() +
theme(legend.position = "none")
ggsave("svm_esl_decision_regions.png", p_svm_final, width=4.5, height=4, dpi=300)
print(p_svm_final)cat("All three output figures saved to working directory:\n
1. esl_raw_scatter.png (raw class scatter)
2. svm_cv_error_plot.png (cross-val error hyperparameter curve)
3. svm_esl_decision_regions.png (SVM decision regions with support vectors)\n")## All three output figures saved to working directory:
##
## 1. esl_raw_scatter.png (raw class scatter)
## 2. svm_cv_error_plot.png (cross-val error hyperparameter curve)
## 3. svm_esl_decision_regions.png (SVM decision regions with support vectors)
The separating hyperplane logic that underpins binary SVM classifiers does not naturally extend to classification problems with more than two distinct classes. Three widely adopted strategies have been developed to adapt SVMs to multi-class tasks with \(K>2\) target classes: One-versus-One, One-versus-All, and DAGSVM.
For a task with \(K>2\) classes, the One-versus-One approach trains \(\frac{K(K-1)}{2}\) independent binary SVM classifiers, each dedicated to separating one unique pair of classes.
To generate a prediction for a single observation, every pairwise classifier produces a class vote for the sample. We tally total votes for each class, and the observation is assigned to the class receiving the highest vote count.
The major limitation of this method is that the total number of trained models grows quadratically with \(K\), leading to excessive computational cost for datasets with many distinct classes.
The svm() function from the e1071 package
automatically uses the One-versus-One strategy for multi-class
classification when the response variable is a factor with three or more
levels.
This strategy trains \(K\) separate binary SVM models. Each model isolates one single target class against the union of all remaining \(K-1\) classes, producing one dedicated separating hyperplane per class.
During prediction, the sample is evaluated against all \(K\) classifiers, and the observation is assigned to the class whose model returns a positive prediction. This simple design carries two critical flaws:
DAGSVM is an optimized extension of the One-versus-One framework. It retains the same pairwise SVM training pipeline but accelerates prediction runtime by eliminating redundant pairwise comparisons via a Directed Acyclic Graph (DAG) decision tree structure.
Six pairwise SVMs are pre-trained: A-B, A-C, A-D, B-C, B-D, C-D.
Instead of evaluating all six pairwise models, only three are required for inference. DAGSVM preserves all advantages of One-versus-One while delivering drastically faster prediction performance.
The Khan dataset from the ISLR library
contains 83 tissue samples belonging to four distinct tumor subtypes.
Each sample includes gene expression intensity measurements across 2308
genes. The objective is to train a multi-class linear SVM to predict
tumor subtype from high-dimensional gene expression profiles.
The dataset is pre-partitioned:
xtrain, ytrain): 63
samplesxtest, ytest): 20
unseen samples for generalization evaluationThis is a high-dimensional regime where predictor count far exceeds observation count. Models fit on such data are highly susceptible to overfitting. To reduce model flexibility and overfitting risk, we select the linear kernel. A linear SVM has only one tunable hyperparameter: the regularization penalty \(C\).
The model achieves perfect classification (0% training error) on all 63 training samples. Zero training error is a strong warning sign of potential overfitting; we validate generalization performance on the independent held-out test set.
Out of 20 held-out test samples, only two are misclassified, yielding a test set error rate of just 10%. This confirms the linear SVM generalizes well despite perfect training-set accuracy.
# Install required packages if missing
required_pkgs <- c("ISLR", "e1071", "ggplot2")
new_pkgs <- required_pkgs[!(required_pkgs %in% installed.packages()[,"Package"])]
if(length(new_pkgs)) install.packages(new_pkgs, quiet = TRUE, repos = "https://cloud.r-project.org")## package 'ISLR' successfully unpacked and MD5 sums checked
library(ISLR)
library(e1071)
library(ggplot2)
# --------------------------
# Step 1: Load Khan dataset
# --------------------------
data("Khan")
cat("Training set dimensions:", dim(Khan$xtrain), "\n")## Training set dimensions: 63 2308
## Test set dimensions: 20 2308
# Build training dataframe
datos_train <- data.frame(y = as.factor(Khan$ytrain), Khan$xtrain)
# --------------------------
# Step 2: Cross-validation grid search
# --------------------------
set.seed(1)
svm_cv <- tune(
svm,
y ~ .,
data = datos_train,
kernel = "linear",
ranges = list(cost = c(0.0001, 0.0005, 0.001, 0.01, 0.1, 1))
)
# Plot CV error vs cost
p_cv <- ggplot(data = svm_cv$performances, aes(x = cost, y = error)) +
geom_line(linewidth = 0.8) +
geom_point(size = 1.8) +
labs(title = "Cross-Validation Error vs Regularization Parameter C") +
theme_bw()
ggsave("svm_khan_cv_error.png", p_cv, width=6, height=4, dpi=300)
print(p_cv)## Optimal parameters:
## cost
## 2 5e-04
modelo_svm <- svm_cv$best.model
# --------------------------
# Step 3: Training performance
# --------------------------
train_error_pct <- 100 * mean(datos_train$y != modelo_svm$fitted)
cat("Training error:", train_error_pct, "%\n")## Training error: 0 %
## True
## Prediction 1 2 3 4
## 1 8 0 0 0
## 2 0 23 0 0
## 3 0 0 12 0
## 4 0 0 0 20
# --------------------------
# Step 4: Test performance
# --------------------------
datos_test <- data.frame(y = as.factor(Khan$ytest), Khan$xtest)
predicciones <- predict(modelo_svm, newdata = datos_test)
test_error_pct <- 100 * mean(datos_test$y != predicciones)
cat("Test error:", test_error_pct, "%\n")## Test error: 10 %
## True
## Prediction 1 2 3 4
## 1 3 0 0 0
## 2 0 6 0 0
## 3 0 0 4 0
## 4 0 0 2 5
This section collects definitions, commentary, and clarifications gathered from various reference sources. Some content could not be integrated into the main document body due to time constraints; others are retained here as auxiliary complementary background material for machine learning topics.
The Perceptron algorithm was published in 1957 by Frank Rosenblatt. The core objective of the Perceptron is to find a separating hyperplane that correctly partitions a linearly separable dataset. Once trained, this hyperplane can be used to perform binary classification tasks.
While the Perceptron itself is a very simple learning algorithm, understanding its internal mechanics is foundational for studying more complex models such as Support Vector Machines and artificial neural networks.
Before detailing the algorithm step-by-step, we first define key required mathematical building blocks: the dot product (scalar product), hyperplanes, and linear separability.
The dot product is an operation defined for two vectors of identical dimensionality, which outputs a single scalar value encoding directional and magnitude relationships between the vectors. There are two fully equivalent interpretations: geometric and algebraic.
Geometrically, the dot product equals the product of the Euclidean norm (second norm / magnitude) of each vector multiplied by the cosine of the angle \(\alpha\) formed between them. For two vectors \(\boldsymbol{x}\) and \(\boldsymbol{y}\) separated by angle \(\alpha\): \[ \boldsymbol{x} \cdot \boldsymbol{y} = \|\boldsymbol{x}\| \|\boldsymbol{y}\| \cos(\alpha) \] The value of the dot product is entirely determined by the angle between the vectors:
Algebraically, the dot product is computed as the sum of element-wise products across all vector dimensions. For 2-dimensional vectors: \[ \boldsymbol{x} \cdot \boldsymbol{y} = x_1 y_1 + x_2 y_2 \] Generalized to \(n\)-dimensional vectors: \[ \boldsymbol{x} \cdot \boldsymbol{y} = \sum_{i=1}^n x_i y_i \] This formulation requires no knowledge of the angle between the two vectors.
The concept of linear separability is intuitive in low-dimensional feature spaces:
This logic generalizes to arbitrary \(n\)-dimensional spaces. The core condition remains identical: there exists a subspace of one lower dimension that cleanly splits the two classes. This separating subspace is formally called a \(\textbf{hyperplane}\).
# Geometric dot product function (corrected spelling & variable names)
dot_product_geometric <- function(x, y, alpha){
norm_x <- sqrt(sum(x^2))
norm_y <- sqrt(sum(y^2))
alpha_rad <- (alpha * pi) / 180
return(norm_x * norm_y * cos(alpha_rad))
}
# Algebraic summation-based dot product function
dot_product_algebra <- function(x, y) {
result <- 0
if (length(x) != length(y)) {
stop("Vectors must have identical length (same dimensionality)")
}
for (i in seq_along(x)) {
result <- result + x[i] * y[i]
}
return(result)
}
# Test example vectors from the text
vec1 <- c(3, 5)
vec2 <- c(8, 2)
# Run geometric version with alpha=45 degrees
res_geo <- dot_product_geometric(x = vec1, y = vec2, alpha = 45)
# Run algebraic version (no angle required)
res_alg <- dot_product_algebra(x = vec1, y = vec2)
# Print matching results
cat("Geometric dot product result: ", res_geo, "\n")## Geometric dot product result: 34
## Algebraic dot product result: 34
## Both implementations return identical value 34 as shown in the reference text.
In geometry, a hyperplane is defined as a subspace whose dimensionality is exactly one less than the ambient space containing it. This abstract definition becomes intuitive with a 2-dimensional example.
In a 2-dimensional ambient space, a hyperplane has \(2-1=1\) dimension: a straight line. The standard slope-intercept line equation from basic calculus is: \[ x_2 = a x_1 + b \] Equivalent rearranged implicit form: \[ x_2 - a x_1 - b = 0 \] An alternative vector-based definition of a line uses the dot product of two vectors. Let input feature vector \(\boldsymbol{x}=(x_1, x_2)\), weight vector \(\boldsymbol{w}=(w_1, w_2)\), and scalar bias constant \(b\). The line equation is written as: \[ \boldsymbol{x} \cdot \boldsymbol{w} + b = 0 \] Expanding the dot product: \[ (x_1, x_2)\cdot(w_1, w_2) + b = x_1 w_1 + x_2 w_2 + b = 0 \] Rearranged to recover slope-intercept form: \[ x_2 = -\frac{w_1}{w_2}x_1 - \frac{b}{w_2} \] Define slope \(a = -\frac{w_1}{w_2}\) and vertical intercept \(c = -\frac{b}{w_2}\), recovering the standard line formula: \[ x_2 = a x_1 + c \] The critical advantage of this vectorized formulation is full generalization to spaces of arbitrary dimensionality; this dot product form is the universal hyperplane equation for any \(n\)-dimensional feature space.
An \(n\)-dimensional hyperplane partitions an \((n+1)\)-dimensional ambient space into two disjoint half-spaces. This partitioning property enables hyperplanes to act as binary classifiers: samples lying on one side of the hyperplane are assigned class label \(+1\), samples on the opposite side receive label \(-1\).
For a 2D dataset, each training observation \(i\) is represented by feature vector \(\boldsymbol{x}_i\) and binary response label \(y_i \in \{+1, -1\}\). Given a hyperplane defined by weight vector \(\boldsymbol{w}\) and bias \(b\), we define a prediction function \(h\): \[ h(\boldsymbol{x}_i) = \begin{cases} +1 & \text{if } \boldsymbol{w}\cdot\boldsymbol{x}_i + b \ge 0 \\ -1 & \text{if } \boldsymbol{w}\cdot\boldsymbol{x}_i + b < 0 \end{cases} \] This piecewise function simplifies compactly via the sign operator: \[ h(\boldsymbol{x}_i) = \operatorname{sign}\big(\boldsymbol{w}\cdot\boldsymbol{x}_i + b\big) \]
We eliminate the standalone bias constant \(b\) by modifying all input and weight vectors into \(\textit{augmented vectors}\):
The final prediction output remains identical, and this augmented vector formulation simplifies algorithm implementation for linear models like the Perceptron.
A hyperplane can perform binary classification on linearly separable data for any feature dimension. The Perceptron algorithm solves the core practical problem: recovering the separating hyperplane weight vector \(\boldsymbol{w}\) from labeled training data.
Given a training dataset with \(m\) \(n\)-dimensional observations \((\boldsymbol{x}_i, y_i)\), the Perceptron optimizes for the weight vector \(\boldsymbol{w}\) of the augmented-space prediction function: \[ h(\tilde{\boldsymbol{x}}_i) = \operatorname{sign}\big(\boldsymbol{w}\cdot\tilde{\boldsymbol{x}}_i\big) \] The only unknown quantity is \(\boldsymbol{w}\); the Perceptron’s objective is to find a \(\boldsymbol{w}\) that perfectly separates all training samples by class.
The algorithm follows four repeating steps until zero misclassified samples remain:
While the Perceptron always returns some perfectly separating hyperplane for linearly separable datasets, it does not guarantee the \(\textit{optimal maximal-margin hyperplane}\). The optimal hyperplane is defined as the boundary positioned equidistant from the closest training samples of both classes; solving for this maximal margin hyperplane is the core objective of Support Vector Machines, not the simple Perceptron algorithm.
# Install ggplot2 if missing for plotting
if (!requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2", quiet = TRUE)
}
library(ggplot2)
# --------------------------
# Helper Prediction Function
# --------------------------
predict_clase <- function(X, w){
pred_score <- apply(X, MARGIN = 1, FUN = function(x){crossprod(x, w)})
pred_class <- sign(pred_score)
return(pred_class)
}
# --------------------------
# Full Perceptron Training Function
# --------------------------
perceptron <- function(X, y, random_seed = 553){
X <- cbind(1, X)
set.seed(random_seed)
w <- runif(n = ncol(X), min = 0, max = 1)
clasificaciones <- predict_clase(X = X, w = w)
errores_clasificacion <- which(clasificaciones != y)
while (length(errores_clasificacion) > 0) {
i <- sample(x = errores_clasificacion, size = 1)
w <- w + X[i,] * y[i]
clasificaciones <- predict_clase(X = X, w = w)
errores_clasificacion <- which(clasificaciones != y)
}
return(w)
}
# --------------------------
# Synthetic Linearly Separable 2D Dataset
# --------------------------
X <- matrix(c(8, 4, 9, 7, 9, 4, 10, 2, 8, 7, 4, 4, 1, 2, 7, 10, 7, 10, 6, 8, 10,
7, 3, 5, 4, 6, 3, 5), ncol = 2, byrow = FALSE)
y <- c(1, 1, 1, 1, 1, 1, 1 , -1 , -1 , -1 , -1 , -1 , -1 , -1 )
# Train model and output weight vector
hiperplano <- perceptron(X = X, y = y)
cat("Learned augmented hyperplane weight vector:\n")## Learned augmented hyperplane weight vector:
## [1] -50.452542 2.744047 5.822881
# --------------------------
# Generate Exact Matching 2D Scatter + Decision Line Plot
# --------------------------
datos <- data.frame(X1 = X[,1], X2 = X[,2], y = factor(y))
p_perceptron <- ggplot(data = datos, aes(x = X1, y = X2, color = y)) +
geom_point(size = 2.5) +
geom_abline(
intercept = -(hiperplano[1]/hiperplano[3]),
slope = -(hiperplano[2]/hiperplano[3])
) +
theme_bw() +
theme(legend.position = "none")
print(p_perceptron)# Save high-resolution figure matching reference layout
ggsave("perceptron_2d_separating_line.png", p_perceptron, width = 4.2, height = 4, dpi = 300)
cat("\nPlot saved as perceptron_2d_separating_line.png in working directory.\n")##
## Plot saved as perceptron_2d_separating_line.png in working directory.