Department of Mathematics

The Chinese University of Hong Kong

HSMMC Math Modelling Advanced Training Workshop

May 16, 2026

1 Gradient Descent Method (GDM) for One-Variable Functions

We now introduce the Gradient Descent Method to obtain an approximate solution to the least-squares problem numerically.

There is a problem of finding the minimum value of a one-variable function \(f(x)\) that can be differentiated as follows.

import numpy as np
import matplotlib.pyplot as plt

# Create x values (matches xmin=-3, xmax=3)
x = np.linspace(-3, 3, 100)

# Define function f(x) = x² + 1
y = x ** 2 + 1

# Create figure
plt.figure(figsize=(6, 4))

# Plot the function (blue, thick line)
plt.plot(x, y, 'blue', linewidth=2)

# Dashed vertical line at x* = 0 (from y=0 to y=1)
plt.plot([0, 0], [0, 1], 'k--')  # black dashed line

# Label x* below the x-axis
plt.text(0, -0.3, r'$x^*$', ha='center', fontsize=12)

# Axes settings (match TikZ exactly)
plt.axhline(0, color='black', linewidth=0.8)  # x-axis
plt.axvline(0, color='black', linewidth=0.8)  # y-axis
plt.xlabel(r'$x$', fontsize=12)
plt.ylabel(r'$f(x)$', fontsize=12)
plt.xlim(-3, 3)

## (-3.0, 3.0)

plt.ylim(-1, 5)

## (-1.0, 5.0)

# Remove ticks (ticks=none)
plt.xticks([])

## ([], [])

plt.yticks([])

## ([], [])

# Show plot
plt.tight_layout()
plt.show()

Then, the point \(x^*\) giving the minimum value of the function (it is called the optimal solution) by Fermat’s critical point theorem satisfies the following: \[ f'(x^*) = 0. \] Therefore, we can determine which of the solutions satisfying the equation \(f'(x) = 0\) is the optimal solution. However, when the function is complex, it is not easy to find a critical point by solving the equation. In this case, the critical point is obtained numerically. In this section, we look at the gradient descent method, which is the most popular numerical optimization method to find the minimum value of a given function. The basic idea of the gradient descent method is to find the slope of the function, move it toward downhill, and repeat it until it reaches the extreme value.

Gradient Descent Method] Algorithm

Step 1 Set an initial iterate \(x_1\), tolerance \(0 \leq \epsilon \ll 1\), initial learning rate \(\eta\) (eta), and iteration number \(k := 1\).
Step 2 Compute \(g_k = f'(x_k)\). If \(|g_k| \leq \epsilon\), then stop.
Step 3 Set \(x_{k+1} = x_k - \eta g_k\), \(k := k + 1\) and go to Step 2.

Let’s explain this GDM algorithm. Set \(x_k\) and compute \(g_k = f'(x_k)\). If \(|g_k| < \epsilon\) (it satisfies the critical point theorem within the margin of error we allow), then the algorithm stops and will give \(x_k\) as the optimal solution. Here, \(\epsilon\) (epsilon, tolerance) satisfies \(0 \leq \epsilon \ll 1\). A notation \(``\ll"\) in this inequality means that the epsilon is much smaller than \(1\). Hence, we give a small tolerance epsilon \(\epsilon = 10^{-6}\), which is very close to zero, and with a learning rate \(\eta\). And let the iteration number \(k\) be \(1\). If \(|g_k| > \epsilon\), \(x_{k+1} = x_k - \eta g_k\) is determined using the formula. Then \(g_k = f'(x_k)\) is calculated to determine whether or not the critical point theorem is satisfied. If the critical point theorem is not satisfied, then \(x_1\) is determined similarly. In this way, we create \(x_1, x_2, \dots (\to x^*)\). If the value of \(g_k = f'(x_k)\) is within a given tolerance after some repeated steps \(x_{k+1} = x_k - \eta f'(x_k)\), then the algorithm stops. We expect some \(x_k\) or the limit \(x^*\) to satisfy \(f'(x^*) = 0\).

Let’s see how the GDM works in detail. Suppose the slope of the tangent line to the function \(f\) is negative at the approximate solution \(x_k\) after the \(k\)-th iteration. It is \(g_k = f'(x_k) < 0\) as shown in the figure below. This means the function decreases when \(x_k\) moves from left to right, so we can expect \(x^*\) is located on the right side of \(x_k\). If \(x_k\) moves from the left to the right, then \(x_{k+1}\) is closer to the optimal value \(x^*\). Therefore, we move from \(x_k\) to \(x_{k+1} = x_k - \eta g_k\) with \(-\eta g_k > 0\) direction.

import numpy as np
import matplotlib.pyplot as plt

# ----------------------
# Data for the function
# ----------------------
x = np.linspace(-3, 3, 100)
y = x ** 2 + 1

# ----------------------
# Create plot
# ----------------------
plt.figure(figsize=(7, 5))

# Plot f(x) = x² + 1 (blue thick line)
plt.plot(x, y, color='blue', linewidth=3)

# Dashed vertical line at x* = 0
plt.plot([0, 0], [0, 1], 'k--', linewidth=1.5)

# Red arrow: (-1.5, 3.25) → (-0.75, 1.5625)
plt.arrow(
    -1.5, 3.25,        # start point
    0.75, -1.6875,     # dx, dy
    color='red',
    linewidth=2,
    head_width=0.12,
    head_length=0.15,
    length_includes_head=True
)

# ----------------------
# All labels (exact positions)
# ----------------------
plt.text(-1.7, 3.5, r"$f'(x_k) < 0$", color='red', fontsize=14)
plt.text(-1.5, -0.4, r"$x_k$", ha='center', fontsize=12)
plt.text(-0.75, -0.4, r"$x_{k+1}$", ha='center', fontsize=12)
plt.text(0, -0.4, r"$x^*$", ha='center', fontsize=12)

# ----------------------
# Axes styling (match TikZ)
# ----------------------
plt.axhline(0, color='black', linewidth=1)
plt.axvline(0, color='black', linewidth=1)
plt.xlabel(r'$x$', fontsize=12)
plt.ylabel(r'$f(x)$', fontsize=12)

plt.xlim(-3, 3)

## (-3.0, 3.0)

plt.ylim(-1, 5)

## (-1.0, 5.0)

# Remove ticks (ticks=none)
plt.xticks([])

## ([], [])

plt.yticks([])

## ([], [])

# Show plot
plt.tight_layout()
plt.show()

Similarly, the slope of the tangent line to the function \(f\) is positive at the approximate solution \(x_k\) after the \(k\)-th iteration. It is \(g_k = f'(x_k) > 0\) as shown in the figure below. It means the function is increasing when \(x_k\) moves from left to right, so we can expect \(x^*\) is located on the left side of \(x_k\). If \(x_k\) moves to the left, then \(x_{k+1}\) is closer to the optimal value \(x^*\). Therefore, we move from \(x_k\) to \(x_{k+1} = x_k - \eta g_k\) with \(-\eta g_k < 0\) direction. If we repeat the process until \(x_k\) moves to \(x^*\), then \(f'(x_k)\) will converge to \(0\). In this way, we get the approximate solution \(x_k\).

import numpy as np
import matplotlib.pyplot as plt

# Create x values for the curve
x = np.linspace(-3, 3, 100)
y = x ** 2 + 1

# Create figure
plt.figure(figsize=(7, 5))

# Plot the function: f(x) = x² + 1 (blue thick line)
plt.plot(x, y, 'blue', linewidth=3)

# Dashed vertical line at x* = 0
plt.plot([0, 0], [0, 1], 'k--', linewidth=1.5)

# Red arrow: (1.5, 3.25) → (0.75, 1.5625)
plt.arrow(1.5, 3.25, -0.75, -1.6875,
          color='red', linewidth=2,
          head_width=0.12, head_length=0.15,
          length_includes_head=True)

# All labels (exact positions)
plt.text(1.7, 3.5, r"$f'(x_k) > 0$", color='red', fontsize=14)
plt.text(1.5, -0.4, r"$x_k$", ha='center', fontsize=12)
plt.text(0.75, -0.4, r"$x_{k+1}$", ha='center', fontsize=12)
plt.text(0, -0.4, r"$x^*$", ha='center', fontsize=12)

# Axes style (matches TikZ exactly)
plt.axhline(0, color='black', linewidth=1)
plt.axvline(0, color='black', linewidth=1)
plt.xlabel(r'$x$', fontsize=12)
plt.ylabel(r'$f(x)$', fontsize=12)

# Axis limits
plt.xlim(-3, 3)

## (-3.0, 3.0)

plt.ylim(-1, 5)

## (-1.0, 5.0)

# Remove ticks (ticks=none)
plt.xticks([])

## ([], [])

plt.yticks([])

## ([], [])

# Show the plot
plt.tight_layout()
plt.show()

This shows that the GDM will find \(x_1, x_2, x_3, \dots\) that satisfy \[f(x_1) > f(x_2) > f(x_3) > \dots.\] Here, \(\eta\) determines the magnitude of movements. We call this \(``\)the learning rate\("\). If the learning rate is too large, \(x_{k+1}\) can pass over \(x^*\), or even the function value may increase. So we should be careful to choose a reasonable learning rate \(\eta\).

import numpy as np
import matplotlib.pyplot as plt

# Create figure with two subplots (side by side)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# --------------------------
# Shared function data
# --------------------------
x = np.linspace(-4, 4, 100)
y = x**2 + 1

# ==========================
# LEFT PLOT: Do not converge (too large step)
# ==========================
ax1.plot(x, y, 'blue', linewidth=3)

# Red horizontal arrow: (-2,5) → (2,5)
ax1.arrow(-2, 5, 4, 0, 
          color='red', linewidth=2, 
          head_width=0.2, head_length=0.15, length_includes_head=True)

# Labels
ax1.text(-2, -0.8, r'$x_k$', ha='center', fontsize=12)
ax1.text(2, -0.8, r'$x_{k+1}$', ha='center', fontsize=12)
ax1.text(0, -1.5, 'Do not converge', ha='center', fontsize=12)

# Axes styling (matches TikZ)
ax1.axhline(0, color='black', linewidth=1)
ax1.axvline(0, color='black', linewidth=1)
ax1.set_xlabel(r'$x$')
ax1.set_ylabel(r'$f(x)$')
ax1.set_xlim(-4, 4)

## (-4.0, 4.0)

ax1.set_ylim(-1, 10)

## (-1.0, 10.0)

ax1.set_xticks([])

## []

ax1.set_yticks([])

## []

# ==========================
# RIGHT PLOT: Too slow (small step)
# ==========================
ax2.plot(x, y, 'blue', linewidth=3)

# Three small red arrows
ax2.arrow(-2, 5, 0.25, -0.9375, 
          color='red', linewidth=2, head_width=0.15, head_length=0.1, length_includes_head=True)
ax2.arrow(-1.75, 4.0625, 0.1875, -0.6211, 
          color='red', linewidth=2, head_width=0.15, head_length=0.1, length_includes_head=True)
ax2.arrow(-1.5625, 3.4414, 0.15625, -0.4687, 
          color='red', linewidth=2, head_width=0.15, head_length=0.1, length_includes_head=True)

# Labels
ax2.text(-2, -0.8, r'$x_k$', ha='center', fontsize=12)
ax2.text(-1.75, -0.8, r'$x_{k+1}$', ha='center', fontsize=12)
ax2.text(-1.5625, -0.8, r'$x_{k+2}$', ha='center', fontsize=12)
ax2.text(-1.40625, -0.8, r'$x_{k+3}$', ha='center', fontsize=12)
ax2.text(-1, -1.5, 'Too slow', ha='center', fontsize=12)

# Axes styling
ax2.axhline(0, color='black', linewidth=1)
ax2.axvline(0, color='black', linewidth=1)
ax2.set_xlabel(r'$x$')
ax2.set_ylabel(r'$f(x)$')
ax2.set_xlim(-4, 4)

## (-4.0, 4.0)

ax2.set_ylim(-1, 10)

## (-1.0, 10.0)

ax2.set_xticks([])

## []

ax2.set_yticks([])

## []

# Final layout
plt.tight_layout()
plt.show()

import numpy as np
import matplotlib.pyplot as plt

# ==============================================
# 1. Define the function and its derivative (from LaTeX)
# f(x) = x² + 1    f'(x) = 2x
# ==============================================
def f(x):
    return x**2 + 1

def df(x):
    return 2 * x

# ==============================================
# 2. Gradient Descent Algorithm (EXACT as LaTeX)
# ==============================================
def gradient_descent(x_start, eta=0.1, tol=1e-6, max_iter=1000):
    x = x_start
    path = [x]  # Store iteration path for plotting
    
    for k in range(max_iter):
        grad = df(x)
        
        # Convergence condition: |f'(x_k)| ≤ ε
        if abs(grad) <= tol:
            print(f"lgorithm Succeed! (Converged at iteration {k+1})")
            break
        
        # Update rule: x_{k+1} = x_k - η * f'(x_k)
        x = x - eta * grad
        path.append(x)
    else:
        print("Did not converge within max iterations")
    
    return x, np.array(path)

# ==============================================
# 3. Run GDM with parameters from LaTeX
# ==============================================
x_opt, path = gradient_descent(x_start=1.5, eta=0.1, tol=1e-6)

## lgorithm Succeed! (Converged at iteration 68)

f_opt = f(x_opt)

# Print final results
print(f"\nOptimal x* = {x_opt:.6f}")

## 
## Optimal x* = 0.000000

print(f"Minimum f(x*) = {f_opt:.6f}")

## Minimum f(x*) = 1.000000

print(f"Iterations: {len(path)}")

## Iterations: 68

# ==============================================
# 4. Plot 1: Function + Convergence Path (matches LaTeX figure)
# ==============================================
x_plot = np.linspace(-3, 3, 500)
y_plot = f(x_plot)

plt.figure(figsize=(10, 5))
plt.plot(x_plot, y_plot, 'b-', linewidth=2, label=r'$f(x) = x^2 + 1$')
plt.scatter(path, f(path), color='red', s=30, zorder=5)
plt.plot(path, f(path), 'r--', linewidth=1.5, label='GD Path')
plt.axvline(x=0, color='k', linestyle='dashed', linewidth=1, label=r'$x^* = 0$')

plt.xlabel(r'$x$')
plt.ylabel(r'$f(x)$')
plt.title('Gradient Descent on $f(x) = x^2 + 1$')
plt.grid(alpha=0.3)
plt.legend()
plt.xlim(-3, 3)

## (-3.0, 3.0)

plt.ylim(-1, 5)

## (-1.0, 5.0)

plt.show()

# ==============================================
# 5. Demo: Too Large Learning Rate (Diverges)
# ==============================================
print("\n--- Testing TOO LARGE learning rate (η=1.0) ---")

## 
## --- Testing TOO LARGE learning rate (η=1.0) ---

x_bad, path_bad = gradient_descent(x_start=-2, eta=1.0, max_iter=5)

## Did not converge within max iterations

# ==============================================
# 6. Demo: Too Small Learning Rate (Too Slow)
# ==============================================
print("\n--- Testing TOO SMALL learning rate (η=0.01) ---")

## 
## --- Testing TOO SMALL learning rate (η=0.01) ---

x_slow, path_slow = gradient_descent(x_start=-2, eta=0.01, max_iter=50)

## Did not converge within max iterations

2 Example 1: Gradient Descent for Single-Variable Function

Find the minimum of \(f(x) = 2x^2 - 3x + 2\). Take \(x_1 = 0\), \(\eta = 0.1\), and \(\epsilon = 10^{-6}\) in the following code.

Solution

From \(f'(x) = 4x - 3 = 0\), the critical point becomes \(x = \frac{3}{4} = 0.75\). Since \(f(x)\) is convex (opens upward), the minimum value \[ f\left(\frac{3}{4}\right) = \frac{7}{8} = 0.875 \] is obtained at \(x = \frac{3}{4}\).

The output of the algorithm is:

Algorithm Succeed!
\(x^* = 0.749999834194560\)
\(|g^*| = 6.63221759289456e-7\)
\(f(x^*) = 0.875000000000055\)
Iteration number = \(31\)

In the above example, the points created by the gradient descent method \((x_1, f(x_1)),\ (x_2, f(x_2)),\ (x_3, f(x_3)),\ \dots\) are shown on the coordinate plane along with the graph of the function \(y = f(x)\). From the figure, it is easy to see that the sequence converged to the point \((x^*, f(x^*))\) on the curve with the minimum value.

It can be seen that the sequence converges to the optimal solution \(x^*\).

\[ \begin{array}{cc} \textbf{Parameter} & \textbf{Value} \\ \hline \textrm{Function} & f(x) = 2x^2 - 3x + 2 \\ \textrm{Derivative} & f'(x) = 4x - 3 \\ \textrm{Initial iterate} ~x_1 & 0.0 \\ \textrm{Tolerance}~ \epsilon & 10^{-6} \\ \textrm{Learning rate} ~\eta & 0.1 \\ \textrm{Converged solution} ~ x^* & 0.75 \\ \textrm{Minimum value}~ f(x^*) & 0.875 \\ \textrm{Iterations to converge} & 31 \\ \hline \end{array} \]

Summary of Example 1 Gradient Descent Results

import numpy as np
import matplotlib.pyplot as plt

# ==============================================
# 1. Define function and derivative (from LaTeX)
# f(x) = 2x² - 3x + 2
# f'(x) = 4x - 3
# ==============================================
def f(x):
    return 2 * x**2 - 3 * x + 2

def df(x):
    return 4 * x - 3

# ==============================================
# 2. Gradient Descent (EXACT parameters from LaTeX)
# ==============================================
x0 = 0.0        # initial iterate
tol = 1e-6      # tolerance
eta = 0.1       # learning rate
max_iter = 300
path = []       # store iteration path

for k in range(max_iter):
    path.append(x0)
    g0 = df(x0)
    
    # Stopping condition
    if abs(g0) <= tol:
        print("Algorithm Succeed!")
        break
    
    # Update step
    x0 = x0 - eta * g0

## Algorithm Succeed!

# ==============================================
# 3. Print results (matches LaTeX output)
# ==============================================
print(f"x* = {x0}")

## x* = 0.7499998341945602

print(f"|g*| = {abs(df(x0))}")

## |g*| = 6.632217592894563e-07

print(f"f(x*) = {f(x0)}")

## f(x*) = 0.8750000000000553

print(f"iteration number = {k + 1}")

## iteration number = 31

# ==============================================
# 4. Plot (EXACTLY matches your LaTeX figure)
# ==============================================
x_plot = np.linspace(-0.2, 1.2, 200)
y_plot = f(x_plot)

plt.figure(figsize=(8, 6))
# Plot the blue curve
plt.plot(x_plot, y_plot, 'b-', linewidth=2, label=r'$f(x)=2x^2-3x+2$')

# Plot red iteration points
x_path = np.array(path)
y_path = f(x_path)
plt.scatter(x_path, y_path, color='red', s=25, zorder=5)

# Dashed line at optimal point x* = 0.75
plt.plot([0.75, 0.75], [0, 0.875], 'k--', linewidth=1)
plt.text(0.75, -0.05, r'$x^*$', ha='center', fontsize=12)

# Axes settings (same as LaTeX)
plt.axhline(0, color='black', linewidth=0.8)
plt.axvline(0, color='black', linewidth=0.8)
plt.xlabel(r'$x$')
plt.ylabel(r'$f(x)$')
plt.xlim(-0.2, 1.2)

## (-0.2, 1.2)

plt.ylim(0.8, 2.2)

## (0.8, 2.2)

plt.grid(alpha=0.3)
plt.title("Gradient Descent Path on $f(x) = 2x^2 - 3x + 2$")
plt.tight_layout()
plt.show()

3 Example 2: Gradient Descent for Single-Variable Function

Find the minimum value of \(f(x) = 9x^2 - 7x + 6\). Use \(x_1 = 0\), \(\eta = 0.1\), and \(\epsilon = 10^{-6}\) with the above GDM code.

Solution

First, compute the derivative: \[ f'(x) = 18x - 7. \] Setting \(f'(x) = 0\) gives the critical point: \[ 18x - 7 = 0 \implies x^* = \frac{7}{18} \approx 0.3889. \] Since \(f''(x) = 18 > 0\), the function is convex, so \(x^* = \frac{7}{18}\) is the minimum point. The minimum value is: \[ f\left(\frac{7}{18}\right) = 9\left(\frac{7}{18}\right)^2 - 7\left(\frac{7}{18}\right) + 6 = \frac{167}{36} \approx 4.6389. \]

Summary of Example 2 Gradient Descent Results

\[ \begin{array}{cc} \textbf{Parameter} & \textbf{Value} \\ \hline \textrm{Function} & f(x) = 9x^2 - 7x + 6 \\ \textrm{Derivative} & f'(x) = 18x - 7 \\ \textrm{Initial iterate}~ x_1 & 0.0 \\ \textrm{Tolerance}~ \epsilon & 10^{-6} \\ \textrm{Learning rate}~ \eta & 0.1 \\ \textrm{True minimum}~ x^* & 7/18 \approx 0.3889 \\ \textrm{Minimum value}~ f(x^*) & 167/36 \approx 4.6389 \\ \hline \end{array} \]

import numpy as np
import matplotlib.pyplot as plt

# --------------------------
# Define the function and its derivative
# --------------------------
def f(x):
    return 9 * x**2 - 7 * x + 6

def df(x):
    return 18 * x - 7

# --------------------------
# Gradient Descent Parameters
# --------------------------
x = 0.0          # initial value x1 = 0
eta = 0.1        # learning rate η = 0.1
epsilon = 1e-6   # tolerance ε = 10⁻⁶
max_iter = 500
path = []        # to store the path of x values

# --------------------------
# Run Gradient Descent
# --------------------------
for k in range(max_iter):
    path.append(x)
    gradient = df(x)
    
    if abs(gradient) <= epsilon:
        print("Algorithm Succeed!")
        break
    
    x = x - eta * gradient

## Algorithm Succeed!

# --------------------------
# Print results
# --------------------------
print(f"Optimal x* = {x:.8f}")

## Optimal x* = 0.38888894

print(f"Gradient at x* = {abs(df(x)):.2e}")

## Gradient at x* = 9.21e-07

print(f"Minimum f(x*) = {f(x):.8f}")

## Minimum f(x*) = 4.63888889

print(f"Iterations taken: {k + 1}")

## Iterations taken: 72

# --------------------------
# Plot the function and path
# --------------------------
x_plot = np.linspace(-0.2, 1.0, 200)
y_plot = f(x_plot)

plt.figure(figsize=(8, 5))
plt.plot(x_plot, y_plot, 'b-', linewidth=2, label=r'$f(x)=9x^2-7x+6$')
plt.scatter(path, [f(p) for p in path], color='red', s=25, zorder=5)
plt.axvline(x=7/18, color='k', linestyle='--', label=r'True minimum $x^*=7/18$')

plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Gradient Descent on $f(x)=9x^2-7x+6$')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

4 Application of the Gradient Descent Method

The GDM introduced in the first session is for minimizing a function \(f(x)\) of one variable. Let’s generalize the algorithm for minimizing a function of several variables. The least-squares problem we learned is a multivariate function \(f(\mathbf{x})\) with at least two independent variables. The least-squares problem was solved using the error function \(E(\mathbf{u})\) transformed using multiple variables. The least-squares problem can also be solved by the gradient descent method (GDM).

GDM Algorithm for minimizing a multi-variable function

Step 1 Set an initial iterate \(\mathbf{u}_1 \in \mathbb{R}^n\), tolerance \(0\leq\epsilon<1\), initial learning rate \(\eta\) (eta), and iteration number \(k:=1\).
Step 2 Compute \(\mathbf{g}_k = \nabla E(\mathbf{u}_k)\). If \(\|\mathbf{g}_k\| \leq \epsilon\), then stop.
Step 3 Set \(\mathbf{u}_{k+1} = \mathbf{u}_k - \eta \mathbf{g}_k\), \(k:=k+1\) and go to Step 2.

All the steps of the GDM algorithm for minimizing a multi-variable function are the same as for a one-variable function.

Correspondence between One-Variable and Multi-Variable Functions

\[ \begin{array} \hline \textbf{One variable function} & \textbf{Multi-variable function} \\ \hline \textrm{Scalar}~ x & \textrm{Vector}~ \mathbf{u} \\ \textrm{Absolute value}~ |a| & \textrm{Norm of vector}~ \|\mathbf{g}\| \\ \textrm{Derivative}~ f'(x) & \textrm{Gradient}~ \nabla E(\mathbf{u}) \\ \hline \end{array} \]

Let \(x,y\) be independent variables and \(z\) be a third variable. Now we can consider a function \(z=f(x,y)\), which is dependent on \(x\) and \(y\). Functions of several variables can be defined in the same manner.

In a 3-dimensional (coordinate) space, \(z=f(x,y)\) can be considered as a point \((x,y,z)\). As \(x\) and \(y\) move, the graph of \(z=f(x,y)\) becomes a surface. For example, the graph of a two-variable function \(z=f(x,y)=-xye^{-x^2-y^2}\) can be drawn using the codes as follows.

# Import required libraries (standard Python)
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

 

# --------------------------
# Example 1: 3D Surface Plot of f(x,y) = -xye^(-x²-y²)
# --------------------------
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

# Create grid
x_surf = np.linspace(-2, 2, 100)
y_surf = np.linspace(-2, 2, 100)
X_surf, Y_surf = np.meshgrid(x_surf, y_surf)

# Define the surface function
f = -X_surf * Y_surf * np.exp(-X_surf**2 - Y_surf**2)

# Plot 3D surface
surf = ax.plot_surface(X_surf, Y_surf, f, cmap='viridis', 
                       alpha=0.6, edgecolor='none')
ax.set_title('3D Surface of $f(x,y) = -xye^{-x^2 - y^2}$', fontsize=14)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')
ax.view_init(30, 30)  # Same viewing angle as your LaTeX figure

# Show all plots
plt.tight_layout()
plt.show()

The graph of \(f\) in the figure above intuitively shows that \(f\) has a local maximum value at the peak and a local minimum value at the valley. To find those critical points, the notion of a gradient of a multi-variable function is required.

The gradient of a function of two variables \(f(x,y)\) is defined as follows: \[ \operatorname{grad} f(x,y) = \nabla f(x,y) = \begin{bmatrix} \frac{\partial f}{\partial x} \\[6pt] \frac{\partial f}{\partial y} \end{bmatrix}. \]

The gradient of \(f(x,y)\) is a \(2\times1\) vector, where \(\frac{\partial f}{\partial x}\) means the partial derivative of \(f\) with respect to \(x\). The partial derivative with respect to \(x\) considers other variables except \(x\) is constant. Similarly, \(\frac{\partial f}{\partial y}\) means the partial derivative of \(f\) with respect to \(y\).

The gradient of \(z = f(x,y) = -xye^{-x^2-y^2}\) can be found as follows.

The output is: \[ (2x^2 y e^{(-x^2 - y^2)} - y e^{(-x^2 - y^2)},\ 2x y^2 e^{(-x^2 - y^2)} - x e^{(-x^2 - y^2)}) \]

Answer:

\[ \operatorname{grad} f(x,y) = \nabla f(x,y) = \left( 2x^2 y e^{-(x^2+y^2)} - y e^{-(x^2+y^2)},\ 2x y^2 e^{-(x^2+y^2)} - x e^{-(x^2+y^2)} \right) \]

import sympy as sp
import numpy as np
import matplotlib.pyplot as plt

# 1. Symbolic gradient computation (SymPy)
x, y = sp.symbols('x y')
f = -x * y * sp.exp(-x**2 - y**2)

# Compute partial derivatives
df_dx = sp.diff(f, x)
df_dy = sp.diff(f, y)

# Print the symbolic gradient (matches your image's formula)
print("Symbolic Gradient (SymPy):")

## Symbolic Gradient (SymPy):

print(f"∂f/∂x = {df_dx}")

## ∂f/∂x = 2*x**2*y*exp(-x**2 - y**2) - y*exp(-x**2 - y**2)

print(f"∂f/∂y = {df_dy}")

## ∂f/∂y = 2*x*y**2*exp(-x**2 - y**2) - x*exp(-x**2 - y**2)

# 2. Convert to numerical functions (NumPy)
df_dx_num = sp.lambdify((x, y), df_dx, 'numpy')
df_dy_num = sp.lambdify((x, y), df_dy, 'numpy')
f_num = sp.lambdify((x, y), f, 'numpy')

# 3. Visualize the gradient vector field
x_vals = np.linspace(-2, 2, 20)
y_vals = np.linspace(-2, 2, 20)
X, Y = np.meshgrid(x_vals, y_vals)

# Evaluate gradient at grid points
U = df_dx_num(X, Y)
V = df_dy_num(X, Y)

# Plot vector field + surface contours
plt.figure(figsize=(8, 6))
contour = plt.contour(X, Y, f_num(X, Y), cmap='viridis')
plt.clabel(contour, inline=True, fontsize=8)
plt.quiver(X, Y, U, V, color='red', scale=50, width=0.003)
plt.title(r'Gradient Vector Field of $f(x,y) = -x y e^{-x^2 - y^2}$')
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha=0.3)
plt.axis('equal')

## (np.float64(-2.0), np.float64(2.0), np.float64(-2.0), np.float64(2.0))

plt.show()

5 Gradient of Multi-Variable Functions

Similarly, the gradient of a multi-variable function with three or more variables \(f\) can be obtained. In general, the gradient of the \(n\)-variable function \(f(x_1, x_2, \dots, x_n)\) is defined as: \[ \operatorname{grad} f(x_1, x_2, \dots, x_n) = \nabla f(x_1, x_2, \dots, x_n) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\[6pt] \frac{\partial f}{\partial x_2} \\[6pt] \vdots \\[6pt] \frac{\partial f}{\partial x_n} \end{bmatrix}. \]

6 Example 3: Least-Squares Problem with Gradient Descent

Given:

\[ \begin{array}{ccccccc} \hline x_i & u_i \\ \hline 0 & 1 \\ 1 & 3 \\ 2 & 4 \\ 3 & 4 \\ \hline \end{array} \]

The first example of the least-squares problem was as follows. For each data \((x_i, u_i)\), let \(\hat{u}_i\) be the value obtained by substituting \(x_i\) in a linear function \(u = a + bx\). Thus: \[ \hat{u}_i = a + bx_i. \] If this equation’s solution does not exist, it is possible to find \(a, b\) where the square of error \((u_i - \hat{u}_i)^2\) is minimized. The error function \(E(\mathbf{u})\), which is the sum of squared errors \((u_i - \hat{u}_i)^2\), is defined as follows (the factor \(\frac{1}{2}\) is included for computational convenience and does not affect the conclusion):

\[\begin{align*} E(\mathbf{u}) & = E(a,b) = \frac{1}{2} \left\{ (a-1)^2 + (a+b-3)^2 + (a+2b-4)^2 + (a+3b-4)^2 \right\} \\ & = \frac{1}{2} \| A\mathbf{u} - \mathbf{y} \|^2. \end{align*}\]

Let’s use the GDM to find the approximate solution \(\hat{\mathbf{u}}\) that minimizes \(E(\mathbf{u})\).

Set an initial iterate \(\mathbf{u}_1 = (-2.5, -2.5)\), tolerance \(\epsilon = 10^{-6}\), and initial learning rate \(\eta = 0.1\).

GDM Parameters for Least-Squares Problem

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# ==============================
# 1. Define Error Function E(a,b)
# ==============================
def error_func(a, b):
    return 0.5 * (
        (a - 1)**2 +
        (a + b - 3)**2 +
        (a + 2*b - 4)**2 +
        (a + 3*b - 4)**2
    )

# ==============================
# 2. Define Gradient of E(a,b)
# ==============================
def gradient(a, b):
    dE_da = (a - 1) + (a + b - 3) + (a + 2*b - 4) + (a + 3*b - 4)
    dE_db = (a + b - 3) + 2*(a + 2*b - 4) + 3*(a + 3*b - 4)
    return np.array([dE_da, dE_db])

# ==============================
# 3. Gradient Descent Parameters (MATCH YOUR LATEX)
# ==============================
u = np.array([-2.5, -2.5])   # initial point
tol = 1e-6                   # tolerance
eta = 0.1                    # learning rate
max_iter = 300
path = []                    # store iteration path

# ==============================
# 4. Run Gradient Descent
# ==============================
for k in range(max_iter):
    grad = gradient(u[0], u[1])
    grad_norm = np.linalg.norm(grad)
    path.append((u[0], u[1], error_func(u[0], u[1])))
    
    if grad_norm <= tol:
        print("Algorithm Succeed!")
        break
    
    u = u - eta * grad

## Algorithm Succeed!

# ==============================
# 5. Print Final Results
# ==============================
print(f"u* = {u}")

## u* = [1.49999929 1.00000033]

print(f"E(x*) = {error_func(u[0], u[1]):.15f}")

## E(x*) = 0.500000000000364

print(f"iteration number = {k + 1}")

## iteration number = 118

# ==============================
# 6. Plot 1: 3D Error Surface + Descent Path
# ==============================
a_vals = np.linspace(-3, 5, 50)
b_vals = np.linspace(-3, 5, 50)
A, B = np.meshgrid(a_vals, b_vals)
Z = error_func(A, B)

path_np = np.array(path)

fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(A, B, Z, cmap='viridis', alpha=0.6, edgecolor='none')
ax.plot(path_np[:,0], path_np[:,1], path_np[:,2], 'r-', linewidth=2)
ax.scatter(path_np[:,0], path_np[:,1], path_np[:,2], c='black', s=15)

ax.set_xlabel('a')
ax.set_ylabel('b')
ax.set_zlabel('E(a,b)')
ax.view_init(30, 30)
ax.set_box_aspect([5, 5, 1])
plt.title("3D Error Function & Gradient Descent Path")
plt.show()

# ==============================
# 7. Plot 2: Least-Squares Line Fit
# ==============================
x_data = np.array([0, 1, 2, 3])
y_data = np.array([1, 3, 4, 4])
x_fit = np.linspace(-0.5, 3.5, 100)
y_fit = u[0] + u[1] * x_fit

plt.figure(figsize=(8, 6))
plt.scatter(x_data, y_data, c='red', s=60, label='Data points')
plt.plot(x_fit, y_fit, 'b-', linewidth=2, label=f'Fit: y = {u[0]:.2f} + {u[1]:.2f}x')
plt.xlabel('x')
plt.ylabel('y')
plt.grid(alpha=0.3)
plt.legend()
plt.title('Least-Squares Linear Fit')
plt.xlim(-1, 4)

## (-1.0, 4.0)

plt.ylim(-1, 5)

## (-1.0, 5.0)

plt.show()

As we found earlier, we found the least-square line by algorithm from the given data. The least-squares solution is obtained with the algorithm. The least-squares line \(y = \frac{3}{2} + x\) was obtained with the answer as a coefficient.

Summary of GDM least-squares linear fit

\[ \begin{array}{ccc} \textbf{Parameter} & \textbf{Value} \\ \hline \textrm{Initial iterate}~\mathbf{u}_1 & (-2.5, -2.5) \\ \textrm{Tolerance}~ \epsilon & 10^{-6} \\ \textrm{Learning rate}~ \eta & 0.1 \\ \textrm{Converged solution}~ \mathbf{u}^* & (1.5, 1.0) \\ \textrm{Minimum error}~ E(\mathbf{x}^*) & 0.5 \\ \textrm{Iterations to converge}~ & 118 \\ \textrm{Least-squares line}~ & y = x + \frac{3}{2} \\ \hline \end{array} \]

Least-Squares Line Solution

The least-square line is \[ y = \frac{3}{2} + x. \]

Therefore, the line we get from the least-squares solution is \[ y = \frac{3}{2} + x. \]

7 Example 4: Least-Squares Problem with Gradient Descent

We can also find a quadratic function \(y = a + bx + cx^2\) that best fits the given data. In this case, the error function \(E(\mathbf{u})\) is as follows: \[ \begin{aligned} E(a,b,c) &= \frac{1}{2}\left\{(a-1)^2 + (a+b+c-3)^2 + (a+2b+4c-4)^2 + (a+3b+9c-4)^2\right\} \\ &= \frac{1}{2} \| A\mathbf{u} - \mathbf{y} \|^2. \end{aligned} \]

Summary of least-squares problems

\[ \begin{array}{ccc} \textbf{Problem Type} & \textbf{Form/Result} \\ \hline \textrm{Linear least-squares model} & y = a + bx \\ \textrm{Linear least-squares solution} & y = x + \frac{3}{2} \\ \textrm{Quadratic least-squares model} & y = a + bx + cx^2 \\ \textrm{Error function for quadratic fit} & E(a,b,c) = \frac{1}{2}\|A\mathbf{u}-\mathbf{y}\|^2 \\ \hline \end{array} \]

import numpy as np
import matplotlib.pyplot as plt

# --------------------------
# 1. Define the error function E(a,b,c)
# --------------------------
def E(a, b, c):
    return 0.5 * (
        (a - 1)**2 +
        (a + b + c - 3)**2 +
        (a + 2*b + 4*c - 4)**2 +
        (a + 3*b + 9*c - 4)**2
    )

# --------------------------
# 2. Compute the gradient of E(a,b,c) (analytical form)
# --------------------------
def grad_E(a, b, c):
    # Partial derivative with respect to a
    dE_da = (a - 1) + (a + b + c - 3) + (a + 2*b + 4*c - 4) + (a + 3*b + 9*c - 4)
    
    # Partial derivative with respect to b
    dE_db = (a + b + c - 3) + 2*(a + 2*b + 4*c - 4) + 3*(a + 3*b + 9*c - 4)
    
    # Partial derivative with respect to c
    dE_dc = (a + b + c - 3) + 4*(a + 2*b + 4*c - 4) + 9*(a + 3*b + 9*c - 4)
    
    return np.array([dE_da, dE_db, dE_dc])

# --------------------------
# 3. Gradient Descent Algorithm
# --------------------------
# Initial parameters (matches your code)
u = np.array([1.0, 1.0, 1.0])  # initial value (a,b,c)
tol = 1e-6                     # tolerance level
eta = 0.01                     # learning rate
max_iter = 5000                # max iterations

for k in range(max_iter):
    g = grad_E(u[0], u[1], u[2])
    gn = np.linalg.norm(g)
    
    if gn <= tol:
        print("Algorithm Succeed!")
        break
    
    u = u - eta * g

## Algorithm Succeed!

# --------------------------
# 4. Print results (matches your output)
# --------------------------
print(f"u* = {u}")

## u* = [ 1.00000138  2.49999724 -0.49999918]

print(f"E(x*) = {E(u[0], u[1], u[2])}")

## E(x*) = 1.5966473717177356e-12

print(f"iteration number = {k + 1}")

## iteration number = 4207

# --------------------------
# 5. Plot the data points and least-squares quadratic curve
# --------------------------
# Original data points
x_data = np.array([0, 1, 2, 3])
y_data = np.array([1, 3, 4, 4])

# Generate curve points
x_curve = np.linspace(-0.5, 3.5, 100)
y_curve = u[0] + u[1] * x_curve + u[2] * (x_curve ** 2)

# Plot
plt.figure(figsize=(8, 6))
plt.scatter(x_data, y_data, color='red', zorder=5, label='Data Points')
plt.plot(x_curve, y_curve, color='blue', label=f'Quadratic Fit: $y = {u[0]:.2f} + {u[1]:.2f}x + {u[2]:.2f}x^2$')

plt.xlabel('x')
plt.ylabel('y')
plt.title('Least-Squares Quadratic Curve via Gradient Descent')
plt.legend()
plt.grid(alpha=0.3)
plt.ylim(-1, 5)

## (-1.0, 5.0)

plt.xlim(-0.5, 4)

## (-0.5, 4.0)

plt.show()