Math Behind Optimization Techniques

9 min readDec 25, 2023

This article will give you a wide insight into how mathematics operates behind the algorithms or optimizer API calls we do regularly from keras like libraries.

What is optimization and what does it do?

According to Wikipedia, mathematical optimization or mathematical programming is the selection of a best element, regarding some criterion, from some set of alternatives.

To put it in simple terms, optimization algorithms find the values of weights (parameters) that reduce the errors while mapping input to output/targets. These algorithms have a tremendous impact on accuracy and training speed of deep learning models.

How does it pick optimum?

As and when we try to explore our input samples and their function on a 2D plane, it would resemble the above diagram having several minima of convergence, but our algorithm needs to find best or satisfactory optimum. So, what does it use behind the wall to achieve?

That’s where our optimizer’s play a picture and here will discuss such two types of optimizing techniques, namely:

Zeroth Order Optimizing Techniques
- Golden Section Method
- Bisection Method
First Order Optimizing Techniques
- Newton’s Single Variable Method
- Newton’s Multi Variable Method

Zeroth Order Optimization Techniques

Golden Section Method
1. It's a technique for finding an extremum of a function within a specific interval. For a strictly unimodal function with an extremum inside the interval, it will find that extremum, while for an interval containing multiple extremities (possibly including the interval boundaries), it will converge to one of them. If the only extremum on the interval is on a boundary of the interval, it will converge to that boundary point. The method operates by successively narrowing the range of values on the specified interval, which makes it relatively slow, but very robust.
2. It was developed by an American statistician, Jack Carl Kiefer, in 1956. He also developed Fibonacci Search Method.
3. The technique derived its name from the fact that the algorithm maintains the function values for four points whose three interval widths are in the ratio φ:1:φ. Where φ is called Golden Ratio.

Terminology

Extremum According to Wiki, in mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a range (the local or relative extrema), or on the entire domain (the global or absolute extrema).

Unimodal Function, A function f (x) is said to be unimodal only if it has either maximum or minimum at an interval.

Golden Ratio, in mathematics two quantities are said to be in golden-ratio if their ratio is same as the ratio of sum of two quantiles, only if a and b > 0.

2. Bisection Method

The Bisection Method is a root-finding technique that repeatedly bisects an interval and then selects a subinterval in which a root must lie for further processing. It is a very straightforward and reliable method for finding a zero of a function as long as the function changes sign over the interval [a, b]. The method is based on the Intermediate Value Theorem, which guarantees that a continuous function which changes sign over an interval has at least one root in that interval.

It is one of the oldest root-finding algorithms, dating back to antiquity. It has been formalized in various mathematical forms over the centuries and remains a fundamental method for numerical approximation of solutions to equations.

How Algorithm works?

The method is applicable for numerically solving the equation f(x) = 0 for the real variable x, where f is a continuous function defined on an interval [a, b] and where f(a) and f(b) have opposite signs, , which indicates that f must have at least one root in the interval. The steps are as follows:

Compute the midpoint c = (a + b) / 2 of the interval.
Evaluate the function at the midpoint, f(c).
If f(c) is close enough to zero, then c is the root.
Otherwise, determine whether the sign of f(c) is the same as the sign of f(a) or f(b).
Replace the endpoint of the interval (either a or b) with c, such that the sign of the function at the new a and b remains opposite. This maintains the guarantee that the interval contains a root.
Repeat the process until the interval is sufficiently small and the midpoint is an acceptable approximation to the root.

To simply, put these iteration using a pseudocode:

input: Function f, 
       endpoint values a, b, 
       tolerance TOL, 
       maximum iterations NMAX
conditions: a < b, 
            either f(a) < 0 and f(b) > 0 or f(a) > 0 and f(b) < 0
output: value which differs from a root of f(x) = 0 by less than TOL
 
N ← 1
while N ≤ NMAX do // limit iterations to prevent infinite loop
    c ← (a + b)/2 // new midpoint
    if f(c) = 0 or (b – a)/2 < TOL then // solution found
        Output(c)
        Stop
    end if
    N ← N + 1 // increment step counter
    if sign(f(c)) = sign(f(a)) then a ← c else b ← c // new interval
end while
Output("Method failed.") // max number of steps exceeded

Terminology:

Root/x: In math, a root of a function is some number x such that f(x) = 0
Intermediate Value Theorem: A theorem in calculus that states if a function f is continuous on the interval [a, b] and N is any number between f(a) and f(b), then there is at least one number c in the interval (a, b) such that f(c) = N.
Convergence: The process of approaching a limit. In the context of the Bisection Method, convergence refers to the successive reduction in the interval size, approaching the root.

First Order Optimization Techniques

Newton’s single variable method:

It is an iterative technique for finding increasingly accurate approximations to the roots (or zeroes) of a real-valued function. The method uses the first derivative of the function to approximate the tangent line at a given point, and then uses the zero of the tangent line as the next approximation for the root. This method is particularly useful for smooth functions and can converge very rapidly if the initial guess is close to the actual root.

Lets assume f(x) be a differentiable function, and X_0 be an initial guess for the root of the function. The Newton-Raphson iteration is given by the formula:

where f`(x) is the derivative of f(x).

The geometric interpretation of this formula is that xn+1 is the point where the tangent to the curve at xn intersects the x-axis.

The method assumes that the function f is sufficiently smooth and that its derivative f’ does not vanish at the root. Convergence is typically assumed when the change in x is below a certain threshold:

or when the function value is close enough to zero.

where \epsilon and \delta are small positive numbers determined by desired precision.

Terminology:

Root: A root of the function f(x) is a solution to the equation f(x)=0.
Tangent Line: A line that touches the curve at a point without crossing it at that point. In Newton’s method, the tangent line’s equation at xn is:

3. Convergence: The method is said to converge if the sequence {xn} approaches a limit, which would be the root of f(x).

4. Divergence: If the sequence {xn} does not approach a root, or the values become arbitrarily large, the method is said to diverge.

2. Newtons Multi Variable Method:

It is an extension of the Single Variable Method, is used to find a local minimum or maximum (collectively known as extrema) of a function with several variables. It is an iterative method that uses the first and second derivatives (the gradient and the Hessian matrix, respectively) to find the direction in which the function decreases most rapidly.

For a multivariable function f(x) with X ∈ Rn, the Newton-Raphson update equation in multiple dimensions is:

Here, \nabla f(Xn) is the gradient vector of partial derivatives, and Hf(Xn) is the Hessian matrix of second-order partial derivatives at Xn.

Similar to the single-variable case, the multivariable Newton’s method requires that the function f be twice continuously differentiable and the Hessian at the extremum be non-singular. Convergence is achieved when the norm of the gradient is smaller than a predetermined threshold:

Alternatively, convergence can also be declared if the change in the variable vector is below a certain threshold:

Terminology:

Gradient (∇f): A vector containing all the first partial derivatives of the function. It points in the direction of the steepest ascent in the function’s graph.
Hessian Matrix (Hf): A square matrix of second-order partial derivatives of the function. It describes the local curvature of the function.
Inverse Hessian ([Hf]−1): The inverse of the Hessian matrix, used to determine the step direction and size in Newton’s method.

Conclusion

In conclusion, the journey through mathematical optimization reveals it as the silent engine powering the machine learning algorithms that drive much of modern computational intelligence. From the Zeroth Order methods like the Golden Section and Bisection methods, which do not require the calculation of derivatives, to the First Order methods such as Newton’s Single Variable and Multi-Variable techniques, which utilize derivatives for more rapid convergence, we see a landscape rich with strategies for finding the most efficient path to a solution.

As the field of artificial intelligence continues to expand, the role of optimization grows ever more central. It is the bridge between theoretical mathematical constructs and practical, real-world applications that span from natural language processing to autonomous vehicles. Understanding the mathematical underpinnings of optimization gives us insight into how and why machine learning models arrive at certain solutions, and empowers us to innovate and improve upon the current methodologies.

In essence, optimization is not just a mathematical concept or a programming practice; it is a critical cog in the machinery of artificial intelligence that propels us toward the future — one algorithmic step at a time. Whether we are fine-tuning a simple model or training a complex deep neural network, the principles of optimization guide us towards achieving the highest level of accuracy and efficiency possible in our quest to decode the patterns hidden within data.

WRITER at MLearning.ai // The Best 2023 AI Tools // AI tools 2024

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

Math Behind Optimization Techniques

What is optimization and what does it do?

How does it pick optimum?

Zeroth Order Optimization Techniques

First Order Optimization Techniques

Conclusion

WRITER at MLearning.ai // The Best 2023 AI Tools // AI tools 2024

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Vtantravahi