TOC

Introduction

Convexity is an important concept in machine learning, especially in optimization problems. Many optimization algorithms are designed to find the minimum of a convex function because a convex function has only one minimum point, and the minimum point can be found using efficient algorithms. In contrast, non-convex functions may have multiple local minimum points, and it may be difficult to find the global minimum.

In machine learning, convexity is important for many tasks, including linear regression, logistic regression, support vector machines, and neural networks. The loss function used in these tasks is typically convex, which allows efficient optimization using algorithms such as gradient descent.

Convex set

A set C is convex if you connect any two points in that set x1,x2C, the line connecting them would lie in the set completely, too. Mathematically, 0θ1,xθ=θx1+(1θ)x2C.

Examples of convex set are hyperplanes and halfspaces. A hyperplane in n-dimensional space is the set of points satisfying the equation: a1x1+a2x2+...+anxn=aTx=b with b,ai(i=1,2,..,n)R. For aTx1=aTx2=b and 0θ1:aTxθ=aT(θx1+(1θ)x2))=θb+(1θ)b=b. A halfspace in n-dimensional space is the set of points that satisfies the following inequality: a1x1+a2x2+...+anxn=aTxb with b,ai(i=1,2,..n)R.

Another example is a norm ball. An Euclidean ball is the set of points that satisfies: B(xc,r)={x∣∣∣xxc2r}. To prove that this is a convex set, we let any x1,x2B(xc,r) and 0θ1:

∣∣xθxc2=∣∣θ(x1xc)+(1θ(x2xc)2θ∣∣x1xc2+(1θ)∣∣x2xc2θr+(1θ)r=r. So xθB(xc,r).

Ellipsoids are convex sets, too.

The thing is, if we use norm p (pR,p>=1): ∣∣xp=(x1p+...+xnp)1/p, we end up with convex sets too. Mahalanobis distance is a norm too. Mahalanobis norm of a vector xRn is: ∣∣xA=xTA1x, with A being a matrix such that xTA1x0,xRn.

The intersection of halfspaces and hyperplanes are also convex. The name is a polyhedra. For a halfspace aiTxbi,i=1,2..m. For a hyperplane ciTx=di,i=1,2...p. With A=[a1,a2..am],b=[b1,b2..bm]T,C=[c1,c2..cp],d=[d1,d2...dp]T the polyhedra is ATxb,CTx=d.

A point is a convex combination of all points x1,x2,...xk if x=θ1x1+θ2x2+...+θkxk with θ1+θ2+...θk=1. The convex hull of a set is the collection of all points that are convex combination of that set. The convex hull is a convex set. The convex hull of a convex set is itself. In other words, the convex hull of a set is the smallest convex set that contains all the set. Two sets are linearly separable if their convex hulls don’t intersect. Going further, we are defining the separating hyperplane theorem: if two nonempty convex sets C and D are disjoint (no intersection), then there exists vector a and scalar b such that aTxb,xC and aTxb,xD. The set of all points x satisfying aTx=b is a hyperplane, and it is a separating hyperplane.

Convex function

A function is convex if its domain is a convex set and when we connect any two points on the graph of that function, we have a line above or on the graph. Mathematically, a function f:RnR is a convex function if domain of f is convex and f(θx+(1θ)y)θf(x)+(1θ)f(y)x,y domain of f and 0θ1

Image: graph of a convex function

Similarly, a function f:RnR is strictly convex if the domain of f is a convex set and f(θx+(1θ)y)<θf(x)+(1θ)f(y),x,y domain of f and xy,0<θ<1.

We can do the same for concave and strictly concave functions. If a function is strictly convex and it has a minimum then that minimum is a global minimum. Here are some properties:

  • If f(x) is convex then af(x) is convex if a > 0 and concave if a < 0

  • The sum of two convex functions is a convex function with the domain to be the intersection of the two functions (the intersection of two convex sets is a convex set)

  • If f1,f2,..fm is convex then f(x)=max{f1(x),f2(x),...fm(x)} is convex on the domain set being the intersection of all the domain sets.

Examples

Examples of univariate convex functions:

  • y = ax + b since the line connecting any two points lie on the graph of it

  • y=eax,aR
  • y=xa on R+,a1 or a0

  • the negative entropy y=xlogx,xR+

Examples of univariate concave functions:

  • y = ax + b since -y is a convex function. Note that y = ax + b is both convex and concave

  • y=xa,xR+,0a1
  • y=log(x)

f(x)=aTx+b is called an affine function. It can be written in matrix form: f(X)=tr(ATX)+b. The quaratic function f(x)=ax2+bx+c is convex if a > 0, concave if a < 0. It can also be written in matrix form with vector x=[x1,x2,...xn]:f(x)=xTAx+bTx+c. Usually A is a symmetric matrix (aij=ajii,y). If A is positive semidefinite, the f(x) is a convex function. If A is negative semidefinite (xTAx0x), f(x) is a concave function.

Derivatives

For differential function, we can check the derivatives to see whether it is convex or not. The definition of the tangient of function f at a point (x0,f(x0)):y=f(x0)(xx0)+f(x0). For multivariate function, let f(x0) to be the gradient of function f at point x0, the tangient surface is: y=f(x0)T(xx0)+f(x0). We have the condition for f to be convex, with the convex domain, and differential everywhere: if and only if x,f(x)f(x0)+f(x0)T(xx0). Similar, a function is strictly convex if the equality only happens at x=x0. Visually, a function is convex if the tangient at any point of the graph lies below the graph.

For multivariate function, i.e. the variable is a vector of dimension d. Its first derivative is a vector of d dimensions. The second derivative is a square matrix of d x d. The second derivative is denoted 2f(x). It is also called the Hessian. A function having the second order derivative is convex if the domain is convex and its Hessian is a semidefinite positive matrix for all x in the domain: 2f(x)0. If Hessian is a positive definite matrix then the function is strictly convex. Similarly, if the Hessian is a negative definite then the function is strictly concave. For univariate functions, this condition means f(x)0x the convex domain.

We have some examples:

  • The negative entropy function f(x)=xlog(x) is strictly convex since the domain x > 0 is convex and f(x)=1x is greater than 0 with all x in the domain.

  • f(x)=x2+5sin(x) is not a convex function since the second derivative f(x)=25sin(x) can have a negative value.

  • The cross entropy function is a strictly convex function. For two probabilities x and 1 - x, 0 < x < 1, a is a constant in [0,1]: f(x) = -(alog(x) + (1-a)log(1-x)) has the second derivative to be ax2+1a(1x)2 is a positive number.

  • If A is a positive definite matrix f(x)=12xTAx is convex since its Hessian is A - a positive definite matrix.

  • For negative entropy function with two variables f(x,y) = x log(x) + y log(y) on the positive set of x and y. This function has the first derivative to be [log(x)+1,log(y)+1]T and its Hessian to be:

[1/x001/y]

This is a diagonal matrix with the positive entries so it is a positive definite matrix. Therefore the negative entropy is a strictly convex funciton.

Jensen’s Inequality

Given a convex function f, we have the generalization of the convexity definition:

iαif(xi)f(iαixi)

and EX[f(X)]f(EX[X]) where iαi=1 and αi0, X is a random variable. In other words, the expectation of a convex function is no less than the convex function of expectation.

Properties

Local Minima are Global Minima

Consider a convex function f on a convex set X. Suppose that xX is a local minimum. So that there exists a small positive value p for xX that satisfies 0<∣xx∣≤p then f(x)<f(x). To prove that this is also the global minimum, we use contradiction. Assume that x is not the global minimum: there exists xX so that f(x)<f(x). This means that there exists λ[0,1),λ=1pxx:0<∣λx+(1λ)xx∣≤p.

However, according to the definition of convex function, we have:

f(λx+(1λ)xλf(x)+(1λ)f(x) <λf(x)+(1λ)f(x)=f(x)

This contradicts x being a local minimum. So the global minimum status is proved.