Derivatives on Normed Vector Spaces
This post explores the conception of derivatives as the best local linear approximation of a continuous function in the general setting of normed vector spaces, and generalizes classical results to spaces of functionals.

The fundamental idea of differential calculus is local linear approximation. The total derivative of a continuous function is the best linear approximation of the function at a given point in its domain. In order to define continuous functions, we require a metric space, and, in order to define linear functions, we require a vector space. It is therefore natural to study differential calculus in the context of a space which is both a metric space and and a vector space; such spaces are called normed vector spaces.
Since each Euclidean space \(\mathbb{R}^n\) is a normed vector space, this approach encompasses all the classical notions from differential calculus. However, this perspective permits the immediate generalization of all results to spaces other than Euclidean space, in particular, spaces of functions. Maps from such spaces of functions into \(\mathbb{R}\) are called functionals. We may then consider various notions of derivatives of functionals, which is part of the subject matter of the mathematical disciplines of the Calculus of Variations and Functional Analysis.
Furthermore, we will develop the results from the perspective of tangent spaces, which permits the generalization of the results within differential geometry, where tangent spaces play an important role.
These concepts find many applications in physics, computer science, engineering, and elsewhere.
Vector Spaces
Vectors are often conceived geometrically as "arrows" emanating from a common point. We may scale (extend or contract) such arrows (which corresponds to multiplying the arrow by some scalar factor), or translate such arrows (which corresponds to adding one arrow to another arrow). If coordinates are introduced, such geometric vectors can be represented as tuples of numbers \((v_1,\dots,v_n)\). The most straightforward operations to define on tuples of numbers are the scalar multiplication of a tuple by a number \(a\)
\[a \cdot (v_1,\dots,v_n) = (av_1,\dots,av_n)\]
and the addition of two tuples
\[(v_1,\dots,v_n) + (w_1,\dots,w_n) = (v_1+w_1,\dots,v_n+w_n).\]
Thus, a vector space is a set (whose elements are considered vectors) endowed with some appropriate notion of "scalar multiplication" and "addition".
A vector space can easily be defined with respect to an arbitrary field, which represents the "scalars". However, we will only consider real vector spaces, which are vector spaces whose field of scalars is \(\mathbb{R}\). Unless otherwise noted, the term "vector space" will designate a real vector space.
Definition 1 (Real Vector Space). A real vector space is a set \(V\) whose elements are called vectors together with an operation \(+ : V \times V \rightarrow V\) called vector addition and an operation \(\cdot : \mathbb{R} \times V \rightarrow V\) called scalar multiplication (where the elements of \(\mathbb{R}\) are called scalars) that satisfy the following axioms:
- Associativity of Vector Addition: \(v_1 + (v_2 + v_3) = (v_1 + v_2) + v_3\) for all \(v_1,v_2,v_3 \in V\).
- Commutativity of Vector Addition: \(v_1 + v_2 = v_2 + v_1\) for all \(v_1,v_2 \in V\).
- Existence of an Additive Identity: there exists an element in \(V\) denoted \(0\) such that \(v + 0 = 0 + v = v\) for all \(v \in V\).
- Existence of Additive Inverses: for all \(v \in V\) there exists an element in \(V\) denoted \(-v\) called the additive inverse of \(v\) such that \(v + (-v) = (-v) + v = 0\).
- Scalar Multiplicative Identity: \(1 \cdot v = v\) for all \(v \in V\), where \(1 \in \mathbb{R}\) is the scalar number.
- Left-Associativity of Scalar Multiplication: \(a_1 \cdot (a_2 \cdot v) = (a_1a_2) \cdot v\) for all \(a_1,a_2 \in \mathbb{R}\) and \(v in V\).
- Right-Distributivity of Scalar Multiplication: \((a_1 + a_2) \cdot v = a_1 \cdot v + a_2 \cdot v\) for all \(a_1,a_2 \in \mathbb{R}\) and \(v \in V\).
- Left-Distributivity of Scalar Multiplication: \(a \cdot (v_1 + v_2) = a \cdot v_1 + a \cdot v_2\) for all \(a \in \mathbb{R}\) and \(v_1,v_2 \in V\).
The scalar product \(a \cdot v\) is often denoted simply as \(av\).
Example 1. (Euclidean Spaces) The prototypical example of a vector space is Euclidean space \(\mathbb{R}^n\) (i.e. the set of all tuples \((v_1, \dots, v_n)\) of \(n\) real numbers), with addition being component-wise addition of tuples of real numbers, i.e.
\[(v_1, \dots, v_n) + (w_1, \dots, w_n) = (v_1+w_1,\dots,v_n+w_n),\]
and scalar multiplication being component-wise scalar multiplication, i.e.
\[a \cdot (v_1, \dots, v_n) = (av_1, \dots, av_n).\]
Because the operations in \(\mathbb{R}^n\) are defined in terms of corresponding operations in \(\mathbb{R}\), they inherit the properties of the operations in \(\mathbb{R}\), inducing a vector space structure on \(\mathbb{R}^n\).
The axioms then guarantee that an abstract vector space behaves in a manner similar to this more concrete Euclidean space.
Example 2. (Spaces of Continuous Functions) Another example of a vector space is the space \(C[a,b]\) consisting of all continuous functions \(f : [a,b] \rightarrow \mathbb{R}\), with vector addition defined point-wise as
\[(f + g)(x) = f(x) + g(x)\]
and scalar multiplication defined point-wise as
\[(a \cdot f)(x) = af(x)\]
for all \(f,g \in C[a,b]\), \(x \in [a,b]\), and \(a \in \mathbb{R}\).
Because the operations in \(C[a,b]\) are defined in terms of corresponding operations in \(\mathbb{R}\), they inherit the properties of the operations in \(\mathbb{R}\), which is a vector space with respect to these operations, thus inducing a vector space structure on \(C[a,b]\).
Linear Maps
Next we consider the appropriate notion of a map between vector spaces. Such a map should be a homomorphism (i.e. it should "preserve the structure" of a vector space). The "structure" that a vector space possesses is the algebraic operations of vector addition and scalar multiplication.
Definition 2 (Linear Map). Given vector spaces \(V\) and \(W\), a function \(T : V \rightarrow W\) is called a linear map (or linear transformation) if the following conditions are satisfied:
- \(T(v_1 + v_2) = T(v_1) + T(v_2)\) for all \(v_1,v_2 \in V\).
- \(T(a \cdot v) = a \cdot T(v)\) for all \(a \in \mathbb{R}\) and \(v \in V\).
Metric Spaces
A metric space is a space in which we can measure the distance between its points.
Definition 3 (Metric Space). A metric space is a set \(X\) together with a function \(d : X \times X \rightarrow \mathbb{R}\) called a metric satisfying the following properties:
- Positive Definiteness: \(d(x_1,x_2) \ge 0\) for all \(x_1,x_2 \in X\), and \(d(x_1, x_2) = 0\) if and only if \(x_1 = x_2\).
- Symmetry: \(d(x_1,x_2) = d(x_2,x_1)\) for all \(x_1,x_2 \in X\).
- Triangle Inequality: \(d(x_1,x_2) + d(x_2,x_3) \ge d(x_1,x_3)\) for all \(x_1,x_2,x_3 \in X\).
Example 3. (Euclidean Spaces) Each Euclidean space \(\mathbb{R}^n\) is a metric space using the Euclidean metric
\[d(x,y) = \sqrt{(x_1 - y_1)^2 + \dots + (x_n - y_n)^2}.\]
This is the usual measure of distance in Euclidean space, which is based on the Pythagorean Theorem.
Example 4. (Spaces of Continuous Functions) Each space \(C[a,b]\) of continuous functions \(f : [a,b] \rightarrow \mathbb{R}\) is a metric space using the supremum metric
\[d(f,g) = \sup_{x \in [a,b]}(f(x) - g(x)).\]
We won't examine the details, but the Extreme Value Theorem ensures that the supremum exists on a closed interval such as \([a,b]\).
Limits
The purpose of metric spaces is to define various limits, which, in turn, permit the definition of continuous functions.
Definition 4 (Sequence). A sequence within a metric space \(X\) is a function \(a : \mathbb{N} \rightarrow X\). The notation \(a_n = a(n)\) is often used to denote an individual term of the sequence. The sequence itself is often denoted \((a_n)_{n \in \mathbb{N}}\).
Definition 5 (Limit of a Sequence). A limit of a sequence \((a_n)_{n \in \mathbb{N}}\) within a metric space \((X, d)\) is a point \(L \in X\) such that, for every \(\varepsilon \gt 0\), there exists an \(N \in \mathbb{N}\) such that, for every \(n > N\), \(d(L, a_n) \lt \varepsilon\). In this case, the sequence is said to converge to the limit, and we write
\[\lim_{n \to \infty}a_n = L.\]
In other words, the sequence becomes arbitrarily close to a limiting point. This may be conceived as a sort of adversarial game: our opponent specifies a number \(\varepsilon\) indicating how close the sequence must come to \(L\). We then indicate a number \(N\) such that every term \(n\) after \(N\) in the sequence is strictly within distance \(\varepsilon\) of \(L\). If we always "win" the game in this manner for any \(\varepsilon\) whatsoever, no matter how small, then \(L\) is a limit of the sequence. Thus, we may always find terms of the sequence as close to the limit as we wish.
It is often useful to talk about the set of points strictly within some distance of another point, which is analogous to a sphere (or "ball"), which consists of all points within some radius of a central point.
Definition 6 (Open Ball). Given a metric space \((X,d)\), a point \(x \in X\), and a real number \(r \in \mathbb{R}\), the open ball around \(x\) of radius \(r\) is the following set of points:
\[B(x,r) = \{ x' \in X : d(x,x') \lt r\}.\]
Thus, the definition of a limit of a sequence can be rephrased using open balls, since the condition \(d(L, a_n) \lt \varepsilon\) is equivalent to \(a_n \in B(L,\varepsilon)\).
Next, we will demonstrate that, if a limit of a sequence exists, then it is necessarily unique, and hence we may speak of "the" limit of a sequence.
Theorem 1. If the limit of a sequence in a metric space exists, then it is unique.
Proof. Suppose that \(L_1\) and \(L_2\) are both limits of a sequence \((a_n)_{n \in \mathbb{N}}\) within a metric space \((X, d)\), and that \(L_1 \neq L_2\). Since \(L_1 \ne L_2\), it follows by positive definiteness that \(d(L_1, L_2) \gt 0\) and hence \(d(L_1,L_2)/2 \gt 0\). Let \(\varepsilon \in \mathbb{R}\) be any real number such that \(0 \lt \varepsilon \lt d(L_1,L_2)/2\). Since \(L_1\) is a limit, there exists a natural number\(N_1 \in \mathbb{N}\) such that \(d(L_1, a_{n_1}) \lt \varepsilon\) for all \(n_1 \gt N_1\). Since \(L_2\) is a limit, there likewise exists a natural number \(N_2 \in \mathbb{N}\) such that \(d(L_2, a_{n_2}) \lt \varepsilon\) for all \(n_2 \gt N_2\). Let \(N = \max(N_1, N_2)\). Then, for all \(n \gt N\), since \(n \gt N_1\) and \(n \gt N_2\), it follows that \(d(L_1, a_n) \lt \varepsilon\) and \(d(L_2, a_n) \lt \varepsilon\). By the triangle inequality, we have the following inequality:
\begin{align}d(L_1,L_2) &\le d(L_1,a_n) + d(L_2,a_n) \\&\lt \varepsilon + \varepsilon \\&= \frac{d(L_1,L_2)}{2} + \frac{d(L_1,L_2)}{2} \\&= d(L_1,L_2).\end{align}
It follows that \(d(L_1,L_2) \lt d(L_1,L_2)\), which is a contradiction, and hence \(L_1 = L_2\). \(\square\)
Note that, for any function \(f : X \rightarrow Y\) and sequence \((a_n)_{n \in \mathbb{N}}\) in \(X\), we may produce a sequence \((f(a_n))_{n \in \mathbb{N}}\) in \(Y\).
It is also possible to define the limit of a function in a metric space.
Definition 7 (Limit of a Function). A limit of a function \(f : X \rightarrow Y\) between metric spaces \((X,d_X)\) and \((Y,d_Y)\) at a point \(a \in X\) is a point \(L \in Y\) such that, for every \(\varepsilon \gt 0\), there exists a \(\delta \gt 0\) such that \(d_Y(f(x),L) \lt \varepsilon\) whenever \(d_X(x,a) \lt \delta\). This limit is denoted
\[\lim_{x \to a}f(x).\]
This means that, as \(x\) approaches \(a\) in the domain, \(f(x)\) approaches \(L\) in the codomain, i.e. \(f(x)\) can be made arbitrarily close to the limit \(L\) by making \(x\) sufficiently close to \(a\).
Just as limit of sequences in metric spaces are unique, a similar argument shows that limits of functions in metric spaces are unique.
Theorem 2. Limits of functions in metric spaces are unique.
Proof. Suppose that \(\lim_{x \to a}f(x) = L_1 = L_2\) for a function \(f : X \rightarrow Y\) between metric spaces \(X\) and \(Y\), where \(L_1 \ne L_2\). Then, by positive definiteness, \(d(L_1, L_2) \gt 0\) and thus \(d(L_1, L_2)/2 \gt 0\). Let \(\varepsilon \in \mathbb{R}\) be any real number such that \(0 \lt \varepsilon \lt d(L_1,L_2)/2\). Then, since \(L_1\) is a limit, there exists a \(\delta_1 \gt 0\) such that, for all \(x \in X\), \(d(f(x), L_1) \lt \varepsilon\) whenever \(d(x, a) \lt \delta_1\). Likewise, since \(L_2\) is a limit, there exists a \(\delta_2 \gt 0\) such that, for all \(x \in X\), \(d(f(x), L_2) \lt \varepsilon\) whenever \(d(x, a) \lt \delta_2\). Let \(\delta = \min(\delta_1,\delta_2)\). Then, since \(\delta \le \delta_1\) and \(\delta \le \delta_2\), it follows that, for all \(x \in X\), \(d(f(x),L_1) \lt \varepsilon\) and \(d(f(x), L_2) \lt \varepsilon\) whenever \(d(x,a) \lt \delta\). By the triangle inequality, we have the following inequality:
\begin{align}d(L_1,L_2) &\le d(L_1, f(x)) + d(L_2, f(x)) \\&\lt \varepsilon + \varepsilon \\&= \frac{d(L_1,L_2)}{2} + \frac{d(L_1,L_2)}{2} \\&= d(L_1,L_2). \end{align}
Thus, \(d(L_1,L_2) \lt d(L_1,L_2)\), which is a contradiction, and hence \(L_1 = L_2\).
Continuity
Limits permit the definition of continuous functions.
Definition 8 (Continuity). A function \(f : X \rightarrow Y\) between metric spaces \(X\) and \(Y\) is continuous at the point \(p \in X\) if, for every convergent sequence \((a_n)_{n \in \mathbb{N}}\) with \(\lim_{n \to \infty}a_n = p\), the following equality is satisfied:
\[\lim_{n \to \infty}f(a_n) = f(p).\]
A function is continuous if it is continuous at every point in its domain.
In other words, a function is continuous if it preserves limits of sequences. Thus, a continuous function is a homomorphism of metric spaces, i.e. a "structure-preserving map" between metric spaces. The "structure" that metric spaces possess is the relation of convergent sequences to their respective limit points, and continuous functions preserve this relation in the sense that they map convergent sequences to convergent sequences.
The following theorem offers a useful alternative definition of continuous functions which is often more technically convenient.
Theorem 3. A function \(f : X \rightarrow Y\) between metric spaces \((X,d_X)\) and \((Y,d_Y)\) is continuous at a point \(a\) if and only if
\[\lim_{x \to x_0}f(x) = f(x_0).\]
Proof. (If). Let \(x_0 \in X\) and suppose that \(\lim_{x \to x_0}f(x) = f(x_0)\). Let \((a_n)_{n \in \mathbb{N}}\) be any sequence and suppose that \(\lim_{n \to \infty}a_n = x_0\). Let \(\varepsilon \gt 0\). Then, by hypothesis, there exists a \(\delta \gt 0\) such that for all \(x \in X\) satisfying \(d(x,x_0) \lt \delta\), \(d(f(x), f(x_0)) \lt \varepsilon\). Since \(\lim_{n \to \infty}a_n = x_0\), there exist an \(N \in \mathbb{N}\) such that \(d(a_n,x_0) < \delta\) for all \(n \gt N\). Then, since \(d(a_n,x_0) < \delta\) for all \(n \gt N\), it follows that \(d(f(a_n), f(x_0)) \lt \varepsilon\) for all \(n \gt N\), and thus \(\lim_{n \to \infty}f(a_n) = f(x_0)\) and \(f\) is continuous at \(x_0\).
(Only if). We will prove the contrapositive. Let \(x_o \in X\) and suppose that \(\lim_{x \to x_0}f(x) \ne f(x_0)\). This means that there exists an \(\varepsilon \gt 0\) such that for all \(\delta \gt 0\) there exists an \(x \in X\) such that \(d(x, x_0) \lt \delta\) and \(d(f(x), f(x_0)) \ge \varepsilon\). In particular, then, since \(1/(n+1) \gt 0\) for all \(n \in \mathbb{N}\), for each \(n\) there exists an \(a_n \in X\) such that \(d(a_n,x_0) \lt 1/(n+1)\) and \(d(f(a_n), f(x_0)) \ge \varepsilon\). Thus, we may (by the Axiom of Choice) define a sequence \((a_n)_{n \in \mathbb{N}}\). Since \(1/(n+1)\) can be made arbitrarily small, it follows that \(\lim_{n \to \infty}a_n = x_0\) (we can always find an \(N \in \mathbb{N}\) such that \(1/(N+1) \lt \varepsilon'\) for any \(\varepsilon' \gt 0\), and thus \(1/(n+1) \lt \varepsilon'\) for every \(n \gt N\)). However, since \(d(f(a_n), f(x_0)) \ge \varepsilon\), it follows that \(\lim_{n \to \infty}f(a_n) \ne f(x_0)\), so \(f\) is not continuous at \(x_0\). \(\square\)
This alternative characterization of continuity can be rephrased in terms of open balls: a function \(f : X \rightarrow Y\) is continuous at a point \(x_0 \in X\) if and only if for every \(\varepsilon \gt 0\) there exists a \(\delta \gt 0\) such that \(f(B(x_0, \delta)) \subseteq B(f(x_0), \varepsilon)\).
The composition of continuous maps is again continuous, as the following theorem shows.
Theorem 4. If \(f : Y \rightarrow Z\) and \(g : X \rightarrow Y\) are continuous maps then so is \(f \circ g : X \rightarrow Z\).
Proof. Suppose \(\varepsilon \gt 0\). Then, since \(f\) is continuous, \(\lim_{y \to g(a)}f(y) =f(g(a))\), which means that there exists a \(\delta_f \gt 0\) such that, for all \(y \in Y\), \(d(f(y), f(g(a)) \lt \varepsilon\) whenever \(d(y,g(a)) \lt \delta_f\). Since \(g\) is continuous, \(\lim_{x \to a}g(x) = g(a)\), which means that there exists a \(\delta_g \gt 0\) such that, for all \(x \in X\), \(d(g(x),g(a)) \lt \delta_f\) whenever \(d(x,a) \lt \delta_g\). Set \(\delta_{g\circ f} = \delta_g\), and suppose \(d(x,a) \lt \delta_{f \circ g}\). Then \(d(g(x),g(a)) \lt \delta_f\), which means that \(d(f(g(x)),f(g(a))) \lt \varepsilon\). Thus, \(\lim_{x \to a}f(g(x)) = f(g(a))\) and so \(f \circ g\) is continuous.
An immediate consequence of continuity is that, for continuous functions \(f\) and \(g\), we can perform substitution within limits, i.e.
\[\lim_{y \to g(a)}f(y) = f(g(a)) = \lim_{x \to a}f(g(x)).\]
Normed Vector Spaces
A normed vector space is both a vector space and a metric space. Normed vector spaces are equipped with norms which measure the "size" of vectors.
Definition 9 (Normed Vector Space). A normed vector space (over \(\mathbb{R}\)) is a vector space \(V\) (over \(\mathbb{R}\)) equipped with an operation \(\lVert \cdot \rVert : V \rightarrow \mathbb{R}\) satisfying the following axioms:
- Positive Definiteness: \(\lVert v \rVert \ge 0\) for all \(v \in V\), and \(\lVert v \rVert = 0\) if and only if \(v = 0\).
- Absolute Homogeneity: \(\lVert av\rVert = \lvert a \rvert \lVert v \rVert\) for all \(a \in \mathbb{R}\) and \(v \in V\).
- Triangle Inequality: \(\lVert v_1 + v_2 \rVert \le \lVert v_1 \rVert + \lVert v_2 \rVert\) for all \(v_1,v_2 \in V\).
These axioms are very similar to the axioms for a metric space. Every normed vector space \(V\) has a canonical metric called the induced metric (or induced distance) defined for all \(v_1,v_2 \in V\) as follows:
\[d(v_1, v_2) = \lVert v_1 - v_2 \rVert.\]
Example 5. The Euclidean norm is the standard norm on Euclidean spaces \(\mathbb{R}^n\). For each \(x \in \mathbb{R}^n\), it is defined using the Pythagorean Theorem as follows:
\[\lVert (x_1,\dots,x_n) \rVert = \sqrt{x_1^2 + \dots + x_n^2}.\]
The induced distance in Euclidean space for \(x,y \in \mathbb{R}^n\) is therefore
\[d(x,y) = \lVert (x_1,\dots,x_n) - (y_1,\dots,y_n) \rVert = \sqrt{(x_1-y_1)^2 + \dots + (x_n-y_n)^2}.\]
Example 6. The supremum norm on \(C[a,b]\) is defined for each \(f \in C[a,b]\) as
\[\lVert f \rVert = \sup_{x \in [a,b]}f(x).\]
The induced distance between \(f,g \in C[a,b]\) is therefore
\[d(f,g) = \sup_{x \in [a,b]}(f(x) - g(x)).\]
Limits in Normed Vector Spaces
Since every normed vector space is a metric space under its induced metric, it is possible to define limits in normed vector spaces. We will primarily be interested in limits of functions. Explicitly, the definition is as follows:
Definition 10 (Limit of a Function). A limit of a function \(f : X \rightarrow Y\) between normed vector spaces \(X\) and \(Y\) at a point \(a \in X\) is a point \(L \in Y\) such that for every \(\varepsilon \gt 0\) there exists a \(\delta \gt 0\) such that \(\lVert f(x) - L \rVert_Y \lt \varepsilon\) whenever \(\lVert x - a \rVert \lt \delta\).
There are several useful properties of limits in normed vector spaces which we will exploit.
Theorem 5. For any function \(f : X \rightarrow Y\) between normed vector spaces \(X\) and \(Y\), \(\lim_{h \to a}f(h) = 0\) if and only if \(\lim_{h \to a}\lVert f(h) \rVert = 0\).
Proof. In fact, these two statements, when expanded, say exactly the same thing. If \(\lim_{h \to a}f(h) = 0\), then, by definition, for every \(\varepsilon \gt 0\) there exists a \(\delta \gt 0\) such that \(\lVert f(h) - 0\rVert_Y = \lVert f(h) \rVert_Y \lt \varepsilon\) whenever \(\lVert h -a \rVert_X \lt \delta\). If \(\lim_{h \to a}\lVert f(h) \rVert_{\mathbb{R}} = 0\), then, by definition, for every \(\varepsilon \gt 0\), there exists a \(\delta \gt 0\) such that \(\lVert \lVert h \rVert_Y - 0 \rVert_{\mathbb{R}} \lt \varepsilon\) whenever \(\lVert h-a \rVert_X \lt \delta\). Observe that
\begin{align}\lVert \lVert h \rVert_Y - 0 \rVert_{\mathbb{R}} &= \sqrt{(\lVert h \rVert_Y - 0)^2} \\&= \left\lvert \lVert h \rVert_Y \right\rvert \\&= \lVert h \rVert_Y.\end{align}
Thus, \(\lVert h \rVert_Y \lt \varepsilon\), and the two statements are equivalent. \(\square\)
Theorem 6. For any function \(f : X \rightarrow Y\) between normed vector spaces \(X\) and \(Y\),
\[\lim_{h \to a}f(h) = \lim_{h \to 0}f(a + h)\]
whenever either limit exists.
Proof. Suppose \(\lim_{h \to a}f(h) = L\) and \(\varepsilon \gt 0\). Then, by definition, there exists a \(\delta \gt 0\) such that, for all \(h \in X\), \(\lVert f(h) - L \rVert \lt \varepsilon\) whenever \(\lVert h-a \rVert \lt \delta\). Suppose that \(\lVert h - 0\rVert = \lVert h \rVert \lt \delta\). Then, it follows that \(\lVert (a + h) - a \rVert = \lVert h \rVert \lt \delta\), and thus \(\lVert f(a+h) - L \rVert \lt \varepsilon\), hence \(\lim_{h \to 0}f(a+h) = L\).
Conversely, suppose that \(\lim_{h \to 0}f(a+h) = L\) and \(\varepsilon \gt 0\). Then, by definition, there exists a \(\delta \gt 0\) such that, for all \(h \in X\), \(\lVert f(a+h) - L \rVert \lt \varepsilon\) whenever \(\lVert h-0 \rVert = \lVert h \rVert \lt \delta\). Suppose that \(\lVert h-a \rVert \lt \delta\). Then, it follows that \(\lVert f(a+(h-a))-L\rVert = \lVert f(h) - L\rVert \lt \varepsilon\), and hence \(\lim_{x \to a}f(h) = L\). \(\square\)
This theorem indicates that we may "translate" limits by "shifting" them accordingly.
Theorem 7. For any function \(f : X \rightarrow Y\) between normed vector spaces,
\[\lim_{t \to 0}f(th) = \lim_{h \to 0}f(h)\]
whenever \(h \ne 0\) and \(\lim_{h \to 0}f(h)\) exists.
Proof. Suppose \(\lim_{h \to 0}f(h) = L\) and \(\varepsilon \gt 0\). Then, there exists a \(\delta \gt 0\) such that \(\lVert f(h) - L \rVert_Y \lt \varepsilon\) whenever \(\lVert h - 0 \rVert_X = \lVert h \rVert_X \lt \delta\). Suppose \(\lVert t -0\rVert_{\mathbb{R}} = \lvert t \rvert \lt \delta / \lVert h \rVert_X\) (which is well-defined since \(h \ne 0\) and hence \(\lVert h \rVert_X \gt 0\)). It then follows that \(\lvert t \rvert \lVert h \rVert_X = \lVert th \rVert_X \lt \delta\), and hence \(\lVert f(th) - L\rVert_Y \lt \varepsilon\). Thus, \(\lim_{t \to 0}f(th) = L\). \(\square\)
This theorem says that we can "parameterize" limits by the parameter \(t\).
Theorem 8. For any function \(f : X \rightarrow Y\) between normed vector spaces and \(c \in \mathbb{R}\), \(\lim_{h \to a}cf(h) = c \cdot \lim_{h \to a}f(h)\) whenever either limit exists and \(c \ne 0\) or whenever \(c=0\) and \(\lim_{h \to a}f(h)\) exists.
Proof. If \(c=0\) and \(\lim_{h \to a}f(h) = L\), then \(\lim_{h \to a}cf(h) = 0 = 0 \cdot L\).
Suppose \(c \ne 0\), \(\lim_{h \to a}cf(h) = L\), and \(\varepsilon \gt 0\). Then there exists a \(\delta \gt 0\) such that \(\lVert cf(h)-L\rVert_Y \lt \varepsilon\) whenever \(\lVert h-a\rVert_X \lt \delta\). Suppose \(\lVert h-a\rVert_X \lt \delta\). Then \(\lVert cf(h)-L\rVert_Y \lt \varepsilon\). Thus
\begin{align}\lVert cf(h) - L \rVert_Y &= \left\lVert c \left(f(h) - \frac{1}{c}L\right) \right\rVert_Y \\&= \lvert c \rvert \left\lVert f(h) - \frac{1}{c}L \cdot \right\rVert_Y \\&\lt \varepsilon,\end{align}
and, furthermore, \(\left\lVert f(h) - \frac{1}{c}L \right\rVert_Y \lt \varepsilon\). Thus, \(\lim_{h \to a}f(h) = \frac{1}{c}L\) and so \(c \cdot \lim_{h \to a}f(h) = L\).
Next, suppose \(c \gt 0\), \(\lim_{h \to a}f(h) = L\) and \(\varepsilon \gt 0\). Then there exists a \(\delta \gt 0\) such that \(\lVert f(h)-L\rVert_Y \lt \frac{1}{c} \cdot \varepsilon\) whenever \(\lVert h-a\rVert_X \lt \delta\). Suppose \(\lVert h-a\rVert_X \lt \delta\). Then \(\lVert f(h)-L\rVert_Y \lt \frac{1}{c}\varepsilon\). Thus
\begin{align}\lVert f(h)-L\rVert_Y &= \left\lVert \frac{1}{c} \cdot (cf(h) - cL)\right\rVert_Y \\&= \left\lvert \frac{1}{c} \right\rvert \cdot \lVert cf(h)-cL \rVert_Y \\&= \frac{1}{c} \cdot \lVert cf(h)-cL \rVert_Y \\&\lt \frac{1}{c} \cdot \varepsilon,\end{align}
and, furthermore, \(\lVert cf(h) - cL \rVert \lt \varepsilon\). Thus, \(\lim_{h \to a}cf(h) = cL\).
Next, suppose \(c \lt 0\), \(\lim_{h \to a}f(h) = L\) and \(\varepsilon \gt 0\). Then there exists a \(\delta \gt 0\) such that \(\lVert f(h)-L\rVert_Y \lt -\frac{1}{c} \cdot \varepsilon\) whenever \(\lVert h-a\rVert_X \lt \delta\). Suppose \(\lVert h-a\rVert_X \lt \delta\). Then \(\lVert f(h)-L\rVert_Y \lt -\frac{1}{c}\varepsilon\). Thus
\begin{align}\lVert f(h)-L\rVert_Y &= \left\lVert \frac{1}{c} \cdot (cf(h) - cL)\right\rVert_Y \\&= \left\lvert \frac{1}{c} \right\rvert \cdot \lVert cf(h)-cL \rVert_Y \\&= -\frac{1}{c} \cdot \lVert cf(h)-cL \rVert_Y \\&\lt -\frac{1}{c} \cdot \varepsilon,\end{align}
and, furthermore, \(\lVert cf(h) - cL \rVert \lt \varepsilon\). Thus, \(\lim_{h \to a}cf(h) = cL\). \(\square\)
Theorem 9. Limits of functions in finite-dimensional normed vector spaces are computed component-wise. Let \(f : X \rightarrow Y\) be a function between normed vector spaces \(X\) and \(Y\). Suppose \(Y\) has dimension \(m\) and let \((e_j)\) be any basis for \(Y\). Then \(\lim_{h \to a}f(h)\) exists if and only if each of the limits \(\lim_{h \to a}f^j(h)\) exist, and
\[\lim_{h \to a}f(h) = \sum_{i=1}^m \left[\left(\lim_{h \to a}f^j(h)\right) \cdot e_j\right],\]
where \(f^j\) is the \(j\)-th component function of \(f\) relative to the basis \((e_j)\), i.e., by definition, \(f(h) = \sum_{j=1}^m(f^j(h) \cdot e_j)\).
Proof. Suppose \(\lim_{h \to a}f(h) = L\) and \(\varepsilon \gt 0\). Then there exists a \(\delta \gt 0\) such that \(\lVert f(h)-L \rVert_Y \lt \lVert e_j \rVert_Y \cdot \varepsilon\) whenever \(\lVert h-a\rVert_X \lt \delta\). Suppose \(\lVert h-a \rVert_X \lt \delta\). Then
\begin{align}\lVert f(h)-L\rVert_Y &= \left\lVert \sum_{j=1}^m(f^j(h) \cdot e_j) - \sum_{j=1}^m(L^j \cdot e_j)\right\rVert_Y \\&= \left\lVert \sum_{j=1}^m(f^j(h) \cdot e_j - L^j \cdot e_j)\right\rVert_Y \\&= \left\lVert \sum_{j=1}^m((f^j(h) - L^j) \cdot e_j)\right\rVert_Y \\&= \sum_{j=1}^m \left\lVert (f^j(h) - L^j) \cdot e_j\right\rVert_Y \\&= \sum_{j=1}^m \left( \lvert f^j(h) - L^j \rvert \cdot \left\lVert e_j\right\rVert_Y\right) \\&\lt \lVert e_j \rVert_Y \cdot \varepsilon.\end{align}
From this, we may conclude that, for every \(j\),
\[\lvert f^j(h) - L^j \rvert \cdot \left\lVert e_j\right\rVert_Y \lt \lVert e_j \rVert_Y \cdot \varepsilon,\]
and, furthermore, we may conclude that
\[\lvert f^j(h) - L^j \rvert \lt \varepsilon.\]
Since \(\lVert f^j(h) - L^j \rVert_{\mathbb{R}} = \lvert f^j(h) - L^j \rvert\), it follows that
\[\lim_{h \to a} f^j(h) = L^j,\]
and thus
\begin{align}\lim_{h \to a}f(h) &= L \\&= \sum_{j=1}^m (L^j \cdot e_j) \\&= \sum_{j=1}^m\left[ \left(\lim_{h \to a} f^j(h)\right) \cdot e_j\right].\end{align}
Conversely, suppose that \(\lim_{h \to a}f^j(h) = L^j\) for each \(j\) and \(\varepsilon \gt 0\). Let \(e = \max_j \lVert e_j \rVert\). Then, for each \(j\), there exists a \(\delta^j \gt 0\) such that \(\lvert f^j(h)-L^j \rvert \lt \frac{1}{m \cdot e} \cdot \varepsilon\) whenever \(\lVert h-a \rVert_X \lt \delta^j\). Let \(\delta = \min_j(\delta_j)\) and suppose that \(\lVert h-a \rVert_X \lt \delta\). Then, for each \(j\), it follows that
\[\lvert f^j(h) - L^j \rvert \lt \frac{1}{m \cdot e} \cdot \varepsilon,\]
and thus
\[\sum_{j=1}^m \lvert f^j(h) - L^j \rvert \lt \frac{1}{ e} \cdot \varepsilon.\]
Furthermore, since \(\lVert e_j \rVert_Y \lt e\) for all \(j\), it follows that
\[\sum_{j=1}^m \left(\lvert f^j(h) - L^j \rvert \cdot \lVert e_j \rVert_Y\right) \lt \varepsilon.\]
Then
\begin{align}\lVert f(h) - L \rVert_Y &= \left\lVert \sum_{j=1}^m (f^j(h) \cdot e_j) - \sum_{j=1}^m(L^j \cdot e_j) \right\rVert_Y \\&= \left\lVert \sum_{j=1}^m (f^j(h) \cdot e_j - L^j \cdot e_j) \right\rVert_Y \\&\le \sum_{j=1}^m \lVert (f^j(h) \cdot e_j - L^j \cdot e_j) \rVert_Y \\&= \sum_{j=1}^m \lVert (f^j(h) - L^j) \cdot e_j \rVert_Y \\&= \sum_{j=1}^m \left(\lvert f^j(h) - L^j \rvert \cdot \lVert e_j \rVert_Y\right) \\&\lt \varepsilon\end{align}
Thus, \(\lim_{h \to a}f(h) = L\).
Derivatives
The derivative \(df_a\) of a continuous map \(f : X \rightarrow Y\) between normed vector spaces \(X\) and \(Y\) at a point \(a \in X\) is the best local linear approximation of \(f\) at \(a\).
Since \(df_a\) is linear, it must be a linear map between vector spaces. It is local in the sense that these vector spaces are concentrated ("centered") at the points \(a\) and \(f(a)\), respectively, that is, the zero vector or "origin" of each vector space is the point \(a\) and \(f(a)\), respectively. We will call these vector spaces tangent spaces and we will denote them \(T_aX\) and \(T_{f(a)}Y\). We will refer to the spaces \(X\) and \(Y\) as the underlying spaces of their respective tangent spaces.
The tangent space \(T_aX\) should be a normed vector space which contains the same vectors as \(X\), yet has its origin translated to the point \(a\).
We can define a map \(T_a : T_aX \rightarrow X\) as follows:
\[T_a(h) = x - a.\]
Conversely, we can define a map \(T_a^{-1} : X \rightarrow T_aX\) as follows:
\[T_a^{-1}(x) = x + a.\]
We can define addition for any \(x,y \in T_aX\) by translating to \(X\), then adding, then translating back to \(T_aX\), as follows:
\[x + y = T_a^{-1}(T_a(x) + T_a(y)) = ((x - a) + (y - a)) + a = x + y - a.\]
We can define scalar multiplication for any \(x,y \in T_aX\) by translating to \(X\), then multiplying, then translating back to \(T_aX\), as follows:
\[s \cdot x = T_a^{-1}(s \cdot T_a(x)) = (s \cdot (x - a)) + a.\]
Note that the vector \(a\) becomes the zero vector in \(T_aX\), since, for any vector \(x \in T_aX\), it follows that
\[x + a = (x - a) + (a - a) + a = x.\]
Next, we can define a linear map \(\bar{L}_{a,b} : T_aX \rightarrow T_bY\) in terms of a linear map \(L_{a,b} : X \rightarrow Y\) by translating to \(X\), applying \(L_{a,b}\), then translating to \(T_bY\), as follows:
\[\bar{L}_{a,b}(h) = T_a^{-1}(L_{a,b}(T(h))) = b + L_{a,b}(h - a).\]
In fact, every linear map \(\bar{L}_{a,b} : T_aX \rightarrow T_bY\) arises this way, since, given \(\bar{L}_{a,b}\), we can define
\[L_{a,b}(h) = T_{b}(\bar{L}_{a,b}(T_a^{-1}(h))) = \bar{L}_{a,b}(h + a) - b,\]
which implies that
\[L_{a,b}(h-a) = \bar{L}_{a,b}((h-a)+a) - b,\]
and thus
\[\bar{L}_{a,b}(h) = b + L_{a,b}(h - a).\]
This means that we may define a linear map \(L_{a,b} : T_aX \rightarrow T_bY\) as a linear map \(L_{a,b} : X \rightarrow Y\) satisfying this relation.
We want to define a particular such linear map \(\bar{L}_{a,f(a)}\) which we will denote \(\bar{d}f_a : T_aX \rightarrow T_{f(a)}Y\) in terms of a corresponding linear map \(L_{a,f(a)}\) which we will denote \(df_a : X \rightarrow Y\). It should therefore satisfy the relation
\[\bar{d}f_a(h) = f(a) + df_a(h - a).\]
We furthermore want this linear map to approximate the continuous function \(f\), i.e.
\[f(h) \approx \bar{d}f_a(h).\]
We can define the error of this approximation as follows:
\[\varepsilon(h) = f(h) - \bar{d}f_a(h) = f(h) - f(a) - df_a(h - a).\]
There is therefore an exact equation
\[f(h) = \bar{d}f_a(h) + \varepsilon(h).\]
Written in terms of \(df_a\), this states that
\[f(h) = f(a) + df_a(h -a) + \varepsilon(h).\]
Now, the "best" linear approximation should, in some sense, minimize the error. Certainly, we want the error to approach \(0\) as \(h\) approaches \(a\), i.e. it should be the case that \(\lim_{h \to a}\varepsilon(h) = 0\). However, this will not suffice for a definition. To see why, first note that we also want the map \(df_a\) to be continuous, and thus \(\lim_{h \to a}df_a(h) = df_a(a)\). Likewise, since \(f\) is continuous, \(\lim_{h \to a}f(h) = f(a)\). Then note the following:
\begin{align}\lim_{h \to a}\varepsilon(h) &= \lim_{h \to a}(f(h) - f(a) - df_a(h - a)) \\&= \lim_{h \to a}f(h) - \lim_{h \to a}f(a) - \lim_{h \to a}df_a(h - a) \\&= f(a) - f(a) - \lim_{h \to a}df_a(h - a) \\&= \lim_{h \to a}(df_a(h) - df_a(a)) \\&= \lim_{h \to a}df_a(h) - \lim_{h \to a}df_a(a) \\&= df_a(a) - df_a(a) \\&= 0.\end{align}
Thus, the error vanishes for all such maps, so we cannot use this as a definition. However, we can require that the "best" linear approximation minimize the error "fastest". That is, we require that \(\lVert \varepsilon(h) \rVert_Y\) approaches \(0\) "faster" than \(h\) approaches \(a\), which we define as follows:
\[\lim_{h \to a}\frac{\lVert \varepsilon(h) \rVert_Y}{\lVert h - a \rVert_X} = 0.\]
Intuitively, this means that, as \(h\) approaches \(a\), \(\lVert \varepsilon(h) \rVert_Y\) gets much smaller than \(\lVert h - a \rVert_X\), since the ratio vanishes.
We use the notation \(f \in o(g)\) to mean that
\[\lim_{x \to 0}\frac{f(x)}{g(x)} = 0,\]
so we thus require that \(\lVert \varepsilon(h) \rVert_Y \in o\left(\lVert h-a \rVert_X\right)\).
Expanding this definition, we have
\[\lim_{h \to a}\frac{\lVert f(h) - f(a) - df_a(h - a)\rVert_Y}{\lVert h-a \rVert_X} = 0.\]
Definition 11 (Bounded Linear Map). A linear map \(f : X \rightarrow Y\) between normed vector spaces is bounded if there exists a constant \(c \in \mathbb{R}\) such that \(\lVert f(x) \rVert_Y \le c \cdot \lVert x \rVert_X\) for all \(x \in X\).
Theorem 10. A linear map between normed vector spaces is bounded if and only if it is continuous.
Proof. Suppose that \(f : X \rightarrow Y\) is a bounded linear map. Then, there exists a \(c \in \mathbb{R}\) such that \(\lVert f(x) \rVert \le c \cdot \lVert x \rVert\) for all \(x\in X\). Suppose \(\varepsilon \gt 0\), and define \(\delta = \varepsilon/c\). Then \(\delta \gt 0\), and if \(\lVert x-y \rVert < \delta\), then
\begin{align}\lVert f(x) - f(y) \rVert &= \lVert f(x-y) \rVert \\& \le c \cdot \lVert x-y \rVert \\&\lt c \cdot \delta \\&= c \cdot \frac{\varepsilon}{c} \\&= \varepsilon.\end{align}
Conversely, suppose \(f\) is continuous. Then, at the point \(0 \in X\), it follows, by definition, that there exists a \(\delta^* > 0\) such that, for all \(x \in X\), \(\lVert f(x) - f(0) \rVert = \lVert f(x-0) \rVert = \lVert f(x) \rVert \lt 1\) whenever \(\lVert x - 0 \rVert = \lVert x \rVert \lt \delta^*\). Then, choosing any number \(0 \lt \delta \lt \delta^*\), we have, for all \(x \in X\), that \(\lVert f(x) \rVert \lt 1\) whenever \(\lVert x \rVert \le \delta \lt \delta^*\) (and \(\lVert f(x) \rVert \lt 1\) implies \(\lVert f(x) \rVert \le 1\)). Note that
\[\left \lVert \delta \cdot \frac{x}{\lVert x \rVert} \right\rVert = \delta.\]
Then, for any \(x \in X\), it follows that
\begin{align}\lVert f(x) \rVert &= \left\lVert f\left(\frac{\delta}{\delta} \cdot \frac{\lVert x \rVert}{\lVert x \rVert} \cdot x\right) \right\rVert \\&= \left\lVert f\left(\frac{\lVert x \rVert}{ \delta} \cdot \frac{\delta}{\lVert x \rVert}\cdot x\right) \right\rVert \\&= \frac{\lVert x \rVert}{\delta} \cdot \left\lVert f\left(\delta \cdot \frac{x}{\lVert x \rVert}\right) \right\rVert \\&\le \frac{\lVert x \rVert}{\delta} \cdot 1 \\&= \frac{1}{\delta} \cdot \lVert x \rVert.\end{align}
\(\square\)
Definition 12 (Operator Norm). The operator norm of a bounded linear map \(L : X \rightarrow Y\) between normed vector spaces is defined as follows:
\[\lVert L \rVert = \inf C\]
where
\[C = \{c \ge 0 : \lVert Lx \rVert_Y \le c \cdot \lVert x \rVert_X \text{for all}~x \in X\}.\]
We are now prepared to define the derivative.
Definition 13 (Derivative). The derivative of a function \(f : X \rightarrow Y\) between normed vector spaces \(X\) and \(Y\) at a point \(a \in X\) is a bounded linear map \(df_a : X \rightarrow Y\) such that
\[\lim_{h \to a}\frac{\lVert f(h) - f(a) - df_a(h - a)\rVert_Y}{\lVert h-a \rVert_X} = 0.\]
If the derivative of a function exists at a point, the function is said to be differentiable at the point. If the derivative of a function exists at all points in its domain, then the function is said to be differentiable.
There are many equivalent ways to express this definition. By the previous theorems on limits, this may be expressed as
\[\lim_{h \to a}\frac{f(h) - f(a) - df_a(h - a)}{\lVert h-a \rVert_X} = 0\]
or as
\[\lim_{h \to 0}\frac{\lVert f(a+h) - f(a) - df_a(h)\rVert_Y}{\lVert h \rVert_X} = 0\]
or as
\[\lim_{h \to 0}\frac{f(a+h) - f(a) - df_a(h)}{\lVert h \rVert_X} = 0.\]
Next, we will demonstrate that derivatives are unique if they exist, and so we may speak of "the" derivative of a function.
Theorem 11. The derivative of a function \(f : X \rightarrow Y\) between normed vector spaces \(X\) and \(Y\) at a point \(a \in X\) is unique if it exists.
Proof. Suppose that there exist linear maps \(d_1f_a, d_2f_a : X \rightarrow Y\) such that
\[\lim_{h \to 0}\frac{\lVert f(a+h) - f(a) - d_1f_a(h)\rVert_Y}{\lVert h \rVert_X} = 0\]
and
\[\lim_{h \to 0}\frac{\lVert f(a+h) - f(a) - d_2f_a(h )\rVert_Y}{\lVert h \rVert_X} = 0.\]
Note that, for any \(h \in X\),
\begin{align}d_1f_a(h) - d_2f_a(h) &= (f(a+h) - f(a) - d_2f_a(h)) \\&- (f(a+h)-f(a)-d_1f_a(h)).\end{align}
By the triangle inequality, this implies that
\begin{align}\lVert d_1f_a(h) - d_2f_a(h) \rVert_Y &= \lVert (f(a+h) - f(a) - d_2f_a(h)) \\&- (f(a+h)-f(a)-d_1f_a(h)) \rVert_Y \\&\le \lVert (f(a+h) - f(a) - d_2f_a(h)) \rVert_Y \\&+ \lVert (f(a+h)-f(a)-d_1f_a(h)) \rVert_Y.\end{align}
This further implies that
\begin{align}\lim_{h \to 0} \frac{\lVert d_1f_a(h) - d_2f_a(h) \rVert_Y}{\lVert h \rVert_X} &\le \lim_{h \to 0} \frac{\lVert f(a+h)-f(a)-d_1f_a(h) \rVert_Y}{\lVert h \rVert_X} \\&+ \lim_{h \to 0} \frac{\lVert f(a+h)-f(a)-d_2f_a(h) \rVert_Y}{\lVert h \rVert_X} \\&= 0.\end{align}
By positive definiteness, this implies that
\[\lim_{h \to 0} \frac{\lVert d_1f_a(h) - d_2f_a(h) \rVert_Y}{\lVert h \rVert_X} = 0.\]
Then, for any \(h \in X\), such that \(h \ne 0\), it follows that
\begin{align}0 &= \lim_{h \to 0} \frac{\lVert d_1f_a(h) - d_2f_a(h) \rVert_Y}{\lVert h \rVert_X} \\&= \lim_{t \to 0} \frac{\lVert d_1f_a(th) - d_2f_a(th) \rVert_Y}{\lVert th \rVert_X} \\&= \lim_{t \to 0} \frac{\lVert t \cdot d_1f_a(h) - t \cdot d_2f_a(h) \rVert_Y}{\lVert th \rVert_X} \\&= \lim_{t \to 0} \frac{\lVert t \cdot (d_1f_a(h) - d_2f_a(h)) \rVert_Y}{\lVert th \rVert_X} \\&= \lim_{t \to 0} \frac{\lvert t \rvert \cdot \lVert d_1f_a(h) - d_2f_a(h) \rVert_Y}{\lvert t \rvert \cdot \lVert h \rVert_X} \\&= \lim_{t \to 0} \frac{\lVert d_1f_a(h) - d_2f_a(h) \rVert_Y}{\lVert h \rVert_X}.\end{align}
Since the ultimate expression is not a function of \(t\), it follows that
\[\frac{\lVert d_1f_a(h) - d_2f_a(h) \rVert_Y}{\lVert h \rVert_X} = 0\]
and thus
\[\lVert d_1f_a(h) - d_2f_a(h) \rVert_Y = 0,\]
which means that \(d_1f_a(h) - d_2f_a(h) = 0\) and thus \(d_1f_a(h) = d_2f_a(h)\). If \(h = 0\), then \(d_1f_a(0) = 0 = d_2f_a(0)\) since each is a linear map and linear maps always map zero vectors to zero vectors. Thus, \(d_1f_a = d_2f_a\). \(\square\)
Next, we will demonstrate that a function is differentiable at a point only if it is continuous at the point.
Theorem 11. A function \(f : X \rightarrow Y\) between normed vector spaces is differentiable at a point \(a \in X\) only if it is continuous at \(a\).
Proof. Suppose that \(f\) is differentiable at \(a\). Then, by definition,
\[\lim_{h \to a}\frac{f(h) - f(a) - df_a(h - a)}{\lVert h-a \rVert_X} = 0.\]
Note that, since \(df_a\) is continuous at \(a\), it follows that
\begin{align}\lim_{h \to a}df_a(a - h) &= \lim_{h \to a}(df_a(a) - df_a(h)) \\&= \lim_{h \to a}df_a(a) - \lim_{h \to a} df_a(h) \\&= df_a(a) - df_a(a) \\&= 0.\end{align}
Also note that \(\lim_{h \to a} \lVert h - a \rVert_X = 0\).
It then follows that
\[\lim_{h \to a}\frac{ f(h) - f(a) - df_a(h - a)}{\lVert h-a \rVert_X} \cdot \lim_{h \to a} \lVert h - a \rVert_X + \lim_{h \to a}df_a(a - h) = 0.\]
Consolidating these limits yields
\[\lim_{h \to a}\left[\frac{f(h) - f(a) - df_a(h - a)}{\lVert h-a \rVert_X} \cdot \lVert h - a \rVert_X + df_a(a - h)\right] = 0.\]
This means that
\[\lim_{h \to a}(f(h) - f(a)) = 0,\]
and thus
\[\lim_{h \to a}f(h) = f(a).\] \(\square\)
Example 7. Consider the function \(f(x) = x^2\). We require that \(f(a+h)-f(a)= df_a(h)+\varepsilon(h)\), so we compute
\begin{align}f(a+h)-f(a) &= (a+h)^2 - a^2 \\&= a^2 + 2ah + h^2 - a^2 \\&= 2ah - h^2\end{align}
Thus, if we define \(df_a(h) = 2ah\) and \(\varepsilon(h) = -h^2\), then the equation is satisfied. We need to confirm that \(\lvert \varepsilon(h) \rvert \in o\left(\lvert h \rvert\right)\), i.e. that
\[\lim_{h \to 0}\frac{\lvert -h^2 \rvert}{\lvert h \rvert} = 0.\]
This is equivalent to \(\lim_{h \to 0}\lvert h \rvert = 0\), which is clearly true. Thus \(df_a\) is the linear map
\[h \mapsto 2a \cdot h.\]
Differentials
There is another way to conceive of the derivative as the best linear approximation to the difference function.
Given any map \(f : X \rightarrow Y\) between normed vector spaces \(X\) and \(Y\), the difference function \(\Delta_af\) of \(f\) at the point \(a \in X\) is defined, for every \(h \in X\), as
\[\Delta_af(h) = f(a + h) - f(a).\]
The difference function simply measures how \(f\) varies as it deviates from point \(a\) by vector \(h\). We can then define the derivative as the best linear approximation of the difference function. That is, we want \(df_a : X \rightarrow Y\) to be a linear map such that
\[\Delta_af(h) \approx df_a(h),\]
which means that there is an exact equation
\[\Delta_af(h) = df_a(h) + \varepsilon(h),\]
where \(\varepsilon(h)\) indicates the error, i.e. the difference between \(\Delta_af(h)\) and its approximation \(df_a(h)\), and is thus defined as
\[\varepsilon(h) = \Delta_af(h) - df_a(h).\]
We require that \(\varepsilon(h) \in o\left(\lVert h \rVert_X\right)\), that is
\[\lim_{h \to 0}\frac{\varepsilon(h)}{\lVert h \rVert_X} = 0.\]
This means that
\[\lim_{h \to 0}\frac{\lVert \Delta_af(h) - df_a(h) \rVert_Y}{\lVert h \rVert_X} = 0.\]
Thus, as \(h\) approaches \(0\), the difference between \(\Delta_af(h)\) and \(df_a(h)\) vanishes "faster" than \(h\) approaches \(0\).
Expanding this definition, we recover a condition equivalent to our original definition of the derivative:
\[\lim_{h \to 0}\frac{\lVert f(a+h) - f(a) - df_a(h) \rVert_Y}{\lVert h \rVert_X} = 0.\]
Affine Maps
An affine space is similar to a vector space, except there is no distinguished zero element.
Definition 14 (Affine Space). An affine space of dimension \(n\) consists of a set set \(P\) whose elements are called points, an \(n\)-dimensional vector space \(V\) whose vectors translate the points of \(P\), and a map \(+ : P \times V \rightarrow P\) satisfying the following axioms:
- \(p + (u + v) = (p + u) + v\) for all \(p \in P\) and \(u,v \in V\).
- \(p + 0 = p\) for all \(p \in P\).
- For all \(p,q \in P\) there exists a unique element denoted \(q - p \in V\) such that \(q = p + (q - p)\).
In other words, the vector space \(V\) acts on the set of points, inducing translations (technically there is a free transitive action of the additive group underlying \(V\)).
Next, we consider what an affine map, that is, a homomorphism of affine spaces, should be. Let \(f : A \rightarrow B\) be a map between the points \(A\) and points \(B\) of two affine spaces whose associated vector spaces are \(\vec{A}\) and \(\vec{B}\), respectively. A homomorphism should preserve both the translation operation (\(+\)) and the translational inverses \((a'-a)\) for \(a',a \in A\). We cannot write \(f(a+v) = f(a) + f(v)\) since \(f\) only maps points. Thus, there must be a pair of mappings consisting of \(f : A \rightarrow B\) and a linear map \(\vec{f} : \vec{A} \rightarrow \vec{B}\) such that
\[f(a + v) = f(a) + \vec{f}(v).\]
Likewise, \(f\) itself cannot preserve the translational inverses \(a'-a \in V\), and so we define \(\vec{f}\) by requiring that it preserve these inverses, namely
\[\vec{f}(a'-a) = f(a') - f(a).\]
However, for this to be a well-defined function, it must be the case that \(\vec{f}(a_1'-a_1) = \vec{f}(a_2'-a_2)\) whenever \(a_1'-a_1 = a_2'-a_2\), and thus we require that \(f(a_1') - f(a_1) = f(a_2') - f(a_2)\).
Note that, for a point \(a\) and a vector \(v\), since \(a+v\) is a point, the third axiom guarantees that there exists a unique vector denoted \((a+v)-a\) such that \(a+((a+v)-a) = a+v\). But \(v\) itself also satisfies the relation \(a+v=a+v\), so, it follows that \((a+v)-a=v\).
Thus, for every point \(a\), and vector \(v\), there exists a unique point \(a+v\) such that \(v=(a+v)-a\). This means that the map \(\vec{f}\) is completely determined, since, for each vector \(v\), \(\vec{f}(v) = \vec{f}((a+v)-a) = f(a+v) - f(a)\). Furthermore, the third axiom again guarantees that \(f(a+v)-f(a)\) is the unique element satisfying \(f(a) + (f(a+v)-f(a)) = f(a+v)\), and thus, since \(\vec{f}(v) = f(a+v) - f(a)\), it also follows that \(f(a) + \vec{f}(v) = f(a+v)\). Thus, the condition on \(\vec{f}\) determines the condition on \(f\).
We have thus arrived at the following definition.
Definition 15 (Affine Map). An affine map between affine spaces \((A, \vec{A})\) and \((B,\vec{B})\) is a map \(f : A \rightarrow B\) such that the linear map \(\vec{f} : \vec{A} \rightarrow \vec{B} \) defined as \(\vec{f}(a'-a) = f(a')-f(a)\) is well-defined, that is, \(f(a_1')-f(a_1) = f(a_2')-f(a_2)\) whenever \(a_1'-a_1 = a_2'-a_2\).
Now, any vector space \(X\) can be considered as an affine space over itself, i.e. the set of points is \(X\) and the set of vectors is \(X\), and affine addition is simply vector addition.
Now, every affine map \(f : X \rightarrow Y\) between vector spaces as affine spaces over themselves must satisfy
\[ f(a+h) = f(a) + df_a(h),\]
for some linear map \(df_a : X \rightarrow Y\) and all \(h \in X\), and, since \(h-a\) makes sense in this context, we can substitute \(h-a\) in place of \(h\) and infer that
\[ f(h) = f(a) + df_a(h-a).\]
Thus, every affine map is of this form in this context.
An affine approximation of a continuous map \(f : X \rightarrow Y\) between normed vector spaces \(X\) and \(Y\) is an affine map between \(X\) and \(Y\) as affine spaces over themselves which approximates \(f\), and thus, since every affine map has the form \(h \mapsto f(a) + df_a(h)\) for some linear map \(df_a : X \rightarrow Y\), this means that
\[f(h) \approx f(a) + df_a(h).\]
There is therefore an equation
\[f(h) = f(a) + df_a(h) + \varepsilon(h),\]
where the error \(\varepsilon(h)\) is given by
\[\varepsilon(h) = f(h) - f(a) - df_a(h).\]
The best affine approximation of \(f\) is then the affine approximation such that
\[\lim_{h \to a}\frac{\lVert \varepsilon(h) \rVert_Y}{\lVert h-a \rVert_X} = 0.\]
Explicitly, this means that
\[\lim_{h \to a}\frac{\lVert f(h) - f(a) - df_a(h) \rVert_Y}{\lVert h-a \rVert_X} = 0,\]
which is precisely the definition of the derivative from above.
Perspectives
In summary, there are thus at least three equivalent perspectives on the derivative of a continuous map \(f : X \rightarrow Y\) between normed vector spaces \(X\) and \(Y\) at a point \(a \in X\):
- The derivative is the best linear approximation of \(f\) at the point \(a\), i.e. a linear map \(\bar{d}f_a : T_aX \rightarrow T_{f(a)}Y\) between tangent spaces satisfying certain properties.
- The derivative is the best linear approximation of the difference function \(\Delta_af\), i.e. a linear map \(df_a : X \rightarrow Y\) satisfying certain properties.
- The derivative is the linear portion of the best affine approximation of \(f\) at the point \(a\).
The Chain Rule
The chain rule is one of the most important properties of derivatives.
Theorem 12. For any differentiable maps \(f : Y \rightarrow Z\) and \(g: X \rightarrow Y\) between normed vector spaces \(X\), \(Y\), and \(Z\) and point \(a \in X\),
\[d(f \circ g)_a = df_{g(a)}(dg_a).\]
Proof. Consider the following:
\begin{align}(f \circ g)(a+h) &= f(g(a+h)) \\&= f(g(a) + dg_a(h) + \varepsilon_g(h)) \\&= f(g(a)) + df_{g(a)}(dg_a(h) + \varepsilon_g(h)) + \varepsilon_f(dg_a(h) + \varepsilon_g(h)) \\&= f(g(a)) + df_{g(a)}(dg_a(h)) + df_{g(a)}(\varepsilon_g(h)) + \varepsilon_f(dg_a(h) + \varepsilon_g(h)) \\&= f(g(a)) + df_{g(a)}(dg_a(h)) + df_{g(a)}(\varepsilon_g(h)) + \varepsilon_f(g(a+h)-g(a)).\end{align}
It thus follows that
\[\varepsilon_{f \circ g}(h) = df_{g(a)}(\varepsilon_g(h)) + \varepsilon_f(g(a+h)-g(a)).\]
Since \(df_{g(a)}\) is bounded, there exists a constant \(c \in \mathbb{R}\) such that \(\lVert df_{g(a)}(\varepsilon_g(h)) \rVert \le c \cdot \lVert \varepsilon_g(h) \rVert\), and since \(\lVert \varepsilon_g(h) \rVert \in o\left(\lVert h \rVert\right)\), it follows that \(\lVert df_{g(a)}(\varepsilon_g(h)) \rVert \in o\left(\lVert h \rVert\right)\).
Next, define the following functions:
- \(D(h) = \frac{f(g(a) + h) - f(g(a)) - df_{g(a)}(h)}{\lVert h \rVert}.\)
- \(E(h) = g(a+h)-g(a).\)
Since \(D\) and \(E\) are continuous, it follows that \(\lim_{h \to 0}(D \circ E)(h) = \lim_{h \to E(0)}D(h)\). Since \(E(0) = 0\) and \(\lim_{g \to 0}D(h) = 0\), this means that
\[\lim_{h \to 0}\frac{f(g(a+h)) - f(g(a)) - df_{g(a)}(g(a+h)-g(a))}{\lVert g(a+h)-g(a) \rVert} = 0.\]
Next, note that, since the operator norm is an infimum, it follows that
\[\lim_{h \to 0}\frac{df_a(h)}{\lVert h\rVert} = \lVert df_a \rVert,\]
and thus
\[\lim_{h \to 0}\frac{ g(a+h)-g(a)}{\lVert h \rVert} = \lim_{h \to 0}\frac{ g(a+h)-g(a)-df_a(h)}{\lVert h \rVert} + \lim_{h \to 0}\frac{df_a(h)}{\lVert h\rVert},\]
which implies that the limit
\[\lim_{h \to 0}\frac{\lVert g(a+h)-g(a) \rVert}{\lVert h \rVert}\]
also exists. Then, it follows that
\[\lim_{h \to 0}\frac{f(g(a+h)) - f(g(a)) - df_{g(a)}(g(a+h)-g(a))}{\lVert g(a+h)-g(a) \rVert}\cdot \frac{\lVert g(a+h)-g(a) \rVert}{\lVert h \rVert} = 0,\]
and thus
\[\lim_{h \to 0}\frac{f(g(a+h)) - f(g(a)) - df_{g(a)}(g(a+h)-g(a))}{\lVert h \rVert} = 0.\]
This means precisely that \(\lVert \varepsilon_f(g(a+h)-g(a)) \rVert \in o\left(\lVert h \rVert\right)\).
It thus follows that \(\varepsilon_{f \circ g}(h) = df_{g(a)}(\varepsilon_g(h)) + \varepsilon_f(g(a+h)-g(a)) \in o\left(\lVert h \rVert\right)\). \(\square\)
Directional Derivatives
The directional derivative indicates the change in a function in a particular direction.
Definition 16 (Directional Derivative). A directional derivative of a map \(f : X \rightarrow Y\) between normed vector spaces \(X\)\and \(Y\) at a point \(a \in X\) is a bounded linear map \(D_af : X \rightarrow Y\) defined for each \(h \in X\) as follows:
\[D_af(h) \lim_{t \to 0}\frac{f(a + th) - f(a)}{t}.\]
Note that, unlike the case for the total derivative, the existence of a directional derivative does not necessarily imply that the function is continuous.
Also note that, in a manner analogous to the total derivative, we may also define the directional derivative as
\[f(a+th)=f(a)+t \cdot D_af(h)+\varepsilon(h),\]
where
\[\varepsilon(h) = f(a+th)-f(a)-t \cdot D_af(h)\]
and \(\varepsilon(h) \in o(t)\), meaning
\[\lim_{t \to 0}\frac{ f(a+th)-f(a)- t \cdot D_af(h)}{t} = 0,\]
which implies that
\[D_af(h) = \lim_{t \to 0}\frac{f(a+th)-f(a)}{t}.\]
Theorem 13. For every continuous map \(f : X \rightarrow Y\) between normed vector spaces \(X\) and \(Y\), point \(a \in X\), and vector \(h \in X\), if \(f\) is differentiable at \(a\), then
\[df_a(h) = D_af(h).\]
Proof. Since \(f\) is differentiable at \(a\), by definition,
\[\lim_{h \to 0}\frac{f(a+h)-f(a)-df_a(h)}{\lVert h \rVert} = 0.\]
Parameterizing the limit, it follows that
\[\lim_{t \to 0}\frac{f(a+th)-f(a)-df_a(th)}{\lvert t \rvert} = 0.\]
Thus, it also follows that
\[\lim_{t \to 0^+}\frac{f(a+th)-f(a)-df_a(th)}{\lvert t \rvert} = 0,\]
which means that
\[\lim_{t \to 0^+}\frac{f(a+th)-f(a)-df_a(th)}{t} = 0.\]
This further implies that
\[\lim_{t \to 0^+}\frac{f(a+th)-f(a)-df_a(th)}{t} - \lim_{t \to 0^+}\frac{df_a(th)}{t} = 0.\]
Since
\begin{align}\lim_{t \to 0^+}\frac{df_a(th)}{t} &= \lim_{t \to 0^+}\frac{t \cdot df_a(h)}{t} \\&= \lim_{t \to 0^+}df_a(h) \\&= df_a(h),\end{align}
it follows that
\[\lim_{t \to 0^+}\frac{f(a+th)-f(a)-df_a(th)}{t} = df_a(h).\]
Similarly,
\[\lim_{t \to 0^-}\frac{f(a+th)-f(a)-df_a(th)}{\lvert t \rvert} = 0,\]
which means that
\[\lim_{t \to 0^-}\frac{f(a+th)-f(a)-df_a(th)}{-t} = 0.\]
This further implies that
\[\lim_{t \to 0^-}\frac{f(a+th)-f(a)-df_a(th)}{-t} - \lim_{t \to 0^-}\frac{df_a(th)}{-t} = 0.\]
Since
\begin{align}\lim_{t \to 0^-}\frac{df_a(th)}{-t} &= \lim_{t \to 0^-}\frac{t \cdot df_a(h)}{-t} \\&= \lim_{t \to 0^-}-df_a(h) \\&= -df_a(h),\end{align}
it then follows that
\[\lim_{t \to 0^-}\frac{f(a+th)-f(a)-df_a(th)}{t} = df_a(h).\]
Since
\[\lim_{t \to 0^+}\frac{f(a+th)-f(a)-df_a(th)}{t} = df_a(h) = \lim_{t \to 0^-}\frac{f(a+th)-f(a)-df_a(th)}{t},\]
it follows that
\[\lim_{t \to 0}\frac{f(a+th)-f(a)-df_a(th)}{t} = df_a(h).\]
\(\square\)
The converse it not necessarily true: the existence of directional derivatives does not guarantee the existence of the total derivative.
The directional derivative likewise satisfies a chain rule.
Theorem 14. If \(f : Y \rightarrow Z\) and \(g : X \rightarrow Y\) are continuous maps between normed vector spaces \(X\), \(Y\), and \(Z\) and the directional derivatives \(D_{g(a)}f(h)\) and \(D_ag(h)\) exist, then, for all \(h \in X\),
\[D_a(f \circ g)(h) = D_{g(a)}f(D_ag(h)).\]
Proof. Consider the following:
\begin{align}(f \circ g)(a+th) &= f(g(a+th)) \\&= f(g(a) + t \cdot Dg_a(h) + \varepsilon_g(h)) \\&= f(g(a) + t \cdot (Dg_a(h) + (1/t) \cdot \varepsilon_g(h))) \\&= f(g(a)) + t \cdot Df_{g(a)}(Dg_a(h) + (1/t) \cdot \varepsilon_g(h)) + \varepsilon_f(Dg_a(h) + (1/t) \cdot \varepsilon_g(h)) \\&= f(g(a)) + t \cdot Df_{g(a)}(Dg_a(h)) + Df_{g(a)}(\varepsilon_g(h)) + \varepsilon_f(Dg_a(h) + (1/t) \cdot \varepsilon_g(h)) \\&= f(g(a)) + t \cdot Df_{g(a)}(Dg_a(h)) + Df_{g(a)}(\varepsilon_g(h)) + \varepsilon_f((1/t) \cdot (g(a+th)-g(a))).\end{align}
It thus follows that
\[\varepsilon_{f \circ g}(h) = Df_{g(a)}(\varepsilon_g(h)) + \varepsilon_f\left(\frac{g(a+h)-g(a)}{t}\right).\]
Since \(Df_{g(a)}\) is bounded, there exists a constant \(c \in \mathbb{R}\) such that \(\lVert Df_{g(a)}(\varepsilon_g(h)) \rVert \le c \cdot \lVert \varepsilon_g(h) \rVert\), and since \(\lVert \varepsilon_g(h) \rVert \in o\left(\lVert h \rVert\right)\), it follows that \(\lVert Df_{g(a)}(\varepsilon_g(h)) \rVert \in o\left(\lVert h \rVert\right)\).
Next, define the following functions:
- \(D(h) = \frac{f(g(a) + h) - f(g(a)) - df_{g(a)}(h)}{\lVert h \rVert}.\)
- \(E(h) = g(a+h)-g(a).\)
Since \(D\) and \(E\) are continuous, it follows that \(\lim_{h \to 0}(D \circ E)(h) = \lim_{h \to E(0)}D(h)\). Since \(E(0) = 0\) and \(\lim_{g \to 0}D(h) = 0\), this means that
\[\lim_{h \to 0}\frac{f(g(a+h)) - f(g(a)) - Df_{g(a)}(g(a+h)-g(a))}{\lVert g(a+h)-g(a) \rVert} = 0.\]
Then, parameterizing this limit, we obtain
\[\lim_{t \to 0}\frac{f(g(a+th)) - f(g(a)) - Df_{g(a)}(g(a+th)-g(a))}{\lVert g(a+th)-g(a) \rVert} = 0.\]
Thus, since, by hypothesis, the limit
\[\lim_{t \to 0}\frac{g(a+th)-g(a)}{t}\]
exists, it follows that
\[\lim_{t \to 0}\frac{f(g(a+th)) - f(g(a)) - Df_{g(a)}(g(a+th)-g(a))}{\lVert g(a+th)-g(a) \rVert}\cdot \frac{\lVert g(a+th)-g(a) \rVert}{ t} = 0,\]
and thus
\[\lim_{t \to 0}\frac{f(g(a+th)) - f(g(a)) - Df_{g(a)}(g(a+th)-g(a))}{ t } = 0,\]
and furthermore
\[\lim_{t \to 0}\frac{f(g(a+th)) - f(g(a)) - t \cdot Df_{g(a)}\left(\frac{g(a+th)-g(a)}{t}\right)}{ t } = 0.\]
This means precisely that \(\lVert \varepsilon_f((1/t) \cdot (g(a+h)-g(a))) \rVert \in o\left(\lVert h \rVert\right)\).
It thus follows that \(\varepsilon_{f \circ g}(h) = Df_{g(a)}(\varepsilon_g(h)) + \varepsilon_f((1/t) \cdot (g(a+h)-g(a))) \in o\left(\lVert h \rVert\right)\). \(\square\)
The directional derivative satisfies a few other important properties. The first is constant factorization.
Theorem 15. For any differentiable map \(f : X \rightarrow Y\) between normed vector spaces \(X\) and \(Y\), point \(a \in X\), constant scalar \(c \in \mathbb{R}\), and vector \(h \in X\),
\[D_a(c \cdot f)(h) = c \cdot D_af(h).\]
Proof. By definition,
\begin{align}D_a(c \cdot f)(h) &= \lim_{t \to 0}\frac{c \cdot f(a+th)-c \cdot f(a)}{t} \\&= c \cdot \lim_{t \to 0}\frac{f(a+th)-\cdot f(a)}{t} \\&= c \cdot D_af(h).\end{align}
\(\square\)
Another property satisfied by directional derivatives is the sum rule.
Theorem 16. For any pair of differentiable maps \(f, g:X \rightarrow Y\), point \(a \in X\), and vector \(h \in X\),
\[D_a(f+g)(v) = D_af(v) + D_ag(v).\]
Proof. By definition,
\begin{align}D_a(f+g)(v) &= \lim_{t \to 0}\frac{(f+g)(a+th)-(f+g)(a)}{t} \\&= \lim_{t \to 0}\frac{f(a+th)+g(a+th)-f(a)-g(a)}{t} \\&= \lim_{t \to 0}\frac{f(a+th)-f(a)}{t}+\lim_{t \to 0}\frac{g(a+th)-g(a)}{t} \\&= D_af(h) + D_ag(h).\end{align}
\(\square\)
Directional derivatives also satisfy a product rule.
Theorem 17. For any pair of differentiable maps \(f,g : X \rightarrow \mathbb{R}\), point \(a \in X\), and vector \(h \in X\),
\[D_a(fg)(h) = f(a) \cdot D_ag(h) + g(a) \cdot D_af(h) .\]
Proof. Consider the following:
\begin{align}D_a(fg)(h) &= \lim_{t \to 0}\frac{(fg)(a+th)-(fg)(a)}{t}\\&= \lim_{t \to 0}\frac{f(a+th)g(a+th)-f(a)g(a)}{t}\\&= \lim_{t \to 0}\frac{f(a+th)g(a+th)-f(a)g(a) + (f(a+th)g(a) - f(a+th)g(a))}{t}\\&=\lim_{t \to 0}\frac{f(a+th)(g(a+th)-g(a)) + g(a)(f(a+th)-f(a))}{t}\\&= \left[\lim_{t \to 0}f(a+th)\lim_{t \to 0}\frac{g(a+th)-g(a)}{t}\right] + \left[g(a)\lim_{t \to 0}\frac{f(a+th)-f(a)}{t}\right]\\&= f(a)D_ag(h) + g(a)D_af(h)\end{align}
This uses the fact that \(\lim_{t \to 0}f(a+th) = f(a)\) since \(f\) is differentiable and therefore continuous. \(\square\)
Partial Derivatives
Partial derivatives are an important special case of directional derivatives in finite-dimensional normed vector spaces. A partial derivative is a directional derivative in the direction of a basis vector.
Definition 17 (Partial Derivative). Given a continuous map \(f : X \rightarrow Y\) between normed vector spaces \(X\) and \(Y\), where \(X\) as dimension \(n\), and a basis \((e_i)\) for \(X\), the partial derivative of \(f\) with respect to the \(i\)-th coordinate function \(x^i\) is defined as follows:
\[\frac{\partial f}{\partial x^i}(a) = D_af(e_i).\]
In the case of functions \(f : \mathbb{R} \rightarrow Y\) and the standard basis \(e_1 = 1\) for \(\mathbb{R}\), the notation
\[\frac{df}{dx}(a) = D_af(e_1) = D_af(1)\]
is used. Expanding this definition, it follows that
\[\frac{df}{dx}(a) = \lim_{t \to 0}\frac{f(a + t) - f(a)}{t}.\]
When this expression applied to functions \(f : \mathbb{R} \rightarrow \mathbb{R}\), it recovers the definition of the classical derivative. Thus, the classical derivative is a special case of a partial derivative.
This also means that the directional derivative can alternatively be defined as follows:
\[D_af(h) = \frac{d}{dt}\bigg\rvert_0 f(a + th).\]
Writing \(g(t) = f(a + th)\), this means that
\[D_af(h) = \frac{dg}{dt}(0) = \lim_{t \to 0}\frac{g(0 + t) - g(0)}{t} = \lim_{t \to 0}\frac{f(a + th) - f(a)}{t}.\]
Directional derivatives are linear combinations of partial derivatives, just as vectors are linear combinations of basis vectors, since
\begin{align}D_af(h) &= D_af(h^ie_i) \\&= h^iD_af(e_i) \\&= h^i\frac{\partial f}{\partial x^i}(a).\end{align}
If the codomain \(Y\) of the continuous map \(f : X \rightarrow Y\) is an \(m\)-dimensional space with basis \((e_j)\), then the directional derivative can be analyzed even further. Since limits in finite-dimensional normed vector spaces are computed component-wise, it follows that \(D_af(h) = D_af^j(h) \cdot e_j\), and thus
\begin{align}D_af(h) &= D_af^j(h^ie_i) \cdot e_j \\&= h^iD_af^j(e_i) \cdot e_j\\&= h^i\frac{\partial f^j}{\partial x^i}(a) \cdot e_j.\end{align}
This implies that we may represent \(D_af\) in coordinates as an \(m \times n\) matrix, called the Jacobian matrix, defined for row \(j\) and column \(i\) as
\[\left(\frac{\partial f^j}{\partial x^i}(a)\right)_{ji},\]
since the product
\[\begin{bmatrix}\frac{\partial f^1}{\partial x^1}(a) & \dots & \frac{\partial f^1}{\partial x^n}(a) \\ \vdots & \ddots & \vdots \\ \frac{\partial f^m}{\partial x^1}(a) & \dots & \frac{\partial f^m}{\partial x^n}(a)\end{bmatrix} \begin{bmatrix}h^1 \\ \vdots \\ h^n\end{bmatrix}\]
computes the coordinates for \(D_af(h)\) in the given basis.
The chain rule then implies that the Jacobian matrix representing \(D_a(f \circ g)\) is the product of the Jacobian matrices representing \(D_{g(a)}f\) and \(D_ag\) in the respective bases.
\[\begin{bmatrix}\frac{\partial f^1}{\partial x^1}(g(a)) & \dots & \frac{\partial f^1}{\partial x^n}(g(a)) \\ \vdots & \ddots & \vdots \\ \frac{\partial f^m}{\partial x^1}(g(a)) & \dots & \frac{\partial f^m}{\partial x^n}(g(a))\end{bmatrix} \begin{bmatrix}\frac{\partial g^1}{\partial y^1}(a) & \dots & \frac{\partial g^1}{\partial y^m}(a) \\ \vdots & \ddots & \vdots \\ \frac{\partial g^p}{\partial y^1}(a) & \dots & \frac{\partial g^p}{\partial y^m}(a)\end{bmatrix}.\]
Functional Differentials
The functional differential or total functional derivative is simply the derivative of a map whose domain is a space of functionals and whose codomain is \(\mathbb{R}\). A functional is a generic term which typically refers to a map between a certain type of vector space and \(\mathbb{R}\). For instance, the space \(C[a,b]\) of continuous maps \(f : [a,b] \rightarrow \mathbb{R}\) might be taken as the domain of functionals, and then a functional is a map \(F[f] : C[a,b] \rightarrow \mathbb{R}\). Thus, functionals can be conceived as higher-order functions which accept functions as input and produce numbers as output.
Definition 18 (Functional Differential / Total Functional Derivative). The functional differential or total functional derivative of a functional \(F : B \rightarrow \mathbb{R}\) at a point \(f \in B\) defined on a Banach space \(B\) is the total derivative of \(F\) at the point \(f\), which is therefore a bounded linear functional denoted \(\delta F_f : B \rightarrow \mathbb{R}\) satisfying, for each \(\varphi \in B\),
\[F[f + \varphi] = F[f] + \delta_f[\varphi] + \varepsilon[\varphi],\]
where
\[\varepsilon[\varphi] = F[f + \varphi] - F[f] - \delta_f[\varphi],\]
and \(\lvert \varepsilon[\varphi] \lvert \in o\left(\lVert \varphi \rVert_B\right)\), meaning that
\[\lim_{\varphi \to 0}\frac{\lvert \varepsilon[\varphi] \rvert}{\lVert \varphi \rVert_B} = 0.\]
Thus, the functional differential is just the total derivative of a map between a normed vector space of functionals and \(\mathbb{R}\).
Now, sometimes the total derivative may not exist, so many authors define the functional differential to be the directional derivative.
Given a function, there isn't much one can do with it, except to integrate it in some manner. Thus, functionals are typically expressed as integrals.
Example 8. Consider the functional \(F : C[a,b] \rightarrow \mathbb{R}\) defined as follows for every \(f \in C[a,b]\):
\[F[f] = \int_a^b\left(f(x)\right)^2~dx.\]
Let \(\varphi \in C[a,b]\) and consider \(\delta_f[\varphi]\). Here, we use the supremum norm \(\lVert \varphi \rVert_{\infty}\) on \(C[a,b]\).
Since \(F[f+\varphi]=F[f] + \delta_f[\varphi] + \varepsilon[\varphi]\), this means that
\[F[f + \varphi]-F[f] \approx \delta_f[\varphi],\]
so we compute
\begin{align}F[f + \varphi]-F[f] &= \int_a^b \left(f(x)+\varphi(x)\right)^2 - \left(f(x)\right)^2~dx \\&= \int_a^b \left(f(x)\right)^2 + 2f(x)\varphi(x) + \left(\varphi(x)\right)^2 - \left(f(x)\right)^2~dx \\&= \int_a^b 2f(x)\varphi(x)~dx + \int_a^b \left(\varphi(x)\right)^2~dx.\end{align}
Thus, if we define
\[\delta_f[\varphi] = \int_a^b 2f(x)\varphi(x)~dx,\]
which is indeed a continuous (and thus bounded) linear map, it follows that
\[\varepsilon[\varphi] = \int_a^b \left(\varphi(x)\right)^2~dx,\]
which vanishes as \(\lVert \varphi \rVert\) approaches \(0\). To see this, suppose \(\varepsilon > 0\), and define \(\delta = \lVert \varphi \rVert \cdot \varepsilon / (b-a)\). If \(0 \lt \lVert \varphi \rVert_{\infty} \cdot \delta\), then \(\delta \gt 0\), and we compute
\begin{align}\frac{1}{\lVert \varphi \rVert_{\infty}} \cdot \left\lvert \int_a^b \left(\varphi(x)\right)^2~dx \right\rvert &\le \frac{1}{\lVert \varphi \rVert_{\infty}} \cdot \int_a^b \left \lvert\varphi(x)\right \rvert^2~dx \\&= \frac{1}{\lVert \varphi \rVert_{\infty}} \cdot \int_a^b \lVert \varphi \rVert_{\infty}^2~dx \\&= \frac{1}{\lVert \varphi \rVert_{\infty}} \cdot (b-a) \cdot \lVert \varphi \rVert_{\infty}^2 \\&= (b-a) \cdot \lVert \varphi \rVert_{\infty} \\&\lt \frac{1}{\lVert \varphi \rVert_{\infty}} \cdot (b-a) \cdot \delta \\&= \varepsilon.\end{align}
Thus, it follows that
\[\lim_{\varphi \to 0}\frac{\varepsilon[\varphi]}{\lVert \varphi\rVert_{\infty}} = 0.\]
Functional Derivatives
Directional derivatives can be defined for functionals since they are maps between normed vector spaces. Since spaces of functionals are generally infinite-dimensional, the definition of partial derivatives given above does not apply. However, it is natural to ask whether there is some analogous notion for functionals.
We previously established that, for an \(n\)-dimensional vector space \(X\) with basis \((e_i)\) and coordinate functions \((x^i)\), the total derivative \(df_a\) of a continuous map \(f : X \rightarrow \mathbb{R}\) at a point \(a \in X\)
\[df_a(h) = \sum_{i=1}^n \frac{\partial f}{\partial x^i}(a) \cdot h^i.\]
Then, by formal analogy, if we replace the sum with an integral over the entire domain \(\Omega\) and replace the finite set of coordinates \((h^i)\) with a function \(\varphi(x)\) and replace the coordinate functions \((x^i)\) with a function \(f\), and if we denote the formal analogy of the partial derivative as \((\delta F/\delta f)(x)\), we obtain
\[\delta_f[\varphi] = \int_{\Omega} \frac{\delta F}{\delta f}(x) \cdot \varphi(x)~dx.\]
A functional derivative is precisely such a formal analog of the partial derivative. The terminology is perhaps misleading, since one might expect that a "functional derivative" means the (total) derivative \(\delta_f F\) of a functional \(F\).
Definition 19 (Functional Derivative). The functional derivative of a functional \(F : [\Omega, \mathbb{R}] \rightarrow \mathbb{R}\) with respect to \(f : \Omega \rightarrow \mathbb{R}\) is a function denoted
\[\frac{\delta F}{\delta f}(x) : \Omega \rightarrow \mathbb{R}\]
such that for all \(\varphi : \Omega \rightarrow \mathbb{R}\)
\[\delta_f[\varphi] = \int_{\Omega} \frac{\delta F}{\delta f}(x) \cdot \varphi(x)~dx.\]
Example 9. Consider the functional \(F : C[a,b] \rightarrow \mathbb{R}\) defined as follows for every \(f \in C[a,b]\):
\[F[f] = \int_a^b\left(f(x)\right)^2~dx.\]
We previously determined that
\[\delta_f F[\varphi] = \int_a^b 2f(x)\varphi(x)~dx.\]
Thus, it follows that
\[\frac{\delta F}{\delta f}(x) = 2f(x).\]
Euler-Lagrange Formula
The Euler-Lagrange formula provides a method for computing functional derivatives subject to certain boundary conditions. This formula also yields an equation which permits calculating the extrema of functionals subject to the boundary conditions. It is thus useful for optimization problems.
We consider a wide class of functions which are parameterized by a variable \(x\), a function \(f(x)\), and the first derivative \(f'(x)\), which we denote as \(L(x,f(x), f'(x))\), so that the functionals under consideration all have the form
\[F[f] = \int_a^b L(x,f(x),f'(x))~dx.\]
In other words, \(x \in \mathbb{R}\), \(f, f' : \mathbb{R} \rightarrow \mathbb{R}\), \(L : \mathbb{R}^3 \rightarrow \mathbb{R}\) is the map \((y_1, y_2, y_3) \mapsto L(y_1,y_2,y_3)\), and \(L(x,f(x),f'(x))\) denotes the composite map \(x \mapsto (x,f(x),f'(x)) \mapsto L(x,f(x),f'(x))\).
The function \(L\) is typically assumed to be sufficiently differentiable (for instance, at least twice differentiable).
Our goal is to compute the functional derivative of \(F\) at \(f\) subject to the generic boundary conditions \(f(a) = A\) and \(f(b) = B\).
We seek a solution to the equation
\[\frac{d}{dt}\bigg\lvert_0 F[f + t \cdot \varphi] = \int_a^b \frac{\delta F}{\delta f}(x) \cdot \varphi(x)~dx.\]
We restrict the class of admissible "test" functions \(\varphi\) to those such that \((f + t \cdot \varphi)(a) = f(a) = A\) and \((f + t \cdot \varphi)(b) = f(b) = B\), which implies that \(\varphi(a) = 0\) and \(\varphi(b) = 0\).
We then calculate
\begin{align}\frac{d}{dt}\bigg\lvert_0 F[f + t \cdot \varphi] &= \frac{d}{dt}\bigg\lvert_0\int_a^b L(x,f(x) + t \cdot \varphi(x),f'(x) + t \cdot \varphi'(x))~dx \\&= \int_a^b \frac{d}{dt}\bigg\lvert_0 L(x,f(x) + t \cdot \varphi(x),f'(x) + t \cdot \varphi'(x))~dx \\&= \int_a^b \left[\varphi(x) \cdot \frac{\partial L}{\partial y_2}(x,f(x)+t\cdot\varphi(x),f'(x)+t\cdot\varphi'(x)) + \varphi'(x) \cdot \frac{\partial L}{\partial y_3}(x,f(x)+t\cdot\varphi(x),f'(x)+t\cdot\varphi'(x)) \right]_{t=0} ~dx\\&= \int_a^b \varphi(x) \cdot \frac{\partial L}{\partial y_2}(x,f(x),f'(x)) + \varphi'(x) \cdot\frac{\partial L}{\partial y_3}(x,f(x),f'(x)) ~dx\end{align}
Next, we can apply integration by parts to the second term of the integrand to obtain the following:
\[\int_a^b \left[\frac{\partial L}{\partial y_2}(x,f(x),f'(x)) - \frac{d}{dx}\frac{\partial L}{\partial y_3}(x,f(x),f'(x))\right] \cdot \varphi(x)~dx + \left[\varphi(x) \cdot \frac{\partial L}{\partial y_3}(x,f(x),f'(x))\right]_a^b.\]
Applying the boundary conditions \(\varphi(a) = 0\) and \(\varphi(b) = 0\), this yields
\[\frac{d}{dt}\bigg\lvert_0 F[f + t \cdot \varphi] = \int_a^b \left[\frac{\partial L}{\partial y_2}(x,f(x),f'(x)) - \frac{d}{dx}\frac{\partial L}{\partial y_3}(x,f(x),f'(x))\right] \cdot \varphi(x)~dx.\]
Thus, under the appropriate boundary conditions, the functional derivative is
\[\frac{\delta F}{\delta f}(x) = \frac{\partial L}{\partial y_2}(x,f(x),f'(x)) - \frac{d}{dx}\frac{\partial L}{\partial y_3}(x,f(x),f'(x)).\]
The expression
\[\frac{\partial L}{\partial y_2}(x,f(x),f'(x)) - \frac{d}{dx}\frac{\partial L}{\partial y_3}(x,f(x),f'(x))\]
is called the Euler-Lagrange formula.
Theorem 18. (The Fundamental Lemma of the Calculus of Variations) If \(f \in C[a,b]\) and if \(\int_a^b f(x)h(x)~dx=0\) for all \(h \in C[a,b]\) with \(h(a) = h(b) = 0\), then \(f(x) = 0\) for all \(x \in [a,b]\).
Proof. We will prove the contrapositive. Let \(f \in C[a,b]\), and suppose that \(f\) is non-zero somewhere in \([a,b]\). Since \(f\) is continuous, there is some interval \([c,d] \subset [a,b]\) on which \(f\) has the same sign. Define a function \(h\) such that \(h(x) = (x - c)(d - x)\) if \(x \in [x_1,x_2]\) and \(h(x) = 0\) otherwise. Then, \(h \in C[a,b]\) and \(h(a) = h(b) = 0\). Then
\[\int_a^b f(x)h(x)~dx = \int_c^d f(x)h(x)~dx,\]
and since \(h(x) > 0\) on the interval \([c,d]\), it follows that the integral is positive when \(f\) is positive and negative when \(f\) is negative, so \(\int_c^d f(x)h(x)~dx \ne 0\). \(\square\)
The Euler-Lagrange formula is often used to find the extrema of functions. The extrema occur when
\[\frac{d}{dt}\bigg\lvert_0 F[f + t \cdot \varphi] = \int_a^b \frac{\delta F}{\delta f}(x) \cdot \varphi(x)~dx = 0.\]
Thus, by the Euler-Lagrange formula, under the boundary conditions, it follows that
\[\int_a^b \left[\frac{\partial L}{\partial y_2}(x,f(x),f'(x)) - \frac{d}{dx}\frac{\partial L}{\partial y_3}(x,f(x),f'(x))\right] \cdot \varphi(x)~dx = 0,\]
and applying the Fundamental Lemma of the Calculus of Variations, it follows that
\[\frac{\partial L}{\partial y_2}(x,f(x),f'(x)) - \frac{d}{dx}\frac{\partial L}{\partial y_3}(x,f(x),f'(x)) = 0.\]
This equation is called the Euler-Lagrange equation.
Example 10. We will apply the Euler-Lagrange equation to compute the shortest path between two points \(A\) and \(B\) on the Euclidean plane \(\mathbb{R}^2\). Each path is represented as a curve, that is, as a function \(f : [a,b] \rightarrow \mathbb{R}\). The curve can be conceived as the graph of \(f\), i.e. the set of points \((x, f(x)) \in \mathbb{R}^2\). The curve can thus be parameterized by the map \(r(x) = (r^1(x),r^2(x)) = (x,f(x))\). To compute the length of the curve, we use a line integral:
\[F[f] = \int_a^b \lVert r'(x) \rVert~dx.\]
Note that
\begin{align}\lVert r'(x) \rVert &= \left\lVert \left(\frac{\partial r^1}{\partial x}(x), \frac{\partial r^2}{\partial x}(x)\right) \right\rVert \\&= \left\lVert \left(1, \frac{d f}{d x}(x) \right) \right\rVert \\&= \sqrt{1 + (f'(x))^2},\end{align}
so the integral can also be written as
\[F[f] = \int_a^b L(x,f(x),f'(x))~dx = \int_a^b \sqrt{1 + f'(x)}~dx,\]
where \(L\) is the map
\[L(y_1,y_2,y_3) = \sqrt{1 + y_3^2}.\]
The map \(F\) is thus a functional. Our goal is to find the value of \(f\) that minimizes the curve length, i.e. the minimum of the functional \(L\).
We compute
\[\frac{\partial L}{\partial y_2} = 0,\]
and
\[\frac{\partial L}{\partial y_3} = \frac{y_3}{\sqrt{1 + y_3^2}},\]
so that
\[L(x,f(x),f'(x)) = \frac{f'(x)}{\sqrt{1 + (f'(x))^2}}.\]
Writing \(L(x,f(x),f'(x))\) as an abbreviation for the function \(x \mapsto L(x,f(x),f'(x))\), the Euler-Lagrange equation yields
\[\frac{\partial L}{\partial y_2}(x,f(x),f'(x)) - \frac{d}{dx}\frac{\partial L}{\partial y_3}(x,f(x),f'(x)) = 0 - \frac{d}{dx}\frac{f'(x)}{\sqrt{1 + (f'(x))^2}} = 0,\]
so
\[ \frac{d}{dx}\frac{1}{\sqrt{1 + (f'(x))^2}} = 0.\]
This yields a differential equation. We solve by integrating both sides of the equation to yield
\[\frac{f'(x)}{\sqrt{1 + (f'(x))^2}} = C\]
for some constant \(C\) (which will be the negation of the constant of integration).
It then follows that
\[\frac{(f'(x))^2}{1 + (f'(x))^2} = C^2,\]
and so
\[\frac{1}{C^2} = \frac{1 + (f'(x))^2}{(f'(x))^2} = \frac{1}{(f'(x))^2} + 1.\]
Then
\[\frac{1}{(f'(x))^2} = \frac{1}{C^2} - 1 = \frac{1-C^2}{C^2},\]
so
\[(f'(x))^2 = \frac{C^2}{1-C^2}\]
and
\[f'(x) = \pm \sqrt{\frac{C^2}{1-C^2}}.\]
Since \(f'(x)\) is continuous, it is therefore everywhere equal to a constant \(\alpha\) which is either the positive or negative root. Then, integrating the differential equation
\[f'(x) = \alpha,\]
we obtain
\[f(x) = \alpha \cdot x + \beta\]
for some constant of integration \(\beta\). Thus, the shortest path between two points in the Euclidean plane is a straight line.