Total Derivatives
This post contains a detailed explanation of the concept of the total derivative of a continuous function. It attempts to explain the total derivative in a way which relates it to concepts from differential geometry (such as tangent spaces).

The total derivative \(df_a\) of a function \(f : \mathbb{R}^n \rightarrow \mathbb{R}^m\) at a point \(a \in \mathbb{R}^n\) represents the best linear approximation of \(f\) at \(a\); but, what exactly does this mean?
Well, we want \(df_a\) to be a linear map. So, first, we have to define the vector spaces that will serve as the domain and codomain of this linear map.
The Tangent Space
Let's consider the tangent space \(T_a\mathbb{R}^n\), which is a vector space which represents the best linear approximation of \(\mathbb{R}^n\) at the point \(a\). Since \(\mathbb{R}^n\) is already a vector space, the best linear approximation at the point \(a\) is just the translation of \(\mathbb{R}^n\) to the point \(a\) (i.e. a vector space isomorphic to \(\mathbb{R}^n\) "centered" at \(a\)). In other words, it is a vector space whose underlying set of vectors is also \(\mathbb{R}^n\), but whose zero vector is \(a\) and whose addition and scalar multiplication operations are redefined accordingly.
We can map a vector \(v \in T_a\mathbb{R}^n\) to a vector in \(\mathbb{R}^n\) by subtracting \(a\):
\[T_a(v) = v - a\]
Conversely, we can map a vector \(v \in \mathbb{R}^n\) to a vector in \(T_a\mathbb{R}^n\) by adding \(a\):
\[T_a^{-1}(v) = v + a\]
Each of these maps is a linear isomorphism.
Then, we can define addition in \(T_a\mathbb{R}^n\); we simply translate to \(\mathbb{R}^n\), add, then translate back to \(T_a\mathbb{R}^n\):
\begin{align*}v + w &= T_a^{-1}(T_a(v) + T_a(w))\\ &= (v-a)+(w-a)+a\\ &= v+w-a\end{align*}
We can define scalar multiplication in \(T_a\mathbb{R}^n\); we simply translate to \(\mathbb{R}^n\), multiply, then translate back to \(T_a\mathbb{R}^n\):
\[sv = T_a^{-1}(s T_a(v)) = s(v - a) + a.\]
Likewise, we can define a linear map between tangent spaces \(\bar{d}f_a : T_a\mathbb{R}^n \rightarrow T_{f(a)}\mathbb{R}^m\) in terms of a linear map \(df_a : \mathbb{R}^n \rightarrow \mathbb{R}^m\); we simply translate from \(T_a\mathbb{R}^n\) to \(\mathbb{R}^n\), then apply \(df_a\), then translate from \(\mathbb{R}^m\) to \(T_{f(a)}\mathbb{R}^m\) as follows:
\begin{align*}\bar{d}f_a(h) &= T_{f(a)}^{-1}(df_a(T_a(v)))\\ &= f(a) + df_a(h - a) \end{align*}
This means that the following diagram commutes:
\begin{CD} T_a\mathbb{R}^n @>\bar{d}f_a>> T_{f(a)}\mathbb{R}^m\\ @VVT_aV @AAT_{f(a)}^{-1}A\\ \mathbb{R}^n @>df_a>> \mathbb{R}^m \end{CD}
Moreover, the map \(\bar{d}f_a\) is canonically isomorphic to the map \(df_a\), (which means that the previous diagram commutes and each of the maps \(T_a\) and \(T_{f(a)}^{-1}\) are isomorphisms), so we can use them interchangeably.
The Error Function
We want \(\bar{d}f_a\) to approximate \(f\):
\[f(h) \approx f(a) + df_a(h - a)\]
The error \(\varepsilon\) in the approximation is thus given by
\[\varepsilon(h) = f(h) - f(a) - df_a(h - a)\]
so that there is an equation
\[f(h) = f(a) + df_a(h-a) + \varepsilon(h) \]
Now, the "best" approximation minimizes the error is some sense. Certainly, we want the error to become arbitrarily close to \(0\) as the input \(h\) approaches \(a\), i.e. that \(\lim_{h \to a}\varepsilon(h) = 0\). However, if \(f\) is continuous, this is true for all linear maps, and thus it cannot serve as a definition. Note that, since all linear maps on finite-dimensional normed vector spaces are continuous (which means that \(\lim_{h \to a}df_a(h) = df_a(a)\)),
\begin{align}\lim_{h \to a}df_a(h - a) &= \lim_{h \to a}df_a(h) - df_a(a)\\ &= df_a(a) - df_a(a)\\ &= 0.\end{align}
By definition, the continuity of \(f\) means that \(\lim_{h \to a}f(h) = f(a)\). Then
\begin{align}\lim_{h \to a}\varepsilon(h) &= \lim_{h \to a}f(h) - f(a) - df_a(h - a)\\&= f(a) - f(a) - \lim_{h \to a}df_a(h - a)\\&= 0.\end{align}
However, we can require that the "best" approximation minimize the error "fastest". The function \(||h-a||\) indicates the distance of a vector \(h\) from \(a\) in the domain. The function \(\varepsilon\) indicates the error in the linear approximation in the codomain. We can require that the error in the linear approximation in the codomain converge "faster" to \(0\) than the distance in the domain converges to \(0\), i.e.:
\[||\varepsilon(h)|| \in o(||h - a||)\]
which is a special notation which means that
\[\lim_{h \to a} \frac{||\varepsilon(h)||}{||h - a||} = 0 \]
Intuitively, this means that, as \(h\) approaches \(a\), \(||\varepsilon(h)||\) gets much smaller than \(||h-a||\), since the ratio vanishes.
Writing this explicitly, we obtain
\[\lim_{h \to a}\frac{||f(h) - f(a) - df_a(h - a)||}{||h-a||} = 0.\]
The Total Derivative Defined
We have thus arrived at the following definition.
Definition (Total Derivative) The total derivative of a function \(f : \mathbb{R}^n \rightarrow \mathbb{R}^m\) at a point \(a \in \mathbb{R}^n \), if it exists, is the unique linear map \(df_a : \mathbb{R}^n \rightarrow \mathbb{R}^m\) such that
\[\lim_{h \to a}\frac{||f(h) - f(a) - df_a(h - a)||}{||h-a||} = 0.\]
A function is differentiable at \(a\) if the total derivative at \(a\) exists and is differentiable if the total derivative exists for every point in its domain.
We will establish that it is indeed unique later.
Note that we generally refer to the entire domain \(\mathbb{R}^n\), but this definition works for functions \(f : U \rightarrow \mathbb{R}^m\) whose domain is restricted to an open subset \(U \subseteq \mathbb{R}^n\).
Limits in Normed Vector Spaces
Let's review some useful facts about limits in normed vector spaces.
Theorem For any function \(f : V \rightarrow W\) between normed vector spaces \(V\) and \(W\), \(\lim_{h \to a}f(h) = 0\) if and only if \(\lim_{h \to a}||f(h)|| = 0\).
Proof. In fact, these two statements say the same thing, namely, that, for any \(\varepsilon > 0\), there exists a \(\delta > 0\) such that \(||f(h)|| \lt \varepsilon\) whenever \(||h-a|| \lt \delta\).
Theorem For any function \(f : V \rightarrow W\) between normed vector spaces \(V\) and \(W\), \(\lim_{h \to a}f(h) = \lim_{h \to 0}f(a+h)\).
Proof. Suppose \(\lim_{h \to a}f(h) = L\) and \(\varepsilon \gt 0\). Then, \(||f(h) - L|| \lt \varepsilon\) for all \(h \in V\) whenever \(||h-a|| \lt \delta\) for some \(\delta \gt 0\). Suppose \(||v|| \lt \delta\) for any \(v \in V\). Then \(||(v+a)-a|| = ||v|| \lt \delta\), and thus \(||f(v+a) - L|| \lt \varepsilon\) and so \(\lim_{h \to 0}f(a+h) = L\). Conversely, suppose \(\lim_{h \to 0}f(a+h) = L\) and \(\varepsilon \gt 0\). Then, \(||f(a+h) - L|| \lt \varepsilon\) whenever \(||h|| \lt \delta\) for some \(\delta > 0\). Suppose \(||v-a|| \lt \delta\). Then \(||f(a+(v-a)) - L|| = ||f(v) - L|| \lt \varepsilon\) and so \(\lim_{h \to a}f(h) = L\).
Theorem For any function \(f : \mathbb{R}^n \rightarrow \mathbb{R}^m\), \(\lim_{h \to a}f(h) = \left(\lim_{h \to a}f^j(h)\right)\cdot e_j\).
Proof. Suppose that \(\lim_{h \to a}f(h) = L\) and \(\varepsilon > 0\). Then, using the Euclidean norm, for some \(\delta \gt 0\) such that \(||h-L|| \lt \delta\) we have that
\[||f(h) - L|| = \sqrt{\sum_{i=1}^m (f^j(h) - L^j)^2} \lt \varepsilon.\]
This means that
\[\left|\sum_{i=1}^m (f^j(h) - L^j)^2\right|\lt \varepsilon^2\]
Since the sum is positive, it follows that
\[\sum_{i=1}^m (f^j(h) - L^j)^2\lt \varepsilon^2\]
and thus, for each index \(i\),
\[(f^j(h) - L^j)^2 \lt \varepsilon^2\]
and finally
\[|f^j(h) - L^j| \lt \varepsilon.\]
Thus, \(\lim_{h \to a}f^j(h) = L^j\).
For the converse, suppose that \(\lim_{h \to a}f^j = L^j\) for all \(1 \leq i \leq m\) and \(\varepsilon \gt 0\). Then, there exist \(\delta^j \gt 0\) such that \(|f^j(h) - L^j| < \varepsilon/\sqrt{m}\) whenever \(||h-a|| \lt \delta^j\). Whenever \(||h-a|| \lt \min_{1\leq i \leq m}(\delta^j)\), it then follows that
\begin{align}||f(h) - L|| &= \sqrt{\sum_{i=1}^m (f^j(h) - L^j)^2}\\&\lt \sqrt{\sum_{i=1}^m \varepsilon^2/m}\\&= \varepsilon \end{align}
and so \(\lim_{h \to a}f(h) = L\).
Theorem For any function \(f : V \rightarrow W\) between normed vector spaces\(\lim_{t \to 0}f(tv) = \lim_{h \to 0}f(h)\) for any vector \(v \in V\) whenever \(\lim_{h \to 0}f(h)\) exists and \(v \neq 0\).
Proof. Suppose \(\lim_{h \to 0}f(h) = L\) and \(\varepsilon \gt 0\). Then, there exists a \(\delta \gt 0\) such that \(||f(h) - L|| \lt \varepsilon\) whenever \(||h|| \lt \delta\). If \(|t| \lt \delta/||v||\) (which is defined since \(v \neq 0\)), then \(|t|||v|| = ||tv|| < \delta\). Then \(||f(tv) - L|| \lt \varepsilon\), so \(\lim_{t \to 0}f(tv) = L\).
Alternative Definitions
The theorems on limits indicate that there are several equivalent definitions of the total derivative. For instance:
\[\lim_{h \to a}\frac{f(h) - f(a) - df_a(h - a)}{||h-a||} = 0\]
\[\lim_{h \to 0}\frac{||f(a + h) - f(a) - df_a(h)||}{||h||} = 0\]
\[\lim_{h \to 0}\frac{f(a + h) - f(a) - df_a(h )}{||h||} = 0\]
Uniqueness of the Total Derivative
In order to call \(df_a\) "the" total derivative of \(f\), we need to show that it is unique. Suppose that we have
\[\lim_{h \to a}\frac{||f(h) - f(a) - d_1f_a(h - a)||}{||h-a||} = 0\]
and
\[\lim_{h \to a}\frac{||f(h) - f(a) - d_2f_a(h - a)||}{||h-a||} = 0\]
for two linear maps \(d_1f_a,d_2f_a : T_a\mathbb{R}^n \rightarrow T_{f(a)}\mathbb{R}^m\). For any vector \(h \in \mathbb{R}^n\), we have
\begin{align}d_1f_a(h) - d_2f_a(h) &= (f(a + h) - f(a) - d_2f_a(h)) \\&- (f(a + h) - f(a) - d_1f_a(h))\end{align}.
By the triangle inequality property of norms, this implies that
\begin{align}||d_1f_a(h) - d_2f_a(h)|| &= ||(f(a + h) - f(a) - d_2f_a(h)) \\&- (f(a + h) - f(a) - d_1f_a(h))||\\&\leq ||(f(a + h) - f(a) - d_2f_a(h))|| \\&+ ||(f(a + h) - f(a) - d_1f_a(h))||\end{align}.
This implies that
\begin{align}\lim_{h \to 0} \frac{||d_1f_a(h) - d_2f_a(h)||}{||h||} &\leq \lim_{h \to 0}\frac{||f(a + h) - f(a) - d_1f_a(h)||}{||h||} \\&+ \lim_{h \to 0}\frac{||f(a + h) - f(a) - d_2f_a(h)||}{||h||} \\ &= 0.\end{align}
This means that
\[\lim_{h \to 0} \frac{||d_1f_a(h) - d_2f_a(h)||}{||h||} = 0.\]
Thus, in particular, for an arbitrary vector \(v \neq 0\),
\[\lim_{h \to 0} \frac{||d_1f_a(h) - d_2f_a(h)||}{||h||} = \lim_{t \to 0} \frac{||d_1f_a(tv) - d_2f_a(tv)||}{||tv||}\]
It then follows that
\begin{align}0 &= \lim_{t \to 0} \frac{||d_1f_a(tv) - d_2f_a(tv)||}{||tv||}\\ &= \lim_{t \to 0} \frac{||(d_1f_a - d_2f_a)(tv)||}{||tv||}\\ &= \lim_{t \to 0} \frac{||t(d_1f_a - d_2f_a)(v)||}{||tv||}\\ &= \lim_{t \to 0} \frac{|t|||(d_1f_a - d_2f_a)(v)||}{|t|||v||}\\ &= \lim_{t \to 0} \frac{||(d_1f_a - d_2f_a)(v)||}{||v||}\\ &= \frac{||(d_1f_a - d_2f_a)(v)||}{||v||}.\end{align}
This means that \(||(d_1f_a - d_2f_a)(v)|| = 0\), and thus \((d_1f_a - d_2f_a)(v) = 0\) and \(d_1f_a(v) = d_2f_a(v)\). Also, \(d_1f_a(0) = 0 = d_2f_a(0)\) since both are linear, so \(d_1f_a = d_2f_a\). Thus, the total derivative, if it exists, is necessarily unique.
Differentiable Implies Continuous
If a function is differentiable at a point, then it is also continuous at that point.
A function \(f\) is continuous at a point \(a\) if \(\lim_{h \to a}f(h) = f(a)\).
If \(f\) is differentiable at a point \(a\), then
\[\lim_{h \to a}\frac{f(h)-f(a)-df_a(h - a)}{||h-a||} = 0.\]
Note that, since all linear maps on finite-dimensional normed vector spaces are continuous,
\begin{align}\lim_{h \to a}df_a(h - a) &= \lim_{h \to a}df_a(h) - df_a(a)\\ &= df_a(a) - df_a(a)\\ &= 0.\end{align}
We then have that
\[\lim_{h \to a}\frac{f(h)-f(a)-df_a(h)}{||h-a||} \cdot \lim_{h \to a}||h - a|| + \lim_{h \to a}df_a(h - a) = 0\]
since \(0 \cdot 0 + 0 = 0\).
Consolidating the terms from these limits, we obtain
\[\lim_{h \to a}\left[\frac{f(h)-f(a)-df_a(h - a)}{||h-a||} ||h - a|| + df_a(h - a)\right] = 0.\]
This implies that
\[\lim_{h \to a}f(h) - f(a) = 0\].
Directional Derivatives
The directional derivative \(D_af\) indicates the change of a function \(f : \mathbb{R}^n \rightarrow \mathbb{R}^m\) in the direction of a vector \(v \in \mathbb{R}^n\) at a point \(a \in \mathbb{R}^n\):
\[D_af(v) = \lim_{t \to 0}\frac{f(a + tv) - f(a)}{t}.\]
Note that many authors define the directional derivative only for functions \(f : \mathbb{R}^n \rightarrow \mathbb{R}\). However, limits of function \(f : \mathbb{R}^n \rightarrow \mathbb{R}^m\) are computed component-wise, so these definitions entail each other. Thus, it follows that
\[D_af(v) = D_af^j(v)\cdot e_j.\]
where \(f^j\) is the \(j\)-th component function, \(e_j\) is the \(j\)-th standard basis vector, and the summation convention is used.
Next we want to establish that \(df_a(v) = D_af(v)\), i.e. that the total derivative of a function \(f\) at a point \(a\) evaluated at a vector \(v\) is equal to the directional derivative of \(f\) in the direction of \(v\) at the point \(a\). This is why it is called the "total" derivative: it indicates the rate of change in every direction.
To see this, note that, if \(f\) is differentiable at a point \(a\), then, by one of the equivalent definitions,
\[\lim_{t \to 0}\frac{f(a + th) - f(a) - df_a(th)}{||th||} = 0.\]
We then compute
\begin{align}0 &= \lim_{t \to 0}\frac{f(a + th) - f(a) - df_a(th)}{||th||}\\ &= \lim_{t \to 0}\frac{f(a + th) - f(a) - df_a(th)}{|t|||h||}\\ &= \frac{1}{||h||}\lim_{t \to 0}\frac{f(a + th) - f(a) - df_a(th)}{|t|}\end{align}
which further implies that
\[\lim_{t \to 0}\frac{f(a + th) - f(a) - df_a(th)}{|t|} = 0.\]
Since limits are computed component-wise, we only need to establish that
\[\lim_{t \to 0}\frac{f^j(a + th) - f^j(a) - (df^j)_a(th)}{|t|} = 0\]
for arbitrary \(f^j\).
Due to the absolute value in the denominator, we treat the limit from "above" when \(t \gt 0\). Since the composite function in the limit is a function from \(\mathbb{R}\) to \(\mathbb{R}\) and the existence of the limit implies that it is equal to the limit from above, we calculate
\[\lim_{t \to 0^+}\frac{f^j(a + th) - f^j(a) - (df^j)_a(th)}{t} = 0.\]
This means that
\[\left[\lim_{t \to 0^+}\frac{f^j(a + th) - f^j(a)}{t} - \frac{(df^j)_a(th)}{t}\right] = 0\]
and thus
\[\lim_{t \to 0^+}\frac{f^j(a + th) - f^j(a)}{t} - \lim_{t \to 0}\frac{(df^j)_a(th)}{t} = 0\]
and so
\[\lim_{t \to 0^+}\frac{f^j(a + th) - f^j(a)}{t} = \lim_{t \to 0}\frac{(df^j)_a(th)}{t}\]
which, by linearity, implies
\begin{align}\lim_{t \to 0^+}\frac{f^j(a + th) - f^j(a)}{t} &= \lim_{t \to 0}\frac{t \cdot (df^j)_a(h)}{t}\\&= \lim_{t \to 0}(df^j)_a(h)\\&= (df^j)_a(h).\end{align}
Thus, \(df_a(h) = D_af(h)\).
Partial Derivatives
Partial derivatives are an important special case of directional derivatives. They are directional derivatives in the direction of the standard basis vectors \(e_i\). The following notation is used:
\[\frac{\partial f}{\partial x^i}(a) = D_af(e_i).\]
An alternative, but equivalent, definition of the directional derivative of a function \(f : \mathbb{R}^n \rightarrow \mathbb{R}\) is the following:
\[D_af(v) = \frac{d}{dt}\bigg\rvert_{t = 0}f(a + tv).\]
By the chain rule (which we have not yet discussed), this is equal to
\[v^i\frac{\partial f}{\partial x^i}(a).\]
The Classical Derivative
The classical derivative of a 1-dimensional function \(f : \mathbb{R} \rightarrow \mathbb{R}\) is technically a partial derivative (i.e. a directional derivative in the direction of the basis vector \(e_1 = 1\), which is a scalar in this context). Instead of the notation \((\partial f / \partial x) (a)\), the notation \(f'(a) = (df/ dx)(a)\) is used for functions \(f : \mathbb{R} \rightarrow \mathbb{R}\), so that
\[\frac{df}{dx}(a) = \frac{\partial f}{\partial x}(a) = (D_af)(1) = \lim_{t \to 0}\frac{f(a + t) - f(a)}{t}.\]
More generally, we then have for any \(h \in \mathbb{R}\)
\[df_a(h) = D_af(h) = h \frac{\partial f}{\partial x}(a) = h \frac{df}{dx}(a) = h \cdot f'(a).\]
Thus, the classical derivative represents the linear map \(h \mapsto f'(a) \cdot h\), which is the total derivative.
The Jacobian Matrix
Every linear map is represented by a matrix. The matrix \(Df\) that represents the total derivative \(df_a\) is called the Jacobian matrix.
The component functions \(f^j\) allow us to analyze \(df_a(h)\) into \(m\) components in the codomain, and the partial derivatives allow us to further analyze \((df^j)_a(h)\) into \(n\) components. Together, this fully analyzes the total derivative into \(m \times n\) components as follows:
\begin{align}df_a(h) &= D_af(h)\\&= D_af^j(h) \cdot e_j\\&= (df^j)_a(h) \cdot e_j\\&= h^i \cdot \frac{\partial f^j}{\partial x^i}(a) \cdot e_j.\end{align}
This implies that the matrix representation of \(df_a\) with row \(j\) and column \(i\) is
\[\left(\frac{\partial f^j}{\partial x^i}(a)\right),\]
since we have
\[\begin{bmatrix}\frac{\partial f^1}{\partial x^1}(a) & \dots & \frac{\partial f^1}{\partial x^n}(a)\\ \vdots & \ddots & \vdots\\ \frac{\partial f^m}{\partial x^1}(a) & \dots & \frac{\partial f^m}{\partial x^n}(a) \end{bmatrix} \begin{bmatrix} h^1\\ \vdots \\ h^n\end{bmatrix} = df_a(h).\]
Affine versus Linear Approximation
Many authors prefer to call the total derivative an affine approximation rather than a linear approximation. Both are correct, but it depends on the perspective. If the approximation is viewed as a map \(h \mapsto f(a) + df_a(h - a)\) with signature \(\mathbb{R}^n \rightarrow \mathbb{R}^m\), then this is indeed an affine approximation. However, if we think of the approximation as the map \(\bar{d}f_a : T_a\mathbb{R}^n \rightarrow T_{f(a)}\mathbb{R}^m\), then it is a linear map which is isomorphic to the map \(df_a\). The latter perspective generalizes immediately to differential geometry (smooth manifolds), where the differential of a smooth map is viewed as a linear transformation between tangent spaces.
Generalizations
The definitions above generalize from the spaces \(\mathbb{R}^n\) immediately to arbitrary normed vector spaces. This generalization is often called the Fréchet derivative.
Examples
Example 1. Consider the function \(f(x) : \mathbb{R} \rightarrow \mathbb{R} = x^2\). The total derivative is the unique linear map \(df_a : \mathbb{R} \rightarrow \mathbb{R}\) satisfying the following:
\[\lim_{h \to 0}\frac{(a+h)^2 - a^2 - df_a(h)}{||h||} = 0.\]
This means that
\[\lim_{h \to 0}\frac{a^2 + 2ah + h^2 - a^2 - df_a(h)}{||h||} = 0\]
and so
\[\lim_{h \to 0}\frac{2ah + h^2 - df_a(h)}{||h||} = 0.\]
Rearranging (and assuming that the limit on the right exists), we obtain
\[\lim_{h \to 0}\frac{h^2}{||h||} + \lim_{h \to 0}\frac{2ah - df_a(h)}{||h||} = 0.\]
Since
\[\lim_{h \to 0}\frac{h^2}{||h||} = 0,\]
the total derivative only needs to satisfy
\[\lim_{h \to 0}\frac{2ah - df_a(h)}{||h||} = 0.\]
This is satisfied if \(df_a(h) = 2ah\), and indeed this limit exists. Thus, \(df_a\) is a linear map represented by the real number \(2a\), which is the classical derivative \(f'(a)\).
Example 2. Consider the function \(f(x,y) : \mathbb{R}^2 \rightarrow \mathbb{R} = x^2 + y^2\). Recall that
\[df_a(h) = D_hf(a) = h^i\frac{\partial}{\partial x^i}(a).\]
Writing \(h = (x,y)\) and \(x\) for the coordinate function \(h^1\) and \(y\) for the coordinate function \(h^2\), we obtain
\[df_a(x,y) = x \frac{\partial f}{\partial x}(a) + y \frac{\partial f}{\partial y}(a).\]
This becomes
\[df_a(x,y) = 2ax + 2ay.\]
Now, some readers may be familiar with calculus, and recognize the expression
\[df = \frac{\partial f}{\partial x}dx + \frac{\partial f}{\partial y}dy.\]
This alternative formulation represents the global differential \(df\) as a covector field, which we will eventually discuss on this blog. This covector field, when evalulated at the point \(a\) and applied to the tangent vector representation of \((x,y)\), yields the same result. Here is not the place to explain, but we will provide the calculation that demonstrates the relationship between these two formulations.
One way to see that they are equivalent is to represent the point \((x,y) \in \mathbb{R}^2\) in the tangent space \(T_a\mathbb{R}^2\) as
\[x\frac{\partial}{\partial x}\bigg\rvert_a + y \frac{\partial}{\partial y}\bigg\rvert_a\]
and then compute
\begin{align}df_a\left(x\frac{\partial}{\partial x}\bigg\rvert_a + y \frac{\partial}{\partial y}\bigg\rvert_a\right)\\&= \left(x\frac{\partial}{\partial x}\bigg\rvert_a + y \frac{\partial}{\partial y}\bigg\rvert_a\right)(f)\\&= x\frac{\partial f}{\partial x}(a) + y\frac{\partial f}{\partial y}(a)\\&= 2ax + 2ay.\end{align}
We can also use the formulation at the point \(a\)
\[df_a = \frac{\partial f}{\partial x}(a)dx_a + \frac{\partial f}{\partial y}(a)dy_a\]
and apply this to the same vector to compute
\[df_a\left(x\frac{\partial}{\partial x}\bigg\rvert_a + y \frac{\partial}{\partial y}\bigg\rvert_a\right) = \left(\frac{\partial f}{\partial x}(a)dx_a + \frac{\partial f}{\partial y}(a)dy_a\right) \left(x\frac{\partial}{\partial x}\bigg\rvert_a + y \frac{\partial}{\partial y}\bigg\rvert_a\right).\]
We can compute
\begin{align}\frac{\partial f}{\partial x}(a)dx_a\left(x\frac{\partial}{\partial x}\bigg\rvert_a + y \frac{\partial}{\partial y}\bigg\rvert_a\right) &=\frac{\partial f}{\partial x}(a)\left[dx_a\left(x\frac{\partial}{\partial x}\bigg\rvert_a\right) + dx_a\left(y \frac{\partial}{\partial y}\bigg\rvert_a\right)\right]\\&= \frac{\partial f}{\partial x}(a)\left[x dx_a\left(\frac{\partial}{\partial x}\bigg\rvert_a\right) + y dx_a\left( \frac{\partial}{\partial y}\bigg\rvert_a\right)\right]\\&= \frac{\partial f}{\partial x}(a)\left[x \frac{\partial x}{\partial x}(a) + y \frac{\partial x}{\partial y}(a)\right]\\&= x \frac{\partial f}{\partial x}(a).\end{align}
Likewise, we compute
\[\frac{\partial f}{\partial y}(a)dy_a\left(x\frac{\partial}{\partial x}\bigg\rvert_a + y \frac{\partial}{\partial y}\bigg\rvert_a\right) = y \frac{\partial f}{\partial y}(a).\]
Thus, we finally obtain
\[df_a\left(x\frac{\partial}{\partial x}\bigg\rvert_a + y \frac{\partial}{\partial y}\bigg\rvert_a\right) = x \frac{\partial f}{\partial x}(a) + y \frac{\partial f}{\partial y}(a) = 2ax + 2ay.\]
Thus, the formulation in terms of covector fields is a global, point-free formulation, whereas the total derivative is a local, point-wise formulation.