[Online Review] Linear Algebra

Fundamental concepts of linear algebra which help understand statistical and machine learning techniques.

Vector and Matrix Algebra


The span of a list of vectors ${v_1,…, v_k}$ in $R_n$ is the collection of all linear combinations of ${v_1,…, v_k}$.

Upper and Lower Triangular Matrices

Upper Triangular means that the entries of A “below the diagonal are zero.

Lower Triangular means that the entries of A “above” the diagonal are zero.

  • not just for SQUARE matrices

Systems of Linear Equations & Nonsigular Matrices

Reduced Row Echelon Form (rref)


A matrix A is in reduced row echelon form (rref) if each of the following is satisfied:

  1. Any zero-rows occur at the bottom.
  2. The first nonzero entry of a nonzero row is a 1 (called a pivot).
  3. Every pivot occurs to the right of the pivots above it.
  4. All nonpivot entries [in a column that contains a pivot] are zero.

Rank and Nullity

  • Rank : the number of pivot columns. –> dependent variables
  • Nullity: the number of nonpivot columns. –> independent variables (free)

rref Systems(增加了竖线右边那列)

Consistency 相容/可解

consistent (or solvable): the augmented(竖线右边的那列) column contains no pivot.

inconsistent (or unsolvable): the augmented column contains a pivot.

Number of Solutions

An rref system has:

  • exactly one solution if
    • it is consistent. no pivot in augmented column
    • it has NO free variables. rank = number of variables, nullity = 0
  • infinitely many solutions if
    • it is consistent. no pivot in augmented column
    • it has free variables. rank \< number of variables, nullity > 0
  • no solutions if
    • it is inconsistent. augmented column contains a pivot

Gauss-Jordan Elimination — how to get rref of a matrix


  1. If $a_{ij} = 0$, switch the $i^{th}$ row with the first row below it whose jth entry is nonzero. If no such row exists, increase j by one and repeat this step.
  2. Multiply the ith row by 1/aij to make a pivot entry in the position $(i,j)$.
  3. Eliminate all other entries in the jth column by subtracting suitable multiples of the ith row from the other rows.
  4. Increase i and j by one and return to Step 1.
    The algorithm terminates after the last row or last column is processed. The output of this algorithm is the reduced row echelon form rref(A) of A.

Nonsingular Matrices 非奇异矩阵(可逆矩阵)

$Ax = b$ —> $x=A^{-1}b$
$A^{-1}$: inverse / reciprocal. 逆

Invertibility 可逆性

a $n × n$ matrix A is invertible / nonsingular if and only if rank(A) = n
* only apply to square matrix

Gauss-Jordan Algorithm for Computing $A^{-1}$

  1. Form $[A | I_n]$.
  2. Solve $[A | I_n]$ —> $[rref(A) | B]$.
  3. If $rref(A) \neq I_n$, then A is not invertible.
  4. If $rref(A) = I_n$, then A is invertible and $A^{-1} = B$.


Matrix Inverse Rules

  • Involution: If A is invertible, then $(A^{-1})^{-1} = A$.
  • Transposition: If A is invertible, then $(A^{-1})^{T} = (A^{T})^{-1}$
  • Reverse Multiplication: If A and B are invertible, then $(AB)^{-1}= B^{-1}A^{-1}$
  • Scaling Rule: If A is invertible and $c \neq 0$, then $(c · A)^{-1} = \frac{1}{c} · A^{-1}$.

Least Squares Approximations

Fitting Lines to Data

The Method of Least Squares

Let ${v_1, v_2,…, v_k}$ be the columns of a matrix $A = [v_1, v_2,…, v_k]$
The list of vectors ${v_1, v_2,…, v_k}$ is

  • linearly independent if rank(A) = k — full column rank
  • linearly dependent if rank(A) < k

Nonsingularity of $A^TA$

  • If $A_{n*k}$ has full column rank ($rank(A)=k$), then $A^{T}A$ is nonsingular ($A^{-1}$ exists).
  • If C is nonsingular, then $Cx=d$ has a unique solution. (because we can do $C^{-1}Cx=C^{-1}d$ —> $ x = C^{-1}d$
    This means that the system $A^TAx = A^Tb$ has a unique solution x , even if $Ax = b$ has no solution.

Least Squares Approximation

When $Ax = b$ has no solution, solve $A^TAx =A^Tb$ for x .

The vector x is called the least squares approximate solution to $Ax = b$ .

This approximate solution makes the error $E = ∥Ax − b∥^2$ as small as possible.

Gram-Schmidt Orthogonalization and QR-Factorization

Inversion: $(AB)^T= B^{T}A^{T}$
Transposition: $(AB)^{-1}= B^{-1}A^{-1}$

Orthonormal list of vector

A list of vectors ${q_1, q_2,…, q_k}$ is called orthonormal if

  • each vector is a unit vecto: $∥q_i∥ = 1$
  • each pair of vectors is orthogonal: $q_i q_j = 0$

Matrices with Orthonormal Columns

If matrix Q is composed of orthonormal columns, then $Q^{T}Q = I$ . (proved by its definition)
Note that these matrices need NOT be square.

Orthogonal Matrix

when Q is a square matrix, the relation $Q^{T}Q = I$ means that $Q^{-1}$ exists, and $Q^{-1}= Q^{T}$.
Such matrices are called orthogonal matrices.

Least-Squares Approximation for Qx = b

The least-squares approximation $x$ of $Qx = b$ is $x = Q^{T}b$.
Consequently, if $Qx = QQ^{T}b \neq b$ then the system $Qx = b$ is inconsistent.(先乘$Q^{T}$,再乘$Q$倒回去,但是结果不等,说明$Q^{T}b$ 只是近似解)


Given $A = QR$, the least squares approximate solution to $Ax = b$ is the solution x to the upper-triangular system $Rx = Q^{T}b$.

Using Gram-Schmidt Algorithm to Compute A=QR

Input: matrix $A$ with linearly independent columns ($rank(A) =k$).
Output: find $Q, R$ for the QR-decomposition $A = QR$.

Explaination: projecting $v$ onto $w$

* There are other way of computing Q, R and we might get results that are sligthly different in signals.

Ax=b -> QRx = b -> Rx = Q^{T}b


Eigenvalues & Eigenvectors

Eigenvalues. 特征值


  • Calculate λ: $|A − λ · I_n| = 0$
  • Determine if $λ$ is an Eigenvalue of $A$: $λ$ is an eigenvalue of an $n×n$ matrix $A$ if the matrix $A−λI_n$ is singular ( $rank(A − λ I_n) < n$).


An eigenvector of $A$ associated to the eigenvalue $λ$ is a vector $v$ satisfying the equation $Av = λv$.

  • Test to Determine if v is an Eigenvector of A: A vector v != 0 is an eigen vector of A if and only if Av =λ·v for some scalar λ. In this situation, λ is an eigenvalue of A.


An eigenpair of a matrix $A$ is a pair $(λ,v)$ where $λ$ is an eigen value of $A$ and $v$ is an eigenvector of $A$ corresponding to the eigenvalue $λ$.

  • one eigenvalue may have multiple eigenvectors.

Pivot Basis

Geometric Multiplicity 几何重数

The geometric multiplicity of λ as an eigenvalue of $A$ is a number that measures the number of linearly independent eigenvectors (nonpivot columns) of $A$ associated to λ.
$gm_A(λ) = nullity(A−λ·I_n)$

Find Pivot Basis Associated to $λ$

for $(A − λ · I_n) x = 0$

form $[rref[A − λ · I_n | 0 ]$

write the general solution to $(A−λ·I_n)x = 0$ as $x = c_1·v_1 + c_2·v_2 +···+c_g·v_g$ .

The list of vectors ${v1, v2,…, vg}$ is called the pivot basis associated to $λ$. (but not the eigenvector associated to λ)

Eigenvectors v.s Pivot Basis

The eigenvectors of $A$ associated to $λ$ are the linear combinations of the pivot basis ${v_1, v_2,…, v_g}$. That is, every eigenvector of $A$ associated to $λ$ is of the form $v = c_1v_1 + c_2 v_2 + · · · + c_gv_g$ where $c_1,c_2,…,c_g$ are scalars.

Diagonalization 对于方阵


An $n × n$ matrix $A$ is diagonalizable if it is possible to find

  • an invertible $n × n$ matrix $P$ and (可逆的方阵P)
  • an $n × n$ diagonal matrix $D$ (对角矩阵D)
    satisfying $A = PDP$.

The matrix power $A^k$ is given by $A^k = PD^kP^{-1}$ (连乘)

Determin diagonalizability

A is diagonalizable if and only if the sum of the geometric multiplicities of each eigenvalue $λ_i$ of $A$ is equal to $n$: $\sum{gm_A(λ_i)} = n$

Diagonalize a Matrix

Let A be a diagonalizable matrix. To find a diagonalization $A = PDP^{-1}$, start with two empty $n × n$ matrices $P$ and $D$. Then complete the following steps for each eigenvalue $λ$ of $A$.

  1. Find the pivot basis ${v_1, v_2,…, v_g}$ associated to $λ$. (solve $[rref[A − λ · I_n | 0 ]$ )
  2. P: Insert each vector from the pivot basis ${ v_1, v_2, . . . , v_g}$ into the first available columns of $P$.
  3. Q: Insert $λ$ onto the first available diagonal entries of $D$ a total number of $gm_A(λ)$ times.

Spectral Theorem and Principal Component Analysis

symmetric: $A^T= A$
orthogonal: $Q^T= Q^{-1}$

Spectral Theorem (Diagonalization of Symmetric Matrices)

Let $A$ be a symmetric matrix ($A^T= A$). Then there is a diagonalization $A = QDQ^T$ where Q is an orthogonal matrix ($Q^{T}= Q^{-1}$) .

It indicates that any given symmetric matrix is diagonalizable and that the eigenvectors chosen during the diagonalization process may be chosen to be orthonormal.


The Covariance Matrix 协方差矩阵

$A$ -> $A_c:$ centered data.
When using a line to approximate centered data, the line is always of the form $f (x) = mx$.

That is, the line always passes through the origin.

Suppose that data is stored in an $n × p$ matrix $A$, where each column corresponds to a variable.

The covariance matrix of A is $Σ= \frac{1}{n-1}{A_c}^TA_c$

positive definite

  • symmetric: $Σ^T= Σ$ -> we can use spectral theory to diagonalize $Σ=QDQ^T$ where Q is orthogonal

  • eigenvalues $\lambda_i$of the covariance matrix are positive numbers

Orthogonal Distance

  • least squares: use vertical distance to measure the error (linear regression)
  • principal component analysis (PCA): use orthogonal distance to measure the error (orthogonal regression)

The Orthogonal Error in Approximating Centered Data by f (x) = m · x

Consider the problem of using a line $f(x) = mx$ to approximate a centered data set, whose points are listed as vectors ${p_1, p_2, . . . , p_n}$.

The orthogonal error in using $f (x) = mx$ to approximate the data set is: the sum of the squares of the orthogonal distances from each of these points to the vector $w = (1, m)$.(代表条线的一个向量)
That is, $E=∥p_1 −proj_w(p_1)∥^2+∥p_2 −proj_w(p_2)∥^2 +···+∥p_n −proj_w(p_n)∥^2$

Orthogonal Regression

Goal: Given a data set, find the line that minimizes the orthogonal error.

First Principal Component

Suppose that $λ_1$ is the largest eigenvalue of the covariance matrix $Σ$(which is positive definite). A first principal component of $Σ$ is an eigenpair $(λ_1, v_1)$ of $Σ$.

Get the regression line

  • Let $A$ be a matrix whose rows consist of data points.
  • Let $(λ_1, v_1)$ be a first principal component of the covariance matrix $Σ$.
  • Write the vector $v_1$ in coordinates $v_1 = (x, y)$ and define $m = \frac{y}{x}$ .
  • $f (x) = mx$ minimizes the orthogonal error.