Fundamental concepts of linear algebra which help understand statistical and machine learning techniques.

# Vector and Matrix Algebra

**Span**

The span of a list of vectors ${v_1,…, v_k}$ in $R_n$ is the collection of all linear combinations of ${v_1,…, v_k}$.

**Upper and Lower Triangular Matrices**

Upper Triangular means that the entries of A “below the diagonal are zero.

Lower Triangular means that the entries of A “above” the diagonal are zero.

- not just for SQUARE matrices

# Systems of Linear Equations & Nonsigular Matrices

## Reduced Row Echelon Form (rref)

**Definition**

A matrix A is in reduced row echelon form (rref) if each of the following is satisfied:

- Any zero-rows occur at the bottom.
- The first nonzero entry of a nonzero row is a 1 (called a pivot).
- Every pivot occurs to the right of the pivots above it.
- All nonpivot entries [in a column that contains a pivot] are zero.

**Rank and Nullity**

- Rank : the number of pivot columns. –> dependent variables
- Nullity: the number of nonpivot columns. –> independent variables (free)

## rref Systems（增加了竖线右边那列）

### Consistency 相容/可解

consistent (or solvable): the augmented(竖线右边的那列) column contains no pivot.

inconsistent (or unsolvable): the augmented column contains a pivot.

### Number of Solutions

An rref system has:

**exactly one solution**if- it is consistent. no pivot in augmented column
- it has
**NO**free variables. rank = number of variables, nullity = 0

**infinitely many solutions**if- it is consistent. no pivot in augmented column
- it has free variables. rank \< number of variables, nullity > 0

**no solutions**if- it is inconsistent. augmented column contains a pivot

## Gauss-Jordan Elimination — how to get rref of a matrix

从左至右，对每个column：当前row元素变成1，其他消为0

- If $a_{ij} = 0$, switch the $i^{th}$ row with the first row below it whose jth entry is nonzero. If no such row exists, increase j by one and repeat this step.
- Multiply the ith row by 1/aij to make a pivot entry in the position $(i,j)$.
- Eliminate all other entries in the jth column by subtracting suitable multiples of the ith row from the other rows.
- Increase i and j by one and return to Step 1.

The algorithm terminates after the last row or last column is processed. The output of this algorithm is the reduced row echelon form rref(A) of A.

## Nonsingular Matrices 非奇异矩阵（可逆矩阵）

$Ax = b$ —> $x=A^{-1}b$

$A^{-1}$: inverse / reciprocal. 逆

### Invertibility 可逆性

a $n × n$ matrix A is invertible / nonsingular if and only if `rank(A) = n`

* **only apply to square matrix**

### Gauss-Jordan Algorithm for Computing $A^{-1}$

- Form $[A | I_n]$.
- Solve $[A | I_n]$ —> $[rref(A) | B]$.
- If $rref(A) \neq I_n$, then A is
**not**invertible. - If $rref(A) = I_n$, then A is invertible and $A^{-1} = B$.

### Matrix Inverse Rules

**Involution**: If A is invertible, then $(A^{-1})^{-1} = A$.**Transposition**: If A is invertible, then $(A^{-1})^{T} = (A^{T})^{-1}$**Reverse Multiplication**: If A and B are invertible, then $(AB)^{-1}= B^{-1}A^{-1}$**Scaling Rule**: If A is invertible and $c \neq 0$, then $(c · A)^{-1} = \frac{1}{c} · A^{-1}$.

# Least Squares Approximations

## Fitting Lines to Data

## The Method of Least Squares

Let ${v_1, v_2,…, v_k}$ be the columns of a matrix $A = [v_1, v_2,…, v_k]$

The list of vectors ${v_1, v_2,…, v_k}$ is

**linearly independent**if rank(A) = k —**full column rank****linearly dependent**if rank(A) < k

**Nonsingularity of $A^TA$**

- If $A_{n*k}$ has full column rank ($rank(A)=k$), then $A^{T}A$ is nonsingular ($A^{-1}$ exists).
- If C is nonsingular, then $Cx=d$ has a unique solution. (because we can do $C^{-1}Cx=C^{-1}d$ —> $ x = C^{-1}d$

This means that the system $A^TAx = A^Tb$ has a unique solution x , even if $Ax = b$ has no solution.

**Least Squares Approximation**

When $Ax = b$ has no solution, solve $A^TAx =A^Tb$ for x .

The vector x is called the **least squares approximate solution** to $Ax = b$ .

This approximate solution makes the error $E = ∥Ax − b∥^2$ as small as possible.

## Gram-Schmidt Orthogonalization and QR-Factorization

Inversion: $(AB)^T= B^{T}A^{T}$

Transposition: $(AB)^{-1}= B^{-1}A^{-1}$

**Orthonormal** list of vector

A list of vectors ${q_1, q_2,…, q_k}$ is called orthonormal if

- each vector is a
**unit**vecto: $∥q_i∥ = 1$ - each pair of vectors is
**orthogonal**: $q_i q_j = 0$

**Matrices with Orthonormal Columns**

If matrix Q is composed of orthonormal columns, then $Q^{T}Q = I$ . (proved by its definition)

Note that these matrices need NOT be square.

**Orthogonal** Matrix

when Q is a **square** matrix, the relation $Q^{T}Q = I$ means that $Q^{-1}$ exists, and $Q^{-1}= Q^{T}$.

Such matrices are called **orthogonal** matrices.

如果正方形矩阵的列向量是orthonormal（相互orthogonal且长度都为1）的，则该矩阵是orthogonal的。

**Least-Squares Approximation for Qx = b**

The least-squares approximation $x$ of $Qx = b$ is $x = Q^{T}b$.

Consequently, if $Qx = QQ^{T}b \neq b$ then the system $Qx = b$ is inconsistent.(先乘$Q^{T}$，再乘$Q$倒回去，但是结果不等，说明$Q^{T}b$ 只是近似解)

**QR-Decompositions**

Given $A = QR$, the least squares approximate solution to $Ax = b$ is the solution x to the upper-triangular system $Rx = Q^{T}b$.

**Using Gram-Schmidt Algorithm to Compute A=QR**

Input: matrix $A$ with linearly independent columns ($rank(A) =k$).

Output: find $Q, R$ for the QR-decomposition $A = QR$.

Explaination: projecting $v$ onto $w$

* **There are other way of computing Q, R and we might get results that are sligthly different in signals.**

Ax=b -> QRx = b -> Rx = Q^{T}b

# Elgenvalues

## Eigenvalues & Eigenvectors

Eigenvalues. 特征值

递推关系

**Eigenvalues**

- Calculate λ: $|A − λ · I_n| = 0$
- Determine if $λ$ is an Eigenvalue of $A$: $λ$ is an eigenvalue of an $n×n$ matrix $A$ if the matrix $A−λI_n$ is singular ( $rank(A − λ I_n) < n$).

**Eigenvectors**

An eigenvector of $A$ associated to the eigenvalue $λ$ is a vector $v$ satisfying the equation $Av = λv$.

- Test to Determine if v is an Eigenvector of A: A vector v != 0 is an eigen vector of A if and only if Av =λ·v for some scalar λ. In this situation, λ is an eigenvalue of A.

**Eigenpair**

An eigenpair of a matrix $A$ is a pair $(λ,v)$ where $λ$ is an eigen value of $A$ and $v$ is an eigenvector of $A$ corresponding to the eigenvalue $λ$.

- one eigenvalue may have multiple eigenvectors.

## Pivot Basis

**Geometric Multiplicity** 几何重数

The geometric multiplicity of λ as an eigenvalue of $A$ is **a number** that measures the number of **linearly independent** eigenvectors (nonpivot columns) of $A$ associated to λ.

$gm_A(λ) = nullity(A−λ·I_n)$

**Find Pivot Basis Associated to $λ$**

for $(A − λ · I_n) x = 0$

form $[rref[A − λ · I_n | 0 ]$

write the general solution to $(A−λ·I_n)x = 0$ as $x = c_1·v_1 + c_2·v_2 +···+c_g·v_g$ .

The list of vectors ${v1, v2,…, vg}$ is called the **pivot basis** associated to $λ$. (but not the eigenvector associated to λ)

**Eigenvectors v.s Pivot Basis**

The eigenvectors of $A$ associated to $λ$ are the **linear combinations** of the pivot basis ${v_1, v_2,…, v_g}$. That is, every eigenvector of $A$ associated to $λ$ is of the form $v = c_1v_1 + c_2 v_2 + · · · + c_gv_g$ where $c_1,c_2,…,c_g$ are scalars.

## Diagonalization 对于方阵

**Diagonalizable**

An $n × n$ matrix $A$ is diagonalizable if it is possible to find

- an
**invertible**$n × n$ matrix $P$ and (可逆的方阵P) - an $n × n$
**diagonal**matrix $D$ （对角矩阵D）

satisfying $A = PDP$.

The matrix power $A^k$ is given by $A^k = PD^kP^{-1}$ （连乘）

**Determin diagonalizability**

A is diagonalizable if and only if the sum of the geometric multiplicities of each eigenvalue $λ_i$ of $A$ is equal to $n$: $\sum{gm_A(λ_i)} = n$

**Diagonalize a Matrix**

Let A be a diagonalizable matrix. To find a diagonalization $A = PDP^{-1}$, start with two empty $n × n$ matrices $P$ and $D$. Then complete the following steps for each eigenvalue $λ$ of $A$.

- Find the pivot basis ${v_1, v_2,…, v_g}$ associated to $λ$. (solve $[rref[A − λ · I_n | 0 ]$ )
- P: Insert each vector from the
**pivot basis**${ v_1, v_2, . . . , v_g}$ into the first available columns of $P$. - Q: Insert $λ$ onto the first available diagonal entries of $D$ a total number of $gm_A(λ)$ times.

## Spectral Theorem and Principal Component Analysis

symmetric: $A^T= A$

orthogonal: $Q^T= Q^{-1}$

**Spectral Theorem (Diagonalization of Symmetric Matrices)**

Let $A$ be a **symmetric** matrix ($A^T= A$). Then there is a diagonalization $A = QDQ^T$ where Q is an **orthogonal** matrix ($Q^{T}= Q^{-1}$) .

It indicates that any given symmetric matrix is diagonalizable and that the eigenvectors chosen during the diagonalization process may be chosen to be orthonormal.

作用：如果能够找到orthogonal的$Q$，就无需求$Q^{-1}$，直接求$Q^T$就好。

**The Covariance Matrix** 协方差矩阵

$A$ -> $A_c:$ **centered** data.

When using a line to approximate centered data, the line is always of the form $f (x) = mx$.

That is, the line always passes through the origin.

Suppose that data is stored in an $n × p$ matrix $A$, where each column corresponds to a variable.

The **covariance matrix** of A is $Σ= \frac{1}{n-1}{A_c}^TA_c$

**positive definite**

symmetric: $Σ^T= Σ$ -> we can use spectral theory to diagonalize $Σ=QDQ^T$ where Q is orthogonal

eigenvalues $\lambda_i$of the covariance matrix are

**positive**numbers

**Orthogonal Distance**

- least squares: use
**vertical**distance to measure the error (linear regression) - principal component analysis (PCA): use
**orthogonal**distance to measure the error (orthogonal regression)

**The Orthogonal Error in Approximating Centered Data by f (x) = m · x**

Consider the problem of using a line $f(x) = mx$ to approximate a centered data set, whose points are listed as vectors ${p_1, p_2, . . . , p_n}$.

The orthogonal error in using $f (x) = mx$ to approximate the data set is: the sum of the squares of the orthogonal distances from each of these points to the vector $w = (1, m)$.(代表条线的一个向量)

That is, $E=∥p_1 −proj_w(p_1)∥^2+∥p_2 −proj_w(p_2)∥^2 +···+∥p_n −proj_w(p_n)∥^2$

**Orthogonal Regression**

Goal: Given a data set, find the line that minimizes the orthogonal error.

**First Principal Component**

Suppose that $λ_1$ is the **largest** eigenvalue of the covariance matrix $Σ$(which is positive definite). A first principal component of $Σ$ is an eigenpair $(λ_1, v_1)$ of $Σ$.

**Get the regression line**

- Let $A$ be a matrix whose rows consist of data points.
- Let $(λ_1, v_1)$ be a first principal component of the covariance matrix $Σ$.
- Write the vector $v_1$ in coordinates $v_1 = (x, y)$ and define $m = \frac{y}{x}$ .
- $f (x) = mx$ minimizes the orthogonal error.