In signal processing, data science, and machine learning, it is ubiquitous to represent quantitative data as a vector. For the purposes of these notes, we shall define a vector as a finite collection of ordered numerical values over one dimension. A vector will be represented by a bold letter. For example, $$\mathbf{x} = \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \\ 5 \\ 6 \\ 7 \end{bmatrix}$$ is a length-7 vector. The primary difference between vectors and our signal notation $x[n]$ is that we generally represent $x[n]$ as an infinitely long set of values, indexed by $n$. In contrast, vectors are often represented by a finite long set of values (although they can be infinitely long as well). Otherwise, $x[n]$ and $\mathbf{x}$ are very similar.
Vectors represent a wide variety of data in engineering, such as time sequence of data. Vectors are represented as a row or a column. We will generally assume vectors are column vectors (unless transposed) in these notes.
The inner product (or dot product) of two vectors is an operation that performs the element-wise multiplication and sum of the elements of each vector. The inner product is represented by $$\mathbf{x} \cdot \mathbf{y} = \mathbf{x}^T\mathbf{y} = x_1y_1+x_2y_2+x_3y_3+\ldots$$ where $(\cdot)^T$ represents a transpose operation. The transpose of a matrix involves interchanging rows with columns or vice versa. For example, the transpose of a column vector would be a row vector, thus altering the dimensions from $n \times 1$ to $1 \times n$. The inner product is a frequently used operation in pattern recognition and, by extension, machine learning. The inner product of two vectors produces a scalar and is often considered a measure of similarity between vectors. If the inner product is between a vector with itself, it is a metric for size. Additionally, two vectors are said to be orthogonal if their inner product is zero.
$$\mathbf{x}^T\mathbf{y} = \begin{bmatrix} 1 &-3 &0 &2 &5 \end{bmatrix}\begin{bmatrix}0 \\1 \\4 \\-4 \\0 \end{bmatrix} = 0 -3 + 0 - 8 + 0 = -11$$It is standard to utilize the inner product to compute measures of distance. Specifically, we see that, $||\mathbf{x}||_2= \sqrt{\mathbf{x}^T\mathbf{x}} = \sqrt{x_1^{2} + x_2^{2} + x_3^{2} + ....}$ This is known as the 2-norm of the vector $\mathbf{x}$. The notation $\| \cdot \|$ refers to a norm. We can observe that when the length of vector $\mathbf{x}$ is two (i.e., it has two components), then the 2-norm is equivalent to the two-dimensional Euclidean norm. Hence, the 2-norm is a measured straight-line distance between two vectors (usually from the origin, known as the zero vector), each representing a point in an abstract space whose number of dimensions is equal to the length of each vector.
Norms are valuable as they provide a means to quantify the size of errors in the data. A considerable majority of machine learning algorithms, for example, aim to minimize the squared error $\mathbf{e}^T \mathbf{e}$ (where $\mathbf{e} = \mathbf{x} - \widehat{\mathbf{x}}$ is an error vector between some data $\mathbf{x}$ and a prediction $\widehat{\mathbf{x}}$).
Other norms, or distances, exist and are used for various purposes in machine learning. Two of the other most commonly used norms include the 1-norm and $\infty$-norm. The 1-norm is defined by, $$||\mathbf{x}||_1 = |x_1|+|x_2|+|x_3|+....$$ This is often known as the Manhattan distance. In two-dimensional space, it is illustrated by the distance traversing between two points with a 90-degree turn. In machine learning, the 1-norm is often minimized when we desire the resulting vector to be mostly zeros (i.e., sparse).
The $\infty$-norm is defined by $$\| \mathbf{x} \|_{\infty} = \max \left( |x_1|, |x_2|, |x_3|, \ldots \right) \; .$$ In two dimensions, this is demonstrated by the largest of the two distances (of our respective axes). In machine learning, the $\infty$-norm is often minimized when we desire the resulting vector to be resilient to changes.
The inner product and 2-norm together are used to define the correlation coefficient between two vectors. The correlation coefficient between vectors $\mathbf{x}$ and $\mathbf{y}$ is defined by $$c = \frac{\mathbf{x}^T \mathbf{y}}{\| \mathbf{x} \|_2 \| \mathbf{y} \|_2} = \cos(\theta)$$ The correlation coefficient is a metric of similarity between $\mathbf{x}$ and $\mathbf{y}$ such that $-1 \leq c \leq 1$. The value $\theta$ is the angle between $\mathbf{x}$ and $\mathbf{y}$. This measure of similarity ignores differences in scaling (i.e., multiplying all elements of either vector by a single positive number).
When $c = 1$, the two vectors are equivalent, neglecting a positive scaling factor. Geometrically, this is the same as stating that the two vectors point in the same direction ($\theta = 0$ degrees). When $c = -1$, the two vectors are the negative of each other, neglecting a positive scaling factor. Geometrically, this is the same as stating that the two vectors point in opposite directions ($\theta = 180$ degrees). When $c = 0$, the two vectors are orthogonal to each other ($\theta = 90$ degrees). When orthogonal, one vector has no linear relationship with the other vector.
The correlation coefficient is widely used in pattern recognition to determine how closely two vectors match. In this manner, the correlation coefficient can be used to create a simple classifier. If, for example, we have examples of four classes, we could compute the correlation coefficient between our known and unknown examples. This is the fundamental premise for most classification algorithms. The two major differences among them are (1) how we measure similarity and (2) how we incorporate multiple examples of each class (rather than one example in the demonstration above).