Artificial Neural Networks

Neural computing is a machine learning paradigm which aims to emulate neurological function and learning. “Artificial Neural Network” (ANN/NN) is constructed merely by appropriately connecting a group of adaptable nodes (“artificial neurons”). ANN is a model of neural computation, deduced from simplified units which mimic integration and activation properties of real neurons. The simplest type of an ANN, composed of a single layer of neurons is a perceptron. The simplest architecture (a network of neurons) is a directed acyclic graph, where neurons are arranged into layers and do not connect to one another intralayer. Directed edges connect neurons from a layer $H$ to a layer $H+1$. Layers of neurons are called “hidden” when they are neither input layers nor output layers. The number of neurons in a layer gives its width. Over the years, numerous architectures of neural networks have been proposed, including graph neural networks and recurrent neural networks which allow for cyclic connections amongst neurons. Importantly, a network is always directed.

Neural networks are a particularly alluring class of machine learning algorithm because of the Universal Approximation Theorem (UAT) [1]. The original version states that there exists a neural network with a single hidden layer and an arbitrary number of neurons which can approximate any continuous function defined on a bounded domain. Various extensions to this theorem give bounds for neural network’s width or depth. Note that UAT assumes a function to be approximated, $g: x\in A\rightarrow \mathbb{R}$, is bounded: $\exist m, M$ s.t. $m\leq g(x)\leq M$ $\forall A$.

A neuron is a non-linear function $f: \textbf{x}\in\mathbb{R}^d \rightarrow \mathbb{R}^1$. So a neuron at a layer $N+1$ takes as an input an output of all of its predecessor neurons from a layer $N$ and outputs a value that will be an input to neurons in the succeeding layer: $h_{i}^{N+1}=\sigma(\sum_j w_{j} h_j^N + b_i)$, where $h_j^{N}$ is a state of a neuron $j$ which is at a layer $N$, $w_{j}$ is a weight of an edge $(j,i)$, $b$ is a ``bias" term of a neuron $i$, and $\sigma$ is the non-linearity, such as a Sigmoid, Step function, Tanh, ReLU, etc. One can collect states of all neurons from a layer into a vector and rewrite the mapping in a vector form $\textbf{h}^{N+1} = \sigma(\textbf{W} \cdot \textbf{h}^{N} + \textbf{b} )$. If an input to a neural network is $\textbf{x}\in \mathbb{R}^{n}$, and there are $d_1$ neurons in the first layer, then

$\textbf{h}^{1} = \sigma(\textbf{W}_1 \cdot \textbf{x} + \textbf{b}_1 )$, $\textbf{W}\in\mathbb{R}^{d_1 \times n}$, $\textbf{b}_1\in\mathbb{R}^{d_1 }$,

$\textbf{h}^{2} = \sigma(\textbf{W}_2 \cdot \textbf{h}^1 + \textbf{b}_2 )$, $\textbf{W}\in\mathbb{R}^{d_2 \times d_1}$, $\textbf{b}_2\in\mathbb{R}^{d_2 }$,

finally, after some number of layers, the neurons at a layer $N$ produce to an $m$-dimensional output $\hat{\textbf{y}}$:

$\hat{\textbf{y}}= \textbf{h}^{N} = \sigma(\textbf{W}_N \cdot \textbf{h}^{N-1} + \textbf{b}_N )$, $\textbf{W}\in\mathbb{R}^{m \times d_N}$, $\textbf{b}_N\in\mathbb{R}^{m }$.

One can see that a neural network represents a highly non-linear function $\psi: \textbf{x}\in \mathbb{R}^{n}\rightarrow \hat{\textbf{y}} \in\mathbb{R}^{m}$. Coming back to UAT, it states that if we have some bounded function $g: \textbf{x} \in \mathbb{R}^{n}\rightarrow\textbf{y} \in \mathbb{R}^{m}$ there exists a neural network $\psi$ that can approximate $g$ with arbitrary accuracy. In other words, for the same input $\textbf{x}$, $\psi$ produces an output $\hat{\textbf{y}}\approx \textbf{y}$.

To achieve such approximation, the neural network must be trained, i.e. the weights of its parameters (entries in matrices $\textbf{W}$ and bias vectors $\textbf{b}$) have to be adjusted in a standard supervised learning setting. The most common method for computing gradients in deep ANNs is Backpropagation, which I will write about next time. Backpropagation enables minimization of a loss function of a form $\mathcal{L}= \mathbf{E}_{ (\textbf{x},\textbf{y})\in\mathcal{D}} \big[ ||\textbf{y}- \psi(\textbf{x}) ||\big]$, where $||\cdot||$ is a norm. The parameters of the model that achieve the minimal loss given the dataset (minimal with a caveat related to statistical learning theory, which describes the quality of the approximation given finiteness of training data and overfitting) yield our best approximation of $g$.

References

[1] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303-314.