2.2. Choice of Activation Functions The activation functions in neurons are the building blocks of an ANN model. Similar to the neurons selleck chemicals in a biology system, the activation function determines whether a neuron should be
turned on or off according to the inputs. In a simple form, such on/off response can be represented with threshold functions, also known as a Heaviside function in the ANN literature as follows: Gγh,0+xt′γh=1,if γh,0+xt′γh≥00,if γh,0+xt′γh<0, (4) where c is the threshold and the remaining variables are defined previously. In some complex systems, the neurons may also need to be bounded real values. It is common to select sigmoid (S-shaped) and squashing (bounded) activation functions. It is also required that an activation function is bounded and differentiable. The most used two sigmoid functions in the ANN models are the logistic function and hyperbolic tangent (Tanh) function. Equations (5) and (6) are their mathematical expressions: Gγh,0+xt′γh=11+e−(γh,0+xt′γh). (5) Gγh,0+xt′γh=e(γh,0+xt′γh)−e−(γh,0+xt′γh)e(γh,0+xt′γh)+e−(γh,0+xt′γh).
(6) 2.3. Learning Process to Update the Weights of Interconnections Training ANNs can be divided into supervised training and unsupervised training. The supervised learning needs pairs of training samples and each pair is composed of inputs and desired outputs (i.e., observations). The learning process is to adjust the interconnection weights to reduce the difference between the inferred outputs from the ANN model and the actual observations whereas the unsupervised learning is to find hidden structure in unlabeled data with, for example, statistical inference. In the context of this paper, the authors only review part of influential supervised learning algorithms. To effectively approximate the complex systems, the interconnection weights in the ANNs have to be estimated with the existing observations. A simple example with only one single target output y and the network function y = fG,q(x; θ) is used to illustrate how to update the weights. θ is the
vector of interconnection weights. After the activation G and the structure of hidden layers are determined and a training sample of T observations is given, the optimal θ can be obtained by minimizing the mean squared error (MSE) in Entinostat (7), which can be obtained with the first order differentiation of (7) (i.e., (8) and (9)): 1T∑i=1Ty−fG,qx;θ2, (7) E∇fG,qx;θy−fG,qx;θ=0, (8) where fG,q(x; θ) is the gradient vector of fG,q with respect to θ. Rumelhart et al. designed a recursive gradient-descent-based algorithm to estimate the θ^ as follows [10]: θ^t+1=θ^t+ηt∇fG,qx;θ^tyt−fG,qx;θ^t, (9) where ηt is the learning rate and (9) is so called backpropagation algorithm and is a generalized form of the “delta rule” of single-layer perceptron model [5].