Parameters of a neural network

In Why is a neural network a function? we saw that there should be a function $\mathbf{f}$ that maps $\mathbf{x}$ to $\mathbf{y}$ and we have no idea what it looks like. What we know is that "a function manipulates its input $\mathbf{x}$ and outputs $\mathbf{y}$''. Then what is the source of ambiguity that we are facing? The answer is we do not know what manipulations are done on $\mathbf{x}$ that results in $\mathbf{y}$. If we knew these manipulations, we would know the function and everything is done. The last sentence is the key of this post, that is, getting an idea of the term "manipulations" which will be explained as follows.

The simplest way of thinking about "manipulations" of the input $\mathbf{x}$ is to think about some interactions with some other "unknown" parameters that result in $\mathbf{y}$. Why did I emphasize the unknown? The answer is we want to replace "unknown manipulations" with "a set of unknown parameters". This is our first step towards knowing the unknown function $\mathbf{f}$. The dependence on "an unknown set parameters" can be denoted by writing $\mathbf{f}$ as $\mathbf{f}_{\pmb{\theta}^*}$ where "the unknown set of parameters" $\pmb{\theta}^*$ represents "unknown manipulations" that are applied on $\mathbf{x}$ to get $\mathbf{y}$. Hence, mathematically, one can write $\mathbf{y}= \mathbf{f}_{\pmb{\theta}^*}(\mathbf{x})$.
Pictorially, we wish to have the following picture.

Although, wishful thinking is a human tendency, the above wish can be fulfilled only if we know $\pmb{\theta}^*$. However, $\pmb{\theta}^*$ is unknown and we want to find it. Thus we need to strategize to find it. I will provide a remedy for it after recalling the fact that $\pmb{\theta}^*$ is a set of well-chosen unknown parameters in $\mathbf{f}_{\pmb{\theta}^*}$ which helps to get $\mathbf{y}$ when $\mathbf{x}$ is passed on. But, we do not know $\pmb{\theta}^*$ so we stop here and go with the following rationale. This rationale comes from the fact that at the end of the day we want $\mathbf{y}$ (the class of flower) and we are not concerned with the function that produces correct $\mathbf{y}$ for the input $\mathbf{x}$.

To lay out the rationale, suppose that we put some "known" arbitrary ${\theta}$ instead of $\pmb{\theta}^*$, i.e., $\mathbf{f}_{\pmb{\theta}}$. Then what happens? The answer is very clear, we get a different output $\hat{\mathbf{y}}$, i.e., $\hat{\mathbf{y}}= \mathbf{f}_{\pmb{\theta}}(\mathbf{x})$. Then, what would you wish now according to the last formula? We wish $\hat{\mathbf{y}}$ be as close as possible to $\mathbf{y}$ where $\hat{\mathbf{y}}= \mathbf{f}_{\pmb{\theta}}(\mathbf{x})$ and $\mathbf{y}= \mathbf{f}_{\pmb{\theta}^*}(\mathbf{x})$. Notice if we let ${\theta}$ be equal to $\pmb{\theta}^*$ we have trivially fulfilled our recent wish, but $\pmb{\theta}^*$ is not known and we cannot fulfill our recent wish. However, if we can fulfill it, no one would care about the function anymore. This shifts our focus from $\mathbf{f}_{\pmb{\theta}^*}$ to $\mathbf{y}$ and $\hat{\mathbf{y}}$. I will get back to $\mathbf{f}_{\pmb{\theta}^*}$ in other posts. Right now, try to understand what I said: "we want $\hat{\mathbf{y}}$ to be as close as possible to $\mathbf{y}$". But what does closedness mean here? How can we quantify closedness and then measure it?
To answer these question, first notice $\mathbf{y}$ is a probability vector in $\mathbb{R}^3$ whose components are nonnegative and add up to one. Therefore $\hat{\mathbf{y}}$ has to be a probability vector so we can compare them. Thus we need a tool that tells us how close probability vectors are to one another. I will introduce this tool in the next post.

Recap

1. We started with "an unknown function" $\mathbf{f}$ that takes $\mathbf{x}$ as the input and outputs $\mathbf{y}$.
2. We know "the unknown $\mathbf{f}$" does some manipulations on $\mathbf{x}$ as a result we assumed these manipulations are equivalent to interactions with some "unknown parameters" $\pmb{\theta}^*$. Thus, $\mathbf{f}$ becomes to $\mathbf{f}_{\pmb{\theta}^*}$.
3. Finally we changed our focus from $\mathbf{f}_{\pmb{\theta}^*}$ to $\mathbf{y}$ and $\hat{\mathbf{y}}$ to strategize for finding $\pmb{\theta}^*$.

Archive