How many inputs does my model have? How many outputs does my model have? These two related questions can often lead to great confusion when setting up a model such as a neural network or a support vector machine. These models work by accepting a fixed number of inputs and returning a fixed number of outputs based on those inputs. If you need to review how models work this article may help.
Ideally your model would have the same number of inputs and outputs as your collected data. However, this is rarely the case. This is because data are rarely presented to the model in exactly the same way as you originally received the data.
There are two primary reasons that the actual input and output to the model differ from your collected data.
- Time Series
If you are dealing with neither normalization nor time series then you might well have a 1:1 relationship between your collected data inputs and outputs and the model. Even if you are normalizing you might still be 1:1. However, if you are using time series, then your model likely has considerably more inputs and outputs than your collected data.
Normalization essentially takes a value from one number range and converts it to another number range. Most models can only deal with input/output in the range between 0 and 1. Because of this normalization is often required. Consider a test score of 30 points out of a 40 point test. Is this a good score? Normally you normalize such a score to a 100 point range. You simply divide 30 by 40 and get 0.75. This is 75% score. This likely gives you an idea of how good the score was! See how much easier it is dealing with a normalized score? Your model likely feels the same way! This is range normalization. Range normalization does not change your input or output count.
Normalizing a nominal value, on the other hand, does change your count. A nominal value is non-numeric. Values such as gender, color, pay grade, species, etc are all examples of nominal values. Normalization takes these values and transforms them into a number. Consider the famous Iris data set. Each line in the table contains data for one individual iris flower. There are four measurements. These four measurements would be the four inputs. All four inputs would be range normalized, and thus would still be four inputs. However, the output would be the species of iris that each line corresponds to. For this data set we consider the species setosa, versicolor and virginica. To normalize a nominal value you must know how many different nominal values there are. In this case there are three. If we were to use one-of-n encoding, we would have three outputs from the one input. If we were to use equilateral encoding, we would have two outputs. Either way, the single value of iris species has expanded into either two or three different values. Assuming we used one-of-n our mapping would look something like this.
Sometimes you mix nominal and ranged values. Consider a more complex system that might try to predict the number of sick days for an employee.
Now we have a somewhat more complex mapping. The output is 1:1, because sick days is just a simple number. Sick days will probably be normalized to a range, but it is still just a single number. The inputs are slightly more complex.
The first column is age. This maps to a single input. Age will likely be range normalized. Likewise number of children is just a number to be range normalized. Neither age nor number of children require additional inputs. Gender is nominal, it is non-numeric. However, because there are just two values we do not need to use one-of-n. We could just specify 0 to mean male, and 1 to mean female. The fourth column specifies level. We don’t want to simply convert this to a number. There are four levels: hourly employee, manager, subject matter expert (SME) and hourly. There is not a logical progression. A SME is hired as an expert and is both likely well compensated, but also not on the “management track”. As a result these four levels are not a progression. We need four values to perform a one-of-n encoding.
As you can see, some of the values flow directly into the model. Other values expand on their way to the model. You must be aware of this as you prepare your data for the model. Some software will perform such mappings automatically, often you must deal with these mappings yourself.
In the previous section the one row of collected data resulted in one set of inputs/outputs for the model. In time series, this is not the case. Time series seeks to predict. As a result, several rows of data will be used to make this prediction. These groupings of rows are called windows. Every window is of a specific size. The rows that you use to make a prediction are called the input window. The input window is used to predict a number of rows into the future. These future rows are called the future window. The size of these two windows does not need to be the same. For example, you might use the last ten rows to predict the next three. In this case you would have an input window of ten and a future window of three.
Time series is often used in finance, to predict future movement of a security. Time series is also frequently used to process signals, such as audio. The examples I use to illustrate will be financial. Consider if you wanted to use the price of gold and the USA prime interest rate to predict the the unemployment rate. You might have the following rows. These numbers are entirely made up!
|Row #||Price of Gold||Prime Rate||UNEmployment Rate|
We would like to use 3 rows to predict the next two. Fortunately, there are no nominal values. It is quite possible to combine time series and nominal values. We will save that for the next section.
You might be wondering what the amount of time is between each row. For this example, it really does not matter. Each row could be a minute, day, year or century. It really does not matter, so long as they are uniform.
Our first row will consist of three inputs and two outputs. We are using price of gold and prime rate to predict unemployment. Therefore our first input to the model will have six inputs (3 time slices * 2 values) . We will also have two outputs (2 value * 2 time slices). This results in the first input/output pair as being the following:
- Input: 1000,0.02, 1001, 0.021, 1002, 0.022
- Output: 0.83, 0.84
See how we combine three time slices into the inputs for the model. The next pair simply slides the window down. This process continues as long as we have data.
- Input: 1001, 0.021, 1002, 0.022, 1003, 0.023
- Output: 0.84, 0.85
For the two above pairs there is a strict separation of input and output data. However, often we want to use the same value to predict itself. Consider looking for patterns in a particular stock. For the above table, we will look at how to use the price of gold, prime rate and past unemployment rates to predict future unemployment rates. Now our pair looks like this.
- Input: 1000,0.02,0.80 1001, 0.021,0.81, 1002, 0.022, 0.82
- Output: 0.83, 0.84
Notice the output is still the same. We are trying to predict row 4 based on rows 1,2,3. This time we are using ALL of the values. It is okay for a datum to be both input and output. However, you never want the input and output windows to overlap.
Time series introduces an abstraction between your collected data and what is actually sent to the model. Rows are now grouped and sent to the model.
Both Normalization and Time Series
It is possible to use both time-series and nominal normalization on the same data set. The mapping becomes more complex. You must first apply the normalization and then use time series to package multiple rows, like was done in the last section.
|Row #||Sentiment||Prime Rate||UNEmployment Rate|
The sentiment input has three possible values: bullish, neutral and bearish. This means not now takes five inputs to represent the above table. A single time-slice has three values from sentiment, one from prime rate, and one from unemployment rate. This results in a total of five inputs per time slice. However, because the input window is of size three, there are 15 total inputs sent to the model.
Using one-of-n encoding, the first pair would be.
- Input: 1,0,0,0.02,0.80 1,0,0, 0.021,0.81, 1,0,0,0.022, 0.82
- Output: 0.83, 0.84
As you can see a sequence of “1,0,0″ is added to the front of each of the three time-slices to indicate that the current sentiment is bearish.