These are mostly reading notes from the wonderful book Artificial Intelligence by Melanie Mitchell.
The path of a tree is followed from the root towards one leaf. The path is chosen at random. At every branching node, a random branch is chosen. This approach is useful for huge decision trees, for example in the game of Go. Evaluating every path is computationally not feasible. When evaluating a big amount of paths, it is likely to find a good enough solution. In contrast, when evaluating all paths, the optimum solution will be found.
Genetic algorithms can find a good solution in a non-gradient environment. This is possible whenever a program can follow a rule set. Rules are changed for every iteration and the outcome is evaluated. Rules have to have a binary representation, so that a binary String can be genetically modified. The general pattern is:
- Initialisation with a random rule set
- Select best performing rules
- Apply genetic operators such as Crossover and Mutation
- Terminate under a condition, for example after a number of rounds or at a cost-threshold
First a word about what Q-Tables are good for. Whenever a program has to reach one single goal, and the way to achieve this goal is unknown, Q-Tables can do the trick. An intuitive example is to find the exit of a maze: One final reward without the evaluation of intermediate steps.
There are States the program can be in, the rows of the Q-Table. For any given State, there are Options the algorithm can choose from, the columns. To stay with the maze example, if the player is in a State of being placed in the middle of an open field, then the options are to move in any direction (North, East, South, …).
At first, the moves start from a random location and happen randomly in order to find the reward. Only the action that led to the reward gets stored with a high number of points in the Q-Table (for example 100 points. This is up to the fine tuning of the use-case). For example, the State is “One step straight ahead towards the exit” and the chosen Option was “Move straight”, which unlocks the reward, will lead to a high number in this State-Option combination.
In the second round, not finding the reward counts, as the solution was already found. More important is, how does the algorithm find the State that led to the reward. In the example, what comes before standing right in front of the exit? If this pre-reward step was found, another high number is stored in the Q-Table for the respective State-Option combination.
This pattern of finding the pre-step to every previous round continues. There is one exception: With a low chance, the algorithm can choose an Option that is not associated with a (high) number of points. In that way, it will find potentially better solutions and try moves it never tried before.
Whenever the output for an input of n values is known, this algorithm strikes. That means the input needs to be quantifiable and the output has to be known already, at least for a training set. Between Input and Output is a “deep” or “hidden” layer. This layer multiplies the Inputs and forwards the result to the Outputs. Every node in the hidden layer is connected to every Input and every Output.
When training the network, the multipliers at the hidden layer(s) are gradually adapted to best match all Inputs to their expected result.
This technique is conceptually easy to understand, but extremely hard to master, as it requires fine-tuning and a good intuition on how to apply changes.
As well known as “self learning”. Good behaviour is rewarded, while bad behaviour is ignored. An example will give an intuition on how reenforced learning works in practice. AlphaZero, the Go playing machine, uses a combination of deep neural networks and Monte Carlo trees (at least one of the earlier versions of AlphaZero did. The approach was changed later). The deep neural network suggests, which Monte Carlo trees to try. The best performing Monte Carlo tree is fed back into the neural network, so that in a next iteration, the neural network can make even better suggestions which Monte Carlo trees to try for a similar situation. This technique only works if the output of the round can be clearly scored, so that the neural network only trains itself with good inputs.
A language is natural if humans speak it.
An early approach for natural language processing was the use of recurring neural networks. A recurring neural network feeds back its own values into the network, together with another input. Let’s say, every word in the dictionary gets a value. A sentence, now formed of a String of values, is fed into the recurring neural network, one at a time. To keep track of the sentence’ state, the output from the previous words are fed back into the network together with the value of the current word. This technique is also called one-hot input, as only always one word is active at a time. After years of experimentation, the results are not as good as with other techniques for language processing.
Unfortunately not furhter explained in Mitchell's book and only briefly referenced in Translation. The LSTM addresses the deficites of sequential inputs in recurring neural networks. The cell state is the information transportation system from step to step. The information can stay the same over time or gradually change. It is used to generate unit output together with other operations explained below. LSTMs have four hidden layers. The first layer decides about which information to "forget" from cell state. The next two layers figure out, what to add to the cell state. The fourth layer renders the output from the cell state and the cell's input for this particular step.
Before explaining how word vectors are generated, I start with a motivating example: Word vectors give the similarity of meaning between to words. Words that are more closely related have a shorter distance than words that have very different meanings. A surprising property is the distance between word correlations. Measuring the distance Man -> Prince yields almost the same distance as Woman -> Princess. To make use of this property, the word cloud can answer questions. Asking Fish -> Water; Bird -> ? will give the answer Air. This is possible by measuring the distance Fish -> Water and then checking what lays in the same distance from the word Bird. Google released a 300 dimensional word cloud called word2vec. How did they create it?
Start with a neural network with one hidden layer containing 300 (hidden) units. The input and output layers have a unit for every word from the dictionary, so it is a rather huge input/output layer. From a sentence, feed adjacent words into the network. For example, feed Burger and train for Restaurant. Also train for the opposite case. Once the training is done, extract the word vector. To do that, every word from the dictionary is “lit up” in the network. The “illumination” of the hidden layer will mark the position in the 300-dimensional space.
It is possible to project a higher-dimensional space into two or three dimensions, so that humans can visualise and inspect them more easily.
Traditionally, translation systems were composed of human made rules. Google drastically changed this in 2016 with the release of the machine learning translation system.
The outline of this idea is to have an encoder- and a decoder neural network. The encoder is fed with a sentence, using the one-hot approach (see recurring neural networks) . At the end, a stop sequence is used and the activation of units in the network is extracted. This activation is fed into the decoder network for the target language, which converts the sequence of values back into a sequence of words. In contrast to simple neural networks, the units in these encoder/decoder networks are made from “long short-term memory” LSTM units. These units account for inputs that come over time and autonomously decide which inputs to generously “forget”. The length of the sentence from the original language can be different from the length of the resulting language.
How to evaluate that a machine can really understand the contents of a sentence, in contrast to simply react on it in a trained way? A powerful approach is to ask a question like “The couch did not fit through the hallway, because it was too narrow. What was too narrow?”. A machine neither understanding what a couch or a hallway is nor knowing its dimensions and relations, will never be able to answer this question. This test has an infinite amount of possibilities as new questions can simply be created: “Water was filled from the bottle into the glass until it was empty. What was empty?”. By creating enough questions in this pattern, only a machine that clearly understands text can score at about 100%, otherwise it will be roughly 50% for random choices.