Home  Categories  Science learning  Astonishing Hypotheses - A Scientific Exploration of the Soul 

Chapter 15 Chapter 13 Neural Networks

Astonishing Hypotheses - A Scientific Exploration of the Soul 费兰西斯·克里克 15898Words 2018-03-20

"...I believe that the best test of a model is whether its designers can answer these questions: What do you now know that you didn't know before? And how do you prove that it is true?" —James M. Bower A neural network is a collection of units with various interconnections.Each unit has properties of an extremely simplified neuron.Neural networks are often used to simulate the behavior of parts of the nervous system, to produce useful commercial devices and to test general theories of how the brain works. Why on earth do neuroscientists need theory so much?If they can understand the exact behavior of individual neurons, they may be able to predict the properties of populations of interacting neurons.Regrettably, things are not so easy.In fact, the behavior of individual neurons is often far from simple, and neurons are almost always connected together in a complex way.Furthermore, the overall system is usually highly nonlinear.A linear system, in its simplest form, doubles its output exactly as the input doubles—that is, the output is proportional to the input. ①For example, on the surface of a pond, when two small turbulent currents in progress meet each other, they pass through each other without interfering with each other.In order to calculate the combined effect of two small water waves, one only needs to add the effects of the first wave and the second wave at every point in space and time.In this way, each column of waves behaves independently of the other column.This is generally not the case for waves of large amplitude.The laws of physics dictate that the equilibrium is broken at large amplitudes.The process of breaking through a train of waves is highly nonlinear: once the amplitude exceeds a certain threshold, the wave behaves in an entirely new way.That's not just "more of the same", but some new features.Nonlinear behavior is common in everyday life, especially in love and war.As the song goes: "Kissing her once isn't half as good as kissing her twice." If a system is nonlinear, it is often much more difficult to understand mathematically than it is for linear systems.Its behavior can be more complex.Prediction of interacting populations of neurons thus becomes difficult, especially as the end result is often counterintuitive. High-speed digital computers are one of the most important technological developments of the last 50 years.It is often called the von Neumann computer in honor of the brilliant scientist and creator of the computer.Since computers can manipulate symbols and numbers like the human brain, it is natural to imagine the brain as some form of fairly complex von Neumann computer.Such comparisons, if taken to extremes, lead to unrealistic theories. Computers are built on inherently high-speed components.Even personal computers have a base cycle time, or clock rate, of more than 10 million operations per second.In contrast, the typical firing rate of a neuron is only in the range of 100 spikes per second.Computers are a million times faster.And fast supercomputers like the Cray machines go even higher.Roughly speaking, computer operations are sequential, one operation after another.In contrast, the brain usually works in a massively parallel way, with about a million axons reaching the brain from each eye, for example, all working simultaneously.This high degree of parallelism is repeated at almost every stage in the system.This wiring somehow compensates for the relative slowness in neuron behavior.It also means that the loss of even a few scattered neurons is unlikely to significantly change the behavior of the brain.In technical terms, the brain is called "degrade gracefully."The computer is fragile, and even a small damage to it, or a small error in the program, will cause a big disaster.Errors in a computer degrade catastrophically. Computers are highly stable at work.Because its individual components are so reliable, they usually produce exactly the same output when given the same input.Conversely, individual neurons have more variation.They are governed by signals that regulate their behavior, with some properties changing as they are "computed". A typical neuron may have hundreds or even tens of thousands of inputs from all over the place, with numerous projections from its axons.A basic component of a computer, the transistor, has only a handful of inputs and outputs. In a computer, information is encoded as a train of pulses of 0s and 1s.In this form, computers transfer information from one specific place to another with high precision.Information can go to a specific address, extract or change the content stored there.This enables information to be stored in a specific location in memory and further utilized at some later point in time.This kind of precision doesn't come to mind.Although the pattern of spikes a neuron sends along its axon (rather than just its average firing rate) may carry some information, there is no precise information encoded by the spikes. ①In this way, memory will inevitably be "stored" in different forms. The brain looks nothing like a general-purpose computer.Different parts of the brain, and even different parts of the neocortex, are specialized (at least in part) to process different types of information.It appears that most of the memory is stored where the current operation is performed.All of these are completely different from the traditional von Neumann computer, because the basic operations of the computer (such as addition, multiplication, etc.) are performed in only one or a few places, while its memory is stored in many very different places . In the end, computers are carefully designed by engineers, while brains have evolved through generations of animals through natural selection.This results in a substantially different form of design as described in Chapter 1. People are used to talking about computers in terms of hardware and software.Since one does not need to know the details of the hardware (circuits, etc.) to write software (computer programs), people -- especially psychologists -- argue that it is not necessary to know anything about the "hardware" of the brain.In fact, it is inappropriate to try to impose this theory on the operation of the brain. There is no obvious difference between the hardware and software of the brain.A plausible explanation for this approach is that, although brain activity is highly parallel, there is some form of sequential (attention-controlled) mechanism on top of all these parallel At a high level, in places far from sensory input, it can be superficially said that the brain has a certain resemblance to a computer. One can judge a theoretical approach by its results.Computers do what they are programmed to do and are therefore good at solving certain types of problems, such as large-scale number crunching, rigorous logical reasoning, and chess.Most people don't do these things as quickly and as well as they do.But even the most modern computers are powerless when it comes to tasks that ordinary people can do quickly and effortlessly, such as seeing objects and making sense of them. Significant progress has been made in recent years in designing a new generation of computers that work in a more parallel fashion.Most designs use many minicomputers, or parts of minicomputers.They are linked together and run concurrently.Information exchange between the small computers and global control over the calculations are handled by some fairly complex devices.In problems like weather forecasting, the basic elements appear in multiple places.Supercomputers are especially useful at this time. The AI community has also moved to design programs that are more brain-like.They replace the strict logic normally used in computing with a type of fuzzy logic.Propositions no longer have to be true or false, but only need to be more or less likely.The program tries to find, among a set of propositions, the most likely combination to conclude, rather than those it considers less likely. In conceptual settings, this approach is indeed more brain-like than early AI approaches, but in other respects, especially in the storage of memories, it is less brain-like.Therefore, it may be difficult to examine its similarity to the behavior of the real brain at all levels. A group of previously unknown theorists has developed a more cerebral approach.Today it is known as the PDP approach (i.e. Parallel Distributed Processing).The topic has a long history, and I can only outline one or two.In 1943 the work of Warrenc McCulloch and Walter Pitts was one of the earliest attempts in this direction.They showed that, in principle, "networks" of very simple units connected together can perform computations on any logical and arithmetic function.Because the units of the network resemble greatly simplified neurons, it is now often referred to as a "neural network". This achievement was so encouraging that it led many people to believe that this is how the brain works.It may have helped in the design of modern computers, but its most striking conclusion was terribly wrong about the brain. The next big advance was a very simple one-layer device invented by Frank Rosenblatt, which he called the Perceptron.The significance is that although its connections are initially random, it can change them using a simple and well-defined rule, so that it can be taught to perform certain simple tasks, such as recognizing printed letters in fixed positions.The way the perceptron works is that it has only two responses to a task: right or wrong.You just tell it whether its (provisional) answer is correct.It then changes its connections according to a kind of perceptron learning rule.Rosenblatt showed that for a certain class of simple problems—those that are "linearly separable"—perceptrons can learn the correct behavior after a finite number of training sessions. The result attracted attention because of its mathematical beauty.It was a pity that it had bad luck, and its influence faded quickly.Mar Vin Minsky and Segmour Papert showed that the structure of the perceptron and the learning rules are incapable of performing "exclusive-or problems" (e.g., whether it is apples or oranges, but not two both), so it is impossible to learn it.They wrote a book detailing the limitations of the perceptron throughout.This killed interest in perceptrons for many years (Minsky later admitted that he had gone too far).Much of the work on this question has turned its attention to artificial intelligence methods. ① It is possible to construct a multi-layer network with simple units that can perform XOR problems (or similar tasks) that simple single-layer networks cannot.Such a network must have connections at many different levels, and the question is which connections, initially random, must be modified to allow the network to perform the desired operation.Minsky and Peppert's contribution would have been greater if they had provided an answer to this question, rather than driving the perceptron to a dead end. The next development to gain widespread attention came from John Hop-field, a Caltech physicist turned molecular biologist and brain theorist. In 1982 he proposed a network, now known as the Hopfield network (see Figure 53).This is a simple network with self-feedback.Each unit can only have two outputs: one 1 (indicating inhibition) or ten 1 (indicating excitation).But each cell has multiple inputs.Each connection is assigned a specific strength.At each instant the unit sums the effects (2) from all its connections.If this sum is greater than 0, set the output state to +1 (on average, when the excitatory input of the unit is greater than the inhibitory input, the output is positive), otherwise it outputs -1.Sometimes this means that the output of a unit changes because the input from other units has changed. Despite this, there are still many theoretical workers who continue to work in obscurity.This includes Stephen Grossberg, Jim Anderson, Teuvo Kohonen and Devid Willshaw. (2) The influence of each input on the unit is obtained by multiplying the current input signal (+1 or -1) with its corresponding weight. (If the current signal is -1 and the weight is +2, the influence is -2.) The computation is repeated over and over until the output of all units stabilizes. ①In the Hopfield network, the states of all units are not changed at the same time, but one by one in a random order. Hopfield has proved theoretically that given a set of weights (connection strength) and any input, The network will not roam indefinitely, nor will it go into oscillations, but will quickly reach a steady state. ① Hopfield's arguments are convincing and articulate.His network has huge appeal to mathematicians and physicists who think they have finally found a way (as we say in California) that they can dabble in brain research.Although this network seriously violates biology in many details, they are not worried about it. How can the strength of all these connections be regulated? In 1944, the Canadian psychologist Donald Hebb published the book "The Organization of Behavior".It was then, as now, widely believed that a key factor in the learning process is the regulation of the strength of the connections (synapses) of neurons.Hebb realized that it was not enough to increase the strength of a synapse just because it was active.He expected a mechanism that only works when the activity of two neurons is correlated.There is a passage from his book that was later widely quoted: "When an axon of cell A is sufficiently close to cell B to have an influence on it, and participates persistently and continuously in the excitation of cell B, then in this Some growth process or metabolic change occurs in two cells or in one of them so that the influence of A as one of the cells that excites B is strengthened." This mechanism, and some similar rules, are now called "Hertz Bruce". Hopfield used a form of Hebb's rule to regulate connection weights in his network.For one mode in the problem, if two units have the same output, the interconnection weight between them is set to +1.If they have opposite outputs, both weights are set to -1.Roughly he says that each unit motivates its "friends" and tries to weaken its "enemies". How does the Hopfield network work?If the network is fed the correct pattern of cell activity, it will stay in this state.There's nothing special about it, because what is given to it at this point is the answer.It is worth noting that if only a small part of the pattern is given as a "clue", it will stabilize on the correct output, that is, the entire pattern after a brief evolution. After continuously adjusting the output of each unit, the network What is revealed is a stable linkage of cell activities.Ultimately it will effectively retrieve that memory from something that is only close to the "memory" it stores, which is also said to be "content-addressable"—that is, it doesn't A separate, unique signal that is used as an "address".Any perceivable part of the input pattern will be used as an address.This is beginning to be slightly similar to human memory. Note that memory does not have to be stored in an active state, it can also be entirely passive, since it is embedded in the pattern of weights ie in the strength of connections between all the various units.The network can be completely inactive (all outputs set to 0), but as soon as a signal is input, the network suddenly becomes active and enters a steady state of activity corresponding to the pattern it should remember for a short time.Presumably, the recall of human long-term memory has this general property (it's just that the pattern of activity is not permanently maintained).You can remember a lot of things that you can't recall right now. A neural network (specifically a Hopfield network) can "remember" one pattern, but can it also remember a second pattern in addition?If several patterns are not too similar to each other, a network is able to memorize them all, i.e. given a sufficiently large fraction of one of the patterns, the network will output that pattern after a few cycles.Because any memory is distributed among many connections, memory is distributed throughout the system.Because any one connection may be contained in multiple memories, memories can be superimposed.Furthermore, memory is robust, and changing a few connections usually does not significantly change the behavior of the network. It should come as no surprise that there is a price to pay for these features.Adding too much memory to a network can easily mess it up.Even given cues, or even complete patterns as input, the network produces meaningless output. ① Some have suggested that this is a phenomenon that occurs when we dream (Freud called it "condensation"), but that's off topic.It is worth noting that all these properties are "naturally occurring".They are not carefully set by the network designer, but are determined by the nature of the units, their connection patterns, and weight adjustment rules. Hopfield networks have another property that when several inputs are in fact roughly similar to each other, what it "remembers" will be some average of the trained patterns, after properly computing the network's connection weights.This is another property somewhat similar to the brain.For us humans, when we listen to a particular pitch, we perceive it to be the same even if it varies within a certain range.The input is similar but different, and the output—what we hear—is the same. These simple networks are not comparable to the complexity of the brain, but this simplification does make it possible for us to understand their behavior, and even features that appear in simple networks may also appear in more complex networks with the same general properties. In addition, they provide us with multiple views of the possible functions of specific brain circuits.For example, there is an area in the hippocampus called CA3 whose connections actually resemble a content-addressed network.Of course, whether this is correct remains to be tested experimentally. Interestingly, these simple neural networks share some characteristics of holograms.In a hologram, several images can be stored overlapping each other; any part of the hologram can be used to restore the entire image, but the clarity will decrease; the hologram is robust to small defects.This analogy is often enthusiastically supported by people who know little about both the brain and holograms.The comparison is almost certainly worthless.There are two reasons.A detailed mathematical analysis shows that neural networks and holograms are mathematically different.What's more, while neural networks are built from units that bear some resemblance to real neurons, there is no evidence of the machinery or processing required to have holograms in the brain. (1) A newer book that packs a huge punch is this thick two-volume set by David Rumelhart, James McClelland, and the PDP team Book "Parallel Distributed Processing" (1).The book came out in 1986 and quickly became a bestseller, at least in academic circles.I'm also nominally a member of the PDP team and co-wrote a chapter of it with Chiko Asanuma.But I played a very small role.I have pretty much only one contribution, which is to insist that they stop using the term neuron as the unit of their network. The Cal State University, San Diego Department of Psychology is about a mile from the Salk Institute.In the late 70s and early 80s I used to walk to the small informal meetings held by their discussion groups.The place where I used to wander back then is now a huge parking lot.The pace of life is getting faster and faster, and I have changed to driving between the two places. The research team was led by Rumelhart and McClelland, but McClelland soon left for the East Coast.Both started out as psychologists, but they became disenchanted with symbol processors and worked together to develop a model of "interacting stimulators" that process words.Encouraged by another student of Christopher Longuet-Higgins, Geoffrey Hinton, they set out to work on a more ambitious "connectionist" scheme .They adopted the term parallel distributed processing because it has a broader coverage than the previous term, associative memory. In the early days when people invented the Internet, some theorists bravely started to try.They wired together small, still clumsy electronic circuits, often including old-fashioned relays, to simulate their very simple network.Much more complex neural networks have now been developed, thanks to the vastly faster and cheaper modern computers.New ideas about networks can now be simulated and tested on computers (which are mostly digital computers), without having to rely on crude analog circuits or use rather difficult mathematical arguments like earlier research. The book "Parallel Distributed Processing" published in 1986 has been brewing for a long time since the end of 1981.This is fortunate, because it is the latest development (or rather its revival or application) of a particular algorithm that, building on earlier work, quickly made a big impression.The book's enthusiastic readers include not only brain theorists and psychologists, but also mathematicians, physicists and engineers, and even workers in the field of artificial intelligence.The latter's initial reaction, however, was rather hostile.Eventually neuroscientists and molecular biologists heard about it, too. The subtitle of the book is "An Exploration of the Microarchitecture of Cognition".It's kind of a hodge-podge, but one particular algorithm produces amazing results.This algorithm is now known as the "error backpropagation algorithm", often shortened to "backpropagation".In order to understand this algorithm, you need to know something about learning algorithms in general. Some forms of learning in neural networks are called "teacherless".This means no guidance from outside input.Changes to any connection depend only on local state within the network.The simple Hebbian rule has this property.In contrast, in teacher learning, guidance signals about the network's performance are provided externally to the network. Teacher-less learning is alluring because, in a sense, the network is teaching itself.Theorists have devised a more efficient learning rule, but it requires a "teacher" to tell the network whether it responds well, poorly, or badly to certain inputs.One such rule is called the "delta law". Training a network requires a set of inputs for training, called the "training set".We'll see an example of this shortly when we discuss NET talk.A useful training set must be a suitable sample of the inputs the network is likely to encounter after training.Often the signal from the training set needs to be fed multiple times, thus requiring a lot of training before the network learns to perform well.This is partly because connections to such networks are often random.And in a sense, the brain's initial connections are controlled by genetic mechanisms, often not entirely random. How is the network trained?When a signal from the training set is fed into the network, the network produces an output.This means that each output neuron is in a particular active state.The teacher uses a signal to tell each output neuron its error, that is, the difference between its state and correctness. The name δ comes from the difference between the real activity and the requirement (in mathematics, δ is often used to represent small and limited difference).The network's learning rules use this information to calculate how to adjust the weights to improve the network's performance. The Adaline network is an earlier example of learning with a teacher.It was designed by Bernard Widrow and ME Hoff in 1960, so the δ law is also called the Widrow-Hoff rule.They design the rules such that the total error always decreases at each correction step. ①This means that the network will eventually reach a minimum value of error during the training process.There is no doubt about it, but it is not yet clear whether it is a true global minimum or just a local minimum.In terms of physical geography, are we reaching a lake in a crater, or a lower pond.An ocean, or a sunken sea (sea below sea level) like the Dead Sea? The training algorithm is adjustable, so the step size for approaching the local minimum can be large or small.If the step size is too large, the algorithm will make the network jump around the minima (it will go downhill at first, but go so far that it goes uphill again).If the step is small, the algorithm takes an extremely long time to reach the bottom of the minimum.One can also use finer tuning schemes. Backpropagation is a special case of learning algorithms with a teacher.In order for this to work, the elements of the network need to have some special properties.Their output need not be binary (ie, or 0, or +1 or -1), but divided into several levels.It usually takes a value between 0 and +1.Theorists blindly believe that this corresponds to the average firing rate of the neuron (taking the maximum firing rate + 1), but they are often at a loss as to when this averaging should be taken. How do you determine the size of this "graded" output?As before, each unit weights the sum of the inputs, but this time there is no longer a true threshold.If the sum is small, the output is almost 0.When the sum is slightly larger, the output increases.When the sum is large, the output is close to the maximum value.The sigmoid function (sigmoid function) shown in Figure 54 embodies this typical relationship between the sum of the inputs and the output.If one takes the average firing rate of a real neuron as its output, it behaves not much differently from that. This seemingly smooth curve has two important properties.It is mathematically "differentiable", meaning that the slope anywhere is finite; the backpropagation algorithm relies on this property.What's more, the curve is nonlinear, as is the case with real neurons.Output doesn't always double when (internal) input doubles.This nonlinearity allows it to handle a wider range of problems than strictly linear systems. Now let's look at a typical backpropagation network.It usually has three different cell layers (see Figure 55).The bottom layer is the input layer.The next layer is called the "hidden unit" layer, because these units are not directly connected to the world outside the network.The topmost layer is the output layer.Each unit in the bottom layer is connected to all units in the layer above.The same goes for the middle layer.The network has only forward connections, no side connections, and no backward projections other than training.Its structure can hardly be simplified. At the beginning of training, all weights are randomly assigned, so the initial response of the network to all signals is meaningless.Afterwards, given a training input, an output is generated and the weights are adjusted according to the backpropagation training rules.Here's how it works: After the network produces an output for training, each unit in the higher layers is told the difference between its output and the "correct" output.The unit uses this information to make small adjustments to the weights of each synapse that reaches it from lower-level units.It then passes this information back to each unit in the hidden layer.Each hidden layer unit collects the error information transmitted by all higher-level units, and uses it to adjust all synapses from the lowest layer. On the whole, the specific algorithm makes the network always adjust to reduce the error.This process is repeated many times. (The algorithm is general and can be used for feed-forward networks with more than three layers.) After a sufficient amount of training the network is ready to use.At this point there is an input test set to test the network.The test set is chosen so that its general (statistical) properties are similar to the training set, but otherwise different. (The weights are held constant at this stage in order to examine the behavior of the trained network.) If the results are unsatisfactory, the designer starts from scratch, modifying the structure of the network, the way inputs and outputs are encoded, the parameters in the training rules, or total number of training sessions. All of this can seem abstract.An example may make it clearer for readers.Terry Seginowski and Charles Rosenberg provided a famous demonstration in 1987.They called their network NET talk.Its task is to convert written English into English pronunciation.The irregular spelling of English, which makes it a particularly difficult language to pronounce, makes this task not an easy one.Of course, the pronunciation rules of English are not clearly told to the Internet in advance.During training, the network is given a correction signal after each trial, and the network learns from it.Input is passed through the network letter by letter in a special way. NET talk's total output is a string of symbols corresponding to the spoken pronunciation. To make the demonstration more vivid, the output of the network is coupled to a separate pre-existing machine (a digital speech synthesizer).It can turn the output of NET talkk into pronunciation, so that you can hear the machine "reading" English. Since the pronunciation of an English letter depends to a large extent on the matching of letters before and after it, the input layer reads a string of 7 letters at a time. ① The units in the output layer correspond to the 21 pronunciation features required by phonemes ②, and there are 5 units dealing with syllable boundaries and stress.Figure 56 shows its general structure. ③ They trained the network using excerpts from two passages of text, each accompanied by the phonetic transcription needed to train the machine.The first passage is taken from the Merriam-Webster Pocket Dictionary.The second excerpt is somewhat strange, a continuous speech of a child.The initial weights have small random values and are updated for each word processed during the training period.They write programs that allow the computer to do this automatically given the input and (correct) output information.When judging the real output, the program takes as its best guess the phoneme that is closest to the real pronunciation, and usually several "pronunciation" output units are related to this. It's fascinating to hear a machine learn to "read" English. ①Initially, because the initial connection is random, only a series of confusing voices can be heard. NET talk quickly learned to distinguish between vowels and consonants.But at first it only knows one vowel and one consonant, so it's like babbling.Later it could recognize word boundaries and make a string of sounds like words.After about ten manipulations on the training set, the words became clear and the reading sounded like that of a toddler. The actual results are not perfect, English pronunciation depends on word meaning in some cases, and NET talk knows nothing about it.Some similar pronunciations often cause confusion, such as the "th" sound in thesis and throw.Using another example passage from the same child as a test, the machine did well, showing that it could generalize what it had learned from its fairly small training set (1024 words) to new words it had never encountered. ② This is called "generalization". Obviously the network isn't just a lookup table of every word it's been trained on.Its generalization ability depends on the redundancy of English pronunciation.Not every English word is pronounced in its own unique way, although foreigners who are new to English tend to think so. (This problem arises from the fact that English has two origins, Latin and Germanic, which give English a rich vocabulary.) One advantage of a neural network, relative to most data collected from real neurons, is that it is easy to examine the receptive field of each of its hidden units after training.Does a letter fire only a few hidden units, or is its activity spread across many hidden units like a hologram?The answer is closer to the former.Although there is no specific hidden unit in each letter-sound correspondence, each such correspondence does not propagate to all hidden units. It is thus possible to examine how hidden units behave in clusters (i.e. have the same properties).Serginowski and Rosenberg found that "...the most important difference is the complete separation of vowels from consonants, while clusters of hidden units have different patterns in these two classes, and for vowels the next important The variable is the letter, and consonant clustering follows a mixed strategy, relying more on their sound similarity." The importance of this rather haphazard arrangement, typical of neural networks, is its uncanny resemblance to the responses of many real cortical neurons, such as those in the visual system, rather than the subtlety that engineers impose on the system. The design is quite different. Their conclusions are: NET talk is a presentation that epitomizes many aspects of learning.First, the network starts with some reasonable "innate" knowledge, embodied in the representation of the input and output chosen by the experimenter, but no specific knowledge about English - the network can understand any language with the same set of letters and phonemes. language training.其次，网络通过学习获得了它的能力，其间经历了几个不同的训练阶段，并达到了一种显著的水平。最后，信息分布在网络之中，因而没有一个单元或连接是必不可少的，作为结果，网络具有容错能力，对增长的损害是故障弱化的。此外，网络从损伤中恢复的速度比重新学习要快得多。尽管这些与人类的学习和记忆很相似，但NET talk过于简单，还不能作为人类获得阅读能力的一个好的模型。网络试图用一个阶段完成人类发育中两个阶段出现的过程，即首先是儿童学会说话；只有在单词及其含义的表达已经建立好以后，他们才学习阅读。同时，我们不仅具有使用字母-发音对应的能力，似乎还能达到整个单词的发音表达，但在网络中并没有单词水平的表达。注意到网络上并没有什么地方清楚地表达英语的发音规则，这与标准的计算机程序不同。它们内在地镶嵌在习得的权重模式当中。这正是小孩学习语言的方式。它能正确他说话，但对它的脑所默认的规则一无所知。 ① NET talk有几条特性是与生物学大为抵触的。网络的单元违背了一条规律，即一个神经元只能产生兴奋性或抑制性输出，而不会二者皆有。更为严重的是，照字面上说，反传算法要求教师信息快速地沿传递向前的操作信息的同一个突触发送回去。这在脑中是完全不可能发生的。试验中用了独立的回路来完成这一步，但对我而言它们显得过于勉强，并不符合生物原型。尽管有这些局限性，NET talk展示了一个相对简单的神经网络所能完成的功能，给人印象非常深刻。别忘了那里只有不足500个神经元和2万个连接。如果包括（在前面的脚注中列出的）某些限制和忽略，这个数目将会大一些，但恐怕不会大10倍。而在每一侧新皮层边长大约四分之一毫米的一小块表面（比针尖还小）有大约5000个神经元。因而与脑相比，NET talk仅是极小的一部分。②所以它能学会这样相对复杂的任务给人印象格外深刻。另一个神经网络是由西德尼·莱基（Sidney Lehky）和特里·塞吉诺斯基设计的。他们的网络所要解决的问题是在不知道光源方向的情况下试图从某些物体的阴影中推断出其三维形状（第四章描述的所谓从阴影到形状问题）。对隐层单元的感受野进行检查时发现了令人吃惊的结果。其中一些感受野与实验中在脑视觉第一区（V1区）发现的一些神经元非常相似。它们总是成为边缘检测器或棒检测器，但在训练过程中，并未向网络呈现过边或棒，设计者也未强行规定感受野的形状。它们的出现是训练的结果。此外，当用一根棒来测试网络时，其输出层单元的反应类似于V1区具有端点抑制（end-stopping）的复杂细胞。网络和反传算法二者都在多处与生物学违背，但这个例子提出了这样一个回想起来应该很明显的问题：仅仅从观察脑中一个神经元的感受野并不能推断出它的功能，正如第十一章描述的那样，了解它的投射野，即它将轴突传向哪些神经元，也同样重要。我们已经关注了神经网络中"学习"的两种极端情况：由赫布规则说明的无教师学习和反传算法那样的有教师学习。此外还有若干种常见的类型。一种同样重要的类型是"竞争学习"。①其基本思想是网络操作中存在一种胜者为王机制，使得能够最好地表达了输入的含义的那个单元（或更实际他说是少数单元）抑制了其他所有单元。学习过程中，每一步中只修正与胜者密切相关的那些连接，而不是系统的全部连接。这通常用一个三层网络进行模拟，如同标准的反传网络，但又有显著差异，即它的中间层单元之间具有强的相互连接。这些连接的强度通常是固定的，并不改变。通常短程连接是兴奋性的，而长程的则是抑制性的，一个单元倾向于与其近邻友好而与远处的相对抗。这种设置意味着中间层的神经元为整个网络的活动而竞争。在一个精心设计的网络中，在任何一次试验中通常只有一个胜者。这种网络并没有外部教师。网络自己寻找最佳反应。这种学习算法使得只有胜者及其近邻单元调节输入权重。这种方式使得当前的那种特殊反应在将来出现可能性更大。由于学习算法自动将权重推向所要求的方向，每个隐单元将学会与一种特定种类的输入相联系。 ① 到此为止我们考虑的网络处理的是静态的输入，并在一个时间间隔后产生一个静态的输出。很显然在脑中有一些操作能表达一个时间序列，如口哨吹出一段曲调或理解一种语言并用之交谈。人们初步设计了一些网络来着手解决这个问题，但目前尚不深入。（NET talk确实产生了一个时间序列，但这只是数据传入和传出网络的一种方法，而不是它的一种特性。）语言学家曾经强调，目前在语言处理方面（如句法规则）根据人工智能理论编写的程序处理更为有效。其本质原因是网络擅长于高度并行的处理，而这种语言学任务要求一定程度的序列式处理。脑中具有注意系统，它具有某种序列式的本性，对低层的并行处理进行操作，迄今为止神经网络并未达到要求的这种序列处理的复杂程度，虽然它应当出现。真实神经元（其轴突、突触和树突）都存在不可避免的时间延迟和处理过程中的不断变化。神经网络的大多数设计者认为这些特性很讨厌，因而回避它们。这种态度也许是错的。几乎可以肯定进化就建立在这些改变和时间延迟上，并从中获益。对这些神经网络的一种可能的批评是，由于它们使用这样一种大体上说不真实的学习算法，事实上它们并不能揭示很多关于脑的情况。对此有两种答案。一种是尝试在生物学看来更容易接受的算法，另一种方法更有效且更具有普遍性。加利福尼亚州立大学圣迭戈分校的戴维·齐帕泽（David Zipser），一个由分子生物学家转为神经理论学家，曾经指出，对于鉴别研究中的系统的本质而言，反传算法是非常好的方法。他称之为"神经系统的身份证明"。他的观点是，如果一个网络的结构至少近似于真实物体，并了解了系统足够多的限制，那么反传算法作为一种最小化误差的方法，通常能达到一个一般性质相似于真实生物系统的解。这样便在朝着了解生物系统行为的正确方向上迈出了第一步。如果神经元及其连接的结构还算逼真，并已有足够的限制被加入到系统中，那么产生的模型可能是有用的，它与现实情况足够相似。这样便允许仔细地研究模型各组成部分的行为。与在动物上做相同的实验相比，这更加快速也更彻底。我们必须明白科学目标并非到此为止，这很重要。例如，模型可能会显示，在该模型中某一类突触需要按反传法确定的某种方式改变。但在真实系统中反传法并不出现。因此模拟者必须为这一类突触找到合适的真实的学习规则。例如，那些特定的突触或许只需要某一种形式的赫布规则。这些现实性的学习规则可能是局部的，在模型的各个部分不尽相同。如果需要的话，可能会引入一些全局信号，然后必须重新运行该模型。如果模型仍能工作，那么实验者必须表明这种学习方式确实在预测的地方出现，并揭示这种学习所包含的细胞和分子机制以支持这个观点。只有如此我们才能从这些"有趣"的演示上升为真正科学的有说服力的结果。所有这些意味着需要对大量的模型及其变体进行测试。幸运的是，随着极高速而又廉价的计算机的发展，现在可以对许多模型进行模拟。这样人们就可以检测某种设置的实际行为是否与原先所希望的相同，但即便使用最先进的计算机也很难检验那些人们所希望的巨大而复杂的模型。 "坚持要求所有的模型应当经过模拟检验，这令人遗憾地带来了两个副产品。如果一个的假设模型的行为相当成功，其设计者很难相信它是不正确的。然而经验告诉我们，若干差异很大的模型也会产生相同的行为。为了证明这些模型哪个更接近于事实，看来还需要其他证据，诸如真实神经元及脑中该部分的分子的准确特性。另一种危害是，对成功的模型过分强调会抑制对问题的更为自由的想像，从而会阻碍理论的产生。自然界是以一种特殊的方式运行的。对问题过于狭隘的讨论会使人们由于某种特殊的困难而放弃极有价值的想法。但是进化或许使用了某些额外的小花招来回避这些困难。尽管有这些保留，模拟一个理论，即便仅仅为了体会一下它事实上如何工作，也是有用的。我们对神经网络能总结出些什么呢？它们的基础设计更像脑，而不是标准计算机的结构，然而，它们的单元并没有真实神经元那样复杂，大多数网络的结构与新皮层的回路相比也过于简单。目前，如果一个网络要在普通计算机上在合理的时间内进行模拟，它的规模只能很小。随着计算机变得越来越快，以及像网络那样高度并行的计算机的生产商业化，这会有所改善，但仍将一直是严重的障碍。尽管神经网络有这些局限性，它现在仍然显示出了惊人的完成任务的能力。整个领域内充满了新观点。虽然其中许多网络会被人们遗忘，但通过了解它们，抓住其局限性并设计改进它们的新方法，肯定会有坚实的发展。这些网络有可能具有重要的商业应用。尽管有时它会导致理论家远离生物事实，但最终会产生有用的观点和发明。也许所有这些神经网络方面的工作的最重要的结果是它提出了关于脑可能的工作方式的新观点。在过去，脑的许多方面看上去是完全不可理解的。得益于所有这些新的观念，人们现在至少瞥见了将来按生物现实设计脑模型的可能性，而不是用一些毫无生物依据的模型仅仅去捕捉脑行为的某些有限方面。即便现在这些新观念已经使我们对实验的讨论更为敏锐，我们现在更多地了解了关于个体神经元所必须掌握的知识。我们可以指出回路的哪些方面我们尚不足够了解（如新皮层的向回的通路），我们从新的角度看待单个神经元的行为，并意识到在实验日程上下一个重要的任务是它们整个群体的行为。神经网络还有很长的路要走，但它们终于有了好的开端。 ①查尔斯·安德森（charlesAnderson）和戴维·范·埃森提出脑中有些装置将信息按规定路线从一处传至另一处。不过这个观点尚有争议。 ①该网络以一个早期网络为基础。那个网络被称为"自旋玻璃"，是物理学家受一种理论概念的启发而提出的。 ①这对应于一个适定的数学函数（称为"能量函数"，来自自旋玻璃）的（局域）极小值。霍普菲尔德还给出了一个确定权重的简单规则以使网络的每个特定的活动模式对应于能量函数的一个极小值。 ①对于霍普菲尔德网络而言，输出可视为网络存贮的记忆中与输出（似为"输入"之误——译者注）紧密相关的那些记忆的加权和。 ①在1968年，克里斯托夫·朗格特-希金斯（Christopher Longuet-Higgins）从全息图出发发明了一种称为"声音全息记录器"（holophone）的装置。此后他又发明了另一种装置称为"相关图"，并最终形成了一种特殊的神经网络形式。他的学生戴维·威尔肖在完成博士论文期间对其进行了详细的研究。 (2)他们和其他一些想法接近的理论家合作，在1981年完成了《联想记忆的并行模式》，由杰弗里·希尔顿（Geoffrey Hinton）和吉姆·安德森编著。这本书的读者主要是神经网络方面的工作者，它的影响并不像后一本书那样广泛。（1)PDP即平行分布式处理（Parallel Distributed rocessing）的缩写。 ①更准确他说是误差的平方的平均值在下降，因此该规则有时又叫做最小均方（LMS）规则。 ①29个"字母"各有一个相应的单元；这包括字母表中的26个字母，还有三个表示标点和边界。因而输入层需要29x7=203个单元。 ②例如，因为辅音p和b发音时都是以拢起嘴唇开始的，所以都称作"唇止音"。 ③中间层（隐层）最初有80个隐单元，后来改为120个，结果能完成得更好。机器总共需要调节大约2万个突触。权重可正可负。他们并没有构造一个真正的平行的网络来做这件事，而是在一台中型高速计算机上（一台VAX11//780FPA）模拟这个网络。 ①计算机的工作通常不够快，不能实时地发音，因而需要先把输出录下来，再加速播放，这样人们才能听明白。 ②塞吉诺斯基和罗森堡还表明，网络对于他们设置的连接上的随机损伤具有相当的抵抗力。在这种环境下它的行为是"故障弱化"。他们还试验以11个字母（而不是7个字母）为一组输入。这显著改善了网络的成绩。加上第二个隐单元层并不能改善它的成绩，但有助于网络更好地进行泛化。 ①除了上面列出的以外，NET talk还有许多简化。虽然作者们信奉分布式表达，在输入输出均有"祖母细胞"即，例如有一个单元代表"窗口中第三个位置上的字母a"。这样做是为了降低计算所需要的时间，是一种合理的简化形式。虽然数据顺序传入7个字母的方式在人工智能程序是完全可以接受的，却显得与生物事实相违背。输出的"胜者为王"这一步并不是由"单元"完成的，也不存在一组单元去表达预计输出与实际输出之间的差异（即教师信号）。这些运算都是由程序执行的。 ②这种比较不太公平，因为神经网络的一个单元更好的考虑是等价于脑中一小群相神经元。因而更合适的数字大约是8万个神经元（相当于一平方毫米皮层下神经元的数目）。 ①它是由斯蒂芬·格罗斯伯格、托伊沃·科霍宁等人发展的。 ①我不打算讨论竞争网络的局限性。显然必须有足够多的隐单元来容纳网络试图从提供的输入中所学的所有东西，训练不能太快，也不能太慢，等等。这种网络要正确工作需要仔细设计。毫无疑问，不久的将来会发明出基于竞争学习基本思想的更加复杂的应用。

Report

Prev| Chapter list| Next

Press "Left Key ←" to return to the previous chapter; Press "Right Key →" to enter the next chapter; Press "Space Bar" to scroll down.

Chapters

Chapters

Setting

Setting

Add

Return

Book

