Home  Categories  Science learning  digital survival 

Chapter 11 5. Can we talk about overtones?

digital survival 尼葛洛庞帝 6096Words 2018-03-20

Typing is not an ideal interface for most people.If we could talk to computers, even the staunchest anti-machines would probably use them with greater enthusiasm.However, current computers are still deaf and dumb.Why? Computers have not made much progress in speech recognition, not because of a lack of technology, but because of a lack of vision.Whenever I see people speaking with a microphone in speech recognition demonstrations or product advertisements, I am very surprised: have they really forgotten that one of the greatest values of speaking is to free your hands to do things? something else?I also wonder when I see people talking with their face close to the screen: have they forgotten that being able to use a remote is one of the reasons to use voice?And when I hear people asking for voice systems that recognize individual users, I ask myself: Have they forgotten that we're talking to a personal computer, not a public computer?Why does it seem like everyone is looking in the wrong direction to solve the problem? the reason is simple.Until recently, we have been driven by two misleading notions.The first concept was influenced by the old-fashioned telephone communication system, hoping that anyone, anywhere, could pick up the microphone and give orders to the computer without having to talk to an operator, and it didn't matter what the speaker was.Another lingering idea comes from office automation—we want a talking typewriter that we can utter non-stop, he says, and that can transcribe our dictation into written words. .The focus on these two areas has kept us from achieving for years the more achievable (and useful) goal of enabling computers to recognize and Know what the conversation is about. We also ignore the value of speaking beyond words.For example, today's computers demand a person's undivided attention.You usually have to sit tight and pay attention to both the process of the interaction and the content of the interaction.Using the computer while walking around, or getting it involved in one of the group conversations when there are multiple, is simply out of the question.Voice recognition could change all that. Being able to use a computer at an arm's length away is very important.Imagine, if when you are talking to someone, his or her nose tip is always close to your face, what would that feel like!We usually talk to others at a distance, and occasionally turn our backs while doing other things at the same time.Sometimes they have even gone to other places and can't see each other, and they are still talking.This happens all the time.I would like to have a computer within "hearing range", which must be able to distinguish voices from surrounding noises such as air conditioners or airplanes flying overhead. Another reason speech is better than words is that it can have other incidental ways of conveying information.As anyone with young children or pets knows, how you say something can be more important than what you say.The tone of voice is very important.For example, no matter how much the owner brags about his or her beloved puppy, the puppy only seems to respond to the tone of voice, and its internal ability to analyze complex words is basically zero. In addition to the literal meaning, the spoken words convey a lot of information at the same time.When we speak, we use exactly the same words, and we can express passion, or sarcasm, or anger, or twinkle, or flattery, or exhaustion, and so on.In the study of computer speech recognition, everyone ignores these subtle differences, or worse, treats them as blemishes rather than features.However, it is these qualities that make speaking a richer input medium than typing.Let the computer "obedient" If your foreign language skills are decent, but not yet fluent, you may find it difficult to understand a news broadcast amidst the noise.On the contrary, for a person who can speak a foreign language extremely fluently, these noises are at best disturbing.Recognizing language and understanding language are inseparable. At present, computers cannot understand the meaning of things by first establishing a consensus on the meaning of something like you and me. While the computers of the future will undoubtedly be more intelligent, for the time being we will have to grapple with the problem of machine speech recognition, leaving aside the problem of machine comprehension.Once these two tasks are differentiated, the way to go is very clear. We must turn the spoken words into computer-readable commands.The problem of speech recognition has three variables: the vocabulary, the degree of dependence of the machine on the speaker, and the relevance of words. together. We can think of these three aspects of speech recognition as a three-dimensional stereo axis.On the vocabulary axis, the fewer words that need to be recognized, the easier it is for the computer.If the system knows in advance who is speaking, the problem is even simpler.If the speaker can pronounce each word separately, the computer can understand it better and recognize it more easily. At the starting point of these three axes, we can find words that are extremely rare and completely dependent on the speaker's voice. When these words are pronounced, there must be a clear - obvious - stop - -pause. As we move along either axis, that is, increasing the vocabulary that a computer can recognize, allowing the system to serve any speaker, or allowing words to be connected to a higher and higher degree, in this case Next, each step forward will make the problem more and more difficult.When we get to the far ends of the three axes, we expect computers to be able to recognize any word spoken by anyone, as well as ambiguous words "printed (to any degree").It's often assumed that we have to be at the extreme end of two or three axes for a speech recognition system to be useful to humans.This is totally wrong. Let's consider them one by one.When it comes to the number of vocabulary, we may ask: How much is too many, 500, 5000 or 5 words?But the question should really be: How many words does a computer need in memory at any one time that it can recognize?This problem prompts us to group words according to context, so that large groups of phrases can be put into memory when needed.When I ask my computer to answer a call, it enters the information into my electronic phone book.When I plan a trip somewhere, it puts the names of places on it. If you think of vocabulary as a set of words that are needed in any situation—called “word windows”—then the computer only has to pick words from a much smaller phonological library, which As long as there are about 500 characters in the phonetic library, there is no need for as many as five. The assumption that a voice recognition system capable of recognizing individual speakers is needed is that this capability was a requirement of the telephone company in the past, and the central computer of the telephone company had to be able to understand everyone, providing a "universal service" .Today, computers are more pervasive and more personal.We can do more speech recognition at the periphery of the network - through a personal computer, a microphone, or with the help of a small smart card (smartcard). If I want to talk to an airline computer in a phone booth, I can plug into my home computer or get out my pocket computer and have it convert the sound into a signal that the machine can understand for me, and then , and then contact the airline's computer. The third problem is the ambiguity of pronunciation.When talking to the computer, we don't want to spit out every single word exaggeratedly like a tourist talking to a foreign child, and pause every time we read a word.Therefore this axis is the most challenging.But we can also simplify the problem a little bit, that is, think of language as the sound of many words together, rather than the sound of many individual words.In fact, dealing with this concatenated sound may well be part of the personalization and training of your computer. When we think of speech as an interactive and conversational medium, we are not far from the easiest part of speech recognition.Words not in the dictionary Speech is a medium often filled with the sounds of words not in the dictionary.Speech is not only more colorful than black and white, but features in dialogue, such as the use of non-literal language such as body language, often make the dialogue emerge with additional meaning. In 1978, we at MIT adopted an advanced speaker-dependent speech recognition system capable of recognizing continuous speech.But like many systems of its kind then and now, the system misfired when there was even the slightest tension in the speaker's voice.When graduate students demonstrate the system to our sponsors, we expect it to perform perfectly.As a result, the voice of the graduate student who was giving the presentation was tense due to excessive anxiety, and the system completely failed. A few years later, another student came up with a brilliant idea: find out where the user pauses in speech, and program the computer to make an "aha" sound at the appropriate moment; When the machine was talking, the machine would say "Aha--", "A-ha-ha" or "Aha-ha" every few minutes.The sounds had a tremendously soothing effect (as if the machine was encouraging the user to continue the conversation), the user became more relaxed, and the system's performance improved by leaps and bounds. This concept embodies two important meanings: first, not all pronunciations need to have literal meanings to be valuable in communication; second, some sounds are purely etiquette in conversation.When you answer the phone and don't say "um" to the caller at appropriate intervals, the caller will become nervous and eventually ask, "Hey, are you listening?" "Aha" or " Hmm doesn't mean "yes", "no" or "maybe" it's basically sending a bit of information: "I'm here".Parallel Expressions Imagine this situation: You are sitting around a table with a group of people who speak French except you.You've only studied bad French for a year in middle school, and suddenly someone turns to you and says, "Would you like some more wine?" You completely understand.Then the man changed the subject to French politics.Unless you're fluent in French, it's like listening to aliens (and even if you're fluent in French, you don't necessarily understand it). You might be thinking: "How about some wine?" Simple French that a child can understand, but politics requires more sophisticated language skills.True, but that's not the important difference between the two conversations. When someone asks you if you want a refill, he may be reaching for the bottle and looking at your empty glass.In other words, the information you are decoding is not just sound, but multiple messages in parallel and cumbersome.Moreover, all subjects and objects are in the same space and time.The result of all these conditions acting at the same time enables you to understand what he means. Let me reiterate, encumbrance is a good thing.The use of parallel channels (gesture, eye contact, and talk) is at the heart of human communication.Humans naturally gravitate toward parallel representations.If you only speak a little Italian, talking to Italians on the phone can be very difficult.But when you check into an Italian hotel and there is no soap in the room, instead of picking up the phone, you go downstairs to the front desk clerk and pull out all the housekeeping skills you learned in language crash school, Ask him to bring you soap, and you will even do a few bathing actions while talking. When we're in a different place, we do everything we can to communicate our intentions and interpret all the relevant signals for even the slightest bit of meaning.Computers are in such a foreign land—the land of human beings.There are two ways to tell a computer to talk: by replaying a previously recorded sound, or by synthesizing the sound of letters, syllables, or (most likely) phonemes.Both ways have pros and cons.Letting the computer speak is the same as making music. You can store the sound (just like a CD) and then replay it, or you can use a synthetic method to reproduce the music according to the tune (just like a musician). Recalling previously stored utterances returns to the most "natural"-sounding way of speaking and hearing, especially when we store a complete message.For this reason, most telephone messages are recorded in this manner.When you try to piece together recorded sounds or individual words, the results are less satisfactory because the overall rhythm is missing. In the past, people were reluctant to use pre-recorded conversations as human-computer interfaces, because it would consume a lot of computer storage capacity.Today, it's less of a problem. The real problem is the most obvious one.You must have recorded the conversation ahead of time to use the pre-recorded conversation. If you want the computer to speak without getting names wrong, then you have to store those names first.Stored voices cannot be applied to random speech.For this reason, people use the second way - synthesis. The speech synthesizer will read the content of a string of text word by word according to some rules (just like when you read this sentence).Every language is different, and thus the difficulty of synthesis is not the same. English is one of the most difficult languages to synthesize because we write English in a strange and seemingly illogical way.Some other languages, such as Turkish, are much easier.In fact, it is very easy to synthesize Turkish, because Kemal changed Turkish from Arabic to Latin in 1929. The result of this conversion is that there is a one-to-one correspondence between sounds and letters, each letter Pronunciation: There are no silent letters or confusing compound vowels; therefore, at the word level, Turkish is a computer speech synthesizer's dream come true. Even if the machine could pronounce every and every single word, there were other problems.It is very difficult to assemble the synthesized characters and add the overall rhythm and tone at the level of phrases or sentences.However, it is very important to do so, not only to make the computer speak nicely, but also to show different colors, expressions and intonations according to the content and intention of the speech.Otherwise, the sound from the computer is as unappetizingly monotonous as a drunken Swede muttering to himself. We're starting to see (and hear) systems now that combine speech synthesis and sound storage, and as digitization becomes more common, the ultimate solution will be a combination of the two.The trend towards miniaturization In the next millennium, we will find that we talk to machines as much as we talk to humans, or even more than we talk to humans.What seems to bother people most when talking to inanimate objects is the problem of self-awareness.We feel perfectly at ease talking to dogs and canaries, but it feels weird to talk to doorknobs or lampposts (unless you're dead drunk).Don't I feel silly when I talk to the oven?Probably not the same thing as talking into an answering machine. The trend of miniaturization will make voice input more ubiquitous today than in the past. Computers are getting smaller and smaller, devices that took up an entire room yesterday are now on your desk, and tomorrow you will be able to wear a pocket computer on your wrist. Many desktop PC users don't fully appreciate how much PCs have shrunk in size over the past 10 years. The reason for this is that the size of the computer has changed in different ways. For example, the size of the keyboard has remained the same as much as possible, while the monitor has grown larger.As a result, the overall size of a desktop computer today is still about the same as the Macintosh 15 years ago. If you haven't used a modem for a long time, the change in modem size is more indicative of how big the change really is.Less than 15 years ago, a 1,200-baud modem (about $1,000) was almost as big as an oven lying on its side.Back then, a 9600 baud modem was like a big iron cage on a shelf.Today, however, you can find a 19200 baud modem on a smart card.Even with modems being made the size of a multiplier card, we still have a lot of underutilized space, and a fair amount of design today is purely for shape (to fill the socket, or to be big enough for us to hold, without losing it).The reason why we don't put things like modems in the "pin headers" isn't primarily for technical reasons, but because we can easily misplace pins and find them hard to find. Once freed from the constraints of finger spread (which determines the shape and size of a comfortable keyboard), the size of the computer becomes more constrained by pockets, wallets, watches, ballpoint pens, and other similar items. volume effect.In this form factor, a credit card is close to the minimum size we want, and the display is so small that a GUI doesn't make much sense. A pen-shaped system is likely to be seen as an unwieldy transitional tool, too big and too small.The push-button design isn't ideal either.Take a look at your TV and VCR remotes and you'll see the limitations of buttons: Push-button devices are entirely designed for young people with slender fingers and good eyesight. For the above reasons, the trend of miniaturization will inevitably promote the improvement of speech production and speech recognition technology, and promote speech recognition to become the dominant human-machine interface of computers attached to small objects.An actual voice recognition system doesn't have to be housed in cufflinks and fobs.Small devices can help by communicating.The key is that after miniaturization, it must be driven by sound.Call, Heartfelt Many years ago, the director of development at Hallmark cards told me that their company's main competitor was AT&T. The slogan of "Calling, Sending Heart Songs" is to convey feelings through voice. The channel of sound carries not only the signal, but all the accompanying understanding, reflection, sympathy or tolerance. We say that someone "sounds" honest, that an argument "sounds" unreliable, or that something doesn't "sound" right.Information that evokes feelings is hidden in the sound. Like "phone, telegram," we'll find that we, too, will be able to communicate our wishes to the machine by voice.Some will act like an instructor not teaching their computer, others will use the voice of reason.Speaking and empowering are inseparable.Will you give orders to the seven dwarfs? possible. In 20 years, you might be speaking to a group of eight-inch-tall holographic assistants sitting on a table.This expectation is not far-fetched at all.For sure, voice will be the primary channel of communication between you and your interface agent.

Report

Prev| Chapter list| Next

Press "Left Key ←" to return to the previous chapter; Press "Right Key →" to enter the next chapter; Press "Space Bar" to scroll down.

Chapters

Chapters

Setting

Setting

Add

Return

Book

