Andrew Turner was an Electrical Engineering undergraduate at the University of Southern California in 2000. He is an audio recording engineer and works as an electrical engineer for the TMH Corporation.
Sound consists of variations in air pressure arriving at the ears. The human hearing system is capable of deciphering these relatively mundane and simple variations into a substantial amount of important information, much of which is relied upon for human survival. Since the hearing process is both psychological (associated with the brain) and acoustical (associated with the physics of sound), it is commonly referred to as psychoacoustics. This paper discusses the basics of the human hearing system and psychoacoustics. Following the aforementioned discussion, this paper specifically explores auditory localization; that is, the brain’s ability to interpret sound (audio) as a three-dimensional image. By understanding how the ear-brain system works, the reader is capable of understanding how the seemingly complex surround-sound systems accomplish an immersive and sound auditory experience.
The technology used in simulating surround sound systems makes use of human perception of sound in order to simulate specific sound effects. By understanding the working of the human auditory system, one can understand the engineering behind the design of various commercial sound systems.
Basics of Psychoacoustics with Respect to Localization
Psychoacoustics, or the study of human perception of sound, provides an explanation of how the human ear-brain system interprets sound and decodes information from a pair of receivers (ears) into a complete, 3-dimensional auditory ‘image’. In the perception of sound, there are three general fields by which humans can arbitrate: pitch, volume (or loudness), and time. These three domains provide the essential information or cues necessary for localization, which is the process of determining where a sound came from in three-dimensional space. In order to understand the basics of human auditory localization, one must first understand the method by which sound is received through the ears, as well as the basic definitions of pitch, loudness, and time.
Anatomy of the Human Ear
The human ear is comprised of three main parts: the outer ear, the middle ear, and the inner ear. The outer ear is comprised of the pinna, the ear canal, and the eardrum (Fig. 1). The pinna assists in both directing sound into the ear canal as well as “encoding” the incoming sounds with directional information that the brain interprets. The ear canal carries this encoded information to the eardrum that vibrates according to the incoming sound pattern. The middle ear is mechanically connected to the eardrum by means of three bones: hammer, anvil, and stirrup. These bones move in accordance to the vibrating eardrum and are connected to the inner ear, where the cochlea (A5) translates the vibrations into neuron impulses that can be interpreted by the brain. The auditory nerve then carries the impulses to the brain where the impulses are interpreted .
Quantification and Interpretation of Sound Based on Pitch, Loudness and Time
The pitch of a sound is directly associated with its frequency. Each key or note on a piano has an associated pitch. As the keys are played from left to right, the pitch increases. The reason for this increase is related to the speed at which the strings inside a piano vibrate back and forth when struck by the individual keys. A string at a given tension and length will tend to resonate (or vibrate) at a specific, certain frequency. This frequency is known as the fundamental frequency. For example, the note “middle-C” has a fundamental frequency of 261.6 Hz (it vibrates 261.6 times a second).
The second domain perceived by the ear-brain system is loudness, which is closely associated with acoustic pressure. Pressure is defined as a force over a given amount of area; pressure is measured in units of Newton of force per square meter of area. One Newton per square meter is equal to one Pascal of pressure. Acoustic pressure is similar to water pressure in pipes. High water pressure is present in fire hydrant pipes while low water pressure is present in shower pipes. The water from a fire hydrant can exert a large, powerful amount of force on anything in front of it because of its high pressure, while a shower’s low pressure generally does not exert very much force. The force generated by a sound moves the eardrum and the various parts of the middle ear in accordance with its magnitude or power. Consequently, a high acoustic pressure that causes the eardrum to move a greater distance is a loud sound, while a low acoustic pressure is a soft sound.
Acoustic pressure is measured in decibels of Sound Pressure Level (dB SPL), and is calculated by the equation shown below. 20×10-6 (20 micro) Pascal is equal to 0 dB SPL which is approximately the quietest sound a human can hear. The “maximum” a human can hear is called the Threshold of Pain and is commonly referenced as 120 dB SPL, which is equivalent to 20 Pascal (100,000 times 20 micro Pascal). Exceeding 120 dB SPL is damaging to human hearing and may cause discomfort. As a general rule, an increase or decrease in Sound Pressure Level by 10 dB is approximately a doubling or halving of loudness, respectively.
The third and simplest aspect of sound that humans perceive is the time domain. The time domain is important because, unlike light, the time it takes for sound waves to travel from one place to another is not negligible. Time, with respect to sound, is easily measured and observed (particularly in the form of echoes in canyons). Echoes in canyons are heard as a result of sound traveling across varying length paths before reaching an observer’s ears. In a canyon echo, a source emits a sound that travels away from the source.
The sound travels in many different paths, some directly towards the observer and others directly into the walls of the canyon. First, the observer hears the sound as it comes directly from the source since it is the shortest distance for the sound to travel. Next, the observer hears the sound that traveled directly from the source into the walls of the canyon after it reflects off of the walls. The sound that travels directly into the walls of the canyon takes more time to arrive at the observer’s ears because it travels a longer path (from the source to the wall, then to the observer as opposed to directly to the observer). Since these travel distances are long in a large canyon, the time it takes for sound to arrive at the observer’s ears is quite long (sometimes nearing one second).
With these three basic definitions in mind, and an understanding of the basic inner-workings of the physical ear, we can discuss some of the more complex issues of human hearing. We will now go into the psychoacoustic interpretations (performed by the brain) of sound signals composed of these three domains.
Psychoacoustic Interpretation of Pitch and Loudness
While pitch and loudness are nearly synonymous with frequency and acoustic pressure, respectively, pitch and loudness are not completely independent from each other. The perceived loudness of a sound is closely tied to the pitch of the sound. If a sound consisting of both low and high pitches (i.e. a piano being played at both ends of the keyboard) is played to an observer while varying the loudness (volume) of the sound, the observer will notice changes in pitch alongside the obvious changes in loudness. This phenomenon was explored in subjective listening-test surveys performed by H. Fletcher and W. A. Munson and later refined by D.W. Robinson and R. S. Dadson. These listening-tests, as illustrated below, reveal the relationship between loudness and pitch .
The set of curved lines or curves are titled the “Equal-Loudness Contours.” A given frequency and loudness (“Intensity”) can be taken as defining a point of origin on the graph (i.e. 1000 Hz and 40 decibels). This point of origin is located at the intersection of the vertical line marked “1000 Hz” (which illustrates intensity at a fixed frequency) with the horizontal line marked “40” (illustrating frequency at a fixed intensity). By starting at the predefined point of origin, varying the frequency (moving to the left or right), while following the curved line associated with the starting point, the graph illustrates that hearing sensitivity (how the ear-brain system responds to a sound) is not consistent or ‘flat’ across frequency. This is particularly obvious for low frequencies (bass) and high frequencies (treble). To clarify the figure, in order to make a low frequency (i.e. 40 Hz) have the same perceived intensity or loudness as a midrange frequency (i.e. 2 kHz) more sound pressure level is required for the low frequency (the low frequency sound must be louder). It is apparent that, as the overall level of the curve is raised, the curve tends to flatten out particularly at low frequencies. This illustrates that the difference between low frequency intensities and midrange frequency intensities is minimized as the volume or loudness is increased. Therefore, louder the sound, more even is the response of the ear.
Psychoacoustic Interpretation of Time Domain
The final domain that must be explained with respect to human hearing is that of the time or phase domain. The human ears are physically separated by the width of the head (approximately 7 inches). This width can lead to a delay in the time at which a sound reaches each ear. The delay can be calculated by simply making a triangle between the source and the two ears, multiplying the length of the source-ear sides by the speed of sound (1130 ft/sec or 344 m/sec), and taking the difference of the two results by subtracting one from the other. A graphical set of examples of this is phenomenon shown below where the length of lines “y” and “z” are different.
A sound that radiates from a location directly in front of the head (source A), centered between the two ears, makes an isosceles triangle (the length of the two “x” lines being equal) from the source to the two ears. It therefore arrives at the same time to both ears because the distance to each ear is the same. When the source moves off of the center axis (sources B and C), these differences (as small as 10 microseconds) are interpreted by the brain as localization cues.
At bass frequencies, the very long wavelengths associated with the low frequencies are unaffected by a 7 inch barrier. As the frequency of a sound increases into the midrange and high frequency ranges, the head does act as a physical barrier. This barrier diminishes the loudness of a sound from one ear to the next (with the ear closer to the sound source receiving a louder sound than the ear farther away). When the effects of the head-barrier are significant (as the sound’s frequency is increased to the midrange region), the brain “pays attention to” or interprets loudness differences and not time differences to derive localization cues. If a sound is louder in one ear than in the other, the source must be closer to that ear since the energy of sound decreases over distance (thereby decreasing loudness). For localization of low frequencies to lower-mid frequencies the time relationship is relied upon. As the frequencies increase above 700 Hz, the head begins to function as a barrier or wall to sound waves. Therefore, the Sound Pressure Level is not equal at both ears .
From Psychoacoustics to Sound Systems
To summarize, the brain is able to detect extremely small time differences between the two ears as well as very slight SPL differences at higher frequencies. The combination of these methods of localization allows the ear-brain system to localize horizontal sound sources to 1 degree accuracy . Localization cues are critical to human survival since they provide significant information as to where objects (such as a fast moving truck) are in space.
The pinna focuses sound into the ear canal. The shape of the pinna (visible in Fig. 1) is essentially an acoustical information encoder. The pinna imposes (encodes) specific characteristics onto the incoming sound as it directs the sound to the ear canal. This encoding takes into account a sound’s angle of incidence (the angle at which the sound hits the ear) on the horizontal and vertical axis. The pinna encodes sound in two principle terms.
First, sound is encoded in terms of overall frequency content (by changing the loudness of certain frequencies). Second, sound is encoded in terms of time of arrival to the ear canal (recall the three domains of perception – pitch, loudness and time). In order to direct sound towards the ear, the pinna has a conical form that causes reflections of sound to be redirected towards the ear canal. These reflections vary according to frequency as well as angle. When the reflections off of the pinna are added together, a specific frequency response or emphasis/de-emphasis on certain frequencies occurs at the ear canal opening. An individual’s pinna therefore has a specific set of frequency responses for all different directions, both vertical and horizontal, for each different ear. The frequency responses are constant for a specific pinna, and do not change significantly over time. The brain can thus interpret these responses and translate the encoded sound into directional information.
Directional information is observable throughout modern day theaters and sound systems. With the influx of DVD and high-capacity storage media, it is possible to trick the human auditory system by providing more than conventional 2-channel, left to right localization. It is possible, using a technique known as binaural recording, to recreate a complete 3-dimensional sound-scape for an individual listener. By placing two tiny microphones next to a person’s ear canal and recording that person’s experiences, the acoustic effects of the pinna and the “head-barrier” are recorded simultaneously . When reproduced over a pair of headphones or two tiny loudspeakers placed in the same location as the microphones (next to the ear canal), the audio image is complete and full. In understanding that the pinna creates many of the localization cues that are interpreted by the brain, it’s apparent that this technique should work as it does. Binaural techniques are employed with a reasonable degree of success by computer game manufacturers, where individual users often listen on headphones while playing their games.
A problem arises when a sound must be reproduced for a larger audience as it does in theaters. There is no practical method of directly piping sound into 100 or more people’s ears, bypassing the individuals’ pinnae. In order to accomplish immersive audio on a larger scale, it is necessary to provide a number of loudspeakers in various positions in space. Each of these individual loudspeakers takes the responsibility of reproducing a certain location in the 3-D sound-space (Fig. 2). Current sound systems in theaters often employ multiple horizontal loudspeakers in order to provide directional localization cues to the audience. Dolby stereo, Dolby Digital, DTS, and Sony SDDS all employ from 3 to 7 loudspeakers across the horizontal plane to accomplish surround-sound. These sound systems make use of the practical and affordable technology available . For example, in the case of the most popular surround sound format for homes: Dolby Digital or “Dolby AC-3,” engineers are restricted by the storage medium (a DVD or Laser Disc) and equipment (DVD player or Laser Disc player) to a certain amount of data. Since the designers of Dolby Digital must efficiently work within their constraints, the most effective loudspeakers are utilized (namely the Front Left, Front Center, Front Right, Back Left, and Back Right loudspeakers in the horizontal plane).
As technology progresses, more channels can be reasonably encoded and utilized in sound for film, music, and video-games, getting closer and closer to emulating a complete 3-dimensional sound-space. By utilizing their understanding of psychoacoustic localization principles and the methods by which the ears work, sound system designers for film, music, and video games will be able to efficiently create simulative and immersive experiences for listeners.
-  B. Truax. “Handbook for Acoustic Ecology.” Simon Fraser University. Internet: http://www.sfu.ca/sca/ Manuals/ZAAPf/t/the_ear.html, Jun. 21, 1999. [Nov. 1, 2000].
-  F. Everest. The Master Handbook of Acoustics. 3rd ed. McGraw-Hill: New York (1994), pp. 40-41, 54.
-  P. Lennie. Class Lecture, Topic: “The Auditory System.” Faculty of Brain and Cognitive Sciences, University of Rochester, Rochester, New York. Available : http://www.bcs.rochester.edu/bcs/programs /courses/info/245/ARCHIVES/S96/auditory _where.html. [Nov.1, 2000].
-  J. Eargle. Handbook of Recording Engineering. 3rd ed. Chapman & Hall: New York (1996).
-  Holman, Tomlinson. 5.1 Surround Sound, Up and Running. Focal Press: Boston (2000).