2 Artificial Neural Networks and Music

2.1 Some Connectionist Basics

2.1.1 Artificial Neural Networks

When classical cognitive science arose in the 1950s, it viewed cognition as being the rule-governed manipulation of symbols, analogous to the operations of a digital computer or to the definition of a formal logic. The view that thinking is performing a sort of mental logic is called logicism (Oaksford & Chater, 1991, 2007). Logicism, in the form of classical cognitive science, had many early and compelling successes, particularly in the computer simulation of higher-order cognition (Feigenbaum & Feldman, 1995; Newell & Simon, 1972).

However, while classical theorists promised that thinking machines were on the horizon, these promises were continually being broken. Some researchers began to question the foundations and the potential of classical cognitive science (Dreyfus, 1972, 1992). In particular, some challenged the notion that cognition is the rule-governed manipulation of symbols. Arguments arose that while the brain was likely an information processor, it was unlikely to be similar to a digital computer. As a result, some cognitive scientists adopted models of information processing that are more biologically plausible. These cognitive scientists are known as connectionists. In the mid-1980s, connectionist cognitive science arose as a reaction against its logicist ancestor.

Connectionist cognitive scientists employ artificial neural networks as models of human information processing (Dawson, 2004, 2005). An artificial neural network is a computer simulation of interconnected processing units. Each processing unit is analogous to a neuron and behaves as follows: First, it computes the total signal that it is receiving from other processors in the network. Second, the processor converts this total signal into some level of internal activity. Third, the processor unit sends its internal activity on to other processors. All of these operations are mathematical: signals and processor activities are all numbers that are determined by simple mathematical equations. In addition, these operations are parallel: many different processing units can be operating at the same time.

If processors in a PDP (Parallel Distributed Processing) network are analogous to neurons, then connections between processors are analogous to synapses between neurons. Each connection in a network has an associated weight (a numerical value) that indicates the connection’s strength, as well as whether it is excitatory (positive weight) or inhibitory (negative weight). The connection is a communication channel that modifies a numerical signal sent through it by multiplying the signal by the connection’s weight. In general, an artificial neural network has layers of processing units; signals pass through weighted connections from one layer to the next. The function of a typical network is to generate a desired response to a stimulus. The stimulus (e.g., a signal from the environment) is encoded as a pattern of activity in a layer of input units. The network’s response to the stimulus is represented as a pattern of activity in its layer of output units. Intervening layers of processors in the system, called hidden units, detect more complex stimulus features.

Figure 2-1 provides an example of a musical artificial neural network. This network consists of 12 input units (the circles at the bottom of the figure), seven hidden units (the circles in the middle of the figure), and 12 output units (the circles at the top of the figure). Each line between circles in Figure 2-1 represents a weighted connection from one processor to another. In this network, each input unit has a connection to each hidden unit, and each hidden unit has a connection to each output unit. There are no direct connections between input and output units. This particular network is an example of a chord progression network that will be discussed later in this book. It is presented one chord from a musical sequence and responds with the next chord in the sequence. When a chord is presented to the input units (in this case by activating four pitches, B, C, E, and G, shown in grey in the figure), signals are sent through its layers producing responses in the output units. In Figure 2-1, the output units shaded in grey have turned on to the stimulus, while the unshaded output units remain off. Musically speaking, Figure 2-1 illustrates a network that has been presented a C major seventh (Cmaj7) chord and has responded with a C minor seventh (Cmin7) chord.

Figure 2-1 provides an example of one type of artificial neural network. There are many different types of networks, including distributed associative memories, feedforward networks, recurrent networks, and self-organizing maps (Amit, 1989; Anderson, 1995; Bechtel & Abrahamsen, 2002; Dawson, 2004; Gluck & Myers, 2001; Grossberg, 1988). Within each of these network types one finds many different variations, including different learning rules, numerous functions for computing incoming signals, various methods for computing processor activity, and so on. In other words, the domain “artificial neural network” is wide and varied. The current book explores a small subset of these possible network types in the context of music.

Figure 2-1

Figure 2-1 An example artificial neural network that, when presented a stimulus chord, responds with another chord.

2.1.2 Teaching Networks

How might a network like the one illustrated in Figure 2-1 “know” what chord to respond with when it is presented a stimulus? An artificial neural network’s pattern of connectivity—its set of connection weights—defines its response to a stimulus. As a result, this pattern of connectivity is analogous to a computer program. However, one does not program artificial neural networks in any conventional sense. Instead, one teaches them. Networks receive a sequence of input patterns, and learn, by adjusting their connection weights, to produce the correct responses to presented patterns.

Typically, when a network is trained to classify patterns, it is presented a set of input patterns for which desired responses are already known. That is, each stimulus is associated with a desired response, and the goal is to train a network so that it generates this desired response for each pattern in the training set. To accomplish this, a supervised learning algorithm is used to train the network.

Supervised learning in general proceeds as follows: We start with a network whose connection weights are given small, random initial values. We present one of the training patterns to the network; it generates a response to this pattern using its current connection weights. Early in learning, because the network is started randomly, we expect its responses to be highly inaccurate. We can measure this inaccuracy by comparing the desired response for each output unit (the response that we want) to the observed response (the actual response generated by the network to the stimulus). We do so by taking the mathematical difference between the desired and observed responses. This difference is the error of an output unit.

Once error has been computed, it is used to modify connection weights in order to reduce network error. That is, after changing connection weights, the next time the same pattern is presented to the network the network will generate less error in response to it. There is a variety of different learning rules that can be used to train artificial neural networks (Bishop, 1995; Caudill & Butler, 1992; Grossberg, 1988; Ripley, 1996; Rojas, 1996; Shepherd, 1997). For networks that include hidden units, the error computed for each output unit must be sent backward through the network in order for hidden unit error to be determined, and for hidden unit connection weights to be modified (Rumelhart, Hinton, & Williams, 1986; Rumelhart & McClelland, 1986b). For all of the various supervised learning rules, there is one common feature: each time connection weights are modified network errors decrease. The goal is to reduce network error, with enough training, to a magnitude that is small enough to say that the network has learned the correct response to each of the training stimuli.

2.1.3 What Can Networks Do?

A connectionist network is a computer simulation that converts an input pattern into an output response. What kind of stimulus-response mappings can artificial neural networks learn to generate?

One common task is pattern recognition (Lippmann, 1989; Pao, 1989; Ripley, 1996). When a network performs pattern recognition, it identifies its input pattern as belonging to a particular class. For instance, one might present a network a set of musical notes that the network could then classify as representing a particular type of musical chord (Yaremchuk & Dawson, 2005). The Figure 2-1 network can be considered a pattern recognition system, because it generates a discrete or categorical response to a stimulus.

Another task that connectionist networks can accomplish is function approximation (Siegelmann, 1999; Takane, Oshima-Takane, & Shultz, 1994). In a function approximation task, a network maps an input pattern into a continuous output response. In general, the input units represent the values of one or more x-variables, and the output unit(s) represents some function of these variables. That is, the network computes a function y = f(x1, x2, . . . xn) where an output unit represents the value of y. For instance, later we will discuss a network that is presented a summary of the notes in a particular song, and generates the probability that the song is written in a particular key. Pattern recognition and function approximation are two very general tasks that artificial neural networks can accomplish very well. As a result, connectionist models have arisen in a variety of different research domains, including perception (Carpenter & Grossberg, 1992; Wechsler, 1992), language (Mammone, 1993), brain function (Amit, 1989; Burnod, 1990; Gluck & Myers, 2001), and animal learning (Dawson, 2008; Enquist & Ghirlanda, 2005; Schmajuk, 1997).

Connectionist models have also been applied to a wide variety of problems in music and in musical cognition (Bharucha, 1999; Fiske, 2004; Griffith & Todd, 1999; Todd & Loy, 1991). A variety of network architectures have been applied to such topics as classifying pitch and tonality, assigning rhythm and metre, classifying and completing melodic structure, and composing new musical pieces. Let us briefly consider some examples of musical connectionism.

Connectionist networks can accomplish a variety of tasks that require classification of basic elements of Western music (e.g., pitch, tonality, and harmony). Artificial neural networks have been trained to classify chords (Laden & Keefe, 1989; Yaremchuk & Dawson, 2005, 2008), to assign notes to structures similar to the tonal hierarchy (Leman, 1991; Scarborough, Miller, & Jones, 1989), to model the effects of musical expectations on musical perception (Bharucha, 1987; Bharucha & Todd, 1989), to add harmony to melodies (Berkeley & Raine, 2011; Shibata, 1991), to determine the musical key of a melody (Griffith, 1995), to identify a melody even when it has been transposed into a different key (Benuskova, 1995; Bharucha & Todd, 1989; Page, 1994; Stevens & Latimer, 1992), and to detect the chord patterns in a composition (Gjerdingen, 1992).

Artificial neural networks can also model some perceptual illusions involving pitch. One example is virtual pitch (Terhardt, Stoll, & Seewann, 1982a, 1982b). In this illusion, one constructs a musical signal from a combination of sine waves (i.e., harmonics) but does not include the lowest-frequency sine wave, the fundamental frequency. The fundamental frequency determines the pitch of the tone (i.e., the octave in which the pitch is experienced). Human listeners, however, do not hear a tone missing its fundamental frequency in a different octave. Instead, they hear the tone in the correct octave, as if the missing fundamental frequency is put back into the stimulus (Fletcher, 1924). Certain types of artificial neural networks can use context (i.e., the presence of harmonic sine waves) to add the missing fundamental (Benuskova, 1994; Sano & Jenkins, 1989).

Artificial neural networks can also handle other important aspects of music that are independent of tonality, such as assigning rhythm and metre (Desain & Honing, 1989; Griffith & Todd, 1999). For example, one network for assigning rhythm and metre uses a system of oscillating processors—units that fire at a set frequency (Large & Kolen, 1994). The phase of an oscillator’s frequency can vary, and signals between processors enable their phases to entrain. This permits the network to represent the metrical structure of a musical input, even if the actual input is noisy or imperfect. This notion can be elaborated in a self-organizing network that permits preferences for, or expectancies of, certain rhythmic patterns to determine the final representation that the network converges to (Gasser, Eck, & Port, 1999).

The examples cited above generally involve using artificial neural networks to detect properties of existing music. The ability of networks to process tonality, harmony, metre, and rhythm also permits them to generate new music. Composition has in fact been one of the most successful applications of musical connectionism. Networks can compose single-voiced melodies on the basis of learned musical structure (Mozer, 1991; Todd, 1989); can compose harmonized melodies or multiple-voiced pieces (Adiloglu & Alpaslan, 2007; Bellgard & Tsang, 1994; Hoover & Stanley, 2009; Mozer, 1994); can improvise when presented new jazz melodies and harmonies (Franklin, 2006); and can improvise by composing variations on learned melodies (Nagashima & Kawashima, 1997).

The logic of network composition is that the relationship between successive notes in a melody, or between different notes played at the same time in a harmonized or multiple-voiced piece, is not random, but is instead constrained by stylistic, melodic, and acoustic constraints (Huron, 2006; Kohonen, Laine, Tiits, & Torkkola, 1991; Lewis, 1991; Mozer, 1991, 1994; Temperley, 2007). Networks can learn these constraints and then use them to generate the next note in a new composition.

The ability of artificial neural networks to exploit similarity relationships positions them to capture regularities that are difficult to express in language or using formal rules (Loy, 1991). This permits networks to solve musical problems that involve very abstract properties. For example, human subjects can accurately classify the genre or style of a short musical selection within a quarter of a second (Gjerdingen & Perrott, 2008). The notion of style or genre is too vague to be formalized in a fashion suitable for a classical rule-governed system (Loy, 1991). However, neural networks are up to the task, and can: classify musical patterns as belonging to the early works of Mozart (Gjerdingen, 1990); classify selections as belonging to different genres of Western music (Mostafa & Billor, 2009); evaluate the affective aesthetics of a melody (Cangelosi, 2010; Coutinho & Cangelosi, 2009; Katz, 1995); and even predict the possibility that a particular song has “hit potential” (Monterola, Abundo, Tugaff, & Venturina, 2009).

Artificial neural networks have musical applications that extend beyond human cognition. For instance, with the wide availability of digital music, networks are proving to be useful in serving as adaptive systems for selecting music, or generating musical playlists, based on a user’s mood or past preferences combined with the ability to process properties of the stored music (Bugatti, Flammini, & Migliorati, 2002; Jun, Rho, & Hwang, 2010; Liu, Hsieh, & Tsai, 2010; Munoz-Exposito, Garcia-Galan, Ruiz-Reyes, & Vera-Candeas, 2007; Wieczorkowska & Kubera, 2010). Networks can also be used to automatically transcribe music (Marolt, 2004a, 2004b) and to generate realistic-sounding singing voices by manipulating vibrato (Gu & Lin, 2014).

Clearly, there is a great deal of interest in using artificial neural networks to study musical cognition. Bharucha (1999) provides five different advantages of connectionist research on music. First, artificial neural networks can account for how music is learned. Second, connectionist theories of such learning are biologically plausible. Third, networks provide accounts of music perception phenomena, such as contextual effects and the filling-in of incomplete information. Fourth, networks exploit similarity-based regularities that are important in theories of musical cognition. Fifth, networks may discover regularities (e.g., in musical styles) that elude more formal analyses.

This fifth observation made by Bharucha (1999) hearkens back to the tension, as discussed in Chapter 1, between universal laws and musical aesthetics faced by psychophysical researchers. Proposing that some aspects of music cannot be captured by formal rules is similar to claiming, like Helmholtz, that natural laws cannot explain musical aesthetics.

Much of the musical connectionism pursued in later chapters of this book reacts against this fifth point of Bharucha (1999). The current research does not agree that a main goal of musical connectionism is to capture informal regularities. Instead, the current research uses musical networks to reveal formal properties of music. The remainder of this chapter explores the uneasy relationship between formal and informal accounts of music, with a particular interest in connectionism’s role in this relationship.

2.2 Romanticism and Connectionism

2.2.1 Musical Romanticism

At the end of the period stretching from 1543 to 1687, the scientific revolution evolved into the Enlightenment (Ede & Cormack, 2004). The Enlightenment saw many of the ideas born during the scientific revolution extended and modified, particularly with respect to individualism, freedom, politics, and commerce. The resulting Industrial Revolution transferred power and wealth from the nobility to the commercial class (Plantinga, 1984).

The ideas that characterize the Enlightenment profoundly influenced politics, thinking, and art, which, in turn, created discontentment with the existing social order, and led to the 1789 revolution in France. This, in turn, led to an artistic and intellectual movement called Romanticism (Claudon, 1980) that roughly spanned the period from just before the French Revolution through to the end of the 19th century. Because the Enlightenment evolved from the scientific revolution, it too exalted reason and rationality. In contrast, Romanticism emphasized the individual, the irrational, and the imaginative (Einstein, 1947; Plantinga, 1984). It replaced reason with an emphasis on the imaginary and the sublime. The Romantic artists looked back longingly at unspoiled, pre-industrial existence by depicting wild or fanciful settings. Nature was their inspiration. Romanticism appealed to the untamed mountains and chasms of the Alps to oppose the Enlightenment’s view of an ordered, structured world.

Many argue that the most purely Romanticist art was music, because music expresses mystical or imaginative ideas and emotions that language cannot (Einstein, 1947; Plantinga, 1984; Sullivan, 1927). Language, of course, is a key vehicle of reason and rationality. In rejecting language, Romanticism focused upon purely instrumental music that “became the choicest means of saying what could not be said, of expressing something deeper than the word had been able to express” (Einstein, 1947, p. 32). Romanticist composers strove to replace the calculated, rational form of such music as Bach’s contrapuntal fugues (Gaines, 2005; Hofstadter, 1979) with a music that expressed intense emotion and communicated the sublime (Einstein, 1947; Longyear, 1988; Plantinga, 1984; Whittall, 1987).

2.2.2 Connectionism as Romanticism

In reacting against classical cognitive science, connectionism also rejects the Cartesian rationalism that permeates classical cognitive science (Dawson, 2013). The rise of connectionist cognitive science is analogous to the Romanticist reaction against the Enlightenment. Dawson (2013) outlines many intellectual parallels between Romanticist music and connectionist cognitive science. We explore two of these parallels below.

The first parallel is the rejection of the logical. Romanticism rejected reason by moving away from language, particularly in music. This is paralleled in connectionism’s claim that cognitive explanations need not appeal to explicit rules or symbols (Bechtel, 1994; Bechtel & Abrahamsen, 2002; Horgan & Tienson, 1996; Ramsey, Stich, & Rumelhart, 1991; Rumelhart & McClelland, 1986a). Connectionism abandons logicism, and assumes that the internal workings of its networks do not involve the rule-governed manipulation of symbols. “We would all like to attain a better understanding of the internal operations of networks, but focusing our search on functional equivalents to symbolic operations could keep us from noticing what is most worth seeing” (Bechtel, 1994, p. 458).

The second parallel is connectionism’s sympathy with Romanticism’s emphasis on nature. Cartesian philosophy, and the classical cognitive science that it inspired, view the mind as disembodied from the natural world. Connectionists reject this perspective by developing models that are biologically plausible or neuronally inspired (McClelland & Rumelhart, 1986; Rumelhart & McClelland, 1986b). Connectionism emphasizes the brain.

Biological inspiration carries with it sympathy with the sublime. Connectionists accept that the internal structure of their networks is very difficult to understand (Dawson, 1998, 2004, 2009; Dawson & Shamanski, 1994; McCloskey, 1991; Mozer & Smolensky, 1989; Seidenberg, 1993). This is because their networks mimic the mysterious structure of the brain. “One thing that connectionist networks have in common with brains is that if you open them up and peer inside, all you can see is a big pile of goo” (Mozer & Smolensky, 1989, p. 3). In the connectionist literature, detailed analyses of the internal structure of a network, coupled with accounts of how network structures solve problems of interest, are rare (Dawson, 1998, 2004, 2005, 2009, 2013; Dawson & Shamanski, 1994).

Connectionism’s rejection of logicism, and its embrace of the sublime, accounts for its current popularity in the study of musical cognition. Some researchers believe that artificial neural networks can capture musical regularities that cannot be rationally expressed (Bharucha, 1999; Rowe, 2001; Todd & Loy, 1991). Of course, this belief about the utility of networks parallels the Romanticist view that one cannot formalize important characteristics of music.

This Romanticist perspective is readily evident, for example, in discussions of networks that compose music. Such networks are presumed to internalize constraints that are difficult to formalize. “Nonconnectionist algorithmic approaches in the computer arts have often met with the difficulty that ‘laws’ of art are characteristically fuzzy and ill-suited for algorithmic description” (Lewis, 1991, p. 212). Furthermore, these “laws” are unlikely to arise from analyzing the internal structure of a network, “since the hidden units typically compute some complicated, often uninterpretable function of their inputs” (Todd, 1989, p. 31). Such accounts of modern connectionist networks evoke the earlier musings of Helmholtz about the nature of musical aesthetics.

Connectionist Romanticism raises some questions that we explore in detail in the final section of this chapter. First, if the musical regularities captured by artificial neural networks cannot be formally expressed, then what is the purpose of such networks in a cognitive science of music? Second, is it possible that musical networks can capture formal musical regularities?

2.3 Against Connectionist Romanticism

2.3.1 Bonini’s Paradox

Models attempt to enhance our understanding of the world. Cognitive scientists use many different kinds of models. These include statistical models that describe data (Kruschke, 2011; Lunneborg, 1994), mathematical models that provide quantifiable laws (Atkinson, Bower, & Crothers, 1965; Coombs, Dawes, & Tversky, 1970; Restle, 1971), and computer simulations that themselves generate behaviour of interest (Dutton & Starbuck, 1971; Newell & Simon, 1961, 1972). Regardless of type, a model serves to increase understanding by providing a simplified and tractable account of some phenomenon of interest.

Merely creating a model, however, does not always guarantee greater understanding. This is particularly true of computer simulations of cognitive processes (Lewandowsky, 1993). Such simulations can encounter what Dutton and Starbuck (1971) call Bonini’s paradox. This paradox occurs when a computer simulation is as difficult to understand as the phenomenon being modelled. There are reasons to believe that the Romanticism of connectionism leads directly to Bonini’s paradox, particularly in the study of musical cognition.

To begin, the internal structure of artificial neural networks is notoriously difficult to understand. This is because of their parallel, distributed, and nonlinear nature. Connectionists, after training a network, are often hard pressed to describe how it actually accomplishes its task.

In the early stages of the connectionist revolution, this was not a keen concern. The 1980s was a period of “gee whiz” connectionism (Dawson, 2009) in which connectionists modelled phenomena that were prototypical for classical cognitive science. In the mid-1980s, it was sufficiently interesting to show that such phenomena might be accounted for by alternative kinds of models. Researchers during this period were not required to delve into the details of the internal structures of networks to explain their operations. However, in modern connectionist cognitive science it is necessary for researchers to spell out exactly how networks function (Dawson, 2004). In the absence of such details, connectionist models have absolutely nothing to contribute to cognitive science (McCloskey, 1991). It is no longer enough for a network to be inspired by the (sublime) brain. A network must provide details that give insight into how brains might actually process information. An un-interpreted network produces Bonini’s paradox.

Importantly, many musical networks have an additional wrinkle that makes them difficult to understand. In connectionism, there are two general approaches to network training. One is supervised learning. In supervised learning, a researcher defines a set of desired input/output pairings, and trains a network to generate these, typically with an error-correcting learning rule such as the one introduced in the next chapter. That is, in supervised learning the researcher knows beforehand what a network is designed to do and teaches the network to respond accordingly.

The other approach to network training is called unsupervised learning and is typically employed in what are called self-organizing networks (Amit, 1989; Carpenter & Grossberg, 1992; Grossberg, 1980, 1987, 1988; Kohonen, 1977, 1984). In unsupervised learning, one presents input patterns to a network, but these patterns are not paired with desired outputs. Instead, networks stabilize after every input and then modify their weights to encode this stable state. As a result, a self-organizing network learns the statistical regularities in its input patterns and generates responses that reflect these regularities, without external guidance or teaching.

In general, connectionist cognitive science is much more likely to use supervised learning than unsupervised learning. However, it has been argued that this is not true in the study of musical cognition (Dawson, 2013), which seems to have a marked preference for unsupervised learning. In two collections of papers about connectionist musical cognition (Griffith & Todd, 1999; Todd & Loy, 1991), one finds many more self-organizing networks than one would expect to find in other domains of cognitive science.

How is the use of self-organizing networks related to Bonini’s paradox? First, one constructs these networks from the same components used to create other sorts of artificial neural networks. Therefore, they are just as difficult to interpret as any other kind of network. Second, if one uses unsupervised learning to train a network, then difficulties in understanding the network’s internal structure are compounded by the additional fact that one may not even know what it is that the network has learned. If one does not know what a network’s responses are supposed to signify, then how can one understand the network?

2.3.2 An Alternative Paradigm

Of course, a strong motivator for applying connectionism to musical cognition, and for preferring unsupervised learning, is the Romanticist view that important aspects of music are informal. Presumably, networks can capture these regularities informally. However, any input/output relationship that can be realized in an artificial neural network must be formal. All connectionist networks are mathematical engines that compute functions–they map numerical elements from an input domain onto numerical elements in an output domain. In other words, musical networks do not have the advantage of capturing informal regularities that symbolic languages cannot capture, as some propose (Bharucha, 1999). Instead, networks have the disadvantage of capturing formal regularities that are difficult to ascertain or to express. We must discover and detail these regularities if connectionism is to contribute to the cognition of music.

Certainly, networks are hard to interpret. However, it is not impossible to explore the internal structure of a trained network in order to explain how it converts its inputs into its responses. Connectionist cognitive scientists have developed many techniques for interpreting the internal structure of artificial neural networks (Baesens, Setiono, Mues, & Vanthienen, 2003; Berkeley, Dawson, Medler, Schopflocher, & Hornsby, 1995; Dawson, 2004, 2005; Gallant, 1993; Hanson & Burr, 1990; Hayashi, Setiono, & Yoshida, 2000; Hinton, 1986; Moorhead, Haig, & Clement, 1989; Omlin & Giles, 1996; Setiono, Baesens, & Mues, 2011; Setiono, Thong, & Yap, 1998; Taha & Ghosh, 1999).

The research detailed in the remaining chapters of this book concerns training artificial neural networks on musical tasks and then interpreting the internal structure of each trained network. We interpret networks in order to discover the manner in which they solve musical problems. Earlier we saw that Krumhansl (1990a) made a number of design decisions that guided her studies of musical cognition. Similar decisions guide the research described in the chapters that follow. Let us consider these design decisions.

Krumhansl (1990a) focused her experimental research by studying subjects’ responses to musical pitch, typically building her stimuli from the manageable set of 12 pitch-classes. She did this because the principle of octave equivalence captures the notion that pitch-class is a psychologically valid concept, because pitch-class is the foundation of Western tonal music, and because combinations of pitch-classes can be used to define more complex musical entities such as intervals, chords, and scales.

The simulation research reported in this book also focuses on tasks that involve pitch-class representations of stimuli. One reason for this design decision is an endorsement of all of Krumhansl’s (1990a) reasons for focusing on pitch. Later we demonstrate that pitch-class representations permit the definition of a number of interesting musical tasks. For instance, a network can be presented inputs represented in terms of constituent pitch-classes and can learn to perform such tasks as identifying a scale’s tonic, determining whether a scale is major or minor, classifying various types of triads and tetrachords, and generating the next chord in a progression.

A second reason for emphasizing pitch-class representations in the current research is that a great deal of the theory of Western music is related to pitch-class (Forte, 1973). We will see many examples in which network interpretations relate to music theory, and this relationship is facilitated when training sets are represented using pitch-classes.

Krumhansl (1990a) also made a number of design decisions concerning her experimental methods, such as whether subjects required musical expertise, and what types of judgments were required of subjects. We make a number of analogous design decisions concerning the nature of the networks to study.

First, all of the simulation studies that I report involve supervised learning. That is, I train networks to generate a desired set of input/output responses. This is because my primary goal is to interpret the internal structure of trained networks. To accomplish this research goal it is extremely helpful to know precisely the responses that a network has learned.

Second, all of the tasks my networks learn via supervised training involve well-established concepts in Western music theory. One reason for this is that music theory itself is typically used to construct a set of training stimuli. A second reason is that for such tasks music theory itself is a powerful aid to network interpretation.

Third, all of the simulation studies that I report seek the simplest network architecture capable of learning a desired input/output mapping. For networks that include hidden units, this means finding the smallest number of required hidden units. If a network that had no hidden units can solve a problem, then I study this network. The reason for seeking the simplest networks that could learn a task is my goal of network interpretation: simpler networks are easier to interpret.

Fourth, many of the simulation studies that I report use a particular architecture that I call networks of value units (Dawson & Schopflocher, 1992). Such networks employ an activation function tuned so that processing units only turn on to a narrow range of incoming signals. One advantage of this architecture is that networks of value units have many desirable properties when the goal is to interpret their internal structure (Berkeley et al., 1995; Dawson, 1998, 2004, 2013). As the value unit architecture is not standard, the next section describes it in more detail, and explains its advantages for a research project that has as its goal the interpretation of the internal structure of trained networks.

2.4 The Value Unit Architecture

2.4.1 Activation Functions

A key element of a processor in an artificial neural network is its activation function. An activation function is a mathematical equation that converts a processor’s net input into a numerical value called the processor’s activity. If the processor is an output unit, then its activity is a response. If the processor is a hidden unit, then its activity is passed on as a signal to other processors in the network.

In modern artificial neural networks, most activation functions are nonlinear. The most common in the literature is the logistic function:

equation

In this equation, net is a processor’s net input, and θ is the unit’s bias. Figure 2-2 illustrates the logistic function. That it is nonlinear is evident in its sigmoid shape. The values of this function range from zero (when net input is at negative infinity) to one (when net input is at positive infinity). When net input is equal to θ, it produces an activity of 0.5. Thus, the bias is analogous to a processor’s threshold. Processors that use the logistic activation function are called integration devices (Ballard, 1986).

Figure 2-2

Figure 2-2 The logistic activation function used by an integration device to convert net input into activity.

The logistic activation function is the most common in connectionism and was fundamental to the discovery of learning rules for networks that include hidden units (Rumelhart et al., 1986). However, it is not the only activation function to be found in artificial neural networks. One review paper notes that an extremely large number of different activation functions exist in the connectionist literature (Duch & Jankowski, 1999).

One alternative activation function (Dawson & Schopflocher, 1992) uses a particular form of the Gaussian equation:

equation

In this equation, the value µ is analogous to the bias of an integration device. However, when net input equals µ this equation produces a maximum activity of one. As net input moves away from µ in either direction, activity drops quickly toward zero. Figure 2-3 illustrates the shape of this activation function. Because this function generates high activity to a very narrow range of net input values, processors that use this activation function are called value units (Ballard, 1986).

Figure 2-3

Figure 2-3 The Gaussian activation function used by a value unit to convert net input into activity.

One characteristic of an integration device is that after net input reaches a sufficiently high value, the activity that it is converted into is essentially “on.” In Figure 2-2, there would not be an appreciable difference between activity when net input was 6 and activity when net input was 600. A value unit exhibits a very different sensitivity to net input. A value unit generates very high activity to a narrow range of net inputs and generates very low activity to any net input that is outside this narrow range. This different nature of a value unit’s activation function often leads to advantages in comparison to the more traditional integration device architecture (Dawson & Schopflocher, 1992).

First, for many problems, networks of value units learn much faster than do networks of integration devices. Second, networks of value units tend to require fewer hidden units than do networks of integration devices when confronted with a complex problem.

Third, and most important in the context of this book, value units have emergent properties that make the internal structure of networks that contain them easier to interpret than networks of integration devices (Berkeley et al., 1995). In general, this is because each value unit is tuned, via its activation function, to respond to very particular combinations of stimulus features. This is not the case for integration devices, which in essence turn on when a sufficient number of stimulus features are present. The tuning that is inherent in the value unit architecture provides a window into how networks solve problems that is rarely available with the more traditional architecture. I have been able to take advantage of value unit properties to interpret the internal structure of many networks trained to solve problems in a wide variety of domains (Dawson & Boechler, 2007; Dawson, Boechler, & Orsten, 2005; Dawson, Boechler, & Valsangkar-Smyth, 2000a; Dawson, Medler, & Berkeley, 1997; Dawson, Medler, McCaughan, Willson, & Carbonaro, 2000b; Dawson & Piercey, 2001; Dawson & Zimmerman, 2003; Leighton & Dawson, 2001; Medler, Dawson, & Kingstone, 2005; Yaremchuk & Dawson, 2005, 2008).

This is not to say that network interpretation requires the value unit architecture. In some instances, networks of integration devices are indeed better for this task (Graham & Dawson, 2005). However, our experience has shown that value units are often advantageous when network interpretation is the goal, and it is for this reason that they are used in many of the simulations that we report. On the one hand, the success of these simulations provides more evidence in support of this network choice. On the other hand, the point of this book is the value of network interpretations; seeking interpretations in networks that use other activation functions is an important task that should be encouraged. Perhaps results like those reported in the chapters that follow will stimulate researchers to interpret the internal structure of other types of networks.

2.5 Summary and Implications

The musical cognitivism that was introduced in Chapter 1 is situated in classical cognitive science, which assumes that cognition results from the rule-governed manipulation of mental representations. Chapter 2 began by pointing out that alternative notions of cognition exist within cognitive science (Dawson, 2004). One of these is connectionism, which views cognition as emerging from non-symbolic information processing in the brain. Connectionist cognitive science models this type of information processing with artificial neural networks. Chapter 2 introduced some of the basic properties of these networks, including the properties of processing units, of weighted connections, and of the ability of these networks to learn from experience. It then provided an overview of the general methodological considerations that guided the simulations reported in the chapters that follow. These include focusing on musical tasks that involve pitch-class representations of stimuli, the use of supervised learning, seeking the simplest networks capable of solving musical problems, and a preference for the value unit architecture. In general, the goal of the simulation research is to interpret the internal structure of networks trained on musical tasks in order to determine the kind of musical regularities that these networks exploit and represent.

2.5.1 Methodological Implications

The types of networks explored in the chapters that follow were initially developed in a tradition that explored spatial pattern recognition (McClelland & Rumelhart, 1986; Minsky & Papert, 1969; Pao, 1989; Ripley, 1996; Rosenblatt, 1962; Rumelhart & McClelland, 1986b; Schwab & Nusbaum, 1986). As a result, the networks typically learn to make judgments about sets of pitches that are presented simultaneously to a network—that is, across a spatial array of input units—instead of being presented in succession. These input representations are more closely related to the more abstract representation of music considered when mathematical set theory is applied to music (Forte, 1973, 1985; Roig-Francolí, 2008; Straus, 2005).

Choosing this type of input representation means that the musical tasks we consider involve classification (e.g., identifying a scale’s type, classifying types of musical chords, identifying a composition’s musical key, and so on). I will for the most part not be concerned with temporal properties of music, such as rhythm. I will also not explore temporal properties that involve presenting musical stimuli over time, note by note. However, it is important to realize that I am not in principle limited to analyzing non-temporal properties of music. For instance, later in this book we will see how a (spatial) network is presented an input chord, and then generates the next chord to be played in a particular progression. In addition, I could, in principle, present music temporally to these spatial networks. For instance, I could use a network’s input units as a temporal window—an encoding of the pitches being “heard” at a particular moment in time—and then pass a musical stimulus over time through this input window.

By limiting the tasks below to those that I can present easily to our architectures of choice, I am simply exploring the possibility that such networks can provide new insights that may be relevant to musical cognition or to music theory. To the extent that I encounter success, I am motivated to explore in the future more complicated encodings to pursue similar temporal discoveries with the architectures described below. Furthermore, the exploration of music is not limited to the architectures that are employed in this book. Other architectures have been used (Griffith & Todd, 1999; Todd & Loy, 1991), such as self-organizing networks that learn statistical properties of inputs without requiring an external teacher (Gjerdingen, 1990; Kohonen, 2001; Page, 1994) or recurrent networks that are explicitly designed to detect regularities in time (Elman, 1990; Franklin, 2004, 2006).

A key theme of this book is that regardless of the architecture that one uses to explore music with neural networks, or of the types of musical regularities being investigated, after training a network it is critically important to interpret its internal structure. The literature has long established the computational power of artificial neural networks, so the fact that a network can learn a particular task should not by itself be either interesting or surprising. The interesting information provided by networks can only emerge from their interpretation: finding out how networks actually solve the tasks that they learn (Dawson, 1998, 2004, 2009, 2013).

2.5.2 Synthetic Psychology

The methodological implications discussed in the previous section pertain directly to a connectionist study of music. Importantly, the networks to be presented in the chapters that follow reflect a more general research program called synthetic psychology (Braitenberg, 1984). In synthetic psychology, models are built first, and then used to produce data. The hope is that this data will include surprising phenomena, and that we will be in a position to offer straightforward accounts of these surprises because of our knowledge about the model itself.

It has been argued that one can use artificial neural networks in cognitive science to conduct synthetic psychology (Dawson, 2004). However, for this particular flavour of synthetic psychology to succeed, network interpretations must be supplied. It is never a surprise to find that some network can learn a particular task. This is because networks are, in principle, extremely powerful information processors (Siegelmann & Sontag, 1991). Following from this, one can only be surprised by the methods for solving problems that networks discover as they learn. Of course, to experience such surprises one must examine the internal structure of trained networks. From this perspective, one can consider this book a case study of how connectionists can perform synthetic psychology.

One consequence of using networks to advance synthetic psychology is that one aspect of connectionist Romanticism is not abandoned: the emphasis on individual networks. In some cases, such as when we examine how learning speeds are affected by different encodings, we train multiple networks on the same problem, treat each network as a different subject, and use statistics to get a sense of general performance. However, this is not the typical approach in this book. Instead, I will usually focus on a single network that has been trained on a particular problem, and interpret the internal structure of this particular subject.

There is, of course, a concern that this approach will not reveal average or typical network structures, because it focuses upon individuals instead of groups. However, my experience—particularly with the musical problems that I detail in the chapters that follow—is that this is not the case. I have examined the connection weights of many different networks trained on the same task as part of the research that has culminated in this book, and I repeatedly find similar internal structures in different networks. Indeed, I have taken advantage of this to use some of the tasks described later as exercises in a neural network course. Students in this course train networks in class and then find in their networks similar structures to those that follow. I am confident that my interpretations reflect typical network solutions to these musical problems. Of course, this does not rule out the interesting possibility that networks that use different internal structures to solve this problem can still be discovered.

2.5.3 Seeking New Music Theory

If the zeitgeist of the connectionist cognitive science of music is to capture that which cannot be formalized, then the general paradigm outlined above might seem odd. If we use pitch-class representations, if we use supervised learning, and if we train networks on established concepts of music theory, then what new information can we hope to learn? Should it not be the case that all we will pull out of our networks is the music theory that we put in?

Interestingly, this is not the case. My networks typically reveal alternative solutions to musical problems that lead to new ideas in music theory. The task of the remainder of this book is to provide evidence to support this claim. Let us begin by considering networks that are trained on a basic musical task: identifying the tonic note of a musical scale.

Chapter 3: The Scale Tonic Perceptron