9 Connectionist Reflections

9.1 A Less Romantic Connectionism

The Overture that began this book fancifully asked whether it is true that two systems that generate the same musical inputs and outputs must also share an underlying theory of music. Do the aliens inside the mother ship in Close Encounters of the Third Kind require a theory of Western music to jam with the human scientists below? Alternatively, is it possible that some alien musical theory can also produce identical musical patterns? The purpose of this book was to explore this issue, replacing fictional organisms from another world with agents that are more practical: artificial neural networks. I trained networks to perform a number of tasks that mapped musical stimuli to responses that are well defined in Western music theory. After successfully teaching these networks, I conducted in-depth analyses of their internal structure. In general, I discovered formalisms inside these networks that differed from those that are typical of Western music. The purpose of this chapter is to step back and consider the general results that have been detailed in the preceding chapters, and to discuss the implications of these results. However, before embarking on this summary, let us first consider the relationship between the connectionist research we have been considering and more typical studies that employ artificial neural networks.

9.1.1 A Romantic Revolution

Connectionist cognitive science erupted in the mid-1980s with the discovery of learning rules capable of training networks with a layer of hidden units (Ackley, Hinton, & Sejnowski, 1985; Rumelhart et al., 1986). Connectionism, as a reaction to classical cognitive science, has many parallels with the Romanticist reaction against the age of reason (Dawson, 2013). Two of these parallels are of particular interest to us as we reflect upon the results presented in this book.

First, as noted in Chapter 2, when connectionist cognitive science arose it explicitly abandoned theories that appealed to the rule-governed manipulation of symbols. When researchers introduced artificial neural networks by training them to perform prototypically classical tasks, such as changing the tense of verbs (Rumelhart & McClelland, 1986a) or solving logic problems (Bechtel & Abrahamsen, 2002), this was done to demonstrate that such tasks did not require using explicit rules and symbols. Connectionism is Romanticist in its abandonment of the formal or logical.

Second, connectionist cognitive science attacked the classical view for proposing theories that were neither neuronally inspired nor biological plausible. A central tenet of connectionism is that intelligence emerges from the unique and complex interactions among vast numbers of nonlinear neurons (Churchland, 1986; Churchland & Sejnowski, 1992; Searle, 1984). However, in appealing to biologically plausible information processing, connectionists moved toward theories that were nearly impossible to elaborate fully. When inspired by the brain, connectionists are swept away by the sublime, revealing their Romanticism.

Connectionism’s rejection of the formal and its acceptance of the sublime are further cemented by the fact that connectionists rarely provide detailed interpretations of their networks’ internal structures. Perhaps this is particularly true of the limited connectionist literature on music cognition. Many researchers believe that a key advantage of artificial neural networks is their ability to adapt to musical regularities that cannot be formalized (Bharucha, 1999; Rowe, 2001; Todd & Loy, 1991).

Recent developments in artificial neural network research take connectionist Romanticism even further. In the 1980s, the connectionist revolution began with networks that used only one or two layers of hidden units. Nowadays there is a growing sense that such networks are inadequate for adapting to complex, real world situations. There is an emerging interest in new set of techniques called deep learning that allows researchers to train networks with many layers of hidden units (Bengio, Courville, & Vincent, 2013; Hinton, 2007; Hinton, Osindero, & Teh, 2006; Hinton & Salakhutdinov, 2006; Larochelle, Mandel, Pascanu, & Bengio, 2012). However, the interpretation of deep networks is even more challenging than the interpretation of shallower networks (Erhan, Courville, & Bengio, 2010). Thus, as connectionism turns more and more to deep learning, it becomes even more Romanticist in nature.

9.1.2 Reducing Romanticism

The research presented in this book moves in the direction opposite to deep learning. I have trained very shallow networks—some having no hidden units at all—on very simple musical tasks, employing very straightforward representations. This approach reacts against connectionist Romanticism. I am not concerned with informal, sublime musical properties. Instead, this research project explores the ability of shallow networks to capture formal musical regularities. Furthermore, I am particularly interested in whether networks can reveal new formal properties that are not typical of modern music theory. This book has explored the viability of an anti-Romanticist use of artificial neural networks.

This exploration was largely inspired by one of the seminal publications in musical cognition, Carol Krumhansl’s Cognitive Foundations of Musical Pitch (Krumhansl, 1990a). Krumhansl used perceptual experiments to explore her subjects’ responses to musical stimuli. Recognizing the enormous breadth of possible musical stimuli, not to mention the range of possible musical responses and the variety of potential musical expertise, Krumhansl shaped and streamlined her research by making a number of important design decisions. She developed paradigms that permitted responses to musical stimuli to be obtained easily. She focused on responses to a limited but well-established domain, musical pitch, typically creating her musical stimuli from a simple and tractable set of building blocks: the 12 pitch-classes of Western tonal music. Krumhansl’s research led to a number of fundamental insights into music and musical cognition. These insights were possible because of her design decisions.

Before conducting our simulations, I made a number of design decisions that were similar in spirit to those made by Krumhansl. First, I employed supervised learning rules for our artificial neural networks. This is because I wanted to know exactly what networks learned to do to facilitate the interpretation of their internal structure. In supervised learning, networks converge to a solution only after they generate the desired (and known) response to each input pattern.

Second, all of our simulations trained networks to generate simple, well-established regularities of Western tonal music. Again, with a solid formal understanding of what networks learned to do I was hoping to be better positioned to interpret their internal structure. These tasks included identifying the tonic and the root of musical scales, key-finding, identifying the types of different triads and tetrachords, and generating chord progressions.

Third, most of our simulations used very simple input and output representations. Many of the networks described in earlier chapters used pitch-class encoding, which meant that very simple networks—networks that only had 12 input units—could learn a musical task of interest. Our choice of this representation paralleled Krumhansl’s (1990a) assumption of octave equivalence in her cognitive studies of musical pitch.

Fourth, all of our simulations sought the simplest networks capable of converging upon a solution to the task. Simple networks are easier to interpret. By seeking the simplest networks, I was able to make some surprising discoveries. For instance, identifying a scale’s tonic, performing key-finding, or generating the next chord in a jazz progression, can all be accomplished by simple perceptrons that do not require any hidden units.

Fifth, the point of each of our simulations was not simply to create a network to solve a particular musical task. Instead, the point was to interpret the internal structure of a trained network in order to determine the solution to the task that it had discovered.

All of the research presented in this book explores a basic question: can shallow networks trained on simple musical problems reveal anything novel about the structure of music? In general, I can say that the answer to this question is a resounding yes. Let us now reflect on the various results reported in the preceding chapters.

9.2 Synthetic Psychology 0f Music

9.2.1 Synthetic Psychology

Cognitive science is mostly conducted using an analytic strategy called reverse engineering (Cummins, 1983; Dawson, 2013). In reverse engineering, data is collected from a behaving system, and then some type of model—a model of data, a mathematical model, or a computer simulation—is fit to the data (Dawson, 2004). The purpose of the model is to provide a concise account of the regularities in the data. Reverse engineers collect data first, and fit the model to the data second. However, an alternative research approach called forward engineering is also available to cognitive scientists (Dawson, 2004, 2013; Dawson et al., 2010a). Forward engineering is also known as synthetic psychology (Braitenberg, 1984). In forward engineering, a model is first put together from a basic set of interesting components. Then the behaviour of the model is observed in various situations. In forward engineering, the model does not describe data that has already been collected, but instead is the source of the data. Synthetic psychologists build their models first, and only collect data second—from the model they have constructed.

Artificial neural networks have been proposed as an ideal medium in which synthetic psychology can be conducted (Dawson, 2004). When used in forward engineering, the basic components of artificial neural networks (activation function, learning rule, etc.) are the basic building blocks used to construct the model. Because the network is trained on what to do, but is not told specifically how to accomplish this task, networks can be a source of surprising algorithms or representations that can be used to propose novel theories in cognitive science. However, for connectionist synthetic psychology to succeed, it requires substantial reverse engineering after synthesizing a network. The surprising revelations produced by connectionist forward engineering reveal themselves only after understanding the internal structure of a trained network. The insights that networks provide are not found in their behaviour (the input/output mappings they produce), but instead exist in the regularities that networks have discovered to mediate their behaviour.

This book serves as a case study in connectionist synthetic psychology. First, I forward engineer networks to solve basic musical problems. Second, I reverse engineer the networks to determine how they solve these problems, and to relate these methods to traditional music theory. The success of this approach is measured by the nature and number of surprising representations that I discover in the trained networks. The following sections summarize these major discoveries. The surprises that I revealed concern the complexity of networks trained on various tasks, the suitability of value units for musical tasks, the importance of the tritone, the use of strange circles, and the properties of distributed representations of music.

9.2.2 Network Complexity

One of the mysteries confronting a connectionist forward engineer at the start of a project concerns network complexity. What kind of network is required to learn a task? Is a multilayer perceptron necessary? If so, how many hidden units does it require? When one begins a simulation project, one typically has certain expectations about network complexity. Surprises often occur when these expectations are shown to be false.

One example of this occurred in the initial simulations involving scale tonics and scale modes (Chapters 3 and 4). Conventional music theory defines a very specific pattern of musical intervals between adjacent notes in both a major and a harmonic minor scale. It does not provide a similar general definition for the tonic of such a scale. Because of this, I expected that classifying a scale’s mode would be an easier problem than classifying its tonic. However, our simulations demonstrated that this expectation was false. Scale tonic identification is an easier problem that can be solved by a perceptron. Classifying scale modes is more complex, and requires the use of a multilayer perceptron. A second example of this sort of surprise was provided in our exploration of the ii-V-I progression problem. The expectation was that the pitch-class encoding of this problem would require a multilayer perceptron. It was astonishing when this version of the problem could be learned by a perceptron that used integration devices for output units.

In general, it is interesting and perhaps somewhat surprising that small networks can solve all the various musical problems that I have considered, even when I use pitch-class encoding. The most complicated networks that were encountered (the more abstract encodings of the Coltrane changes) required between nine and 11 hidden units. The remainder of the multilayer perceptrons reported in the book required far fewer hidden units.

9.2.3 The Value of Value Units

One reason for the relative simplicity of the networks that I have reported is that many of them use value units. I opted for value units because they offer many advantages over more traditional architectures when network interpretation is involved (Berkeley et al., 1995; Dawson, 1998, 2004, 2005, 2013; Dawson & Boechler, 2007; Dawson et al., 2005; Dawson et al., 2000a; Dawson et al., 2000b; Nickerson, Bloomfield, Dawson, Charrier, & Sturdy, 2007). However, the simplicity of most of the networks suggests that the value unit architecture is particularly well suited for the synthetic psychology of music. Perhaps this is because the activation function of a value unit is tuned so that the unit only turns on to a very narrow range of net inputs (Dawson & Schopflocher, 1992). As a result, in most of my simulations value units learned to respond to a very small number of musical patterns. Clearly, this tuned sensitivity of the architecture facilitated network interpretation. However, it also permitted very simple networks to learn the problems that I defined. It appears that underlying almost all of the tonal tasks that I studied is a definite relationship between certain musical properties and a desired musical judgment.

This is not to say that the particular musical properties exploited by my networks are traditional or unsurprising. Most of the network interpretations revealed a very different logic underlying some aspect of Western tonality. These novel formalisms are summarized below.

9.2.4 The Prominent Tritone

There appears to be a bias against the tritone in Western music. A long history of studying the consonance of the various musical intervals has indicated that the tritone is one of the most dissonant. For many years, researchers have studied the perceptual properties of the various musical intervals (Bidelman & Krishnan, 2009; Guernsey, 1928; Helmholtz & Ellis, 1863/1954; Krumhansl, 1990a; Malmberg, 1918; McDermott & Hauser, 2004; McLachlan, Marco, Light, & Wilson, 2013; Plantinga & Trehub, 2014; Plomp & Levelt, 1965; Seashore, 1938/1967). Perhaps it is because of its dissonance that the tritone is one of the least frequently appearing intervals, both in Western music and in the music of other cultures (Vos & Troost, 1989). Indeed, the rarity of the tritone is one of the reasons that its presence may be important for key-finding (Browne, 1981; Butler, 1989).

The networks that I have explored do not appear to share this bias against the tritone. Starting with the analysis of the scale mode network, one of the surprises that emerged from interpreting musical networks is their strong preference for the tritone. Repeatedly I found that networks took advantage of the fact that two pitch-classes were a tritone apart to structure their responses to musical stimuli. Table 9-1 tabulates the many examples of tritone usage that we have encountered.

Table 9-1 Examples from previous chapters of identifying tritone relationships in a variety of network interpretations.

Task	Regularity	Depiction
Detecting scale mode	Tritone balance in both hidden units	Figures 4-3 and 4-4
Detecting scale mode	Grouping of minor scales with identical balanced tritones in hidden unit space	Figure 4-2 and table 4-1
Triad classification	Tritone balance in hidden unit weights	Figures 6-4 and 6-5
Classifying added note tetrachords	Tritone equivalence in hidden units weights	Figure 6-24
Classifying extended tetrachords	Tritone equivalence in hidden unit weights	Figures 7-8 and 7-10
Classifying extended tetrachords	Tritone balance in hidden unit weights	Figures 7-11 and 7-12
ii-V-I progression problem	Tritone organization of weight space in MDS solution	Figures 8-9 and 8-10

I repeatedly encountered two general types of tritone exploitation. The first is tritone balance, in which two pitch-classes a tritone apart are assigned connection weights that are equal in magnitude but opposite in sign. As a result, when input units representing each of these pitch-classes are simultaneously active their signals cancel each other out, typically increasing processor activity when value units are part of a network’s architecture.

The second is tritone equivalence, in which two pitch-classes a tritone apart are assigned identical connection weights. As a result, in terms of network processing both of these pitch-classes are functionally identical. Tritone equivalence frequently appears in networks trained on harmonic tasks like chord classification.

Tritone balance and tritone equivalence are characteristics of connection weights between input units and other processors. It is not surprising that when these tritone regularities are seen, I also find tritone organization in analyses that are more abstract. For instance, plots of points in hidden unit spaces, or from various MDS analyses of weights or unit activities, organize themselves so that points related by a tritone are close together in the space.

All of these results lead naturally to a key question: Why do so many musical networks exploit the tritone? One possibility is that the tritone, which is a musical distance of six semitones, divides the octave exactly into two. Perhaps the networks discover that many musical tasks can be solved by identifying the same regularities in each half of the octave.

9.2.5 Strange Circles

The many examples of tritone equivalence provided in Table 9-1 illustrate another surprising property revealed in many network interpretations: the use of strange circles. A strange circle involves an equivalence class of pitch-classes that are related to each other by a specific musical interval, such as the six pitch-classes that define a circle of major seconds.

Network usage makes these circles “strange” because in a variety of circumstances different pitch-classes that belong to the same musical circle are assigned the identical connection weight. As a result, to the network these different pitch-classes are functionally identical. Table 9-1’s listing of tritone equivalences picks out the occasions in which the strange circles are based on the tritone. Table 9-2 provides the instances of strange circles based on other musical intervals that have been encountered in our network interpretations.

One interesting property revealed by Table 9-2 is that the use of strange circles based on intervals other than the tritone only seems to emerge for tasks involving harmonic stimuli. While strange circles of tritones appear in other tasks, the additional circles only appear when networks learn to classify triads or tetrachords.

Table 9-2 Examples from previous chapters that discovered use of strange circles in different network interpretations.

Task	Regularity	Depiction
Triad classification	Circles of minor thirds	Figures 6-3 and 6-27
Triad classification	Circles of major thirds	Figures 6-4 and 6-5
Triad classification	Circles of major seconds	Figure 6-6
Classifying extended tetrachords	Circles of major seconds	Figure 7-5
Classifying extended tetrachords	Circles of minor thirds	Figures 7-7 and 7-8
Classifying extended tetrachords	Circles of major thirds	Figure 7-9

9.2.6 Distributed Representations

One of the major contributions of connectionism to the study of cognition has been the proposal for alternative forms of mental representation. Perhaps the most important of these connectionist contributions has been the notion of coarse coding or of distributed representation (Hinton et al., 1986; Pollack, 1990; Thrun, 1995). Although distributed representations are very difficult technically to define (Van Gelder, 1991), intuitively they involve simultaneous activities in a number of different hidden units; these activities are combined to produce a correct response. Coarse codes are interesting because each of the active components seem as though they have poor sensitivity to properties related to making correct judgments. In a distributed representation, each component is an inaccurate detector, but the combination of these poor components leads to high accuracy.

Distributed representations have been repeatedly encountered when interpreting musical networks. In two notable instances networks appear to solve musical problems by seeking intersections between groups of possibilities picked out by various inaccurate hidden unit detectors.

One example of this type of processing occurred when a multilayer perceptron was trained to perform key-finding (Section 5.4.2). Plots of each hidden unit’s responses to the various keys revealed that each was a very inaccurate detector of musical key (Figure 5-5). However, if one sought the intersection of the sets of keys picked out by each hidden unit’s activity, then the correct musical key could be isolated.

A second example of coarse coding was revealed in the examination of extended tetrachord classification. One reason for using the value unit architecture in many of our simulations is that such units often produce bands of activity where each level of activity captures different subsets of input patterns (Berkeley et al., 1995; Dawson, 2004; Dawson & Boechler, 2007; Dawson & Piercey, 2001). This in turn can facilitate network interpretation. The hidden units in the extended tetrachord network demonstrated distinct banding (Section 7.3). Each of these bands picked out different subsets of extended chord types, again demonstrating the inaccuracy of detection by each individual hidden unit. However, if one determined the intersection of the different sets of chords picked out by each hidden unit’s band, then the correct type of tetrachord was the result.

The two instances provided above are the most prototypical examples of coarse coding that were revealed in my simulation studies. However, a more liberal notion of distributed coding permits us to claim that several other examples of this type of representation were discovered. For instance, on several occasions I described input stimuli as points in a hidden unit space, where the activity of each hidden unit to a pattern provides the coordinates of its location. Output units generate correct responses to problems by carving the hidden unit space into decision regions. Importantly, a hidden unit space is a distributed representation because the location of any point in this space depends upon considering the activity in each hidden unit simultaneously.

From this perspective, hidden units are not required to create distributed representations; such representations are in perceptrons as well. For example, activities passing through every connection weight in a scale tonic perceptron must be considered in order to determine the tonic of an input scale. Furthermore, the set of weights in that perceptron combines the properties of major scales and harmonic minor scales into a single (distributed) representation. Similarly, signals sent through all the weights of an ii-V-I perceptron provide a distributed representation of conditional probabilities related to specific output pitch-classes.

9.2.7 Summary

The results reviewed in this section indicate that my musical networks have yielded a number of interesting and surprising regularities. Even though these networks learned tasks that can be defined using traditional music theory, they have discovered non-traditional means for mediating their input/output mappings.

Why might these results be of interest? The final sections of this chapter consider the implications of these results for two different domains: music and musical cognition.

9.3 Musical Implications

Section 9.2 provided a general overview of my simulation results. Simple artificial neural networks can easily be trained to perform musical tasks that are based on Western tonality. In addition, the internal structure of these networks can be interpreted; these interpretations reveal formal musical regularities. Many of these regularities provide interesting departures from traditional music theory. Ignoring musical cognition for the time being, what are the implications of such results for the study of music in general?

9.3.1 Levels of Investigation

Our consideration of these implications will be aided by recognizing that cognitive science investigates phenomena at different levels of analysis, each of which requires a special vocabulary to capture particular kinds of regularities (Dawson, 1998, 2013; Pylyshyn, 1984). Following the lead of computational vision pioneer David Marr (Marr, 1982), the most abstract level of analysis is the computational level. At this level, researchers investigate what kind of information processing problem is being solved by a system of interest. The computational level of analysis typically uses formal methods that provide proofs that answer this question.

The second level of investigation is the algorithmic level. At the algorithmic level, researchers are typically concerned with determining the particular information processes involved in solving an information-processing problem. That is, what algorithm or procedure is being used to solve an information-processing problem identified at the computational level of analysis? Experimental paradigms, like those developed by cognitive psychologists, typically provide the methods required to perform an investigation at the algorithmic level.

Marr’s third level of investigation is the implementational level. For Marr, this was the level where the methods of neuroscience explained how the information processes identified at the algorithmic level are brought into being by the brain. In modern cognitive science, it is useful to consider two separate questions related to implementation. The first is the architectural level of investigation. At this level, one determines the most basic information processes that are wired into the brain (e.g., primitive symbols and primitive rules). Once this has been determined, an implementational analysis à la Marr can be conducted to explain how the architecture is built into the brain. As far as the relationship between our musical networks and music is concerned, the computational and algorithmic levels of analysis are highly relevant. Let us consider our network contributions in the context of these two levels.

9.3.2 Theory Informs Algorithms

In connectionist cognitive science, the computational level of analysis is concerned with defining the input/output mapping performed by an artificial neural network. At this most abstract level, a network is a device that computes a mathematical function that converts input information into output information. The computational level of analysis defines the function being computed.

From this perspective, music theory itself defines and provides the input/output functions that networks were trained to generate in the preceding chapters. Identifying a scale’s tonic or mode, or classifying triads or tetrachords into chord types, all involve functions whose formal structures are completely defined by music theory.

The tasks described in the preceding paragraph are formal but are not typically expressed mathematically. Fortunately, the formal apparatus of modern music theory permits mathematical definitions of these input/output mappings. Mathematical set theory was applied to music beginning in the 1960s in order to describe atonal music (Babbitt, 1960, 1961; Forte, 1973, 1985; Lewin, 2007; Straus, 1991). While aimed at atonal music, the properties that set theory formalizes can also be used to describe regularities in tonal music.

Indeed, it seems natural to consider that the function of many of my trained networks is to perform set theory operations. One of the basic operations in musical set theory is to express a musical pattern in normal order. A network that is presented a scale in pitch-class representation, and then delivers its tonic, can be thought of as a device that renders the stimulus into a set of elements in normal order, and then returns the first element in that set. A network presented a scale in the same format, but which delivers the scale’s mode, can be considered a device for assigning something akin to a Forte number to the input pattern.

9.3.3 Algorithms Inform Theory

Section 9.3.2 suggests that music theory in general, and musical set theory in particular, provides an appropriate formalism for a computational account of my musical networks. At the very least, music theory defines the training sets that I created for our networks. While the computational level defines what input/output mapping is being computed, analysis at the algorithmic level—network interpretation—reveals how this mapping is mediated by a network’s internal components (Dawson, 1998, 2004, 2013). We have seen that the analysis of musical networks can reveal formal properties that are quite different from those used to define their training set. For example, one of the main results of the algorithmic analyses summarized in Section 9.2 was the discovery that networks often use strange circles to solve harmonic problems. In a strange circle, different pitch-classes that belong to a circle defined by a particular musical interval (e.g., circles of major thirds, major seconds, or tritones) are all treated as being the same pitch.

Musical set theory uses one strange circle—the circle of octaves—when it makes the assumption of octave equivalence. This assumption limits the basic elements of set theory to the 12 different pitch-classes. The strange circles discovered in the musical networks point to a radically different set of basic elements. For instance, circles of major seconds reduce one to considering only two different kinds of elements, pitch-classes that belong to one circle or pitch-classes that belong to the other. Similarly, a set theory based on circles of major thirds would consider only four different kinds of elements, because pitch-classes can belong to only one of four different circles.

If one were to develop a musical formalism based on one or more of the strange circles, then it seems obvious that it would be quite different from musical set theory. However, it might be both interesting and viable. After all, the networks appear to use such a theory to classify types of chords. For another example, consider a second major finding reported in Section 9.2, the discovery of tritone balance in a variety of networks. Unlike tritone equivalence, tritone balance means that the signal generated by a unit representing one pitch-class is cancelled by the signal generated by the unit representing the pitch-class a tritone away.

Tritone balance has some interesting implications for musical set theory. After assuming octave equivalence, music set theorists then order the elements that define a musical stimulus in a particular way. For instance, Forte numbering assigns the pitch-class C the value 0, the pitch-class C♯ the value 1, and so on. This means that in music set theory pitch-classes are organized around a circle of minor seconds (Figure 6-9).

In the circle of minor seconds, pitch-classes that are a tritone apart are opposite one another across the diameter of the circle. Tritone balance occurs when there is a special relationship between these opposite pitch-classes. To make this relationship explicit in music set theory one might first adopt a different numbering system. For instance, if C is assigned the number x, then F♯—a tritone away from C—could be assigned the number –x. Additional operations on sets, involving sums of these numbers, would then have to be invented to take advantage of whatever tritone balance might offer.

The previous examples have shown how certain properties discovered from network interpretations might inform musical set theory. Importantly, networks offer other information pertinent to the computational consideration of music. For example, there is a long history of generating maps that represent the similarity between notes or scales in terms of the distances between points (Krumhansl, 2005; Schoenberg, 1969; Tymoczko, 2012). My musical networks provide a variety of new properties for generating maps that have different arrangements than those mentioned above. For instance, instead of measuring scale similarity in terms of shared pitch-classes, one can measure scale similarity in terms of connection weights. Similarly, at many points in preceding chapters we considered hidden unit spaces. These spaces are alternative maps of musical stimuli in which the coordinates of each point in the map are provided by hidden unit activities.

The point of considering different sorts of network-derived maps is that in many instances they might arrange musical stimuli in a fashion that is quite different from that found in other musical maps. By exploring these differences—by considering why certain musical entities are close to one another and why others are not—it is possible to develop alternative musical theories.

Musical networks can also provide evidence related to other computational issues. For example, one general type of question that often arises in computational analyses concerns the complexity of one situation in comparison to another. Network training provides one approach to answering such questions. In our simulations, because of our interest in network interpretation, we sought to identify the simplest network capable of solving a musical problem. Comparing the structure of networks trained on different problems provides an indication of their relative complexity. For instance, we discovered that a value unit perceptron could solve the scale tonic problem but could not solve the scale mode problem. This suggests that identifying a scale’s tonic is a simpler information-processing problem than identifying its mode. Similarly, the various simulations reported in Chapter 7 indicated that the ii-V-I progression is simpler than the Coltrane changes. This is because an integration device perceptron is all that is required for the former, but a value unit perceptron or a multilayer network of value units is required for the latter, depending upon the choice of encoding.

9.3.4 Network Structure and Composition

The previous sections have pointed out that network interpretations can lead to alternative formal accounts of musical regularities. One interesting possibility raised by this discovery is that the novel formal properties discovered inside a network can be used to provide new methods for musical composition. An example of this possibility is described below.

Atonal music has no discernible musical key or tonal centre because all 12 pitch-classes from Western music occur equally often. Arnold Schoenberg invented a method, called the 12-tone technique or dodecaphony, for composing atonal music. In dodecaphony, one begins a new composition by arranging all 12 pitch-classes in some desired order; this arrangement is the tone row. The first note from the tone row is then used to begin the new piece. The duration of this note, and whether or not it is repeated, is under the composer’s control. However, once the use of this note is complete, dodecaphony takes control: the 12-tone method prevents the composer from using it again until all of the other 11 notes in the tone row have been used. Their use, naturally, follows the same procedure used for the first note: the composer decides upon duration and repetition, uses the note, and then moves on to the next note in the tone row. Let us now consider another approach to composing atonal music, one inspired by a feature that we have observed in several network interpretations.

We have seen several examples of artificial neural networks whose hidden units employ connection weights that assign various subsets of pitch-classes to classes, such as the four different circles of major thirds (Figure 6-17) or the two different circles of major seconds (Figure 6-13). Furthermore, these circles are often strange in the sense that the hidden units treat each member of the circle as being the same pitch-class. That is, all of the pitch-classes that belong to one circle of major seconds may be assigned the same connection weight (e.g., to the connection from a pitch-class input unit to a hidden unit). This means that the hidden unit is “deaf” to any differences between members of this subset of pitch-classes. For a hidden unit that uses equivalence classes based on circles of major seconds, there are only two pitch-classes: some “name” x (the weight assigned to C, D, E, F♯, G♯, and A♯) and some other “name” y (the weight assigned to C♯, D♯, F, G, and B).

Why do networks use strange circle equivalence classes to represent musical structure? One reason is that networks discover that notes that belong to the same strange circle are not typically used together to solve musical problems, such as classifying a musical chord. Instead, the network discovers that combining notes from different strange circles is more successful. This use of equivalence classes—combining pitch-classes from different circles, but not from the same circle—suggests an alternative approach to composing atonal music.

Imagine a musical composition constructed from a set of different musical voices. Let each of these voices be derived from one strange circle. The notes sung by this voice are selected by randomly choosing from the set of pitch-classes that belong to the strange circle. For instance, if one voice was associated with a particular circle of major thirds, then one could write its notes by randomly choosing one note at a time from the set [C, E, G♯]. To make the voice more musically interesting, one could add a randomly selected rest to the mix by selecting from the set [C, E, G♯, R] where R indicates a rest (i.e., no note is to be sung).

If one associated different voices with different strange circles, and composed via random selection as described above, then one would be following the general principle discovered by the network: pitch-classes from different strange circles can occur together, but pitch-classes from the same strange circle cannot. Furthermore, one could use this method to compose atonal music by wisely choosing which strange circles to use to create different voices. For instance, imagine creating a piece that included four voices, each associated with a different circle of major thirds. This composition would be atonal, in Schoenberg’s sense, because the four circles combine to include all 12 possible pitch-classes. Randomly selecting pitches from each of these circles would produce a composition that did not have a tonal centre because each of the 12 pitch-classes would occur equally often when the composition was considered as a whole.

One example of this sort of composition can be found at the following website: http://cognitionandreality.blogspot.ca/2013/03/composing-atonal-music-using-strange.html. The website provides a musical score created by using this approach to composition. This score includes six staves, one for each voice. Each voice is generated by randomly selecting from one strange circle (and including rests in this sampling procedure). The top two staves, written in quarter notes, are each drawn from a different circle of major seconds. The bottom four staves, written in half notes, are each drawn from a different circle of major thirds. The score is created by applying two additional musical assumptions. First, while each wheel generates a pitch-class name, the composer decided how high or low (in terms of octave) each note is positioned. Second, in order to ensure that all notes tend to occur equally often in the score, the two circles of major seconds are sampled twice as frequently relative to the other four strange circles.

At the bottom of this web page, one can find links that play some of the voices individually, some combinations of a small number of the voices, and all of the voices together. On listening to these samples, one discovers that individual voices or strange circles are musical, but are not musically interesting. Music that is more interesting emerges from combining the random outputs of different circles. Other sets of strange circles than those used to create the score discussed above could also be used for composing. What kinds of atonal pieces can be created when many different strange circles are available? To answer this question, I created a Java program that uses David Koelle’s music package jFugue (Koelle, 2008). This package lets the programmer define strings of musical notes, and then takes care of playing them. The program that I wrote lets the user choose a composition’s tempo and length with a mouse, and then make a checkmark beside every strange circle to be used in a piece. The user can decide whether to include rests, and set the duration and the octave (2 is lowest, 5 is highest) for each set of circles. A press of the “compose” button leads to a pause while the various voices are constructed, and then the piece is played through the computer’s speakers. One can easily explore the possibilities of strange circle composing with this program and listening to the sounds that it creates. This program is also available as part of the same blog post mentioned above.

9.4 Implications for Musical Cognition

Music theory, and its formalizations, defines the input/output mappings that our artificial neural networks have learned. Thus, it provides the vocabulary for the computational level analysis of the networks. My network interpretations have revealed how these computational mappings are mediated, and thus provide the algorithmic level analysis. However, we saw in Section 9.3 that these algorithmic level results could inform the computational level as well. What are the implications of our simulations for the study of musical cognition?

In general, the experimental study of cognition focuses upon the algorithmic level. This is because experimental cognitive psychology attempts to discover the procedures used by human subjects to process information (Dawson, 1998, 2013). From this perspective, there should be an important relationship between results in musical cognition and our simulations.

9.4.1 Networks and Algorithms

Even the most committed forward engineer realizes that at some point their models must be related to human data. A synthetic cognitive science of music must eventually find empirical links between networks and human musical cognition. How are these links to be established? Fortunately, when networks are trained they provide a great deal of different kinds of evidence that can be used to compare their musical representations and processes with those of human subjects.

To illustrate, let us consider what is called relative complexity evidence (Pylyshyn, 1984). Relative complexity evidence compares a system’s processing of one type of stimulus to another. For instance, when training musical networks, this could involve comparing the learning of different patterns over time. Are some types of patterns harder to learn than other patterns?

In Chapter 6, I interpreted the structure of a multilayer perceptron trained to classify four different types of triads. To collect relative complexity evidence, I could save the state of the network (e.g., its structure, its responses to patterns, its errors) after every 250 epochs of training.

When this is done, some interesting properties are revealed. A typical network of this type performs well on augmented and diminished triads very early in training. It generates highly accurate responses to both types of these triads after only 250 epochs of training. In contrast, it has more difficulty learning both major and minor triads. About 1000 epochs of training are required to reduce the error generated to major triads to the same level of error generated to the augmented and diminished triads. Minor triads provide the greatest challenge; about 1750 sweeps of training are required before this type of stimulus is learned.

Relative complexity evidence can be easily obtained to get a sense of the dynamics of learning a particular musical task. How might we compare this evidence to the behaviour of human subjects? One approach is to create a new experimental paradigm, one that is as straightforward as the probe tone method (Krumhansl, 1990a). In general, this new experimental paradigm involves teaching human subjects on the same musical task that was presented to the network, where this teaching is done in a manner similar to that used for network training. For example, consider the triad classification problem from Chapter 6. One can build a block of training patterns to present a human subject, where a single block includes each of the 48 triads. In a given block, the order of patterns is randomized. During training, a subject hears a triad, and then classifies the sound. For instance, they might assign the sound to one of four different categories (A, B, C, or D). Of course, each category is associated with a triad type, but the subject need not be provided with the names of these types in order to respond. Because the subject is being trained in a fashion analogous to the network, learning needs to be supervised. After the subject classifies the stimulus, they are told what the correct response was. Then the next stimulus in the block can be presented.

Note that in this paradigm a block of trials for a human subject is analogous to an epoch of training for a network. So, if a subject is run through a series of training blocks, then they are learning in a similar fashion to the network. We continue training until an acceptable degree of accuracy has been achieved, assuming that the feedback that a subject receives after each trial improves their performance. Once the subject has “converged” to a solution to the triad classification problem, their data can be analyzed in a fashion similar to that of the network’s. For instance, the subject’s average accuracy to each triad type can be measured for each block of training. As a result, relative complexity evidence for human subjects can be directly compared to the same kind of evidence collected for networks. One could argue that the similarity between these two sources of data reflects the degree that the network is learning the problem in the same way as a human.

Of course, in practice this paradigm would involve exploring a number of different design decisions. We saw earlier that one could use different encodings of the same training set to affect network learning. Clearly, a variety of network encodings would have to be explored and compared to human data. We also have a variety of design decisions to explore when the human data is collected. Is performance affected by the timbre of stimuli? Is it affected by the octave in which stimuli are presented? Is it affected by inversions of chords? In short, this general approach to studying musical cognition opens the door to a wide range of studies that involve exploring different tasks, different network settings, and different experimental stimuli.

9.4.2 Musical Representations

The discussion in Section 9.4.1 concerns how one might use artificial neural networks to inform the algorithmic analysis of musical cognition, and how to explore the relationship between how networks and humans learn the same musical task. However, in addition to specifying algorithms cognitive scientists also must specify the architecture of cognition (Dawson, 1998, 2013). That is, they must determine the basic representations and operations available for solving an information-processing problem (Pylyshyn, 1984). Interpreting musical networks can inform the architectural mission of the cognitive science of music.

As was noted in Chapter 1, one of the central assumptions of cognitivism is that humans actively process information. Musical cognition is thought to proceed by actively integrating musical stimuli with mental representations of music (Cook, 1999; Deliège & Sloboda, 1997; Deutsch, 1982, 1999, 2013; Francès, 1988; Howell et al., 1985; Krumhansl, 1990a; Lerdahl, 2001; Lerdahl & Jackendoff, 1983; Sloboda, 1985; Snyder, 2000; Temperley, 2001). Furthermore, the act of organizing the music that we hear can affect how we represent it; presumably, musical representations change as a function of our musical experience and training.

It is therefore not surprising that at the heart of musical cognition one finds proposals about the nature of musical representation. For instance, the evidence supporting the existence of the tonal hierarchy (Krumhansl, 1990a) suggests that musical harmony is represented hierarchically in a system that makes certain musical structures more stable, central, or important than others depending upon context (Bharucha & Krumhansl, 1983). This in turn suggests that musical representation may be analogous to how semantic concepts are represented in prototype theory (Rosch, 1975; Rosch & Mervis, 1975).

Other kinds of representation have been proposed. The tonal hierarchy for each key could be explicitly represented, as is required in models of key-finding (Krumhansl & Kessler, 1982). Harmonic structures could be represented spatially, where distances between represented entities reflect their similarity (Krumhansl, 2005). There is a long tradition of employing spatial manifolds as representational primitives for cognition (Cutting, 1986; Cutting & Proffitt, 1982; Kosslyn, 1980, 1994; Shepard, 1984, 1990). Some have proposed that musical cognition is mediated by a language-like generative grammar (Lerdahl, 2001; Lerdahl & Jackendoff, 1983), while others have proposed representations that capture music’s probabilistic structure (Huron, 2006; Temperley, 2007). In the context of such representational proposals, what is the role of the simulations that we explored in preceding chapters?

If a key goal of musical cognitivism is to identify potential representational formats, then this search should be as broad as possible. Artificial neural networks offer a medium for discovering new representational proposals. Many of our discoveries that were summarized earlier in this chapter—the prominent tritone, strange circles, and coarse codes—can be interpreted not just as contributions to music theory but also as contributions to music cognition. When we pull these regularities out of our networks, it is reasonable to ask whether they might also play a role in human cognition.

A network interpretation might simply point to musical information that a representation should make explicit because it affects musical information processing. For instance, we saw that the tritone plays an important role in many of our networks. This is consistent with some results in the psychology of music. It has been claimed that human listeners can quickly identify a composition’s musical key by detecting the presence of rare musical intervals like the tritone (Brown & Butler, 1981; Browne, 1981; Butler, 1989; Van Egmond & Butler, 1997), although this theory has not gone unchallenged (Krumhansl, 1990b). Perhaps, more importantly, our network interpretations may also inform architectural proposals for musical cognition. When I discover a particular representational structure in my networks, like the use of particular strange circles, or a specific kind of coarse coding, it is natural to ask whether these structures are also part of the cognitive architecture for music. For instance, are strange circles literally part of the structure of musical representations? To answer such a question, one must design experiments that explore the relevance of the proposed representation. However, designing these studies comes after generating the question. An important source of such questions is network interpretation.

9.5 Future Directions

This book explored the viability of a synthetic cognitive science of music. Simple artificial neural networks learned basic musical tasks related to Western tonality. When a network completed its training, I interpreted its internal structure. The primary question explored in this book is whether this approach can inform the cognitive science of music. The results summarized in the current chapter, and detailed in the preceding chapters, should convince the reader that the approach introduced in this book holds a great deal of promise. Even though I adopted a very simple approach to training our networks, and even though I trained networks on tasks that are well understood in traditional music theory, I was able to uncover a number of novel and surprising results. My network interpretations revealed a number of representational insights that can inform both music theory (Section 9.3) and musical cognition (Section 9.4). Now that I have established the viability of this approach, it is possible use it to venture into further, more complex, domains.

With respect to computational-level investigations, researchers can now proceed to explore a greater variety of musical information-processing problems. For instance, the Forte numbering system from musical set theory is used to assign an identifying number to a musical entity (Forte, 1973, 1985). This is useful because in many cases two musical entities that seem to be quite different may be assigned the same Forte number. This in turn implies that they are functionally equivalent. This means that musical set theory can define problems that are more involved where, for example, two musical entities that are assigned the same Forte number generate identical network outputs.

With respect to algorithmic-level investigations, experimental psychologists can now proceed to research that has the goal of relating evidence gathered from networks to analogous evidence collected from human subjects. A general approach to this kind of research was proposed in Section 9.4. Other wider variations are also feasible. For instance, one could train neural networks on a task analogous to Krumhansl’s probe tone method. One could also train networks to rate the dissonance or consonance of musical stimuli where network output is informed not by music theory but instead by existing experimental results.

With respect to the architectural level, or even the implementational level, researchers can now consider more complex sorts of encodings. All of the networks reported in this book have encoded stimuli in a fashion that maps directly onto music theory (e.g., pitch-class, pitch). Other physical or physiological encodings are possible. For example, what is the effect of representing musical inputs as collections of sine wave frequencies, or in a fashion that emulates the encoding of the basilar membrane?

Similarly, the design decisions that guided architectural selections for my simulations led me to focus on musical regularities that could be obtained from spatially represented stimuli. For the most part, I avoided the study of musical stimuli that were presented through time, or the study of temporal musical regularities such as rhythm. My success in the simulations reported in this book indicates that studying temporal aspects of music with (interpreted) networks is a crucial next step. Furthermore, the exploration of music is not limited to the architectures that are employed in this book; many other network architectures can be explored (Griffith & Todd, 1999; Todd & Loy, 1991). These include self-organizing networks that learn statistical properties of inputs without requiring an external teacher (Gjerdingen, 1990; Kohonen, 2001; Page, 1994), or recurrent networks that are explicitly designed to detect regularities in time (Elman, 1990; Franklin, 2004, 2006).

There is also a growing interest in deep learning networks (Bengio et al., 2013; Hinton, 2007; Hinton et al., 2006; Hinton & Salakhutdinov, 2006; Larochelle et al., 2012). These networks, which have many more layers of hidden units than we have been considering in this book, have solved many difficult pattern recognition tasks in natural language, image classification, and the processing of sound (Hinton, 2007; Hinton et al., 2006; Mohamed, Dahl, & Hinton, 2012; Sarikaya, Hinton, & Deoras, 2014). Some of these tasks involve processing music, including its temporal properties (Humphrey, Bello, & LeCun, 2013).

The power of deep learning as a technology is becoming well established. However, it is important to remember that the goal of a connectionist cognitive science of music is not to generate new technologies. Instead, it is to enhance our understanding of musical cognition, or of music theory, by providing insights into these domains. These insights require us to investigate how networks solve problems, and to use these interpretations of network processing to inform theory. Deep learning provides a powerful technology, but techniques for interpreting the structure of deep belief networks are in their infancy (Erhan et al., 2010). Until their internal structure can be fundamentally understood, these powerful networks are likely not going to provide new directions to a cognitive science of music.

My hope is that the results reported in this book will serve as an impetus for continued exploration, pursuing investigations of additional musical properties using the architectures described here, or employing new kinds of artificial neural networks. However, it is important to remember that the success of a connectionist cognitive science of music depends on one fundamental research goal: interpreting the internal structure of a network after it learns. Network interpretations will be the source of new theoretical insights into musical cognition.

References