8 Jazz Progression Networks

8.1 The ii-V-I Progression

All of the musical tasks that I have considered in previous chapters ignored the element of time. One reason for this, detailed in Chapter 2, was that the network architectures that I have explored are designed to recognize spatial, but not temporal, patterns. However, even these sorts of networks can be used to deal with some temporal aspects of music. In this chapter, I explore time by training networks on sequences of chords. After learning, if a stimulus chord is presented to the network, then it responds with the next chord in the sequence.

The chapter begins with an introduction to chord progressions, focusing upon one important progression in jazz, the ii-V-I. Various methods are discussed that can be used to encode the chords in this progression for network training. I discover that the type of encoding is an important determinant of how long it takes a network to learn this progression. A network is then interpreted to reveal how it encodes the probabilistic structure of this progression. The chapter then turns to a second progression, the Coltrane changes, which is an elaboration of the ii-V-I progression. Various network encodings are explored, and their impact on learning is determined. The interpretation of this network relates the Coltrane changes to the strange circles that have been discussed in the preceding two chapters.

8.1.1 Chord Progressions

The basic element of harmony is the musical interval, the simultaneous presence of two tones a specific musical distance apart. Chords generally involve presenting more than two tones simultaneously, and therefore involve the presence of several musical intervals. Just as the presence of a single tone cannot by itself establish a musical key, a triad in isolation cannot establish tonality (Schoenberg, 1969). In order for tonality to be established, a succession of triads—or, more generally, a succession of harmonies—must occur. The structure of this succession ensures that one chord naturally leads the listener to the next. Such a structured succession of chords is called a chord progression or, in jazz, “the changes.” Chord progressions are central to the structure of most jazz compositions (Broze & Shanahan, 2013; Sudnow, 1978). This chapter explores the properties of networks that learn particular progressions from jazz: when presented a chord, the network responds with the next chord in the progression.

8.1.2 Basic Changes

Let us begin with succession of chords called the ii-V-I chord progression, which is likely the one most commonly encountered in jazz (Levine, 1989). In its typical form, this progression involves three different tetrachords, each defined in the same musical key; as a result, we can write the ii-V-I progression for each of the 12 different major keys in Western music. The three chords in any of these versions of the progression are constructed using particular notes in a major scale as their root; the scale used defines the key of the progression.

The first tetrachord in the ii-V-I progression is the minor seventh chord constructed using the second note of the progression’s major scale. This is the ii chord; its Roman numeral is written in lower case because it is minor, and indicates the position of this chord’s root in the major scale for the chord’s musical key. For instance, the second note in the C major scale is D, so the ii tetrachord for the key of C is Dm7, which includes the notes D, F, A, and C.

The second tetrachord in the ii-V-I progression is the dominant seventh tetrachord constructed using the fifth note of the progression’s major scale as its root. In the C major scale this note is G, so in the key of C the V chord in the progression is G7, which uses the notes G, B, D, and F.

The third tetrachord in the ii-V-I progression is the major seventh tetrachord constructed using the first note of the progression’s major scale as its root. In the C major scale this note is C, so in the key of C the I chord in the progression is Cmaj7, which contains the notes C, E, G, and B.

The procedure illustrated above for constructing the three chords in the key of C can construct the ii-V-I progression in any other major scale. Table 8-1 provides the three chords in this progression for each major key in Western music.

Table 8-1 The three tetrachords that define the ii-V-I progression for each major key.

Key	Chord
Key	ii	V	I
A	Bm7	E7	Amaj7
A#	Cm7	F7	A#maj7
B	C#m7	F#7	Bmaj7
C	Dm7	G7	Cmaj7
C#	D#m7	G#7	C#maj7
D	Em7	A7	Dmaj7
D#	Fm7	A#7	D#maj7
E	F#m7	B7	Emaj7
F	Gm7	C7	Fmaj7
F#	G#m7	C#7	F#maj7
G	Am7	Dm7	Gmaj7
G#	A#m7	D#7	G#maj7

Note. Each row provides the three chords in the progression for one musical key. The name of the major key is provided in the first column.

8.1.3 The ii-V-I Progression Problem

I am interested in training networks to generate the ii-V-I progression in any key. When presented one chord, the network’s task is to generate a representation of the next chord in the progression. For example, consider the ii-V-I progression in the key of C, which involves the Dmin7, G7, and Cmaj7 chords. I want to train a network so that when Dmin7 is presented to its input units it responds with a representation of G7 in its output units. Similarly, when G7 is presented to its input units, it will generate Cmaj7 in its output units.

I want analogous behaviour from the network for the other 11 possible musical keys. Each key involves defining two input/output pairs, one involving the minor seventh and the dominant seventh chords, the other involving the dominant seventh and the major seventh chords. I never use a major seventh chord as an input pattern; when properly trained the network will never generate a minor seventh chord as a response. The entire training set consists of 24 different input/output pattern pairs.

The input and output chords for the ii-V-I progression problem can be encoded in a number of different ways. Of particular interest in the current chapter is whether the choice of encoding affects the complexity of the network required to learn the progression. Before exploring the results of training networks on the ii-V-I progression problem, let us first discuss the importance of encoding, and how there are several different approaches to encoding tetrachords that are worthy of exploration.

8.2 The Importance of Encodings

8.2.1 Readiness-To-Hand

In Being and Time (Heidegger, 1927/1962), philosopher Martin Heidegger proposed that part of an agent’s engagement with the world involves using equipment. Equipment consists of entities experienced in terms of the potential actions or experiences that they make available. Heidegger also argued that a key property of equipment was readiness-to-hand. Readiness-to-hand means that equipment itself is imperceptible to us when being used; we experience the effects of equipment but not equipment itself. In other words, if we are aware of the existence of a tool, then the tool is poorly designed (Dourish, 2001; Norman, 1998, 2002, 2004). The invisibility of artifacts—the readiness-to-hand of equipment—provides evidence of good design.

8.2.2 Solutions by Design

Readiness-to-hand is not only relevant to the design of artifacts but is also important to theories of problem solving. In cognitive science, problem solving is typically described as searching a problem space (Newell & Simon, 1972). The amount of time required to search through a problem space to find a route to the problem’s solution reflects a problem’s difficulty. The longer the search, the harder the problem. Crucially, search complexity depends in part upon the manner in which states of knowledge about the problem are encoded. If a problem is encoded using one representational scheme, then its solution may require a long and difficult search. However, if the same problem is encoded in a different format, then its difficulty can be drastically reduced. With the proper encoding, a problem’s solution exhibits readiness-to-hand: the solution is immediately apparent, and the process of searching for the solution is so trivial that it becomes invisible (Simon, 1969).

In this context, one theme of the current chapter is to explore artificial neural networks for jazz progressions in the context of efficient design. In particular, it is possible to use many different encodings of the same musical problem. Even though the musical problem remains constant, changing its encoding can make it much more difficult—or much easier—for a network to learn.

8.3 Four Encodings of the ii-V-I Problem

In designing a training set for teaching a network the ii-V-I progression, one must decide how to represent tetrachords both as stimuli and as responses. Ideally, the choice of representation would be “theory neutral” (Pylyshyn, 1984): regardless of our choice of representation, the results of training a network on the task would be the same. Not surprisingly, though, this ideal situation does not arise: different choices of how to represent tetrachords for the network lead to very different simulation results.

Let us first describe four plausible methods for representing tetrachords to networks that must learn the ii-V-I progression. Later in the chapter, we will present results that clearly show that these choices are not theory neutral.

8.3.1 Pitch-Class Encoding

Most of the networks that are described earlier in this book employ a pitch-class representation, which is the first kind of encoding to consider for the ii-V-I progression. This representation only requires 12 units. Each unit represents the presence or absence of one of the possible pitch-classes in Western music.

One major advantage of pitch-class representation is its simplicity: a very small number of input and output units are required to represent any of the different tetrachords that can occur in the progression. A pitch-class representation of the ii-V-I problem requires only 12 input units to represent an input tetrachord, and the same number of output units to represent the tetrachord response generated by the network.

In pitch-class encoding, as we have seen in earlier chapters, a tetrachord stimulus is represented by turning on the four input units that represent the chord’s component pitch-classes, and by turning all of the other eight input units off. For the ii-V-I problem, a network can use the same encoding to represent its tetrachord responses in the output units.

8.3.2 Pitch Encoding

One straightforward way to consider the various chords in a progression is to consider them as being in root position. In root position, all notes in a chord appear in their natural positions in the scale to which they belong. For instance, in root position the lowest note of a tetrachord is the chord’s root: the lowest note of Dm7 is D; the lowest note of G7 is G, and so on.

One consequence of having each tetrachord in root position is that there is a marked similarity in chord “shape,” which is the spacing between adjacent notes in the chord. Tetrachords of the same type (minor seventh, dominant seventh, or major seventh) have very similar shape: four notes that are evenly spaced, as they are stacked upon each other on the staff.

One can imagine that the input units used for pitch-class encoding are the keys of a small piano. Figure 8-1 illustrates the mapping between the input units and the piano keyboard. However, this mapping reveals a possible disadvantage of pitch-class representation: by adopting this encoding, we lose the similarity of shape between different chords of the same type. That is, different spacing between notes—different chord inversions—are required to fit tetrachords on this keyboard because of its small size.

Figure 8-1

Figure 8-1 The mapping between input units used for pitch-class encoding and a piano keyboard.

Figure 8-2 illustrates this issue using a keyboard to represent four different minor seventh chords that belong to different keys. Each belongs to the ii-V-I progression in a particular key. However, to fit each of these chords onto the small keyboard, different chord shapes are required. For example, Figure 8-2 shows that Amin7 can be fit on this keyboard in root position (the A is the lowest note, which is the leftmost note coloured grey in the illustration). In contrast, Cmin7 must be fit using its first inversion (C is the second lowest note), Dmin7 must be fit using its second inversion (D is the second highest note), and Gmin7 must be fit using its third inversion (G is the highest note).

Figure 8-2

Figure 8-2 The keyboard layout of four different minor seventh tetrachords.

In order to create a representation that preserves tetrachord shape, I must abandon the assumption of octave equivalence, and adopt an encoding that explicitly indicates that two different notes (e.g., middle C and the C an octave higher) are distinct pitches even though they belong to the same pitch-class. I did this for the first triad classification network described in Chapter 6 (see Figure 6-2). Pitch encoding is an alternative to pitch-class encoding, and abandons the octave equivalence assumption. In pitch encoding, each input unit represents the presence or absence of a particular pitch, and not of a pitch-class, as is shown in Figure 8-3.

In our use of pitch encoding for the ii-V-I problem, the highest key of the progression is G♯, and the highest note is C♯6 (the highest note in the D♯7 tetrachord for this key). Similarly, the lowest key of the progression is A, and as a result the lowest note that we used is A3 (the lowest note in the Ama7 tetrachord for this key). As a result, our pitch encoding of chords used 29 input units to represent all of the pitches from A3 to C♯6.

Figure 8-3

Figure 8-3 The mapping between input units used for pitch encoding and a piano keyboard.

8.3.3 Pitch Encoding of Inversions

Pitch encoding can represent tetrachords in root position. However, as soon as octave equivalence is abandoned, other versions of the ii-V-I progression problem are possible. For instance, a pianist might prefer inversions of the chords that reduce the hand and finger movement required when one moves from one chord to the next. For example, if one uses the second inversion of every dominant seventh chord in the progression, then a “lower action” version of the progression emerges. The second inversion of a dominant seventh chord is created by taking the two lowest notes in the chord’s root position and raising each an octave. Figure 8-4 provides a version of the ii-V-I progression in which each of the dominant seventh chords is inverted.

Figure 8-4

Figure 8-4 The ii-V-I progression for each possible key.

How does inverting the middle chord of the ii-V-I progression enable lower action movement for a pianist? Figure 8-5 illustrates voice leading—that is, finger movements from one chord to the next—for the ii-V-I progression in the key of C to shed light on this issue. The top three keyboards in Figure 8-5 illustrate voice leading when the dominant seventh chord is in root position. The arrows indicate finger movements from chord to chord. Note that because the middle chord is in root position, substantial movement from chord to chord is required: each finger moves to a different key to play the next chord, and the hand must move up and then back down along the keyboard.

Figure 8-5

Figure 8-5 Voice leading for two versions of the ii-V-I progression.

The lower half of Figure 8-5 shows that if the middle chord is played in second inversion form, much less movement is required. The hand stays at the same position along the keyboard, and moving from one chord to the next only requires changing the position of two fingers. Two fingers press the same keys in successive chords for this version of the progression! In short, an alternative approach to encoding the ii-V-I progression problem is to use pitch encoding, but also to take advantage of its flexibility by presenting dominant seventh chords in their second inversion form. One consequence of this is that slightly fewer processing units are required; all of the tetrachords can be encoded using 24 input units, with the lowest unit representing A3 and the highest unit representing G♯5.

8.3.4 Lead Sheet Encoding

All of the encodings described above represent each pitch-class or each pitch in a tetrachord. As a result, all involve activating four processing units and turning all of the remaining processors off. However, there are many other ways to represent tetrachords, and some of these representations are not concerned with detailing each note in a chord. For instance, one popular approach to teaching adults how to play piano (Houston, 2004) attempts to simplify music reading by eliminating traditional musical notation of chords. Instead, chords are represented in lead sheet notation: they are written as a combination of the name of one note (to provide the chord’s root) and some additional symbols that indicate the type of chord. For instance if one uses lead sheet notation for the ii-V-I progression in the key of C, the chords are merely written as “Dm7,” “G7,” and “Cmaj7.”

Figure 8-6

Figure 8-6 Lead sheet encoding of tetrachords.

A lead sheet encoding can be easily created for an artificial neural network that is to learn the ii-V-I progression. This encoding is very simple, and only requires 15 processors, as is illustrated in Figure 8-6. Three of these processors indicate a chord’s type, where only three chord types (m7, 7, maj7) are required in the ii-V-I progression problem. The remaining 12 processors represent the chord’s root pitch using pitch-class encoding. For example, Figure 8-6 demonstrates how one can represent the Dm7 tetrachord by only activating two units: the unit that represents that the chord is a minor seventh and the unit that indicates that the chord’s root is the pitch-class D.

8.3.5 Implications

The sections above have discussed four different methods for encoding stimuli (and responses) for the ii-V-I progression problem. With these possible encodings of the ii-V-I progression described, we can now investigate the effect of problem encoding on network learning. Does problem representation affect network complexity? Does problem encoding alter the amount of training required for a network to solve the ii-V-I progression problem?

8.4 Complexity, Encoding, and Training Time

Figure 8-7

Figure 8-7 A perceptron trained on the ii-V-I progression task.

8.4.1 Task

All of the networks described in this section of this chapter learn the ii-V-I progression problem using one of the encodings discussed in Section 8.4. The question of interest concerns the effect of the various encodings: Does one encoding require a more complex network, or a greater amount of training, than another?

8.4.2 Network Architecture

In order to explore the effect of different encodings of the ii-V-I problem, pilot studies determine the simplest network capable of solving the problem for any of the encodings. Many of the musical networks reported in earlier chapters used value units. One reason for this is that value unit networks generally are easier to interpret. However, pilot studies indicate that networks of value units are challenged by the pitch-class encoding of the ii-V-I problem. A value unit perceptron could not learn this version of the problem; a multilayer network perceptron that had seven hidden value units was required. All three other encodings are learned by a value unit perceptron. However, all four encodings of the ii-V-I problem are also learned by perceptrons whose output units are integration devices that use the sigmoid-shaped logistic activation function. For the remainder of Section 8.4 we will consider how perceptrons that use integration devices as output units fare with the different encodings of the ii-V-I problem. Figure 8-7 illustrates one such network.

8.4.3 Training

All of the networks described below are trained with a gradient descent rule using the Rosenblatt program (Dawson, 2005). Each network is a perceptron whose output units use the logistic activation function. The only difference between a perceptron trained on one encoding of the ii-V-I progression problem and a perceptron trained on another encoding of this problem is the number of input and output units in the network. Perceptrons trained on a pitch-class encoding required 12 input and 12 output units. Perceptrons trained on a pitch encoding of non-inverted chords required 29 input and 29 output units. Perceptrons trained on a pitch encoding of inverted chords required 24 input and 24 output units. Perceptrons trained on a lead sheet encoding of the problem required 15 input and 15 output units.

When I train any perceptron on the ii-V-I progression problem, the learning rate is 0.50, and connection weights are randomly initialized to values in the range from −0.1 to 0.1. All output unit biases (θ) are set to zero, and do not change during learning. A network learns until it converges on a solution to the problem, where (as in previous chapters) convergence is defined as generating a hit for every output unit on every training pattern. As will be seen below, an integration device perceptron can learn any encoding of the ii-V-I problem very quickly.

8.4.4 Effect of Encoding

An integration device perceptron can learn each of the four encodings of the ii-V-I progression problem. As a result, none of the encodings makes this problem so complex that a more complicated network is required to solve it. However, there are definite effects of encoding on the amount of training that is required before a network converges on a solution. To explore these effects, one can train 20 different “subjects”—different networks—on each encoding of the ii-V-I progression problem. In this small experiment, the independent variable is the encoding of the problem and the dependent variable is the number of epochs of training required before the network converges. Table 8-2 provides the average number of epochs required (with standard deviations) for each of the four encoding conditions in this experiment.

Table 8-2 The mean number of sweeps required for a network to converge (with standard deviations) for perceptrons trained using four different encodings of the ii-V-I progression problem.

	Type of input encoding
	Pitch-class	Pitch (no inversion)	Pitch (inversion)	Lead sheet
Mean	726.75	929.55	256.20	118.05
SD	1.02	2.14	2.02	0.22

Welch two-sample t-tests reveal that the difference between any pair of means in Table 8-2 is significant with p < 0.001. The smallest value of t is for the comparison between the means for the pitch encoding of inverted chords and the lead sheet encoding (t = 304.63, df = 19.47). The largest value of t is for the comparison between the pitch encoding of non-inverted chords and the lead sheet encoding (t = 1687.28, df = 19.42). In short, choice of encoding has a significant effect on the amount of training required for networks to converge. Pitch encoding of non-inverted chords results in the slowest learning; this suggests that pitch encoding provides the most complicated representation of the problem. If this encoding changes to pitch-class encoding, then a significant speeding up of learning occurs. One can produce another significant speeding up of learning by replacing pitch-class encoding with pitch encoding of inverted chords. Finally, another significant improvement in amount of training occurs when we use lead sheet encoding is used. Networks learn the lead sheet encoding of the problem with slightly less than one eighth of the amount of training required when pitch encoding of non-inverted chords is used.

8.5 Interpreting a Pitch-Class Perceptron

8.5.1 Integration Device Activity

Many of the networks that we have considered in earlier chapters used value units, which employ the Gaussian activation function. The simulations described in Section 8.4 revealed that perceptrons whose output units are integration devices are able to solve the ii-V-I progression problem. An integration device converts net input into activity using the sigmoid-shaped logistic function. Let us take a moment to consider the general properties of an integration device, and then use these properties to help interpret the internal structure of one of the perceptrons discussed in Section 8.4.

When the goal of a simulation is to interpret the internal structure of a network, value units have certain advantages. The Gaussian activation function responds to a particular subset of input properties. We have seen in earlier chapters that one can identify these properties by determining what input signals cancel out a value unit’s µ, causing it to turn on.

In contrast, integration devices do not detect specific features that cause them to activate. Instead, they serve as devices that weigh evidence. Every signal coming into an integration device is either evidence in favour of turning on or evidence in favour of turning off. The activity of an integration device reflects the net effect of all of the accumulated evidence. It does not turn on when specific features are present. Instead, it turns on when enough positive evidence has accumulated.

There are two different perspectives for considering the meaning of an integration device’s activity. The first is the digital perspective. We ordinarily train output integration devices either to turn on or to turn off. This requires the net input to an integration device either to be sufficiently high or to be sufficiently small. When the bias (θ) of an integration device is zero, as is the case for all of the Section 8.4 perceptrons, its net input must be 2.20 or higher in order for it to turn on (i.e., to generate activity of 0.90 or higher). If θ is zero, then turning an integration device off (i.e., to generate activity of 0.10 or less) requires a net input that is −2.20 or lower.

While training typically leads us to consider integration devices as being digital (i.e., as turning either on or off), we are not limited to the digital perspective. The continuous value of an integration device’s activity ranges between zero and one, and is highly informative. In particular, an integration device’s activity can be interpreted as a conditional probability (Dawson & Dupuis, 2012; Dawson et al., 2009; Dawson & Gupta, 2017; McClelland, 1998). Thus a second, analogue, perspective on interpreting integration device activity involves viewing activity as representing probability.

For instance, when an output integration device generates activity of 0.70, this means that there is a 70% chance that it will be “rewarded” given the current set of cues that have been presented to the input units (Dawson et al., 2009; Dawson & Gupta, 2017). We will see below that interpreting integration device networks benefits from considering such networks from the digital viewpoint as well as from the probabilistic perspective.

8.5.2 Network Interpretation

Let us consider the perceptron that solves the ii-V-I progression problem when pitch-class encoding of inputs and outputs is employed (Figure 8-7). This network’s knowledge of the ii-V-I progression is stored in its connection weights. Figure 8-8 illustrates the connection weights from the 12 input units to a single output unit, the one that represents the pitch-class A. This pattern of connection weights is present for each output unit. The only difference between output units is that the bars are systematically assigned different pitch-class labels. For instance, the connection weights for the A♯ output unit can be plotted using exactly the same graph as Figure 8-8, but with different labels. For the A♯ output unit, the leftmost bar is associated with G♯ (instead of A), and the remaining bars are labelled A, A♯, B, and so on up to G. This implies that if we can explain the Figure 8-8 weights, then a functionally equivalent explanation applies to each of the other 11 output units.

Figure 8-8

Figure 8-8 The connection weights from the 12 input units to the output unit representing the pitch-class A.

In previous chapters graphs of the connection weights that fed into value units revealed particular musical properties. The connection weights illustrated in Figure 8-8 feed into an integration device but do not reveal any obvious musical pattern. How then are these weights systematically used to turn the A output unit on when needed in the ii-V-I progression problem?

Let us first consider the A output unit from the digital perspective. What properties cause this unit to turn on? The ii-V-I progression problem consists of 24 different input tetrachords. Of these 24 patterns, eight cause the A unit to turn on, while the remaining 16 cause it to turn off.

Table 8-3 The eight patterns in the ii-V-I training set that cause the A output unit to activate when signals are sent through the weights illustrated in Figure 8-8.

Chord	Component pitch-classes	Net input
Am7	A, C, E, G	15.09
C7	A#, C, E, G	9.97
A7	A, C#, E, G	9.05
Em7	B, D, E, G	6.94
Cm7	A#, C, D#, G	2.82
F#m7	A, C#, E, F#	2.31
E7	B, D, E, G#	2.20
F7	A, C, D#, F	2.20

Note. Each row provides information about a particular chord. The first column names the chord, the next column provides the pitch-classes that make up the chord, and the final column provides the net input sent to the output unit when the chord is presented to the network.

Table 8-3 provides the features of the eight input patterns that turn the A output unit on. For each pattern, it provides the name of the input chord, the chord’s four component pitch-classes, and the net input for the A output unit that is associated with each of these chords. The net input is simply the sum of the four weights associated with a chord’s four input pitch-classes.

Not surprisingly, the net input column of Table 8-3 consists of values that are greater than or equal to 2.20, which is the net input required for an integration device to produce activity of at least 0.90 when θ = 0. This suggests that each of these eight input chords are associated with four different connection weights whose sum is sufficient to activate the A output unit.

An inspection of the component pitch-classes in Table 8-3 provides an indication of why each of its rows is associated with high net inputs. For instance, six of these eight input patterns include the pitch-class E, which has the highest connection weight value by far in Figure 8-8. The two chords that do not include E (Cm7 and F7) include some combination of the pitch-classes A, D♯, and G, which are the other three connection weights with positive values.

In short, one account of the connection weights in Figure 8-8 is essentially combinatorial. The pitch-class input units are assigned these weights because 1) the values of the included weights produce high enough net input for the eight chords that turn the A input unit on, and 2) the values of the other weights produce low enough net input (−2.20 or lower) for the 16 chords that turn this output unit off.

Of the 12 connection weights depicted in Figure 8-8, eight are negative. Signals sent through a negative weight will decrease the A unit’s activity. Is it the case that because of this some of the pitch-classes associated with negative weights are absent from Table 8-3? An inspection of the table indicates that this is not so. All 12 input pitch-classes appear at least once, although some (in particular A, C, E, and G) occur more frequently than the others. This suggests that the individual weights in Figure 8-8 might have more to do with the output unit from a probabilistic perspective. Perhaps strongly positive weights are not associated with pitch-classes that definitely turn the output unit on, but are instead associated with pitch-classes that are probably on when the output unit turns on.

To explore this possibility, let us consider the probability structure of the ii-V-I progression problem in the context of the A output unit. There are 24 different input patterns in this problem; eight of them cause the A output unit to turn on and 16 cause the A output unit to turn off.

Table 8-4 provides the number of times that each input pitch-class belongs to a pattern that turns the A output unit either on or off. For instance, the first row of Table 8-4 indicates that four of the input patterns that cause the A output unit to turn on include the input pitch-class A. Similarly, the first row indicates that four of the input patterns that cause the A output unit to turn off also include the input pitch-class A.

Table 8-4 also includes a “Conditional Probability” column. This column indicates the probability that a particular input pitch-class unit is on if one knows that the input chord turns the A output unit on. Thus, in the first row of the table, the conditional probability is P (Output A = 1|Input A = 1). This value is equal to 0.5 for the input pitch-class A because only four of the eight input patterns that include this pitch-class are associated with the A output unit turning on. Similarly, the conditional probability for input pitch-class A♯ is equal to 0.25. This is because only two of the eight input patterns that include this pitch-class are associated with the A output unit turning on.

Armed with a contingency table like Table 8-4, one would typically perform additional computations using Bayes’ theorem to determine the conditional probability that the A output unit is on when a particular input pitch-class is used (e.g., we would compute P (Output A = 1|Input A = 1)). However, for this particular contingency table this probability is identical to the probability reported in the “Conditional Probability” column of the table.

With this knowledge of the probabilities of the ii-V-I problem, we can now explore network structure from a probabilistic perspective. Table 8-4 includes the weight of the connection between each input pitch-class unit and the A output unit. An inspection of these weights indicates that they seem related to the conditional probabilities in the table. That is, negative weights tend to be associated with lower probabilities, while positive weights tend to be associated with higher probabilities. The correlation between Table 8-4’s “Conditional Probability” column and its column of connection weights is very high (r = 0.820).

As noted earlier, the “Conditional Probability” column in Table 8-4 can be interpreted as providing the probability that the A output unit is on given that a particular pitch-class input unit is turned on. The logistic activation function can also be used to estimate this probability. This is accomplished by sending a signal from only one input unit at a time into the output unit and examining the output unit’s response. (This is equivalent to computing the logistic function with the input unit’s weight as input, assuming that θ = 0). The last column in Table 8-4 provides these probability estimates, which are even more strongly correlated with the “Conditional Probability” column than are the connection weights (r = 0.898). Clearly, the connection weights from Figure 8-8 encode the probability structure of the ii-V-I progression problem. In general, the more likely it is for an input unit to be involved in turning on an output unit, the larger will be the connection weight between the two.

However, while this story accounts for most of the network’s structure, it is incomplete. This is why the correlations reported above are not perfect. Connection weights also have values that permit the network to deal with special cases.

Table 8-4 The probability structure of the ii-V-I progression problem in the context of the A output unit whose connection weights were presented in Figure 8-8.

Input pitch-class	Output A: On	Output A: Off	Conditional probability	Weight	Logistic of weight
A	4	4	0.5	2.81	0.94
A#	2	6	0.25	−2.31	0.09
B	2	6	0.25	−3.86	0.02
C	4	4	0.5	−0.15	0.46
C#	2	6	0.25	−6.19	0.00
D	2	6	0.25	−1.64	0.16
D#	2	6	0.25	2.81	0.94
E	6	2	0.75	9.97	1.00
F	1	7	0.125	−3.27	0.04
F#	1	7	0.125	−4.28	0.01
G	5	3	0.625	2.47	0.92
G#	1	7	0.125	−2.27	0.09

Note. Each row provides the structure related to a single input unit. The first column names the input unit. The second column indicates the number of patterns for which the input unit is on and the output unit for A is on, while the third column indicates the number of patterns for which the input unit is on and the output unit for A is off. The fourth column converts the preceding two columns into a conditional probability. The fifth column provides the weight between the input unit and the A output unit, while the sixth column converts that weight into activity using the logistic activation function.

For instance, consider the input unit that represents D♯. This input unit has a healthy positive connection to the output unit for A, even though D♯ has a low probability of turning A on (see Table 8-4). Given this low probability, why is this connection weight so strong?

The reason is the role of D♯ in the context of the three other pitch-classes that are also present to form a tetrachord. D♯ is present in six different chords that do not turn the A output unit on. For these six chords, D♯’s healthy positive weight is not an issue. This is because the other three pitch-classes in each chord have strong negative weights (producing, on average, a net input of −7.82). This is more than enough to cancel out the positive signal of D♯ and turn the A output off. For the two chords in which D♯ is present and the A output turns on, the other three pitch-classes are not positive enough (average net input = −0.30). The positive weight of D♯ is required to turn the A output unit on in these cases. D♯ has been assigned its strong weight to deal with these two cases.

In other words, the perceptron has learned weights that reflect the overall probability structure of the ii-V-I problem (a structure that simply considers the relationship between each input unit and an output unit without considering other input units). However, it then adjusts these weights so that the network generates the correct response in particular contexts (i.e., particular combinations of input signals) that would lead to incorrect responses if probability structure were the only consideration.

8.5.3 The Tonal Hierarchy

The tonal hierarchy is a key finding from Carol Krumhansl’s research on music cognition (Krumhansl, 1990a), and played an important role in some of the key-finding perceptrons discussed earlier in Chapter 5. The tonal hierarchy reflects the differing importance of the various pitch-classes in the context of a particular musical key. For example, in the key of A major the tonic (the pitch-class A) receives the highest rating. The next highest ratings are given to the third or fifth positions of A major (the pitch-classes C♯ and E). The next lowest ratings are provided to the remaining four pitch-classes of the scale (for A major these are the pitch-classes B, D, F♯, and G♯).

The weights illustrated in Figure 8-8 also reflect the differing importance of various pitch-classes in a different context: in this figure, the context is the likelihood of activating the A output unit. This suggests that Figure 8-8 could be interpreted in a fashion similar to Krumhansl’s (1990a) tonal hierarchy. What is the relationship between the tonal hierarchy and the weights illustrated in Figure 8-8? To answer this question, the correlation between the Figure 8-8 weights and the tonal hierarchy for each major key was computed. The results are presented in Table 8-5.

Table 8-5 reveals a definite relationship between Krumhansl’s (1990a) tonal hierarchy and the Figure 8-8 weights. In particular, there is a very high correlation between the weights and the tonal hierarchy for the key of E major, which is a perfect fifth away from A. A smaller, but still healthy, correlation exists between the weights and the tonal hierarchy for the key of A major. Recall that the weights illustrated in Figure 8-8 are found for all of the output units in the Figure 8-7 perceptron, but are associated with different input units. As a result, the pattern of correlations reported in Table 8-5 is found for the connection weights that feed into the other 11 output units. The general finding for each output unit is a very high correlation with the tonal hierarchy a perfect fifth away from the output unit’s pitch-class, and a high correlation with the tonal hierarchy that matches the output unit’s pitch-class.

Table 8-5 The correlations between each of Krumhansl’s tonal hierarchies for major keys and the connection weights illustrated in Figure 8-8.

Major key of tonal hierarchy	Correlation between tonal hierarchy and Output A weights
A	0.41
A#	−0.32
B	0.29
C	−0.09
C#	−0.32
D	0.06
D#	−0.10
E	0.73
F	−0.46
F#	−0.29
G	0.16
G#	−0.06

On the one hand, the relationship discovered between connection weights and the tonal hierarchy is surprising. The perceptron learns a problem that is quite different from the typical probe tone method. It is surprising, though satisfying, to see strong similarities to tonal hierarchies emerge from this network’s internal structure.

On the other hand, the particular relationships revealed in Table 8-5 make perfect sense in the context of the probabilities provided in Table 8-4. For instance, the pitch-class that is most likely to be involved in turning the A output unit on is E. The probability relationships of Table 8-4 emerge quite naturally from the tonal hierarchy, given that the chords in the ii-V-I progression problem are all defined in particular major keys. Furthermore, within each major key all three chords (the minor seventh, the dominant seventh, and the major seventh) include pitch-classes that are a perfect fifth apart. Nevertheless, while the connection weights in Figure 8-8 strongly relate to the tonal hierarchy for E major, they are not perfectly correlated. How might differences between the two be reflected in musical structure?

Tonal hierarchies have been used to explore spatial relationships between different musical keys. One computes the similarity between two different keys by calculating the correlation between their respective tonal hierarchies. One then uses MDS to produce a map in which similar keys are closer to one another than are dissimilar keys (Krumhansl, 1990a; Krumhansl & Kessler, 1982). Krumhansl and Kessler found that this analysis arranges major keys in a map according to the circle of perfect fifths.

A similar analysis can be performed on the connection weights of the ii-V-I perceptron by comparing the similarities of the connection weights that feed into different output units. The correlations between each possible pair of sets of 12 connection weights are computed, and MDS is performed on this similarity data. This analysis arranges output units in a map; output units that have similar connection weight structures will be located close to one another. Figure 8-9 presents the two-dimensional solution when MDS is performed on connection weight similarities. This solution accounts for 28.92% of the variance in the original distances, which is a statistically significant fit (F = 26.034, df = 1, 64, p < 0.001). Unlike the tonal hierarchy analyses (Krumhansl & Kessler, 1982), this MDS solution does not arrange output units according to the circle of perfect fifths. Instead, two other organizational principles emerge. First, output units that represent pitch-classes a tritone away (e.g., B and F, A♯, and E) fall at nearly the same location in the map. Second, these pairs of tritones are near other pairs that are a semitone away. For instance, the nearest neighbours to the F-B pair are the F♯-C pair and the A♯-E pair, each of which contains a pitch-class that is either a semitone higher or lower than either F or B.

The two-dimensional MDS solution in Figure 8-9 reveals some intriguing regularities. However, an analysis that takes out more than two dimensions provides a better fit to the data. Figure 8-10 illustrates the first three dimensions of a five-dimensional solution for the output unit similarities. This solution accounts for 87.90% of the variance in the original distance data, which is a statistically significant fit (F = 464.97, df = 1, 64, p < 0.001).

Figure 8-10 indicates that the third dimension of this solution pulls the tritone pairs vertically apart from one another in the space. However, it is still clear that in this higher-dimensional solution pitch-classes a tritone apart are still located near one another in the space. The arrangement of points in Figure 8-10 corresponds quite nicely to the solution plotted in Figure 8-9. In fact, Figure 8-9 depicts what would be seen if one looked down on Figure 8-10 from above.

Figure 8-9

Figure 8-9 The two-dimensional MDS solution from the analysis of the similarities between output unit weights.

Figure 8-10

Figure 8-10 The first three dimensions of a five-dimensional MDS solution for the analysis of output unit weight similarities.

Why might tritone structure emerge from the ii-V-I progression problem? One answer to this question may be that the ii-V-I progression requires dominant seventh chords; these chords in turn make jazz’s tritone substitution possible. Jazz uses chord substitutions to provide musical variety to the changes. In chord substitutions, one replaces a chord in a progression with another, musically related, chord. In tritone substitution, a dominant seventh chord in one key is replaced with the dominant seventh chord from a key that is a tritone away. For example, the ii-V-I progression in the key of A major uses the E7 chord. Under tritone substitution, it is replaced with A♯7.

Tritone substitution is possible because dominant seventh chords a tritone apart contain the same tritone. That is, they include the same two pitch-classes that are a tritone apart. This makes the two chords harmonically similar and permits them to be substituted for one another (Tymoczko, 2008). This harmonic similarity might also be the source for the tritone regularities in Figures 8-9 and 8-10. Even though the ii-V-I progression problem is not defined using tritone substitution, all of the possible dominant seventh chords are used. Their harmonic similarities—or the possibility of tritone substitution—are reflected in the structure of the two MDS solutions.

This suggests that it might be interesting to explore elaborations of the ii-V-I progression problem. For instance, one could define a version that explicitly defines tritone substitution in the training set. Would such a network exhibit similar structure to the one that we have been analyzing? Another kind of elaboration of the ii-V-I is a chord progression known as the Coltrane changes. Though related to the ii-V-I, the Coltrane changes are notoriously difficult to play. Can a network learn the Coltrane changes, using the various encodings introduced earlier in the chapter? If so, does the increased complexity of the Coltrane changes require us to use a more complicated network?

8.6 The Coltrane Changes

8.6.1 Extending the ii-V-I

The ii-V-I progression plays a dominant role in jazz. Shanahan and Broze compiled a corpus of 1200 jazz standards from published lead sheets (Broze & Shanahan, 2013; Shanahan & Broze, 2012). They analyzed this corpus to identify the five most common three-chord progressions in the lead sheets. The ii-V-I was by far the most prevalent, accounting for over 42% 0f the 7366 instances of these sequences.

Consider one jazz standard that employs the ii-V-I, “Tune Up.” It famously appeared on the album Blue Haze recorded by Miles Davis for the Prestige label in sessions that took place in 1953 and 1954. Davis is credited as being the composer of “Tune Up” on this album, but it was actually composed by Eddie “Cleanhead” Vinson. Vinson was a prominent blues singer, saxophonist, and bandleader (Nisenson, 2000). Vinson, Davis, and “Tune Up” are all linked to another seminal jazz figure, saxophonist and composer John Coltrane. Coltrane was a member of Vinson’s band in 1948. Coltrane also belonged to the Miles Davis Quintet between the years 1955 and 1957, as well as between 1958 and 1960 (Porter, 1998; Thomas, 1975). While in the Davis quintet, Coltrane was involved in performances and recordings of “Tune Up” (DeVito & Porter, 2008).

While jazz is founded on core harmonic patterns such as basic chord progressions, it always pushes this core in new directions. Coltrane was a master of this pursuit. His instructor at the Granoff School in Philadelphia, Dennis Sandole, reported that he and Coltrane investigated many advanced harmonic concepts that served as the foundation for Coltrane’s landmark compositions (Demsey, 1991). Some of Coltrane’s harmonic experiments led to a jazz progression now known as the Coltrane changes. This progression was unveiled in his influential 1960 album Giant Steps, where it appears in two famous pieces, “Giant Steps” and “Countdown.” The Coltrane changes also appear in several other pieces that Coltrane composed around this time (Demsey, 1991). The title “Countdown” pays homage to the Vinson–Davis classic “Tune Up”; Demsey shows that the harmonic structure of “Countdown” is systematically linked to that of “Tune Up.” Indeed, the structure of Coltrane’s changes can be explained as a particular elaboration of the ii-V-I progression (Demsey, 1991). The Coltrane changes add four new chords to the ii-V-I progression. Two of these added chords serve as lead-ins to the V chord in the ii-V-I, while the other two added chords are lead-ins to the I chord in the ii-V-I.

Importantly, the relationship between the roots of the V chord and its two lead-ins is a musical interval of a major third; the same relationship holds among the roots of the I chord and its two lead-ins (Demsey, 1991). The circle of perfect fifths and the four circles of major thirds can be used to create a map of the Coltrane changes chord roots for any key. Figure 8-11 provides this map. The inner circle of pitch-classes in this figure is organized around the circle of perfect fifths. Each of these pitch-classes is then attached to a circle of major thirds, which forms the outer ring of pitch-classes. These circles of major thirds provide the roots of the lead-in chords.

Figure 8-11

Figure 8-11 A map of the Coltrane changes’ chord roots created by combining the circle of perfect fifths (the inner ring of pitch-classes) with the circles of major thirds.

To illustrate the use of this map, consider the Coltrane changes for the key of C major. The ii-V-I progression for C major is Dmin7-G7-Cmaj7. The Coltrane changes elaborate this sequence by adding two lead-in chords for G7 and two lead-in chords for Cmaj7. Figure 8-12 presents the part of Figure 8-11 used for the key of C major, naming each chord and providing the order in which they are played. From Figure 8-12 one can see that the Coltrane changes for C major are Dmin7-D♯7-G♯maj7-B7-Emaj7-G7-Cmaj7. Note that this progression begins with the first chord of the ii-V-I, and it ends with the ii-V-I’s final two chords.

Figure 8-12

Figure 8-12 The portion of the Figure 8-11 map that provides the Coltrane changes for the key of C major.

Figure 8-11 can be used to determine the seven chords of the Coltrane changes for any major key; one simply finds the key’s root in the inner circle and builds a version of Figure 8-12 up from this root. Table 8-6 provides the complete set of Coltrane changes, each of which was determined by applying this method for each of the 12 major keys. Note that the ii, V, and I columns of Table 8-6 also provide the ii-V-I progression for these major keys.

Table 8-6 The Coltrane changes for each major musical key.

Major key	Chord
Major key	ii	1st lead-in for V	1st lead-in for I	2nd lead-in for V	2nd lead-in for I	V	I
A	Bm7	C7	Fmaj7	G#7	C#maj7	E7	Amaj7
A#	Cm7	C#7	F#maj7	A7	Dmaj7	F7	A#maj7
B	C#m7	D7	Gmaj7	A#7	D#maj7	F#7	Bmaj7
C	Dm7	D#7	G#maj7	B7	Emaj7	G7	Cmaj7
C#	D#m7	E7	Amaj7	C7	Fmaj7	G#7	C#maj7
D	Em7	F7	A#maj7	C#7	F#maj7	A7	Dmaj7
D#	Fm7	F#7	Bmaj7	D7	Gmaj7	A#7	D#maj7
E	F#m7	G7	Cmaj7	D#7	G#maj7	B7	Emaj7
F	Gm7	G#7	C#maj7	E7	Amaj7	C7	Fmaj7
F#	G#m7	A7	Dmaj7	F7	A#maj7	C#7	F#maj7
G	Am7	A#7	D#maj7	F#7	Bmaj7	D7	Gmaj7
G#	A#m7	B7	Emaj7	G7	Cmaj7	D#7	G#maj7

Note. Each row provides the sequence of chords that define this progression for a particular musical key; the column labels indicate the role of each chord.

8.6.2 The Coltrane Changes Problem

With Table 8-6 in hand, I can now define a new chord progression problem using the Coltrane changes. As was the case with the earlier simulations using the ii-V-I chords, I present one chord from the progression to a network, and it generates the next chord in the progression. This requires six input/output pairs for any major key. The problem involves learning these chord pairings for each major key, producing a training set composed of 72 different patterns (six for each major key).

8.6.3 Encodings

As was the case in our study of the ii-V-I progression problem, we can explore a number of different encodings for the Coltrane changes problem: pitch-class, pitch without inversions, pitch with inversions, and lead sheet. Interestingly, the structure of the Coltrane changes produces some challenges for some encodings that were not present when the simpler ii-V-I problem was encoded. For instance, consider some additional design decisions that are required when pitch encoding without inversions is used. Unlike the ii-V-I progression, the Coltrane changes can cover a very wide piano keyboard. The number of input units that are required to represent this width depends on where various chords are started on the keyboard. This must be considered because the same chord can occur on different places in the keyboard depending on the musical key in which the changes are being defined. This issue is addressed by stipulating that every tetrachord in the training set has at least one pitch present in the first octave of a piano keyboard. That is, one can shift every chord down the keyboard so that its lowest note is in the range from A3 (the A below middle C) to G♯4 (the G♯ above middle C). Using this encoding, the non-inverted forms of all of the chords used in the Coltrane changes require 23 input units. The lowest input unit corresponds to A3, and the highest input unit corresponds to G5.

A different complication arises when encoding the Coltrane changes using pitch representations of chord inversions. There are two basic issues: First, what chord inversions should be used, and second, where should they be placed on the represented keyboard? Using voice leading as a guide to choice of inversions, I selected the set of chord forms that are summarized in Table 8-7. I then used two different methods to place the chords on the “keyboard,” producing two different versions of training sets. In the first, I defined the chord patterns for the lowest major key on the keyboard. Then I shifted this set of chord patterns up the keyboard to define the Coltrane changes for all of the other major keys. This requires 23 input units to represent any input/output pairing. In this representation the lowest pitch represented is A3 (the A below middle C) and the highest is G♯4 (the G♯ above middle C). This choice of encoding means, unlike the encoding of non-inverted chords, that the same chord is represented in different octaves. However, all of these different representations of the same chord involve representing the chord in a different form (i.e., a different inversion). In the second method, I used the same method used for the pitch encoding of non-inverted chords: every chord was shifted downward so that at least one note belongs to the lowest octave of the input pitches.

Table 8-7 The various chord forms used to achieve efficient voice leading for the Coltrane Changes.

Chord	Chord form
ii	First inversion
First lead-in for V	Root position
First lead-in for I	Second inversion
Second lead-in for V	First inversion
Second lead-in for I	Third inversion
V	Third inversion
I	First inversion

8.7 Learning the Coltrane Changes

8.7.1 Relative Complexity

Section 8.4 reported the results of training artificial neural networks on the ii-V-I progression problem using a variety of encodings. Although we discovered that choice of encoding affected the amount of training required for a network to converge, the major finding of that section was that a very simple network—a perceptron with integration devices in its output layer—could learn any version of the ii-V-I progression problem. We have seen that the Coltrane changes elaborate the ii-V-I progression, and that the Coltrane changes are more difficult for musicians to perform or to improvise over. Are they more difficult for networks to learn?

The following sections report the results of training networks on various encodings of the Coltrane changes, and show that choice of encoding can have an important effect on network complexity. In general, though, all of these simulations point to one general conclusion: the Coltrane changes are indeed more complicated than the ii-V-I progression. This is because an integration device perceptron was never capable of learning a solution to the Coltrane changes, regardless of the choice of encoding. The Coltrane changes require using a more complex network.

8.7.2 Pitch-Class Encoding

In all of the simulations reported in this section, I attempt to discover the simplest network capable of learning the Coltrane changes. With pitch-class encoding, a multilayer perceptron is required. This perceptron uses value units for its output units, and requires nine hidden value units in order to converge. During training, its connection weights are randomly initialized in the range from −0.1 to 0.1, and µs are initialized to zero. The learning rate is 0.01, and each µ is modified by training. The order of pattern presentation is randomized for each epoch of training.

A simulation experiment was conducted in which 25 different networks were trained until convergence was achieved; convergence involved generating a “hit” for every output unit and every training pattern. In this simulation experiment convergence was achieved after a mean of 5299.12 epochs of training (SD = 2774.75). The fastest convergence was obtained after only 1771 epochs of training, while the slowest convergence required 13,044 epochs.

8.7.3 Pitch Encoding without Inversions

When I use pitch encoding to encode the Coltrane changes chords in their root form, a much simpler network is able to learn the problem: a perceptron that uses value units in the output layer. A simulation experiment was conducted in which 25 different networks of this type were trained to convergence. The network was initialized and trained in an identical fashion to the multilayer perceptron described in Section 8.7.2, with the exception that a learning rate of 0.1 was used. On average, convergence was obtained after 301.08 epochs of training (SD = 61.81). The fastest convergence was obtained after 168 epochs, while the slowest convergence required 380 sweeps of training.

While a perceptron learns this version of the Coltrane changes, the fact that this perceptron uses value units instead of integration devices indicates that this encoding of the Coltrane changes is still more complicated than the ii-V-I progression problem. This is because a value unit makes a more complicated carving of a pattern space, making two parallel cuts through it instead of just one (Dawson, 2004, 2005, 2008).

8.7.4 Pitch Encoding with Inversions

As noted in Section 8.6.4, I studied two different versions of the inverted Coltrane changes. Let us first consider the version in which we did not shift chords to all have at least one note in the first octave of the pitch representation. This version of the problem is very difficult in comparison to the non-inverted version of the Coltrane changes. A multilayer perceptron with 11 hidden units is required; all of the output units and all of the hidden units of this network are value units. The structure of this network is initialized in the same way as those discussed above, and it is trained with a learning rate of 0.01. A simulation experiment in which 25 different networks were trained revealed that convergence was obtained after an average of 5237.68 epochs of training (SD = 1487.00). This amount of training is not significantly different from the amount required by the networks trained on the pitch-class representation of the problem (t = 0.0976, df = 36.735, p = 0.9228). It is, however, significantly greater than the amount of training required for the perceptron given the non-inverted encoding (t = −16.5849, df = 24.083, p = 0.001).

The version of the Coltrane changes in which inverted chords were shifted downward to start in the first octave of inputs also requires a multilayer perceptron that is built with value units and contains 11 hidden units. A simulation experiment revealed that this network took much longer to converge than did the other version of the inverted Coltrane changes. On average, 8299.92 epochs of training were required to reach convergence (SD = 3446.14). This amount of training is significantly greater than the amount required by the other version of the inverted chords (t = −4.0794, df = 32.638, p = 0.001).

It appears, then, that using inverted chords makes learning the Coltrane changes a much harder task. First, a more complicated network is required. Second, more training is required. One likely reason for this result is that inverting the chords requires networks to learn more causal links between chord forms than is required when chords are not inverted. In addition, shifting the chords down to all start from the same octave makes the task even more difficult. This is likely because this shift disrupts causal relations between chords even further.

8.7.5 Lead Sheet Encoding

The lead sheet encoding of the Coltrane changes leads to the most efficient learning of this progression. As with the pitch representation of non-inverted chords, a value unit perceptron learns the lead sheet version of the problem. A simulation experiment in which 25 of these networks were trained revealed that on average convergence was achieved after 72.36 epochs of training (SD = 1.89). This is a significantly smaller amount of learning than is required by the perceptron presented the non-inverted chords (t = 18.4924, df = 24.045, p = 0.001).

In summary, all of the results described above support two general conclusions. First, the Coltrane changes are more difficult than the ii-V-I progression because, regardless of encoding, they cannot be learned by an integration device perceptron. Second, choice of encoding of the Coltrane changes has a marked effect on network learning. This choice determines both network complexity and the amount of training required to achieve convergence.

8.8 Interpreting a Coltrane Perceptron

8.8.1 Coltrane Causality

How does an artificial neural network represent its knowledge of the Coltrane changes? To answer this question let us interpret the internal structure of a value unit perceptron that learns this progression using lead sheet encoding. Before examining the network, it will be useful to understand the causal structure that links chords in the Coltrane changes, which we can infer from examining Table 8-6.

First, because lead sheet encoding separates chord types from chord roots, let us consider the causal links between chord types in the Coltrane changes. There are only three relationships between chord types. First, a minor seventh chord always causes the next chord to be a dominant seventh. Second, a dominant seventh chord always causes the next chord to be a major seventh. Third, a major seventh chord always causes the next chord to be a dominant seventh. Let us next consider causal links between chord roots. These causal links are mediated by chord type, but we can ignore this context for the time being.

First, a chord root can cause the next chord root to be a minor second or one semitone higher. For instance, the first row of Table 8-6 shows that the first transition for the Coltrane changes in the key of A major is from a chord with the root of B to a chord with the root of A.

Second, a chord root can cause the next chord root to be a perfect fourth or five semitones higher. For example, Table 8-6 shows that this happens three times in the key of A major: C causes F, G♯ causes C♯, and E causes A.

Third, a chord root can cause the next chord root to be a minor third or three semitones higher. For instance, Table 8-6 shows that this happens twice in the key of A major, because F causes G♯, and C♯ causes E.

When causal links involving chord types and causal links involving chord roots are considered in combination, very systematic causal relations emerge in the Coltrane changes. First, causal links between specific chords are unique. For instance, C7 always precedes Fmaj7. An examination of Table 8-6 indicates that any chord of interest only precedes one chord.

Second, this property means that there are chains of chord sequences that are repeated in different keys of the Coltrane changes. One example chain is C7 – Fmaj7 – C♯7 – G♯7 – C♯maj7. This sequence of chords appears in the first row of Table 8-6, where C7 is the first lead-in for the V chord in the key of A major. The same sequence is also found in the fifth row of Table 8-6, where C7 is the second lead-in for the V chord in the key of C♯ major.

In short, the Coltrane changes can be described as a set of systematic and unique causal links in which the occurrence of one chord in a network’s input units necessarily causes the occurrence of a specific chord in the output units. In order to “know” the Coltrane changes, a network must adjust its connection weights in such a way as to realize these causal links. In the next section, we discover how a value unit perceptron accomplishes this.

8.8.2 Network Structure

The network to interpret is a value unit perceptron trained on the lead sheet encoding of the Coltrane changes. This particular perceptron is initialized in the same fashion as was described earlier in this chapter, and was trained with a learning rate of 0.1. However, in order to facilitate network interpretation I hold the µ of each output unit to zero throughout training. As a result, for an output unit to turn on, its net input needs to be near zero in value. The network converges after 93 epochs of training. Table 8-8 presents the resulting connection weights.

8.8.3 Network Causality

Table 8-8 The connection weights for a perceptron that has learned the Coltrane changes in lead sheet notation.

Input unit	Output unit
Input unit	m7	D7	maj7	A	A#	B	C	C#	D	D#	E	F	F#	G	G#
µ	0.00	0.00	0.00	0.23	−0.79	0.22	0.24	−0.25	−0.79	0.79	−0.25	0.23	0.24	−0.24	−0.22
m7	−0.98	0.07	1.04	2.10	1.45	2.09	2.12	−2.09	−1.78	1.78	−2.11	2.12	2.13	−2.11	−2.09
D7	−1.06	−1.23	−0.19	1.11	−1.76	1.10	1.12	−1.11	1.42	−1.43	−1.12	1.12	1.13	−1.12	−1.10
maj7	−1.03	0.07	1.09	0.78	0.76	0.78	−1.12	−0.74	1.78	0.26	−0.74	0.77	0.77	−0.77	−0.78
A	−0.36	−0.07	0.19	0.77	−0.26	−0.17	0.75	1.11	−0.26	−1.77	−0.74	0.77	0.77	−0.77	−0.78
A#	−0.24	−0.07	0.19	0.78	−0.26	0.78	−0.20	−0.74	−1.41	0.27	2.11	0.77	0.77	−0.77	−0.78
B	−0.24	−0.07	0.19	0.77	−0.26	0.78	0.75	0.19	−0.26	1.42	−0.75	−2.11	0.77	−0.77	−0.78
C	−0.25	−0.07	0.19	0.78	−0.25	0.79	0.75	−0.74	0.76	0.26	1.11	0.77	−2.12	−0.77	−0.78
C#	−0.27	−0.07	0.19	0.78	−0.26	0.79	0.75	−0.74	−0.26	−0.75	−0.74	−1.11	0.77	2.11	−0.78
D	−0.23	−0.07	0.19	0.77	−0.26	0.78	0.75	−0.74	−0.26	0.27	0.19	0.77	−1.12	−0.77	2.09
D#	−0.25	−0.07	0.19	−2.10	−0.25	0.78	0.75	−0.74	−0.26	0.27	−0.74	−0.18	0.77	1.11	−0.78
E	−0.22	−0.07	0.19	0.78	−1.44	0.78	0.75	−0.74	−0.26	0.27	−0.75	0.77	−0.19	−0.77	1.10
F	−0.32	−0.07	0.19	−1.10	−0.26	−2.09	0.75	−0.74	−0.26	0.26	−0.75	0.77	0.77	0.19	−0.78
F#	−0.23	−0.07	0.19	0.77	1.76	0.78	−2.11	−0.74	−0.26	0.27	−0.75	0.77	0.77	−0.76	0.17
G	−0.26	−0.07	0.19	−0.18	−0.25	−1.09	0.75	2.09	−0.26	0.26	−0.74	0.77	0.77	−0.76	−0.78
G#	−0.26	−0.07	0.19	0.23	−0.79	0.22	0.24	−0.25	−0.79	0.79	−0.25	0.23	0.24	−0.24	−0.22

Note. Each row corresponds to an input source (µ or an input unit) and each column corresponds to an output unit. Unique connection weights from input to output are highlighted in grey.

How do the connection weights in Table 8-8 represent the Coltrane changes? They do so by instantiating all of the specific causal relationships that were introduced in the previous section. First, consider the causal relationships that link an input chord type to an output chord type. These relationships are enforced by the weights presented in the first three columns of Table 8-8. In the Coltrane changes, an input minor seventh chord causes an output dominant seventh chord. Two aspects of the connection weights bring this condition to life. First, the connection between the input unit for m7 and the output unit for D7 has a weight of 0.07. Second, the connection between every chord root input unit and the output unit for D7 has a weight of −0.07. As a result, when the m7 unit is activated at the same time as a chord root input unit, the signals from the two input units to the D7 output unit cancel out to zero, turning this input unit on.

A dominant seventh chord can also be activated by a major seventh chord. The network accomplishes this in exactly the same way as was described in the preceding paragraph: note that the connection weight from the maj7 input unit to the D7 output unit is also equal to 0.07. Its signal, when combined with the signal from a chord root input unit, produces a net input of zero that again turns the D7 output unit on.

The network uses the same connection weight logic to activate the maj7 output unit when the D7 input unit is activated. The connection weight from this particular input unit to this particular output unit is −0.19. The connection weight between any chord root input unit and the maj7 output unit is 0.19. As a result, the D7 input unit will combine with any input chord root unit to produce a net input of zero that activates the maj7 output unit.

Importantly, the network also assigns connection weights to the first three columns of Table 8-8 in such a way that output units do not turn on when they are supposed to be off. For example, an input chord never turns on the m7 output unit, because the minor seventh is the chord that starts the Coltrane changes in a given key. Note that no combination of input signals in the first column of Table 8-8 will produce a net input of zero. Similarly, the connection weights from the “wrong” chord types to either the D7 or the maj7 output units are extreme enough never to be cancelled by a signal coming from any input chord root unit.

Let us now turn to considering how the weights in Table 8-8 handle causal links involving chord roots. To do so let us consider the output unit for pitch-class A. The connection weights that feed into this output unit are presented in the fourth column of the table. This output unit is activated by three different causal links between chords: E7 – Amaj7, F♯maj7 – A7, and G♯m7 – A7.

The first clue as to how these causal links are instantiated by a perceptron comes from examining the connection weights in Table 8-8 from each of the chord root units to the A output unit. All but three of these weights are approximately 0.77. The only three exceptions involve the pitch-classes involved in the three causal relationships described above. The weight from E is −2.10, the weight from F♯ is −1.10, and the weight from G♯ is −0.18.

The second clue to chord root causality comes from the relationship between these three different weights to the weights between the A output unit and the three chord type input units. First, the weight from the D7 input unit to A is 2.10. This exactly cancels the signal coming from the E input unit. In other words, when D7 and E are both activated, the A output unit will turn on.

A similar relationship holds for the other two input chord types. The weight from the maj7 input unit to the A output unit is 1.10, which cancels out the signal from the F♯ input unit. As well, the weight from the m7 input unit to the A output unit is 0.23, which essentially cancels out the signal coming from the G♯ input unit.

In short, it appears that for the A output unit to turn on, the two input units must be activated at the same time. One is a chord type unit; the other is a unit representing the chord root. The pairing of chord type and chord root is exactly as required by the causal relationships involved in turning the output unit on.

An examination of the remaining columns in Table 8-8 reveals the same connection weight logic. All but three of the connection weights have the same value. The three that have unique values come from input chord roots involved in turning a particular output unit on. The unique values serve to cancel out the signal coming from a particular chord type unit. These unique values are highlighted in grey in the table. Note that this pattern of grey is very systematic, tracing out a pattern of three diagonals through these weights. This pattern reflects the systematic relationships between input and output chord roots when the musical intervals between these roots are considered (see Section 8.8.1).

8.9 Strange Circles and Coltrane Changes

8.9.1 Circles of Major Thirds

When I use lead sheet encoding for a chord progression problem, a chord’s type and a chord’s root are encoded separately. This provides a network with an opportunity to determine independently an output chord’s type and root. The analysis of the Coltrane changes perceptron in Section 8.9 indicates that the network does not take this independent path. Instead, the network makes explicit the specific causal relationships between pairs of chords. For example, instead of asserting that 7 – maj7 and C – F, the network makes explicit the more specific relationship that C7 – Fmaj7. This claim is supported by the fact that turning on any chord type output unit, or turning on any chord root output unit, requires combining signals from both a chord type input unit and a chord root input unit.

This approach to solving the Coltrane changes points toward a new direction for representing the structure of this chord progression. Figure 8-11 combined the circle of perfect fifths with the four circles of major thirds to generate a map of the Coltrane changes in any key. An alternative approach to generating the Coltrane changes in a particular key is to use only the circles of major thirds.

Two aspects of the network interpretation inform this approach, which is described in detail below. First, a particular input chord (i.e., a combination of chord type and chord root) always causes a particular output chord. Second, a particular input chord never causes an output chord where the roots of the input and the output chords belong to the same circle of major thirds. For example, in Table 8-8, output chords with the root A are only caused by input chords with the roots E, F♯, or G♯ (the grey cells in the A column). Output chords with the root A are never caused by input chords with the roots C♯ or F, which belong to the same circle of major thirds as does A.

This suggests that the Coltrane changes is a sequence of chords in which one chord (associated with one circle of major thirds) can only cause a subsequent chord that is associated with a different circle of major thirds. Interestingly, this means that there is a very simple algorithm that uses three different circles of major thirds to generate the seven chords of the Coltrane changes in a particular major key (Figure 8-13).

Figure 8-13

Figure 8-13 Using three circles of major thirds to define the Coltrane changes for the key of C major.

Figure 8-13 uses three different circles of major thirds to generate the Coltrane changes for the key of C major. The first row of the figure provides the three circles of major thirds; their orientation is critical. The first circle is used once to generate the minor seventh chord that begins the progression. The chord that is played is at the top of the circle, and is pointed to by an arrow. The second circle is used to generate the dominant seventh chord. It is first used to select the D♯7 chord in the top row of the figure, again a chord pointed to by an arrow. The third circle is used to generate the major seventh chord. It is first used to select the G♯maj7 chord in the top row of the figure.

Importantly, to generate the remaining chords in the progression, one first rotates each of the second two circles of major thirds counter-clockwise by 120°. This brings two new chords to the top of these circles, as illustrated in the second row of the figure. These chords are then played in succession; the second circle is used to generate the next dominant seventh chord (B7) and the third circle is used to generate the next major seventh chord (Emaj7). Then these two circles are rotated again counter-clockwise by 120°. This brings the final two chords to the top of these circles, as is shown in the third row of the figure.

The three circles of major thirds illustrated in Figure 8-13 can be used to generate the Coltrane changes for two other musical keys as well. If one takes each circle in the top row and rotates it counter-clockwise by 120 degrees, then the circles will generate the Coltrane changes for the key of G♯ major if the preceding algorithm is used. If the circles in the top row of Figure 8-13 are each rotated counter-clockwise by 240 degrees before starting, then they will produce the Coltrane changes for the key of E major.

With different combinations of circles of major thirds, one can produce the Coltrane changes for any major key. The specific combinations are provided in Figure 8-14. The first row of this figure provides three circles oriented to generate the Coltrane changes for the key of C major (as was illustrated in Figure 8-13); rotating each of these circles once clockwise or twice clockwise orients them for generating the progression for G♯ major and E major respectively. The second row is oriented to produce the progression for A♯ major, and can be rotated to accommodate the keys of F♯ major and D♯ major. The third row is oriented for B major, and can be rotated to produce the chord sequence for G major and D♯ major. The final row is oriented for C♯ major, and can be rotated to produce the Coltrane changes for A major and F major.

Figure 8-14 is interesting because it shows that each circle of major thirds appears three times; each time it appears it does so in a different column. As well, each circle of major thirds is missing from only one of the four rows in the figure. Figure 8-14 is also of interest because it points in certain directions related to jazz composing. First, because the Coltrane changes are comprised of seven chords, the first circle of thirds is only used to generate one chord. In other words, while the Coltrane changes define two lead-in chords to the V chord of the ii-V-I, as well as the I chord of this progression, they do not define lead-in chords for the ii chord. Figure 8-14 provides a strong motivation for elaborating the Coltrane changes by adding two new chords, both minor seventh chords that lead in to the ii. The chords in question are provided by the two unused pitch-classes in the first column of the figure. Second, the fact that each row of Figure 8-14 is missing a circle of major thirds leads one to consider adding it to provide up to three new chords for the progression. Some musical exploration is required to determine where in Figure 8-14 one would insert this new source of chords, as well as to determine the kind of chord to associate with this additional component.

Figure 8-14

Figure 8-14 The circles of major thirds for generating the Coltrane changes in any key.

8.10 Summary and Implications

At the start of this book, I introduced artificial neural networks as artifacts primarily used for pattern classification. That is, they arrange input patterns as points in a space (either a pattern space or a hidden unit space), and output units carve this space into decision regions. If a pattern falls into one decision region, the network generates one kind of response (i.e., one kind of “pattern name”); if it falls into a different decision region, a different response is generated.

In earlier chapters, I have demonstrated that pattern classification is a general ability that can be applied very neatly to a variety of musical problems. For example, I have used it to identify scale tonics, scale modes, musical keys, and chord types. The current chapter has shown a further flexible use of pattern classification in which the response generated by a network to an input chord is a special name: the name of another chord. This permits a network to represent chord progressions in its internal structure. I demonstrated this ability by training networks on two different chord sequences, the ii-V-I progression and the Coltrane changes.

In addition to demonstrating this ability, this chapter also explored the importance of how one encodes network stimuli and responses. One of the main results obtained in the current chapter was that the choice of encoding had enormous impact on problem complexity. For the ii-V-I progression problem, I discovered that encoding did not affect network complexity: all versions of this problem could be learned by an integration device perceptron. However, the choice of encoding did affect the amount of training required for a network to discover a solution to this problem. For the Coltrane changes, choice of encoding not only affected learning speed but also determined network complexity. Some versions of this problem could be solved by a value unit perceptron, while others required multilayer networks of value units that included 11 hidden units.

While a main purpose of the current chapter was simply to illustrate the importance of encoding choices, it is important to keep in mind the implications of such choices. Obviously, problem difficulty is dictated by problem encoding. What encoding, then, should one choose for their networks? It might be very tempting to explore a variety of different and plausible encodings, and then to choose the one that generates the simplest networks. In some cases, this might very well be the appropriate strategy. However, other factors must also be considered when choosing an encoding. For example, perhaps the goal of a network is to provide insight into the formal regularities that govern a specific musical problem. In this case, the encoding that leads to the simplest network may not be the most appropriate, because the encoding may cause certain musical regularities to disappear. We saw earlier in this chapter that one key element of the musical theory of chord progressions is voice leading. The lead sheet notation described in this chapter generates simple networks, but this encoding hides essential properties related to voice leading. Therefore, if one is interested in using networks to explore regularities of voice leading, then the encoding that leads to the simplest network may not be the most appropriate.

As another example, perhaps the goal of training a musical network is to discover representations that serve as the basis for musical cognition. In this case, we may not be searching for the encoding that produces the simplest networks. Instead, we might be searching for the encoding that generates the greatest similarity between various measures of network performance and structure and measures of performance of human listeners in a musical cognition experiment.

From the perspective of musical cognition, human listeners are “black boxes.” This is because we cannot directly observe the internal structures and processes that mediate musical cognition. Instead, we can only infer these internal properties from observations of external behaviour. This process of inference is known as reverse engineering. By observing human responses to musical stimuli in a variety of clever experimental situations, we attempt to discover the structures, processes, or algorithms inside the black box.

Reverse engineering is hard enough because we cannot directly see inside the black box. A second issue that makes reverse engineering challenging is that each input/output or stimulus/response pairing that we can observe can be mediated by more than one process. There is a many-to-one mapping from possible structures, processes, or algorithms to input/output relations (Dawson, 2013). As a result, we might believe that one process is responsible for mediating observed behaviour, but a very different process might actually be responsible. Therefore, we require some special observations useful for validating one theory about what is inside the black box as opposed to another. Fortunately, black boxes will generate some observable behaviours that are side effects of the processes inside the black box. These side effects—called artifacts by Dawson (2013)—can provide critical information for theory validation (Pylyshyn, 1980, 1984).

For instance, one consequence of representing a problem in a particular format might be that some instances of the problem can be solved quickly, while other instances are more difficult to solve. In performing mental arithmetic, for example, one might expect that if numbers were represented mentally in columns then addition problems that require carrying digits from one column to another would take longer than problems that did not require this operation. One can collect relative complexity evidence (Pylyshyn, 1984) to investigate artifacts of this type. With relative complexity evidence, one varies the nature of problems presented to a system, and then explores the relationship between the properties of the problems and the time required to solve them.

A related type of data provides intermediate state evidence (Pylyshyn, 1984). This kind of evidence presumes that information processing inside the black box requires a number of different processing stages, and that each stage might represent intermediate results in a different format. To collect intermediate state evidence, one attempts to determine the number and nature of these intermediate results. For example, when researchers determined that items in short-term memory were confused with similar sounding items (Conrad, 1964) and not with items with similar meaning, this suggested that an intermediate memory store used an acoustic encoding (Waugh & Norman, 1965).

A particular type of data, called error evidence (Pylyshyn, 1984), is very well suited to determine intermediate states. When extra demands are placed on a system’s resources, it may not function as designed, and its internal workings are likely to become more evident (Simon, 1969). This is not just because the overtaxed system makes errors in general, but because these errors are often systematic, and their systematicity reflects the underlying representation. For example, one study (Yaremchuk & Dawson, 2005) investigated a multilayer perceptron trained to identify tetrachord types. When some of its hidden units were removed, the network only made very specific errors: it failed to identify tetrachords as being major when, and only when, they were in their second inversion form. This suggested that the role of the missing hidden units was to permit the network to deal with this rather specialized type of input.

What is the relationship between relative complexity evidence, intermediate state evidence, error evidence, and choice of encoding? In many cases, researchers are specifically interested in using artificial neural networks to serve as models of human musical cognition (Griffith & Todd, 1999; Todd & Loy, 1991). In this case, establishing the validity of the model likely requires collecting all three types of evidence, not only from the human subjects but also from the neural network model. The hope would be to find a close relation between the evidence collected from the human subjects and the evidence collected from the neural network model. Importantly, this match is likely to be highly related to choice of encoding. In other words, a music cognition researcher may not be interested in seeking the encoding that leads to the simplest network, but instead in seeking the encoding that leads to the best match between subject and model.

Chapter 9: Connectionist Reflections