Exploring Composer Attribution in Motet Cycles Using Machine Learning
Introduction and General Methodology
The Music Information Retrieval (MIR) community has been performing important work in recent years on applying methodologies to music based on systematic automated analysis and machine learning. There is excellent opportunity for collaboration between MIR researchers and musicologists, as the domain expertise and insight of musicologists and theorists is obviously invaluable to the study of music, and MIR techniques offer new ways of analysing music systematically in quantities and with a scope that would be infeasible using only traditional manual methodologies. Automated computational analyses can also lead to initially unintuitive but potentially meaningful insights that might not have otherwise been considered, as they can cast a broader analytical net than can be reasonably employed in manual analyses, and because their automated nature helps guard them from being influenced by potentially biased expectations of any given corpus. This document is intended to provide a few brief samples of how certain MIR approaches might be interestingly applied to important musicological projects such as the Motet Cycles Edition (MCE).[1]
In particular, this work focuses on the automated extraction and analysis of musical ‘features,’ each of which encapsulates a single piece of statistical information about a musical score. Each feature presents a summary description about some musical element of the score that can be individually extracted in a consistent way across multiple scores, and that can be represented as a single number. For example, one might imagine the three following simple features: 1) the ‘range’ of a score might be formulated as a feature that measures the difference in semitones between the lowest and highest notes in the score; 2) the ‘note density’ of a score could be measured in terms of the average number of note attacks in the piece per beat; and 3) a ‘contrary motion’ feature could measure the fraction of movements between voices that are contrary.
Features can be most interesting when many hundreds or thousands of different features, capturing diverse musical aspects of a score, are systematically compared across many hundreds or thousands of scores. To do this, it must be possible to extract features from scores quickly and consistently, something that is made possible by software that calculates features from digital scores encoded in symbolic file formats such as MIDI, MusicXML, MEI, **kern and so on.
The particular experiments described below were performed using features extracted by the jSymbolic[2] software, which is open-source and freely available.[3] The current release version 2.2 of jSymbolic can automatically extract 1497 features from a given digital score, relating to a broad variety of musical qualities, including: general pitch statistics; melody and horizontal intervals; chords and vertical intervals; texture; rhythm; instrumentation; and dynamics. jSymbolic has previously been used in a range of musicological research projects, such as investigations of regional styles,[4] the origins of the madrigal,[5] Renaissance compositional style[6] and genre in popular music.[7]
Once features are extracted from all the music in a corpus, they can be explored and compared across pieces to reveal patterns that might not be evident using traditional techniques. One of the essential advantages of this approach is that it permits the direct comparison of musical content in consistently measured ways, not just across sample pieces, but across all pieces in a corpus.
Although feature comparisons can certainly be done meaningfully either manually or using statistical analysis software, another particularly effective class of feature-based approaches is to use machine learning techniques to ‘train’ models that can automatically learn to differentiate between different ‘classes,’ or groups of interest, using a ‘supervised’ process. The features serve as the percepts of the ‘models’ (or ‘classifiers’) trained by these machine learning algorithms, and the output of a trained classifier is the ‘predicted’ class of any digital score it is given to classify (or, more precisely, the features extracted from that digital score). In effect, a classifier is trained using supervised machine learning in much the same way an independent student might be trained: they are given music (or, in the case of classifiers, features extracted from music), they are told what the music is (its class) and they are left to infer how emergent musical patterns map to the classes of interest based on the exemplars they are given to train on. In the particular case of the experiments described below, the class is the composer of a given score, which means that the classifiers are trained to automatically recognize statistically meaningful identifying patterns in the features that (if all goes well) delineate the musical styles of the composers whose work they are trained on.
For this to work, a machine learning algorithm needs to be trained with enough exemplars of each composer’s music to be able to recognize the musical characteristics underlying the composer’s style in general. Too few training exemplars can mean that non-representative variations between individual pieces can occlude the more meaningful general stylistic characteristics that need to be modelled. This can be a problem in early music research, since there are often only a limited number of extant securely attributed pieces to train on. So, in such cases, one must consider the output of classifiers with a grain of salt; they may not definitively prove the attribution of any piece with statistical certainty, but then neither do traditional approaches in most cases. Ultimately, when there is limited training data available, the output of classifiers should be considered as a meaningful piece of evidence in a larger constellation of evidence also derived from manual analyses, historical documents and so on.
Before proceeding to the details of the two sets of experiments described below, it should be mentioned that they only respectively used 801 and 552 of the 1497 features in the full jSymbolic catalogue. This is because the other features are either not relevant to the corpus under consideration (e.g. features based on instrumentation or dynamics) or are sensitive to editorial or encoding inconsistencies found in the digital corpus we used, which was drawn from different sources that used potentially different practices for digitizing their music. The full jSymbolic feature catalogue is only appropriate when all the music in the corpus being studied is encoded using a consistent workflow. Cumming et al. discuss such issues in more detail.[8]
Experiment Set 1: Ave domine Iesu Christe
The first set of experiments carried out in connection with the MCE project involved a focus on the style of Loyset Compère and an investigation of the eight motets of the Ave domine Iesu Christe cycle (in Librone 1),[9] whose attribution to Compère is debated.[10] A collection of 48 digital scores securely attributed to Compère and 1,783 digital scores securely attributed to other Renaissance composers was collected from various sources,[11] for use as training data. This music all consisted of motets or mass movements, and was encoded as MIDI. jSymbolic features were then extracted from this corpus and used to train a binary ‘Compère’ vs. ‘not Compère’ classifier. In particular, the SMO support vector machine[12] implementation in the open-source Weka machine learning and data mining package[13] was used to train classifiers. A linear kernel and default hyper-parameters were used. The ‘not Compère’ music consisted of works by Agricola, Brumel, Busnoys, Févin, Weerbeke, Palestrina, Du Fay, Isaac, Obrecht, Japart, Mouton, Martini, Ockeghem, Regis, Tinctoris, Josquin, Daser, de Orto, Pipelare, La Rue and Victoria.[14]
The resulting trained model was in effect able to, given a previously unseen digital score, classify it as being either likely by Compère or likely not by Compère. More specifically, a classification of ‘not Compère’ meant that it was closer in style to one of the other composers, or, if the generalization during training was successful, a broad amalgamation of styles characteristic of Renaissance music not in the particular style of Compère. In a 10-fold ‘cross-validation’ experiment to examine the efficacy of the features in building a successful classifier with the secure training data, the classifiers were able to correctly identify a piece known to be by Compère 58.3% of the time, and correctly identify a piece known to not be by Compère 99.8% of the time. This uneven split is not unexpected, since far more ‘not Compère’ than ‘Compère’ training data were available; the classifiers tended to err on the side of ‘not Compère,’ so if a given musical item is classified as ‘Compère’ it quite likely is by Compère (since only 0.2% of items not by Compère were misidentified as being by him during cross-validation), but if it is classified as ‘not Compère,’ then there is more uncertainty (since 41.7% of items known to be by Compère were misidentified as not being by him).
Keeping this context in mind, a new classifier was then trained on the entire secure corpus, and the previously unseen Ave domine Iesu Christe items were processed individually by this trained classier. This resulted in the classifications shown in Table 1.
Table 1. Trained classifier output on each motet, predicting whether it should be attributed to Compère or not.
Motet |
Classifier Output |
Ave domine Iesu Christe verbum patris |
Compère |
Ave domine Iesu Christe laus angelorum |
not Compère |
Ave domine Iesu Christe lumen caeli |
Compère |
Ave domine Iesu Christe vita dulcis |
Compère |
Salve salvator mundi |
Compère |
Adoramus te Christe |
Compère |
Parce domine |
Compère |
Da pacem domine |
Compère |
It can be seen that seven of the eight motets were identified as ‘Compère’ and only one as ‘not Compère.’ Given the low probability of a false ‘Compère’ classification and the much higher probability of a false ‘not Compère’ classification, as discussed above, this presents fairly convincing evidence that the cycle is indeed by Compère. However, there is an important caveat: the classifier was only explicitly aware of the music of the twenty-two composers it was trained on, and it is possible that the motet cycle was actually composed by some other unknown composer who had a compositional style statistically more similar to that of Compère than to the styles of any of the other twenty-one composers the classifier was trained on. While this possibility cannot be discounted with certainty, of course, it is perhaps unlikely enough for one to still strongly suspect that the cycle is indeed by Compère, in the absence of musicological evidence to the contrary.
One disadvantage of approaches based on machine learning is that it is not always transparent how particular classifications are made. Classifiers tend to base their output on complex patterns of interactions between dozens or hundreds of features that are hard to decode in human-interpretable ways. Fortunately, it is possible to use statistical analysis techniques to examine the feature data itself, in ways that reveal which features statistically separate classes. For example, one can use a standard statistical metric called ‘information gain’ to rank the features that seem to best delineate classes; the following are the top fifteen ranked jSymbolic features that individually provide the greatest information gain in discriminating music securely attributed to Compère from that securely attributed to the other twenty-one composers experimented with, starting with the features with the highest information gain:[15]
- Initial Time Signature
- Mean Complete Rest Duration
- Longest Complete Rest
- Median Complete Rest Duration
- Complete Rests Fraction
- Voice Separation
- Average Number of Independent Voices
- Median Partial Rest Duration
- Relative Size of Melodic Intervals in Lowest Line
- Metrical Diversity
- Voice Overlap
- Mean Rhythmic Value Run Length
- Relative Note Density of Highest Line
- Maximum Number of Independent Voices
- Voice Equality – Number of Notes
Although analyses such as this can provide important insights, they also have limitations. For example, the information gain analysis above only evaluates features based on the insights they provide individually, and in practice it is quite rare that any individual feature is effective alone. For example, only considering the initial time signature of a piece, the feature with the highest information gain in connection with the Compère experiment, would result in a very poor classifier. As noted above, effective classifiers look at the complex mutual interplay between many features at once, and a feature that individually provides little information gain could potentially in fact be very powerful when considered in conjunction with certain other features, and vice versa.
So, how can one get a sense of which features are most important in a broader, more holistic way? One way is to employ a methodology that looks for groups of features that together are particularly effective at delineating classes of interest. Some approaches of this kind, such as genetic algorithms or forward-backward selection, employ stochastic approaches that look for good, but not necessarily best, feature groups. The results of such analyses highlight features that may or may not be highly discriminative when considered alone, but which are powerful when considered in conjunction with one another. One relatively fast but effective approach of this kind is Weka’s CfsSubsetEval algorithm, which selects a feature subset by maximizing predictive ability while also minimizing the redundancy between features. Such an analysis highlights the following (unranked) features in the Compère vs. not Compère data:
- Pitch Class Histogram 6
- Folded Fifths Pitch Class Histogram 5
- Importance of High Register
- Melodic Thirds
- Vertical Interval Histogram 5
- Vertical Interval Histogram 16
- Wrapped Vertical Interval Histogram 10
- Chord Duration
- Initial Time Signature
- Metrical Diversity
- Mean Rhythmic Value Run Length
- Variability in Rhythmic Value Run Lengths
- Mean Complete Rest Duration
- Beat Histogram Tempo Standardized 51
- Beat Histogram Tempo Standardized 140
- Relative Note Density of Highest Line
- Relative Size of Melodic Intervals in Lowest Line
Once again, each of these features are described in the on-line jSymbolic manual. Notice that there is some partial but not universal overlap between the features specified in the information gain list and the CfsSubsetEval list, as one might expect from the discussion above.
Although these features are by no means the only features considered by the actual classifier used to classify the unattributed music (it considered all 801 extracted features), it does give a sense of some of the features that are particularly mutually important statistically. An essential next step, of course, would be to consider these features in the context of scholarly expertise; some features might only be statistical anomalies that have little musicological salience, while others can highlight areas of importance that might not otherwise have been considered.
It can be especially useful to manually explore the feature data relating to those particular features that have been highlighted by statistical analyses such as those introduced above. For example, both approaches above identified the ‘Relative Size of Melodic Intervals in Lowest Line‘ feature as being important, but in what particular way did Compère’s music differ in this respect from the other composers under consideration? Did he tend to use larger melodic intervals, smaller intervals, or perhaps some more sophisticated pattern? To help explore this, one can look at results for individual items, or one can look at statistical summaries such as means and variances. An investigation of the particular feature patterns indicative of Compère’s style relative to the other composers is left to future work.
Experiment Set 2: Gaude flore virginali, Natus sapientia and Nativitas tua
The second set of MCE experiments involved training a classifier that could distinguish between various potential candidate composers of the motets in the Gaude flore virginali (Munich 3154), Natus sapientia (Munich 3154),[16] and Nativitas tua (Librone 1)[17] groups. The associated training data consisted of 660 motets and mass movements securely attributed to Busnoys, Weerbeke, Martini, Ockeghem, Josquin, Compère and La Rue. This reduced set of composers, compared to the twenty-two composers trained on in Experiment Set 1, was chosen so as to focus on composers with a greater probability of composing at least one of these three groups of motets in particular. In Experiment Set 1, the goal was to have training data representing music not by a single composer (Compère); here, the goal is instead to investigate possible attribution to each of the reduced set of composers. Reducing the list of candidates facilitates the work of the classifier, since this results in fewer classes to model, something that is particularly important here given the very limited number of available training exemplars for certain composers.
During a cross-validation experiment, the classifiers were able to distinguish between the music of the seven composers 77.1% of the time. More details of the cross-validation results are shown in Table 2.
Table 2. ‘Confusion matrix’ indicating how the models classified securely attributed scores during cross-validation. The rows correspond to the true attribution of the items, and the columns correspond to the predictions of the classifiers during cross-validation. The numbers in each cell indicate the number of items securely attributed to the row’s composer that were classified as being by the column’s composer. Entries on the diagonal correspond to correct classifications. For example, looking at the first row, it can be seen that 43 of the 69 items by Busnoys were correctly classified as being by him, 11 were incorrectly classified as being by Ockeghem, etc.
|
Busnoys |
Weerbeke |
Martini |
Ockeghem |
Josquin |
Compère |
La Rue |
Busnoys |
43 |
0 |
9 |
11 |
5 |
1 |
0 |
Weerbeke |
0 |
7 |
1 |
0 |
0 |
0 |
2 |
Martini |
9 |
0 |
100 |
1 |
8 |
0 |
5 |
Ockeghem |
12 |
0 |
5 |
75 |
4 |
0 |
2 |
Josquin |
4 |
2 |
8 |
7 |
91 |
2 |
17 |
Compère |
2 |
2 |
3 |
0 |
1 |
19 |
5 |
La Rue |
0 |
2 |
3 |
1 |
16 |
1 |
174 |
It can be seen from Table 2 that the classifiers were better at identifying certain composers than others. For example, 88.3% of the La Rue scores were correctly classified, but the Weerbeke scores had a classification accuracy of only 70.0%. This is as might be expected, since there were far more available training instances by La Rue than Weerbeke (192 vs. 10, in this most divergent example), which means that there was less training data that could be used to accurately model Weerbeke’s style compared to La Rue’s.
It is also notable that certain kinds of misclassification were more common than others. For example, 69.6% of all misclassified La Rue scores were erroneously assigned to Josquin, and 42.5% of all misclassified Josquin scores were erroneously assigned to La Rue. This sort of pattern in incorrect classifications is as one might expect, given the relative similarity between the styles of La Rue and Josquin.
Next, a classification model was trained on the complete training data and used to classify the previously unseen motets making up the Natus sapientia cycle, the Gaude flore virginali cycle and the group of five motets beginning with Nativitas tua,[18] with the results shown in Table 3.
Table 3. Attributions for each motet as predicted by the trained classifier.
From Munich 3154 |
Classifier Prediction |
Gaude flore virginali: Gaude flore virginali |
Martini |
Gaude flore virginali: Gaude sponsa cara dei |
Martini |
Gaude flore virginali: Gaude splendens vas virtutum |
Martini |
Gaude flore virginali: Gaude nexu voluntatis |
Martini |
Gaude flore virginali: Gaude mater miserorum |
Martini |
Gaude flore virginali: Gaude virgo mater pura |
Martini |
Natus sapientia: Natus sapientia |
Josquin |
Natus sapientia: Cito derelictus |
Josquin |
Natus sapientia: Hora prima ductus est |
Josquin |
Natus sapientia: Crucifige clamitant |
Josquin |
Natus sapientia: Iugo est crucis conclavatus |
La Rue |
Natus sapientia: Iesus dominus exspiravit |
Josquin |
Natus sapientia: Fortitudo latuit |
Josquin |
Natus sapientia: Datur sepulturae corpus |
Josquin |
From Librone 1 |
Classifier Prediction |
Nativitas tua sancta dei genitrix |
Ockeghem |
O redemptor totius populi |
Martini |
Gaude Maria virgo |
Martini |
Exultabit cor meum |
Ockeghem |
Timete dominum omnes sancti |
Martini |
The Gaude flore virginali motets were all attributed to Martini, the Nativitas tua motets in Librone 1 were attributed to either Martini or Ockeghem and the Natus sapientia motets were attributed to Josquin (except Iugo est crucis conclavatus, which was attributed to La Rue).
As always with experiments with limited amounts training data like this (from a machine learning perspective), one should be hesitant in giving too much authority to any single motet classification. This caution is reinforced by the limited 77.1% classification accuracy observed during cross-validation. However, in aggregate the results can become more meaningful. Although the results above certainly do not provide definitive proof that the Gaude flore virginali cycle can be securely attributed to Martini, for example, this result would meaningfully reinforce other musicological evidence supporting him as the composer of the cycle. This consistency of the six Gaude flore virginali classifications also suggests that it is unlikely that any of the other six composers tested for did compose the cycle.
As with Experiment Set 1, it is important to emphasize that any or all of these motets could be by composers other than the seven tested for here; if this is indeed the case, then the predicted classification of a given motet indicates the composer of the seven who is closest in style (based on the features) to the true composer of the motet. Since it would be musicologically surprising for Josquin to have composed the Natus sapientia cycle, for example, one might suspect that this cycle is perhaps by some composer other than the seven tested for, and that that unknown composer had a style similar to that of Josquin (or at least more similar to Josquin’s style than to that of any of the other six composers tested for).
Given the consistency of the predicted attributions for the Gaude flore virginali and Natus sapientia motets, it seems likely (although not certain) that they are each by a single composer, as one might expect, even if not necessarily Martini and Josquin respectively. Such single authorship seems less certain to be the case for the five motets from Librone 1, however, since the classifications of these five motets were more ambiguously split between two composers, although it is also certainly possible that they are in fact by a single composer other than the seven tested for, and that this unknown composer has stylistic traits in common with both Ockeghem and Martini.
One may also use these results to consider the issue of whether the Gaude flore virginali and Natus sapientia cycles are both by the same composer, as they are copied sequentially in Munich 3154. Given the consistency of the classifications for these two cycles individually (6/6 Martini and 7/8 Josquin, respectively), and given the stylistic dissimilarity (relatively speaking) between these two composers, these results suggest that the two cycles reflect substantially different compositional styles, even if they may not actually be by Martini and Josquin in particular, respectively. So, these results provide reasonably strong evidence that the Gaude flore virginali and Natus sapientia cycles are not by the same composer.
As a side note, it is possible that the five Librone 1 motets may be by the same composer as the Gaude flore virginali motets (since 3/5 of the Librone 1 motets were also classified as being by Martini), but the evidence for this is more tentative, partly because of the mixed classification of the Librone 1 motets, and partly because there is less musicological evidence to suspect this to be the case.
Conclusions and Future Research
The two sets of experiments described above provide nice examples of the sorts of insights that can be gained using automatically extracted features, machine learning and statistical analysis. Of course, such results must always be interpreted in the broader context of historical evidence and musicological and music theoretical domain expertise, as noted above. They do, however, add additional evidence of a kind not traditionally available to music scholarship that can be considered meaningfully along with other evidence, and can highlight intriguing and sometimes unexpected areas for further investigation.
At a methodological level, there are a number of ways this research could be expanded and improved upon. More training exemplars by those composers already considered would strengthen the ability of classifiers to generalize and model their styles, and would lessen sensitivity to potentially non-representative characteristics observed in individual pieces. Adding training music by more composers would also broaden the scope of the analysis. Of course, there are real ceilings to the extent to which the training data can be expanded, as there are only so many extant securely attributed pieces from the period in question.
Using digital scores universally reencoded with consistent editorial and encoding workflows would bring the advantage of making it possible to include more features than could be properly studied here. The music used in these experiments was drawn from different sources, and was thus susceptible to editorial and encoding inconsistencies, which meant that only the “safe” jSymbolic features could be used (i.e. the ones not sensitive to such inconsistencies). Consistently prepared training and test data would permit the study of a broader range of features, and thus a broader range of musical characteristics.
There is also room for more sophisticated machine learning and statistical analysis than that employed here. There are a variety of machine learning algorithms available beyond the SVMs that were used, some of which are more interpretable, and there are also approaches not based on the supervised training approach employed here (e.g. similarity metrics based on unsupervised clustering) that could offer useful insights. Multivariate analysis techniques are also available that would allow a more sophisticated investigation of which features in particular delineate the various compositional styles under consideration, and how, both of which are areas of musicological interest.
[1] This edition has been realized within the Projekt ‘Polifonia sforzesca’ at the Schola Cantorum Basiliensis, Musik-Akademie Basel (https://www.fhnw.ch/plattformen/polifonia-sforzesca/), and can be found on the page ‘Edition’ of Gaffurius Codices Online’ (https://www.gaffurius-codices.ch/s/portal/page/editions).
[2] Cory McKay, Julie Cumming, and Ichiro Fujinaga, ‘jSymbolic 2.2: Extracting Features from Symbolic Music for Use in Musicological and MIR Research’, in Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR 2018, Paris, France, September 23–27, 2018), ed. Emilia Gómez et al., 348–54.
[3] The web page for the jSymbolic software is: http://jmir.sourceforge.net/index_jSymbolic.html.
[4] Maria Elena Cuenca and Cory McKay, ‘Exploring Musical Style in the Anonymous and Doubtfully Attributed Mass Movements of the Coimbra Manuscripts: A Statistical and Machine Learning Approach’, in Journal of New Music Research (2021), DOI: 10.1080/09298215.2020.1870505.
[5] Julie Cumming and Cory McKay, ‘Revisiting the Origins of the Italian Madrigal Using Machine Learning’ (Paper presentation), Medieval and Renaissance Music Conference, Maynooth, Maynooth University, 5–8 July, 2018.
[6] Cory McKay, Tristano Tenaglia, Julie Cumming and Ichiro Fujinaga, ‘Using Statistical Feature Extraction to Distinguish the Styles of Different Composers’ (Paper presentation), Medieval and Renaissance Music Conference, Prague, Charles University, 4–8 July, 2017.
[7] Cory McKay, ‘Automatic Music Classification with jMIR’ (PhD. Diss., McGill University, 2010). Available in: McGill University Library Digital Collections.
[8] Julie Cumming, Cory McKay, Jonathan Stuchbery, J., Ichiro Fujinaga, ‘Methodologies for Creating Symbolic Corpora of Western Music Before 1600’, in Proceedings of the International Society for Music Information Retrieval Conference, Paris 2018, 491–498; http://ismir2018.ircam.fr/doc/pdfs/46_Paper.pdf
[9] Archivio della Veneranda Fabbrica del Duomo di Milano, Sez. Musicale, Librone 1 (olim MS 2269; hereafter Librone 1, fols. 162v–170r.
[10] See Daniele V. Filippis’ Introduction to MCE 2.
[11] The well-known Josquin Research Project (https://josquin.stanford.edu) was the source of many of the scores.
[12] The support vector machine algorithm in particular was chosen because of its relative efficacy in dealing with situations involving small numbers of training instances and large numbers of features per instance, as was the case here.
[13] The Weka home page is: https://www.cs.waikato.ac.nz/ml/weka/.
[14] Palestrina and Victoria were included in the ‘not Compère’ training data despite being of later generations than the other composers primarily to broaden and diversify the training data; more securely attributed digitized symbolic music is available for them than for many earlier composers, and it is important to have enough training exemplars available so as to not train a model overfitted to individual pieces.
[15] Definitions of individual feature are available in the jSymbolic manual: http://jmir.sourceforge.net/manuals/jSymbolic_manual/home.html.
[16] Munich, Bayerische Stadtbibliothek, Mus. MS 3154 (‘Leopold Codex’; Munich 3154), fols. 38v-43r and fols. 43v–48v.
[18] These five motets probably do not represent pieces that originated as one motet cycle. The Motet Cycles Database suggests that Exultabit cor meum (2p. Admirabile est nomen tuum) and Timete dominum omnes sancti (2p. Domine dilexi decorum) are related to each other (they in fact share an incipit). Nevertheless, all five motets are in the same D-modus and they are copied between motet cycles. For this reason, and due to some stylistic similarities with Gaude flore virginali, they have been grouped together in this analytical experiment.