A Music-IR Similarity Scale

A Similarity Scale for Content-Based Music IR

Donald Byrd

School of Informatics and School of Music, Indiana University

February 2003; last rev. early June 2007

Music Information Retrieval (music IR)--more technically, content-based music IR--addresses a huge range of tasks. This scope is rarely acknowledged, which is unfortunate because some tasks have very different demands from others, and some are vastly more difficult than others. The main factor, in my view, is just how similar the relevant documents in the collection to be searched really are to the "query". This paper is an attempt to clarify matters.

In the table below:

The relationship categories describe what is the most that might be in common between two items–audio recordings, scores, or whatever–whose similarity is to be evaluated. They are arranged from closest to most distant, and fall into two groups: those with detailed audio characteristics in common and those without. Categories in the first group are well-defined, but those in the second group are not. Note that for material in notation form, categories in the first group do not apply: it’s just "Same music and arrangement". For material in event form, category (1) doesn’t apply. Also note that some categories, and some differences between categories, are probably far more significant than others. While I claim the arrangement of categories is monotonic, moving consistently from closer to most distant relationships, I make no further claims about the categories or their relationships. Finally, the type of differences involved could be anything. To my knowledge, covers of songs invariably use the original's melody and nearly always its harmony, perhaps with some changes. But even conservative jazz versions and variations in classical music may bear no melodic resemblance to their originals, and it may be difficult even to say what less conservative versions or variations have in common with their prototypes.
Basic representation is audio, event (usually in MIDI files), or full symbolic notation (whether common Western music notation or another).
Example systems are systems that appear to be focused primarily on the category in question. I do not mean to suggest that they are useful for only the category given.

Not surprisingly, this ordering also ranks music information retrieval tasks from easiest to most difficult; in fact, a strong case can be made that category (1) -- while by no means easy -- is not even a music-IR problem. A related table and graph and an interesting discussion of the range of music-IR tasks appear in Typke et al (2005). Also see Casey and Slaney (2006), which includes a graph and discussion of the range of music-IR tasks from a perspective that is closer to mine.

Music involves much more complex structural relationships than most text or other media; among the reasons are clearly the fact that music is a performing art and the fact that the vast majority of music is, in a broad sense, polyphonic. See Vellucci (1997) for an extensive discussion of the issues from the perspective of library science. Librarians and text-IR researchers often speak of known-item searches, where the user is trying to locate a copy of a document they already know about, as an important special case. The complexity of musical relationships makes it difficult even to say which of these tasks are known-item searches. Category (1) certainly is; the last few categories clearly are not. Other than that, I leave the question to the reader. A related and important question is what the phrase "same music" means. This is not easy, but the best answer may be to appeal to the concept of a musical work, something the emerging library standard FRBR depends on. For a brief discussion, see Tillett (2004).

The categories whose descriptions start in boldface are what mainstream content-based IR systems focus on.

Detailed audio characteristics in common

Relationship category (task)	Basic representation	Example systems	Comment
1. Same music, arrangement, performance venue, session, performance, & recording	Audio	Shazam ("IPR" version), Audible Magic, MusicDNS(?)	Via audio fingerprint. Current systems are both very accurate and very fast, even with large collections. (See note below.)
2a, b. Same music, arrangement, performance venue, session, performance; different recording. a: Play back original recording & re-record. b: Different original recording.	Audio	(2a) Shazam (public version)	Via audio fingerprint. (2a) Same comment as for Category 1, though with somewhat less accuracy. (See note below.)
3. Same music, arrangement, performance venue, session; different performance, recording	Audio	none(?)	E.g., retakes.
4. Same music, arrangement, performance venue; different session, performance, recording	Audio	none(?)

No detailed audio characteristics in common

Relationship category (task)	Basic representation	Example systems	Comment
5. Same music, arrangement; different performance venue, session, etc.	Audio, events	Foote: ARTHUR	There's an analogous situation in notation, with nothing different; this is the notation equivalent of Category (1).
6. Same music, different arrangement; or different but closely-related music, e.g., conservative variations (Mozart, etc.), alternate takes, most covers and remixes, minor revisions	any	C-Brahms, Greenstone/Meldex, Musipedia, Pickens et al/OMRAS, Themefinder, etc.	Current monophonic systems are good, especially with events or notation; polyphonic systems are fair to good. (See note below.)
7. Different & less closely-related music: freer variations (Brahms, much jazz, etc.), wilder covers, extensive revisions	any	none(?)	A serious AI problem. Current systems are poor. (See note below.)
8. Music in same (form or style) genre, etc.	any	Cuidado, SOMeJB, Tzanetakis(?)	Agreement even among human experts is limited.
9. Music influenced by other music	any	none(?)	Agreement even among human experts is limited.
10. No detectable relationship	any	(none possible)

Notes:

Category 1: The two recordings being compared here are in fact identical. The problem the "IPR" version of Shazam (intended for use by record companies and other music rights owners), Audible Magic, and other audio-fingerprinting systems attempt to solve is to recognize that accurately and efficiently, even for a huge collection of music. Clearly this is an audio-only situation.

Category 2: Two distinct situations are possible. Category 2a essentially means playing back the original recording and recording the playback; this is what the better-known version of Shazam available to individuals is designed for. It does the same thing as the "IPR" version, but in the presence of noise introduced by the environment and the transmission channel; for example, music played on a jukebox in a crowded bar and transmitted to a server via a mobile phone’s low-fidelity microphone. Category 2b, on the other hand, involves comparing different original recordings of the same performance; this situation is much less well-known, but it's by no means contrived. One example would be different mixes of the same studio "take". Also, there are probably rock concerts for which dozens or even hundreds of recordings exist, albeit nearly all low fidelity (and illegal, except for bands like the Grateful Dead that allow them!). Finally, there are surviving examples of early recordings made simultaneously on two or more masters fed by different microphones. (This leads -- accidentally -- to stereo recordings. At least one, a 1932 Duke Ellington session, has been released commercially in such a version.) There's an analogous situation with events, but practical applications are likely to be rare.

Categories 6 and 7: The boundary between these two important categories is difficult to draw. Johnny Cash's cover of the Nine Inch Nails song "Hurt" is clearly in Category 6, as are most of Mozart's Variations on Ah! Vous Dirai-je, Maman (a.k.a. "Twinkle, Twinkle, Little Star"); Jose Feliciano's version of The Doors' "Light My Fire" is definitely in 7. But what about instrumental versions that closely follow the melody, harmony, and form of a song, (e.g., George Winston's solo piano version of Light My Fire or the Kronos Quartet cover of Purple Haze), or vice-versa (the song "Stranger in Paradise", based on Borodin's Polovetsian Dances)? Finally, one of the two Guess Who versions of Light My Fire available via iTunes in early 2007 is extremely similar to the original, except it reduces the very salient 5-min. instrumental interlude to just a few seconds. Is that a "conservative" cover or not?

Categories 8 and 9: Doing well on these tasks undoubtedly requires human-level intelligence.

Acknowledgements

Michael Casey's comments on an earlier version of this document clarified my thinking considerably and led to major improvements in it. Tim Crawford made his own version of my original table with some thought-provoking differences. Jeremy Pickens pointed out a number of things that needed clarification. In addition, Ed Wolf and other members of my spring 2007 Music Representation and Retrieval seminar made some very helpful comments. My thanks to all of them.

References

Casey, Michael, & Slaney, Malcolm (2006). Song Intersection by Approximate Nearest Neighbor Search. In Proceedings of the 7th International Conference on Music Information Retrieval (ISMIR 2006), Victoria, Canada, pp. 144—149.

Tillett, Barbara (2004). What is FRBR?: A Conceptual Model for the Bibliographic Universe. Retrieved April 10, 2007 from the World Wide Web: http://www.loc.gov/cds/FRBR.html

Typke, Rainer, Wiering, Frans, & Veltkamp, Remco C. (2005). A Survey of Music Information Retrieval Systems. In Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR 2005), London, England, pp. 153—160.

Vellucci, Sherry (1997). Bibliographic Relationships in Music Catalogs. Lanham, MD: Scarecrow Press.