Organised Evaluation in (Music) Information Retrieval: TREC and MIREX Anna Pienimki Department of Computer Science University of Helsinki
[email protected]
Abstract. Evaluation is an important component of the information retrieval field. In the are a of text retrieval, the annual TREC evaluation contest provides a venue where evaluation can be done in an organised fashion. Following the example of TREC, both video and music retrieval communities have started to build their own evaluation contests addressing mostly content-based retrieval problems. In this paper we present briefly the history and development of organised evaluation in (music) information retrieval. We discuss the organisation and the main properties of TREC and introduce the MIREX evaluation contest designed for evaluating music information retrieval methods. Moreover, we discuss in more detail how such evaluation contests are organised in practise, using MIREX as an example.
1
Introduction
Evaluation is an important but ofter slightly neglected component of the information retrieval (IR) field. Organised evaluation campaigns or contests for text retrieval methods have become more common within last 10 years in the form of TREC (Text REtrieval Conference). However, especially multimedia retrieval communities, such as music information retrieval (MIR) community, still lack both common methodology and evaluation data for evaluating methods developed for answering content-based information needs in an organised fashion. Organised evaluation is evaluation that is done in a centralised fashion using common test collections to evaluate and compare the results of several retrieval algorithms performing the same task. Organised evaluation has two important advantages: comparability of the methods evaluated and access to realistically sized test collections built, maintained, and funded by the central organisations. Moreover, it offers a forum for discussions how compared methods differ and how they could be developed further. In this paper we discuss the history and development of organised evaluation in (M)IR concentrating on the organisational side of such contests. In Section 2 we introduce the concept of content-based MIR and how it differs from textbased IR. In Section 3 we give a brief overview on the history and development of evaluation in IR and introduce TREC and its main features. MIR evaluation contest MIREX is discussed in more detail in Section 4. Section 5 concludes the paper.
2
2
An Introduction to MIR
MIR is a considerably new research area that addresses the question of finding and retrieving music documents, such as audio files, MIDI files, and sheet music. Even though the concept of MIR literally accommodates both meta data and content-based retrieval methods, it has been widely used as a synonym for content-based music information retrieval. In MIR the documents are searched and browsed using musical parameters as keys. Depending on the application, you may search for a song by humming its melody to a microphone connected to your retrieval system or browse a music collection using the similarity of sound, that is instrumentation and tempo, as your similarity measure. The MIR methods can be divided into two main categories: audio and symbolic methods. The audio retrieval uses mainly digital signal processing methods and addresses the problem of finding music based on its audio content. The common subtasks for audio retrieval methods are, for instance, beat and tempo tracking, frequency estimation and classification based on these features. The symbolic MIR is based on the score instead of audio, and its most widely known application is the so-called query-by-humming retrieval method [Ghias et al. (1995)]. In query-by-humming the music is retrieved by humming a small fragment of melody, which is then converted into symbolic form for the actual retrieval task. Thus, symbolic MIR is highly based on the similarity of the score. When comparing text-based IR and MIR there are several technical differences between the retrieval tasks. Obviously, audio MIR has only a little to do with text-based IR when it comes to the methodology. However, symbolic MIR is somewhat closer to text retrieval and thus certain text retrieval methods have been quite successfully used for symbolic MIR purposes, as well. However, compared to text, music is technically more demanding as data. Firstly, as music is almost always polyphonic, it has more dimensions. Moreover, music is transposition-invariant1and has a rhythm that needs to be taken into account, as well. Therefore text-based retrieval methods have proven to be insufficient for solving most of the symbolic MIR problems.
3
A short history of organised evaluation in IR
3.1
Early evaluation collections
Even though the most well-known organised evaluation framework, TREC, dates only back to 1992, the first attempts for organised evaluation were carried on as early as in the late 1950’s and early 1960’s by Cyril Cleverdon and his colleagues at the Cranfield College of Aeronautics. These experiments are known nowadays as the Cranfield paradigm. The first round of the Cranfield paradigm consisted of testing four indexing systems and attracted widespread attention. However, 1
Transposition-invariance is a property that allows a melody to be recognised as the same regardless of the key in which it is played.
3
critical examination of the methodology showed that the research design had had at least partial effect on the results. To answer to his critics Cleverdon devised a second set of experiments for which he built a test collection consisting of 1400 documents and 279 queries2 [Rasmussen (2003)]. This Cranfield II collection was subsequently used by other groups for evaluation purposes [Voorhees and Harman (2005)]. In the following 25 years several other test collections, such as NPL (11,429 documents) and UKCIS (27,361 documents), were built [Rasmussen (2003)]. Even though these collections were of importance for singular evaluation tasks, they lacked certain necessary features. Namely, they were rarely used in an organised fashion for comparing results of two or more research groups and they were rather small compared to the real-life tasks and collections [Voorhees and Harman (2005)]. Moreover, the data consisted mainly of document surrogates, such as titles and abstracts, instead of full text documents [Rasmussen (2003)]. 3.2
The birth of organised evaluation: TREC
In 1990 NIST (National Institute of Standards and Technology) started to build a large test collection for use in evaluating text retrieval technology developed as part of the Defence Advanced Research Projects Agency (DARPA) TIPSTER project. In the following year NIST proposed that this collection should be made available to the research community as an organised evaluation contest which became TREC. The four main goals of TREC were (and still are): – To encourage research in text retrieval based on large test collections – To increase communication among industry, academia, and government by creating an open forum for the exchange of research ideas – To speed the transfer of technology from research labs into commercial products by demonstrating substantial improvements in retrieval methodologies on real-world problems – To increase the availability of appropriate evaluation techniques for use by industry and academia, including the development of new evaluation techniques more applicable to current systems The first TREC was held in November 1992 with 2 tracks (Ad Hoc and Routing) and 25 participant groups. Even though most of the participating groups in TREC-1 concentrated mainly on the necessary rebuilding of their basic systems to handle the huge scale-up in collection size, the door was finally open for group discussions and collective decisions on what improvements were needed for the future TRECs [Voorhees and Harman (2005); Harman (2002)]. TREC-2 was held in a similar fashion to TREC-1 only 9 months later with the same tasks. Even though most of the participants of TREC-1 participated in TREC-2 and had had time to develop their methods, the results of TREC-2 were not comparable to TREC-1, due to changed topics, that is, the queries 2
In some references, such as Voorhees and Harman (2005) the number of queries is mentioned to be only 225.
4
used. In TREC-3, new tracks (Interactive and Spanish) were added to the list of tracks. These formed the basis of TREC as it is nowadays (Table 1).
Tracks T-1 T-2 Ad Hoc 18 24 Routing 16 25 Interactive Spanish Confusion Database merging Filtering Chinese NLP Speech Cross-language High precision Very large corpus Query Question answering Web Video Novelty Genome HARD Robust Total participants 25 31 Table 1. TREC
T-3 T-4 T-5 T-6 T-7 T-8 T-9 T-10 26 23 28 31 42 41 25 15 16 21 3 11 2 9 8 7 6 6 4 10 7 - 4 5 - 3 3 - 4 7 10 12 14 15 19 - 9 12 - 4 2 - 13 10 10 3 - 13 9 13 16 10 - 5 14 - 7 6 - 2 5 6 - 20 28 36 - 17 23 30 - 12 33 36 38 51 56 66 69 87 tracks and participants 1992-2003
T-11 6 21 9 34 23 19 13 93
T-12 33 27 14 29 14 16 93
TREC started and has mostly stayed as a solely text-based retrieval evaluation contest. In 2001 and 2002 TREC hosted a track for video content retrieval, but due to a very high participant rate and little overlap in techniques for video and text retrieval, the track was spun off as a separate evaluation contest called TRECVID [Voorhees and Harman (2005)].
4
The making of MIREX
The MIR community, that was born in the late 1990’s, started discussing about the need of evaluation in the early 2000’s. The evaluation task was commonly held too different from the text retrieval evaluation tasks and thus the MIR community decided to design its own evaluation contest addressing the questions that were relevant for evaluating specifically MIR methods. These discussions and workshops held in JCDL, SIGIR and ISMIR conferences lead into the birth of MIREX (Music Information Retrieval Evaluation eXchange).
5
4.1
MIREX – something new, something borrowed
The first tentative step towards organised evaluation in the area of MIR was taken in 2004 during the 5th ISMIR (International Conference on Music Information Retrieval) in a form of ISMIR2004 Audio Description Contest (ADF)3 organised by the hosting organisation of ISMIR2004, Universitat Pompeu Fabra, Barcelona. The first and only ADF consisted of 5 audio tasks, such as Genre Classification and Tempo Induction. At the same time, Prof. Downie at the University of Illinois, Urbana-Champaign, had started to work on a testbed project based on the earlier discussions. The project was called International Music Information Retrieval Systems Evaluation Laboratory (IMIRSEL) and it was first introduced in Tuscon, Arizona, in a side meeting of the Joint Conference on Digital Libraries (JCDL 2004). Prof. Downie’s idea was to include both symbolic and meta data tasks into evaluation procedure. After the Audio Description Contest Prof. Downie’s IMIRSEL team started a discussion on the evaluation tasks, which lead into the first MIREX contest held at the 6th ISMIR in London, organised by the IMIRSEL team.4 In 2006 MIREX (also organised by the IMIRSEL team) was held for the second time, at the 7th ISMIR in Victoria, Canada. The second MIREX5 resembled closely the MIREX 2005 with one important exception – the statistical significance of the results was measured. This was an operation suggested by one of the TREC personnel, Ellen Voorhees, who had participated in the preliminary discussions on MIREX already in 2002. While some of the tasks have changed from the first Audio Description Contest and new tasks have been added to the task list, the core of MIREX has stayed relatively stable even though the number of participants has dropped (Table 2). In MIREX 2005 and 2006 most of the test data was provided by the task leaders and it was hand annotated either by the task leaders or their colleagues. In two tasks, however, a ground truth was built using human evaluators. These tasks were Audio Music Similarity and Retrieval (2006) and Symbolic Melodic Similarity (2005 and 2006). For the MIREX 2006 the IMIRSEL team built up an evaluation software, called Evalutron6000 (Figure 1.), which will also be used in the future MIREX contests. Even though MIREX has taken TREC as its pattern, which can be seen in the organisation and evaluation measures used, it has MIR specific challenges that have not been solved this far. The biggest issue is the lack of realistically sized (audio) test collections, due to copyright issues. Even though Naxos6 has kindly offered their database for the use of the research community, it contains only certain type of music and thus cannot be used for genre classification or artist identification tasks. Moreover, since the contest is run by a single research project instead of a national institute, the lack of funding prevents from buying a large collection of popular music. However, two collections, USPOP and 3 4 5
http://ismir2004.ismir.net/ISMIR Contest.html http://www.music-ir.org/mirex2005/index.php/Main Page http://www.music-ir.org/mirexwiki/index.php/MIREX 2006
6
Tasks ADC MIREX05 MIREX06 (Audio) Genre Classification not available 10 (Audio) Artist Identification not available 7 (Audio) Melody Extraction 4 8 5 7 9 6 (Audio) Tempo Induction/Extraction Rhythm Classification 1 Audio Drum Detection 5 Audio Key Finding 6 7 4 Audio Onset Detection Symbolic Genre Classification 4 Symbolic Melodic Similarity 6 5 Symbolic Key Finding 5 5 Audio Beat Tracking Audio Cover Song Identification 6 Audio Music Similarity and Retrieval 5 QBHS 8 2 Score Following Total participants 12+? 58 30 Table 2. ADC and MIREX tasks and participants 2004-2006
Fig. 1. The Evalutron6000 software for human evaluations.
7
USCRAP, have been purchased as CDs and donated to the community by research laboratories. In the area of symbolic MIR the situation with test collections is a bit different. There are reasonably though not realistically sized MIDI collections that are copyright free. However, the quality of the MIDI files varies dramatically and the size of these collections grows more slowly. 4.2
Organising a MIREX task – endless hours of fun?
The author of this paper was asked to co-organise one of the MIREX 2006 tasks, namely Symbolic Melodic Similarity (SMS) with Rainer Typke (University of Utrecht, The Netherlands), based on the earlier discussions with mr. Typke and Prof. Downie about the organisation of the task and the publicity of the test data. The practical issues of organising an evaluation task for an evaluation contest will be discussed below based on the experiences gained in MIREX 2006. Planning an evaluation task Before the task leaders can start planning an evaluation task, they need to agree on the task with the IMIRSEL team. In practise, this is done by ensuring the IMIRSEL team that the leaders have access to suitable test data or are willing to build the collection on the run. When the IMIRSEL team agrees on the task, the task leaders start a wiki page in which they write down the description of the task, requirements for the participating algorithms (including the interfaces), and an overall description of the test collection and queries to be used in the task. The task leaders start an email list for potential participants and other interested members of the research community, through which the list members may share their opinions and give suggestions considering the task. In MIREX 2006, SMS task consisted of three separate subtasks: monophonic RISM UK collection task, polyphonic karaoke collection task, and polyphonic mixed collection task. In the monophonic7 subtask both the query and the data were monophonic and the test collection consisted of 15,000 short incipits of melodies. There were 6 queries half of which were written (and thus quantised) and the other half hummed/whistled. In both of the polyphonic tasks, the query was monophonic and the data polyphonic. The karaoke collection consisted of 1000 karaoke files and the mixed collection of 10,000 general polyphonic music files. There were 5 karaoke queries, 3 of which were hummed/whistled and 2 written. Moreover, there were 6 mixed polyphonic queries, 3 hummed/whistled and 3 written. Compared to its counterpart, Audio Music Similarity and Retrieval, the SMS list was very passive and the potential participants merely expressed their interest in the task in general. Therefore, the task leaders needed to formulate the task completely by themselves, with only a few suggestions by the other members of the list. 6 7
http://www.naxos.com/ A song is monophonic if there is only one note playing at the time.
8
Preparing the data and the queries When the task was finalised, the task leaders were asked to prepare the data and the queries for the IMIRSEL team. In practise this meant approximately 80-100 hours of searching and listening through MIDI files. To make the task sensible, the queries needed to have at least two matches in the data, which complicated finding suitable queries especially for the karaoke collection due to its very small size. When a suitable query was found, it was either written or hummed/whistled and encoded in MIDI. Both karaoke and RISM UK collections were used as whole. Thus, there was no need to prepare the data itself in any other fashion. The mixed collection was supposed to be a random selection of 10,000 MIDI files over the Internet. However, to ensure the matches in the collection, we needed to first add all the matches in the collection by hand and then take a random selection to form the rest of the collection. In this phase, all the duplicates were removed. Running the algorithms When the data and queries were ready, the algorithms submitted by the participants were run for the first time and 10 best matches were pooled for the evaluation procedure. Since many of the polyphonic pieces were over two minutes long, we asked the participants to return only the exact locations of the matches. The algorithms had severe problems with the MIDI data (due to the quality of both the data and the algorithms) and therefore some of the results needed to be cut to snippets by hand. This was done by the IMIRSEL team. Creating the ground truth The ground truth was created by human evaluators. Because of the location of the IMIRSEL team and quite strict rules made by the US government considering the use of human evaluators, the number of potential evaluators was cut very low (21 evaluators for the SMS task). Therefore, each of the evaluators needed to evaluate 13 queries and 15 candidates per each query. In practise this took 2-7 hours per each evaluator. In the evaluation procedure each evaluator was asked to compare each of the result in his/her result subset to the query and rate them using two rating systems. The first rating system had three options: not similar, somewhat similar, and very similar. The second rating system was a continuous slide from 0 to 10. Preparing and publishing the results When the ground truth was completed the results of the human evaluators were compared to the result lists returned by each algorithm and 10 different evaluation measures were calculated for each algorithm: – – – – – –
ADR = Average Dynamic Recall NRGB = Normalised Recall at Group Boundaries AP = Average Precision (non-interpolated) PND = Precision at N Documents Fine(1) = Sum of fine-grained human similarity decisions (0-10). PSum(1) = Sum of human broad similarity decisions: NS=0, SS=1, VS=2.
9
– WCsum(1) = ’World Cup’ scoring: NS=0, SS=1, VS=3 (rewards Very Similar). – SDsum(1) = ’Stephen Downie’ scoring: NS=0, SS=1, VS=4 (strongly rewards Very Similar). – Greater0(1) = NS=0, SS=1, VS=1 (binary relevance judgement). – Greater1(1) = NS=0, SS=0, VS=1 (binary relevance judgement using only Very Similar). (1)Normalised to the range of 0 to 1. MIREX 2007 Already when planning for the MIREX 2006 there was discussion about giving up with the competitive nature of the contest and making the contest more collaborative and comparative instead of announcing only the winners of each task. Due to this, there were no rewards in MIREX 2006. In the MIREX discussion panel held in ISMIR in 2006, the participants wanted to keep MIREX in this way also in 2007. The main complaints were about the enormous amount of time required for building up the ground truth. To make MIREX 2007 even more collaborative and less competitive, a couple of new ideas were presented at the panel. One of them was to combine several tasks to make a continuum. In this type of a task, each combination of algorithms submitted into each subtask would be evaluated and the best combination would be announced. In this way, evaluation of the algorithms would make way for evaluation of larger systems, based on the algorithms.
5
Conclusion
Evaluation is an important part of the (M)IR field. Apart from the text-based retrieval community, multimedia IR research still lacks reasonably sized and publicly available test collections. However, the first attempts to evaluate retrieval methods in an organised fashion have been promising in the fields of video and music information retrieval. Organising an evaluation contest is hard (often volunteer) work with no other reward but the general acceptance of the research community. However, a completed task is a reward itself. Even though there were minor difficulties with organising MIREX 2006, mainly the lack of communication between the IMIRSEL team and the task leaders and the quality of data used, the contest itself proved to be successful. There are problems to be tackled in the MIREX 2007 and in the field in general, such as the copyright issues and making the evaluation contest more collaborative and less competitive.
Bibliography
Ghias, A., J. Logan, D. Chamberlin, and B. C. Smith (1995). Query by humming – musical information retrieval in an audio database. In ACM Multimedia 95, pp. 231–236. Harman, D. K. (2002). The development and evolution of trec and duc. In K. Oyama, E. Ishida, and N. Kando (Eds.), Proceedings of the Third NTCIR Workshop. http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings3/index.html. Rasmussen, E. (2003). Evaluation in information retrieval. In J. S. Downie (Ed.), The MIR/MDL Evaluation Project White Paper Collection, 3rd ed. http://www.music-ir.org/evaluation/wp.html. Voorhees, E. M. and D. K. Harman (2005). The text retrieval conference. In E. M. Voorhees and D. K. Harman (Eds.), TREC - Experiment and Evaluation in Information Retrieval, pp. 3–20. The MIT Press.