Selected References
Most of the references I give here are for natural language and speech processing.
General / Misc
- A nice introduction is this online book:
Sinclair, J. Wynne, M. (ed.) Developing Linguistic Corpora: a Guide to
Good Practice Corpus and Text - Basic Principles Oxford: Oxbow Books,
2005, 1-16 (http://ahds.ac.uk/linguistic-corpora/)
- The LDC huge catalogue (too huge?):
http://www.ldc.upenn.edu/annotation/
- There is an ACL workshop dedicated to the subject, the Linguistic Annotation Workshop (LAW):
Corpus creation
- Another, more speech-oriented, by Olivier Baude:
Baude, O. 2007. Contribution des corpus oraux à la
linguistique de corpus: une démarche
réflexive intégrée,
Communication aux Journées de Linguistique de Corpus
(Lorient).
Annotation formats/schemes, standards
- This overview by Nancy Ide is a good starting point:
Ide, N. Annotation Science: From Theory to Practice and Use. (Invited
Talk) Data Structures for Linguistics Resources and Applications
Proceedings of the Bienniel GLDV Conference, 2007
(http://www.cs.vassar.edu/~ide/papers/GLDV.pdf)
- Annotation graphs:
Bird, S. & Liberman, M. A Formal Framework for Linguistic
Annotation (revised version) CoRR?, 2000, cs.CL/0010033, pp 23-60
(http://arxiv.org/abs/cs/0010033)
Methodology
- An interesting article giving results with
various methodologies (pre-annotation, use of tool, training, etc):
Dandapat, S.; Biswas, P.; Choudhury, M. & Bali, K. Complex
Linguistic Annotation - No Easy Way Out! A Case from Bangla and Hindi
POS Labeling Tasks Proceedings of the third ACL Linguistic Annotation
Workshop, 2009 (http://www.aclweb.org/anthology/W/W09/W09-3002.pdf)
- The MEDIA corpus methodology:
Bonneau-Maynard, H.; Rosset, S.; Ayache, C.; Kuhn, A. & Mostefa, D.
Semantic Annotation of the French Media Dialog Corpus InterSpeech?,
2005 (ftp://tlp.limsi.fr/public/IS052010.PDF)
- An article we wrote
last year, focusing on the importance of the guidelines:
Fort, K.; Ehrmann, M. & Nazarenko, A. Towards a Methodology for
Named Entities Annotation Proceeding of the 3rd ACL Linguistic
Annotation Workshop (LAW III), 2009
(http://www.aclweb.org/anthology/W/W09/W09-3025.pdf)
- Guidelines, again: Nédellec C., Bessières P., Bossy R.,
Kotoujansky A., Manine A.-P., Annotation Guidelines for
Machine Learning-Based Named Entity Recognition in
Microbiology, In Proceedings of the Data and text mining in
integrative biology workshop, joint to ECML/PKDD, M. Hilario et C.
Nedellec (Eds), p. 40-54, Berlin, Germany, september 2006.
(http://www.ecmlpkdd2006.org/ws-dtib.pdf)
- Annotating using
ontologies:
Cimiano, P. & Handschuh, S. Ontology-based linguistic annotation
Proceedings of the ACL 2003 workshop on Linguistic annotation,
Association for Computational Linguistics, 2003, 14-21
(http://dx.doi.org/10.3115/1119296.1119299)
Evaluation of manual annotation
- The reference article on inter-annotator agreement:
Artstein, R. & Poesio, M. Inter-coder agreement for computational
linguistics Computational Linguistics, MIT Press, 2008, 34, 555-596
(http://dx.doi.org/10.1162/coli.07-034-R2)
Annotation tools
In theory, there are many annotation tools.
In practice, very few are really available and usable.
- Transcriber:
Annotation (transcription) tool for speech and more generally audio and video
annotation. Freely available.
http://trans.sourceforge.net/
- GATE: well-known annotation tool,
designed for automatic annotation. Manual annotation is possible, with
interesting features (annotate all, for example), but difficult to get
into. Freely available.
http://gate.ac.uk/
- GLOZZ: a new tool, designed for discourse
annotation, but usable for any kind of annotation, including with
(complex) relations. Doc only in videos. Freely available for research.
http://www.glozz.org/
- Knowtator (Protégé
plugin): allows to annotate using ontologies. Freely available.
http://knowtator.sourceforge.net/index.shtml
- Callisto: esp. for time
annotations (does not allow for the annotation of other relations).
Freely available.
http://callisto.mitre.org/
- MMAX2: interesting tool, allowing for the
annotation of relations, but UI difficult to use. Freely available.
http://mmax2.sourceforge.net/
- Advene: a Python tool dedicated to
video annotation. Many functionalities, easy to use, ergonomic. Freely
available.
http://liris.cnrs.fr/advene/