OK, I'm going to /try/ to be a better LJer.
This morning I had the somewhat dubious honour of being called on as an XML "expert" to sit in on a meeting about the Britten-Pears archive. I was asked by the professor (I guess on the strength of work on CURSUS) who had, in turn, been asked by one of the other lecturers who's got funding to "modernise" this archive. Basically, they want to integrate several different thematic databases to present a single interface. I got to tell them that "interface" doesn't necessarily have to mean "user interface" and that they could present an XML-based API quite easily. I also found myself obliged to agree that restricting your use of element names in an XML schema - using attributes in preference - is a good idea. I did a diagram like this:
<!-- workable: -->
<piece>
<title>Billy Budd</title>
<genre>Opera</genre>
</piece>
<letter>
<from>Benjam Britten</from>
<to>Peter Pears</to>
</letter>
<!-- better: -->
<record type="piece">
<field type="title">Billy Budd</field>
<field type="genre">Opera</field>
</record>
<record type="letter">
<field type="from">Benjam Britten</field>
<field type="to">Peter Pears</field>
</record>
The idea being that the second method is better for SAX processors (though I didn't actually say "SAX"), increases homogeneity and encourages better XPath (and I actually /did/ say "XPath"!).
Anyway, the professor seemed quite pleased with our performance and even quite optimistic about the project. But its nothing to do with me from now on...
And one other thing: I'm going to be a student again!
I had a meeting/interview about my research just now which was /really/ useful. We thrashed out my ideas and came up with a good starting point: I'm going to investigate the concepts involved in encoding music symbolically and assess existing practices. And I'm supposed not to concern myself yet with where exactly it will lead. Apparently this is the best thing to do and it makes sense - how can I know /now/ where my initial research will lead? It would be pointless to try to state now what I think my outcomes will be.
So that all seems quite sorted (to me). Now all I have to do is learn to live without food for three years!
So I'm going to look at principles and practices of symbolic music encoding.
Practices are many varied. It seems that every other paper on the subject is a proposal for a new encoding method and they all begin by detailing the problems with existing systems. Common complaints are along the lines of "its no good for my C17th lute tablature", "it won't allow me to encode variation in the sources" and "it pre-supposes the kinds of uses I should want to make of it" but they are all genuinely serious - they really believe that the plethora of existing methods are no good. Can this really be true? Have all the experts who've put but thought into this problem been hopelessly single-minded?
So what would I want from a music encoding system? I'm in no position to say what they should be able to do yet - give me a year or so and I may be able to begin to answer that question. But one thing I think it true at the moment is that it should allow me to preserve information from manuscript sources and to record variation between sources for a work. One thing to avoid, of course, is the assumption that any version is the "original" or "authoritative" version and that others are merely corruptions - someone will always come along 5 years later and say that you were wrong! Probably because there often /isn't/ an authoritative version. Think, for example, of Chopin always playing his preludes slightly differently or of the medieval chant repertory.
And exactly what is the nature of the notation I'm trying to encode? Take, for example, ties and barlines. In a modern performance edition of a piece of C16th polyphony, the music will be structured into a meter and use barlines and tie notes across bars. The C16th notation had no such convention - but which to encode depends on what I believe the nature of the notation to be: is it a textual source (and therefor should I encode the notation as near to the original manuscript as possible) or is it a description of sonic events (and therefore would a tie across a barline be /identical/ to an single note existing in unmeasured score)?
Lots to think about. Once I'm registered I'm going to go to the library and pour over notation - tonnes of it - and from as many traditions as possible. Ties and barlines are peculiar to Western music and should I really limit myself to just that facet of musical phenomena?
Oh, yes. Look what I did today:
#!/usr/bin/env python
import sys, os, re, time, pxssh, wx
APACHE_HOST_USER = "root"
APACHE_HOST_PASSWD = "password" # root password of server on which Apache runs
APACHE_HOST_NAME = "localhost" # host name of server on which Apache runs
APACHE_ACCESS_LOG = "/var/log/apache2/access.log"
APACHE_ERROR_LOG = "/var/log/apache2/error.log"
POLL_INTERVAL = 60 * 1
LOG_REGEX = re.compile("([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) - - \[([^\]]*)\] \"([^\"]*)\" ([0-9]{3})")
class popup(wx.Frame):
def __init__(self, ip_addr, date, uri, status):
wx.Frame.__init__(self, None, wx.ID_ANY, title="Apache Log", size=(500,80))
self.info = wx.StaticText(self, -1, "IP address: %s; date: %s; uri: \"%s\"; status: %s" %\
(ip_addr, date, uri, status), wx.Point(10,10))
self.dismiss = wx.Button(self, 10, "Close", wx.Point(10,30))
wx.EVT_BUTTON(self, 10, self.OnClick)
self.Show(True)
def OnClick(self, event):
self.Close(True)
def main():
"""
The daemon. Examines last line of apache access. Displays window if IP address
is different to last stored one.
"""
LAST_IP_ADDR = ""
while 1:
s = pxssh.pxssh()
if not s.login(APACHE_HOST_NAME, APACHE_HOST_USER, APACHE_HOST_PASSWD):
pass
else:
s.sendline("tail -1 %s" % APACHE_ACCESS_LOG)
s.prompt()
for line in s.before.split("\n"):
m = LOG_REGEX.match(line)
if m is not None:
ip_addr = m.group(1)
date = m.group(2)
uri = m.group(3)
status = m.group(4)
if ip_addr != LAST_IP_ADDR:
app = wx.App()
entry_info_window = popup(ip_addr, date, uri, status)
app.MainLoop()
LAST_IP_ADDR = ip_addr
break
s.logout()
time.sleep(POLL_INTERVAL)
if __name__ == "__main__":
forking code from Python Cookbook: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/66012
do the UNIX double-fork magic, see Stevens' "Advanced
Programming in the UNIX Environment" for details (ISBN 0201563177)
try:
pid = os.fork()
if pid > 0:
exit first parent
sys.exit(0)
except OSError, e:
print >>sys.stderr, "fork #1 failed: %d (%s)" % (e.errno, e.strerror)
sys.exit(1)
decouple from parent environment
os.chdir("/")
os.setsid()
os.umask(0)
do second fork
try:
pid = os.fork()
if pid > 0:
exit from second parent, print eventual PID before
print "Daemon PID %d" % pid
sys.exit(0)
except OSError, e:
print >>sys.stderr, "fork #2 failed: %d (%s)" % (e.errno, e.strerror)
sys.exit(1)
start the daemon main loop
main()
Someone on a Python forum asked if it would be possible to write a program which monitored your Apache access log and displayed a message on your desktop whenever someone was viewing your site. It seems that the answer is "yes".
I wrote this in my notebook on the train on Thursday. Thought I'd just put it on here for safekeeping:
- How much information to encode?
- Consider that text from a manuscript is encoded using normal ASCII characters and therefore information about the original glyph is lost. Does this matter? How important are the variations of forms of the letter 'a'?
- One method of preserving the glyphs/graphical information may be to store a facsimile of the MS and encode the position on the MS where elements came frame; this is then a hybrid visual/semantic encoding.
- Encoding decisions based on the nature of the information to encode
- Examining the nature of musical notations is a very important part of the work; a comprehensive understanding is necessary for any further decisions.
- This is why I was thinking about tied notes: what is the nature of the notation when a note is tied? Is it more important that the pitch lasts for a particular duration, or that there are two symbols which have a tie mark between them? In some repsects its only a restriction of post C18th Western music.
I hope everyone missed me on #alug today? I was in Oxford! For the Oxford Text Archive's 30th anniversary conference. Well, I was there for half of it anyway, the four hour train journey would have meant getting up far too early to attend the morning half of the conference as well.
But the talks I did hear were interesting. Here are some of the notes I made:
Julia Flanders talked on "Historicizing Humanities Computing". Technological advances since the formation of the OTA 30 years ago have allowed the practice of text encoding to move from being an experiment to being a paradigm - instead of it being purely for its own sake or interest its now possible for ordinary scholars to use text encoding techniques as another resource. But also, seemingly primitive older marked up text was, she suggested, not limited at the time it was produced - it was, as is often the case with historical artefacts, simply what encoders at the time wanted to do. Related to this, she showed some examples of older marked up documents in which the mark up was obviously aimed at being read by human scholars and not really suitable for computer processing. She went on to ask if today's marked up text, in the light of technologies such as XQuery, XSLT and SVG, has become one step removed from the scholars and asked whether or not this could be considered progress. Questions were asked about how important the size of a collection of marked up documents is.
Claire Warwick talked about documentation and commenting - something which we all know we should be doing! One important point she raised was that a lot of people are guilty of either not using existing standards or customising them on the justification that "my project is unique". With text encoding, it quite often is the case that, for example, TEI needs customising before you can use it effectively, but you should still, she argues, make sure that you document your customisations. She also discussed another important contributor to the apparent lack of documentation. Most users of text encoding standards are creators of documents, not users and so users' needs have not always been fully considered - its often difficult for people who design a system to foresee how users will interact with it and to realise which bits are only "obvious" to the designer. She went on to talk about the sustainability of resources and argued that, just because you've deposited your text in the OTA - your funded period is over, you've achieved your stated goals and backed up your text - it doesn't mean that you've accounted for the sustainability of your work. Sustainability, she argued, is achieved through maintenance of archives - both the back-end and the interface. Of course, funding maintenance work (which is by its very nature indeterminate in length) is very difficult to do. She also mentioned that contemporary users are very literate in and critical of Web-presented archives - it only takes the smallest defect for them to dismiss an archive (or many other sites, in fact) as neglected. Finally, she suggested that documentation should be considered a first class deliverable of archival projects - not merely an appendage.
Peter Robinson was very entertaining and spoke under the inflametry title of "Why the OTA is obsolete (or should be..)" and was getting at the fact that scholars should be willing to relinquish ownership of their transcribed texts and begin to work collaboratively on text encoding. Interestingly, he was arguing for a very distributed model in which each contributor has their own copy of the encoded text with their own corrections/variations and users can view dynamic editions which aggregate all these distributed sources. It is one step removed from common Wikis (in which the user-edited documents exist in only one instance). He acknowledged the main difficulties in this: that scholars would be unwilling to give up ownership of their work; that the technology has not yet been applied to this problem (though doing so is not impossible with today's tools); and that a method will be needed to identify texts accurately.
Willard McCarty (whose book, "Humanities Computing" [Palgrave, 2005], I read recently) gave a retrospective on his involvement with issues around text encoding and humanities computing: the formation of the OTA, the founding of the HUMANIST mailing list and his coining of the term "humanities computing".
The conference finished with a panel discussion on the future of electronic text archives which was introduced by Yorick Wilks (Professor of AI and Linguistics at Sheffield) with a brief talk about the semantic web and how text encoders were partly responsible for its formation and would be crucial to its further development. He argued that the logic and trust layers of the semantic web which interest technologists must be grounded on annotated text which is the domain of text encoders.
The discussion went on to talk about Google - a lot. The suitability of Google's PageRank algorithm for text archives was discussed - is it useful for people who want to do linguistic analysis on text data or is it merely a keyword indexer? And will it index all the important documents in an archive or will its popularity-by-link-count method limit the visibility of documents. Use of Google for the OTA was considered in two possible implementations: either giving the archive to the Google Scholar project or using the Google API for the OTA. I was quite surprised to learn that the OTA are still working on user interaction tools for their archive - surely a good indexer (like Swish-e) and a good XML publishing framework (like, um, well I know of an all right one) would provide enough to be getting on with?
Another interesting topic discussed was the idea of "stand off" mark up. That is, mark up which is not embedded in the text (or in which the text is not embedded!) but which is separate. Applications include: meta-data markup, encoding conflicting hierarchies for the same document (e.g. semantic structure and visual structure) and possibly allowing collaborative transcriptions. One other thing that interested me was the idea that a marked up text should be considered a sort of secondary source - its an interpretation or reading, not a substitute for the original.
Anyway, it was all quite interesting and I'm now on the train home. It was nice to see JamesC again. And I met the chap who had been trying out Pycoon and assured him that one day it would be usable by people other than me!
Looking through the bumph we were given, I was very impressed with the state of computing services at Oxford - it really puts UEA to shame. They seem to take it seriously and consider it to be an important and working department staffed by intelligent academics and providing a real service to the core work of the University, whereas at UEA all they seem to have is a succession of failed, ill-conceived projects based on useless proprietary software and misguided belief in over-commercialised buzz-words.
Oh yes, and ergates, if you ever read this, I went to the Botanical Gardens to try to find "the bench" from His Dark Materials, but they close at 17:00 so I couldn't get in (Who closes a garden!?)
OK, so this is what I was talking about: I've been asked to add some content to the Music School's pages on UEA's new content management system. And it seems that, yet again, they've wasted money on another useless product to fullfill an unnecessary role (which could have been accomplished with a free product like Zope). It seems that people have been complaining that its difficult to use and, having had my "training", I can see why - it seems to allow for lots of working concepts (like workflows) which UEA don't want to use but can't hide and which just end up cluttering the interface.
As well as being potentially difficult to use its also bad for UEA's public Web presence as its broken all the URLs! They've decided to call the server www1.uea.ac.uk which looks really bad. Then it prepends a whole load of guff to the beginning of the paths which, as well as looking bad, makes it difficult to give people URLs over the phone. But, worse than all this, the "article" abstractions in the CMS only allow user defined titles and don't allow you to specify what the "filename" should be. And do you know what it uses for the "filenames"? NUMBERS!! Yuk! (I've found a hack for this, though: you can make everything using the "directory" abstraction and avoid "articles".)
However, despite all this, why does UEA need a content management system anyway? Their excuse is that they want to homogenise the UEA's web pages but why is that a good thing? My colleague described it as "empire building" and I think I'm inclined to agree. UEA isn't a corporation - its a university. Its supposed to encourage free thinking and individualism not restrict people into neat little boxes.
Its going to be another UEA IT disaster!
On a lighter note: I played in a recording of a piece for air horns yesterday. It was teh loud! (But we did have ear plugs.)
I wrote this in my notebook last night:
Computational linguistics really has an unfair advantage over algorithmic music analysis (apart from having a more succinct name): computers have an inherent, or at least, much more fundamental understanding of text than they do of music. Its easy to perform lexigraphic operations on text because the semantics of the symbols which it comprises (i.e. letters) is built in to the most fundamental programming libraries.
Music, on the other hand, has no such near-binary digital representation; its encoding must be carried out on a much higher level and its semantics must be meticulously explained to the computer. So what would be a good method for this? Could a binary representation for musical symbols be used where the lexigraphic information (e.g. sort order) is practically built in? Unfortunately the pitch 'A' has more potential attributes than the letter 'A'. For example, in any meaningful context a pitch has a duration, a timbre and may well have an order of importance within its tonal context. A better method would be encoding entities which could have multiple properties (and resigning to the fact that a high level, resource intensive data representation method is inevitable) and writing specialised lexigraphic functions to go with them.
Linguists are also given semantic structure in their manuscripts: words, sentences, paragraphs, etc. But musicians don't have quite so much help here either. Should we only encode the information given in the MS and spend time writing algorithms which try to identify phrases, antecedent-consequent phrases, harmonic progressions, modulations and significant structural boundaries? Should we encode what we believe to be the semantic structure? The result, in this case, would very clearly be a reading or interpretation of the score - not an attempt to reproduce the original. But it may allow other (more interesting?) work to be done if this semantic information is already given.
What exactly is a note? The dictionary gives three definitions of note (in the musical sense):
"1. Written sign representing pitch & duration of a musical sound; key of a pianoforte etc.; single tone of definite pitch made by musical instrument, voice, etc."
The Oxford Companion adds to this definition explaining that the second sense is often not used and further that American musical terminology rejects the third meaning as well - in this argument I think I side with the Americans. A note is only the written representation of a musical sound, not the sound itself or the listener's perception or comprehension of the sound.
In the light of my supposition that elements of musical information should be represented as tree elements - i.e. they can't be represented as characters - would there be any sense in a content-based tagging style? Or would an attribute-based tagging style make more sense?
Of course, an attribute-based tagging style has advantages for processing applications (particularly SAX-based tools) but it may also be syntactically more appropriate here as the markup is the data - its not, as in conventional text markup, a semantic addition to pre-existing encoded information. Consider that a <note>
element may be the finest level of granularity; it doesn't have any inherent content - only attributes. To use a content-based scheme implies that there is (or was) digital content which could exist (or did exist) independently of the markup. This isn't the case.
<!-- content based: -->
<note>A</note>
<!-- attribute based: -->
<note pitch="A" />
I've been inducted as a new post graduate student today! Yay! Apparently faculty inductions are new at UEA which seems a bit strange as they actually told us some quite important things.
But, as well as important things, they also told us some interesting things like about the Dialogues weekly seminar series which are inter-disciplinary and where you can give "experimental" papers, which is cool. I was quite disappointed to find that there seemed to be no linguistics students there - it would have been fun to discuss computational techniques in linguistics and compare them to music. Oh well, I suppose there must be some research that goes on in LLT.
Oh, and we had a free lunch!