OTA 30th anniversary conference

I hope everyone missed me on #alug today? I was in Oxford! For the Oxford Text Archive's 30th anniversary conference. Well, I was there for half of it anyway, the four hour train journey would have meant getting up far too early to attend the morning half of the conference as well.

But the talks I did hear were interesting. Here are some of the notes I made:

Julia Flanders talked on "Historicizing Humanities Computing". Technological advances since the formation of the OTA 30 years ago have allowed the practice of text encoding to move from being an experiment to being a paradigm - instead of it being purely for its own sake or interest its now possible for ordinary scholars to use text encoding techniques as another resource. But also, seemingly primitive older marked up text was, she suggested, not limited at the time it was produced - it was, as is often the case with historical artefacts, simply what encoders at the time wanted to do. Related to this, she showed some examples of older marked up documents in which the mark up was obviously aimed at being read by human scholars and not really suitable for computer processing. She went on to ask if today's marked up text, in the light of technologies such as XQuery, XSLT and SVG, has become one step removed from the scholars and asked whether or not this could be considered progress. Questions were asked about how important the size of a collection of marked up documents is.

Claire Warwick talked about documentation and commenting - something which we all know we should be doing! One important point she raised was that a lot of people are guilty of either not using existing standards or customising them on the justification that "my project is unique". With text encoding, it quite often is the case that, for example, TEI needs customising before you can use it effectively, but you should still, she argues, make sure that you document your customisations. She also discussed another important contributor to the apparent lack of documentation. Most users of text encoding standards are creators of documents, not users and so users' needs have not always been fully considered - its often difficult for people who design a system to foresee how users will interact with it and to realise which bits are only "obvious" to the designer. She went on to talk about the sustainability of resources and argued that, just because you've deposited your text in the OTA - your funded period is over, you've achieved your stated goals and backed up your text - it doesn't mean that you've accounted for the sustainability of your work. Sustainability, she argued, is achieved through maintenance of archives - both the back-end and the interface. Of course, funding maintenance work (which is by its very nature indeterminate in length) is very difficult to do. She also mentioned that contemporary users are very literate in and critical of Web-presented archives - it only takes the smallest defect for them to dismiss an archive (or many other sites, in fact) as neglected. Finally, she suggested that documentation should be considered a first class deliverable of archival projects - not merely an appendage.

Peter Robinson was very entertaining and spoke under the inflametry title of "Why the OTA is obsolete (or should be..)" and was getting at the fact that scholars should be willing to relinquish ownership of their transcribed texts and begin to work collaboratively on text encoding. Interestingly, he was arguing for a very distributed model in which each contributor has their own copy of the encoded text with their own corrections/variations and users can view dynamic editions which aggregate all these distributed sources. It is one step removed from common Wikis (in which the user-edited documents exist in only one instance). He acknowledged the main difficulties in this: that scholars would be unwilling to give up ownership of their work; that the technology has not yet been applied to this problem (though doing so is not impossible with today's tools); and that a method will be needed to identify texts accurately.

Willard McCarty (whose book, "Humanities Computing" [Palgrave, 2005], I read recently) gave a retrospective on his involvement with issues around text encoding and humanities computing: the formation of the OTA, the founding of the HUMANIST mailing list and his coining of the term "humanities computing".

The conference finished with a panel discussion on the future of electronic text archives which was introduced by Yorick Wilks (Professor of AI and Linguistics at Sheffield) with a brief talk about the semantic web and how text encoders were partly responsible for its formation and would be crucial to its further development. He argued that the logic and trust layers of the semantic web which interest technologists must be grounded on annotated text which is the domain of text encoders.

The discussion went on to talk about Google - a lot. The suitability of Google's PageRank algorithm for text archives was discussed - is it useful for people who want to do linguistic analysis on text data or is it merely a keyword indexer? And will it index all the important documents in an archive or will its popularity-by-link-count method limit the visibility of documents. Use of Google for the OTA was considered in two possible implementations: either giving the archive to the Google Scholar project or using the Google API for the OTA. I was quite surprised to learn that the OTA are still working on user interaction tools for their archive - surely a good indexer (like Swish-e) and a good XML publishing framework (like, um, well I know of an all right one) would provide enough to be getting on with?

Another interesting topic discussed was the idea of "stand off" mark up. That is, mark up which is not embedded in the text (or in which the text is not embedded!) but which is separate. Applications include: meta-data markup, encoding conflicting hierarchies for the same document (e.g. semantic structure and visual structure) and possibly allowing collaborative transcriptions. One other thing that interested me was the idea that a marked up text should be considered a sort of secondary source - its an interpretation or reading, not a substitute for the original.

Anyway, it was all quite interesting and I'm now on the train home. It was nice to see JamesC again. And I met the chap who had been trying out Pycoon and assured him that one day it would be usable by people other than me!

Looking through the bumph we were given, I was very impressed with the state of computing services at Oxford - it really puts UEA to shame. They seem to take it seriously and consider it to be an important and working department staffed by intelligent academics and providing a real service to the core work of the University, whereas at UEA all they seem to have is a succession of failed, ill-conceived projects based on useless proprietary software and misguided belief in over-commercialised buzz-words.

Oh yes, and ergates, if you ever read this, I went to the Botanical Gardens to try to find "the bench" from His Dark Materials, but they close at 17:00 so I couldn't get in :-( (Who closes a garden!?)