Selected Publications and Presentations

[Dubin et al., 2003]
D. Dubin, C. M. Sperberg-McQueen, A. Renear, and C. Huitfeldt. A logic programming environment for document semantics and inference. Literary and Linguistic Computing, 18(2):225-233, 2003. (This is a corrected version of an article that appeared in 18:1 pp. 39-47).

Markup licenses inferences about a text. But the information warranting such inferences may not be entirely explicit in the syntax of the markup language used to encode the text. This paper describes a Prolog environment for exploring alternative approaches to representing facts and rules of inference about structured documents. It builds on earlier work proposing an account of how markup licenses inferences, and of what is needed in a specification of the meaning of a markup language. Our system permits an analyst to specify facts and rules of inference about domain entities and properties as well as facts about the markup syntax, and to construct and test alternative approaches to translation between representation layers. The system provides a level of abstraction at which the performative or interpretive meaning of the markup can be explicitly represented in machine-readable and executable form.

[Dubin et al., 2003]
D. S. Dubin, A. Renear, and C. M. Sperberg-McQueen. Addressing obstacles to the retrieval of structured documents. Technical Report UIUCLIS- -2003/1+EPRG, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, Champaign, IL, 2003.

The potential for document markup, such as SGML and XML, to support information retrieval is receiving considerable attention. However the generally underdetermined and implicit nature of even the most basic semantic relationships expressed by SGML/XML markup is an obstacle to its effective exploitation, as is the assumption of "semantic transparency." A project to develop a adequate machine-readable formalism for expressing markup semantics is described, along with specific applications to retrieval problems. A Prolog environment supporting inferences and queries is also used to generate document abstractions based a formalized semantics for markup vocabularies.

[Dubin, 2003]
D. Dubin. Object mapping for markup semantics. In B. T Usdin, editor, Proceedings of Extreme Markup Languages 2003, Montreal, Quebec, August 2003.

The BECHAMEL system is a knowledge representation and inference environment for expressing and testing semantic rules and constraints for markup languages. Written in Prolog, the system provides predicates for processing the syntactic structures that emerge from a SGML/XML parser, defining object classes, instantiating object instances, assigning values to properties, and establishing relationships between or among object instances. BECHAMEL uses Prolog's built-in capabilities to derive inferences from these facts. Part of the ongoing development of BECHAMEL involves experimenting with strategies for mapping syntactic relations to object relations and properties. This paper describes the current strategy, based on a blackboard model. Advantages of this approach include context free rules and the potential to exploit parallel processing for scalability. It has the drawback, however, of not permitting evidence to be described in ways people are likely to find natural or familiar. By using the current approach to produce formal accounts of the semantics of popular markup languages, we hope to learn a great deal about the ways markup syntax typically cues semantic relationships. That advance in our understanding will inform the development of more usable languages for object mapping.

[Renear and Dubin, 2003]
A. Renear and D. Dubin. Towards identity conditions for digital documents. In S. Sutton, editor, Proceedings of the 2003 Dublin Core Conference, Seattle, WA, October 2003. University of Washington.

By "identity conditions" we mean a method for determining whether an object x and an object y are the same object. Identity conditions are arguably an essential feature of any rigorously developed conceptual framework for information modeling. Surprisingly, the concept of same document, which is fundamental to many aspects of library and information science, and to digital libraries in particular, has received little systematic analysis. As a result, not only is the concept of a document itself under-theorized, but progress on a number of important practical problems has been hindered. We review the importance of document identity conditions, demonstrate problems with current approaches, and discuss the general form a solution must take.

[Renear and Salo, 2003]
A. Renear and D. Salo. Electronic books and the open eBook publication structure. In W. E. Kasdorf, editor, The Columbia Guide to Digital Publishing, chapter 11, pages 455-520. Columbia University Press, New York, 2003.

Electronic books, or e-books, will soon be a major part of electronic publishing. This chapter introduces the notion of electronic books, reviewing their history, the advantages they promise, and the difficulties in predicting the pace and nature of e-book development and adoption. It then analyzes some of the critical problems facing both individual publishers and the industry as a whole, drawing on our current understanding of fundamental principles and best practice in information processing and publishing. In the context of this analysis the Open eBook Forum Publication Structure, a widely used XML-based content format, is presented as a foundation for high-performance electronic publishing.

[Renear et al., 2003]
A. Renear, C. Phillippe, P. Lawton, and D. Dubin. An XML document corresponds to which FRBR Group 1 entity?. In B. T Usdin, editor, Proceedings of Extreme Markup Languages 2003, Montreal, Quebec, August 2003.

The FRBR (Functional Requirements for Bibliographic Records), released by the International Federation of Library Associations and Institutions in 1998, generalizes and refines current practices and theory in library cataloging, presenting a compelling natural ontology of entities, attributes, and relationships for representing the "bibliographic universe." The FRBR framework is extremely influential and increasingly accepted as a conceptual foundation for cataloging practice and technology in libraries and elsewhere. XML documents as defined in the W3C XML 1.0 specification, are now an important part of this bibliographic universe and it is natural to ask to which of FRBR's "Group 1" entities does the XML document correspond. Curiously, there seem to be conflicting arguments for assigning the XML document to either of the two plausible entity categories: manifestation and expression. We believe these difficulties illuminate both the nature of the FRBR entities, and the nature of markup. We explore a conjecture that an XML document has a double aspect and that whether it is a FRBR manifestation or a FRBR expression depends upon context and intention. Such a double-aspected nature would not only be consistent with previous arguments that the meaning of XML markup varies in "illocutionary force" according to context of use, but might also help resolve an old puzzle in the humanities computing community as to whether markup is "part of" the text (Buzzetti 2002). However, there are alternative resolutions to explore as well and we seem to still be some distance from a full understanding of the issues.

[Renear et al., 2003]
A. Renear, D. Dubin, C. M. Sperberg-McQueen, and C. Huitfeldt. XML semantics and digital libraries. In C. C. Marshall, G. Henry, and L. Delcambre, editors, Proceedings of the third ACM/IEEE-CS joint conference on Digital libraries, pages 303 - 305, Los Alamitos, CA, 2003. IEEE.

The lack of a standard formalism for expressing the semantics of an XML vocabulary is a major obstacle to the development of high-function interoperable digital libraries. XML document type definitions (DTDs) provide a mechanism for specifying the syntax of an XML vocabulary, but there is no comparable mechanism for specifying the semantics of that vocabulary --- where semantics simply means the basic facts and relationships represented by the occurrence of XML constructs. A substantial loss of functionality and interoperability in digital libraries results from not having a common machine-readable formalism for expressing these relationships for the XML vocabularies currently being used to encode content. Recently a number of projects and standards have begun taking up related topics. We describe the problem and our own project.

[Sperberg-McQueen, 2003]
C. M. Sperberg-McQueen. Logic grammars and xml schema. In B. T Usdin, editor, Proceedings of Extreme Markup Languages 2003, Montreal, Quebec, August 2003.

[Dubin et al., 2002]
D. Dubin, A. Renear, C. M. Sperberg-McQueen, and C. Huitfeldt. A logic programming environment for document semantics and inference. Presented at ALLC/ACH, Tübingen, Germany, July 2002.

Recently Sperberg-McQueen and others have argued that markup functions by licensing inferences about a text. They remark, however, that the information warranting such inferences may not be entirely explicit in the syntax of the markup language used to encode the text.In order to adequately represent such inferences (the "meaning of markup") the Sperberg-McQueen group developed techniques for expressing in predicate logic, (i) the facts signalled by the encoding of a particular document instance and (ii) the logical relationships commonly understood to exist and license further inferences. A Prolog database was used to demonstrate the effectiveness of this approach. The present paper builds directly on this previous work, and reflects new results which provide more rigorous and explanatory layers of abstraction and progress in understanding problems with "deictic" expressions and domains of variables, etc. But the fundmental new result presented is the completion of a complete integrated working system with an entirely new and substantially redesigned Prolog database at its core. This Prolog database has been redesigned to improve functionally, better reflect the theoretical results, and increase functionality, flexibility, and performance. The system permits an analyst to specify facts about the markup syntax (e.g., generic identifiers and attribute values) separately from facts and rules of inference about semantic entities and properties. The system provides a level of abstraction at which the performative or interpretive meaning of the markup can be explicitly represented in machine-readable and executable form. Inferences can then be drawn regarding document components, including problematic structures, such as those participating in overlapping hierarchies.

[Dubin, 2002]
D. Dubin. Standards and information. In J. R. Schement, editor, Encyclopedia of Communication and Information, volume 3, pages 965-967. Macmillan, New York, 2002.

[Renear et al., 2002]
A. Renear, D. Dubin, C. M. Sperberg-McQueen, and C. Huitfeldt. Towards a semantics for XML markup. In R. Furuta, J. I. Maletic, and E. Munson, editors, Proceedings of the 2002 ACM Symposium on Document Engineering, pages 119-126, McLean, VA, November 2002. Association for Computing Machinery.

Although structured document standards provide mechanisms for specifying, in machine-readable form, the syntax of a markup language, there is no comparable mechanism for specifying the semantics of an SGML or XML vocabulary. That is, there is no way to characterize the meaning of markup so that the facts and relationships represented by the occurrence of its constructs can be explicitly, comprehensively, and mechanically identified. This has serious practical and theoretical consequences. On the positive side, SGML/XML constructs can be assigned arbitrary semantics and used in application areas not foreseen by the original designers. On the less positive side, both content developers and application engineers must rely upon prose documentation, or, worse, conjectures about the intention of the markup language designer --- a process that is time-consuming, error-prone, incomplete, and unverifiable, even when the language designer properly documents the language. In addition, the lack of a substantial body of research in markup semantics means that digital document processing is undertheorized as an engineering application area. Although there are some related projects underway (XML Schema, RDF, the Semantic Web) which provide relevant results, none of these projects directly and comprehensively address the core problems of structured document semantics. This proposal characterizes the specific problems that motivate the need for a formal semantics for SGML/XML, describes an ongoing research project --- the BECHAMEL Markup Semantics Project --- that is attempting to develop such a semantics.

[Sperberg-McQueen et al., 2002]
C. M. Sperberg-McQueen, A. Renear, C. Huitfeldt, and D. Dubin. Skeletons in the closet: Saying what markup means. Presented at ALLC/ACH, Tübingen, Germany, July 2002.

Our immediate area of concern is the problem of providing a clear, explicit account of the meaning and interpretation of markup. Scores of projects in humanities computing and elsewhere assume implicitly that markup is meaningful, and use its meaning to govern the processing of the data. While a complete account of the "meaning of markup" may seem daunting, at least part of this project appears manageable: explaining how to determine the set of inferences about a document which are "licensed", implicitly or explicitly, by its markup. However, it proves remarkably difficult to find, in the literature, any straightforward account of how one can go about interpreting markup in such a way as to draw all and only the correct inferences from it. This paper describes a concrete realization of one part of a model proposed earlier, and outlines some of the problems encountered in specifying the inferences licensed by commonly used DTDs. We focus here on the development of a notation for expressing what we call "sentence skeletons", or "skeleton sentences". These are sentences, either in English or some other natural language or in some formal notation, for expressing the meaning of constructs in a markup language. They are called sentence skeletons, rather than full sentences, because they have blanks at various key locations; a system for automatic interpretation of marked up documents will generate actual sentences by filling in the blanks in the sentence skeletons with appropriate values from the documents themselves. We describe theoretical and practical problems arising in using sentence skeletons to say what the markup in some commonly used DTDs actually means, in a way that allows software to generate the correct inferences from the markup and to exploit the information.

[Sperberg-McQueen et al., 2002]
C. M. Sperberg-McQueen, D. Dubin, C. Huitfeldt, and A. Renear. Drawing inferences on the basis of markup. In B. T Usdin and S. R. Newcomb, editors, Proceedings of Extreme Markup Languages 2002, Montreal, Quebec, August 2002.

Various authors have sketched out proposals for identifying the meaning, or guiding the automated interpretation, of markup, sometimes with the goal of using the information expressed by markup to guide the extraction of information from documents and using it to populate reasoning engines. We describe one approach to the problems of building a system to perform such a task.

[Renear, 2001]
A. Renear. Raising the bar: Text encoding from a logical point of view. CLIP 2001: Computers, Literature, Philology, Gerhard-Mercator University, Duisburg, Germany, December 2001.

[Sperberg-McQueen et al., 2001]
C. M. Sperberg-McQueen, C. Huitfeldt, and A. Renear. Practical extraction of meaning from markup. Paper delivered at ACH/ALLC 2001, New York, 2001.

[Renear, 2000]
A. Renear. The descriptive/procedural distinction is flawed. Markup Languages: Theory and Practice, 2(4):411-420, 2000.

The traditional distinction between descriptive and procedural markup is flawed; it conflates two different dimensions-mood and domain-which in fact can vary independently. An adequate markup taxonomy must, among other things, incorporate distinctions such as those developed in contemporary "speech-act theory". This will substantially complicate, although in interesting ways, the development of an adequate theory of markup semantics, as formalization will require modal operators and additional axiomatic relationships. In addition, these reflections reveal that there are foundational issues in markup theory that are not yet resolved, in particular the precise relationship between markup and text

[Sperberg-McQueen et al., 2000]
C. M. Sperberg-McQueen, C. Huitfeldt, and A. Renear. Meaning and interpretation of markup. Markup Languages: Theory and Practice, 2(3):215-234, 2000.

SGML and XML markup allows the user to make (licenses) certain inferences about passages in the marked-up material; in particular, markup signals the occurrence of specific features in a document. Some features are distributed, and their occurrences are logically non-countable (italic font is a simple example); others are non-distributed (paragraphs and other standard textual structures, for example). Formally, the inferences licensed by markup may be expressed as open sentences, whose blanks are to be filled by the contents of an element, by an attribute value, by an individual token of an attribute value, etc. The task of interpreting the meaning of the markup at a particular location in a document may then be formulated as finding the set of inferences about that location which may be drawn on the basis of the markup in the document. Several different approaches to this problem are outlined; one measure of their relative complexity is the complexity of the expressions which are allowed to fill the slots in the open sentences which formally specify the meaning of a markup language.

[Ide and Sperberg-McQueen, 1997]
Nancy M. Ide and C. M. Sperberg-McQueen. Toward a unified docuverse: Standardizing document markup and access without procrustean bargains. In C. Schwartz and M. Rorvig, editors, Proceedings of the 60th Annual Meeting of the American Society for Information Science, pages 347-360, Medford, NJ, 1997. Information Today, Inc.

[Renear et al., 1996]
A. Renear, E. Mylonas, and D. Durand. Refining our notion of what text really is: The problem of overlapping hierarchies. In Susan Hockey and Nancy Ide, editors, Research in Humanities Computing 4, pages 263-280. Oxford University Press, Oxford, 1996.

We examine the claim that 'text is an ordered hierarchy of content objects'; this thesis was affirmed by the authors, and others, in the late 1980s and has been associated with certain approaches to text processing and the encoding of literary texts. First we discuss the nature of this claim and its connection with the history of text processing and text encoding standardization projects such as SGML and the Text Encoding Initiative. We then describe how the experience of the text encoding community, as represented and codified in the TEI Guidelines, has raised difficulties for this thesis. Next we consider two progressively weaker versions of this thesis formulated in response to these difficulties. Ultimately we find that no version appears to be free from counterexample. Although none of these formulations proves to be theoretically sound, they are nonetheless methodologically illuminating as each generalizes actual encoding practices, making explicit certain assumptions that, even though they have been fundamental to the working methodologies of most text encoding projects, have never been explicitly articulated, let alone explained or defended. The counterexamples to the different versions of the OHCO thesis also arise in actual encoding projects -- so although our focus is theoretical it is grounded in the methodology and problems of contemporary encoding practices. The problems discussed here have implications not only for text encoding and our understanding of the nature of textual communication, but raise very fundamental issues in the logic and methodology of the humanities.

[Sperberg-McQueen and Burnard, 1994]
M. Sperberg-McQueen and L. Burnard, editors. Guidelines for Text Encoding and Interchange (TEI P3). ACH/ALLC/ACL Text Encoding Initiative, Chicago, Oxford, 1994.

[Spring and Dubin, 1992]
M. B. Spring and D. Dubin. Hands-on PostScript. Hayden Books, Carmel, IN, 1992. (Published in Polish translation by Intersoftland of Warsaw as PostScript od A do Z).

[Sperberg-McQueen, 1991]
C. M. Sperberg-McQueen. Text in the electronic age: Textual study and text encoding, with examples from medieval texts. Literary and Linguistic Computing, 6(1):34-46, 1991.

This paper discusses characteristic problems in designing methods of encoding texts in machine-readable form for textual study. Any electronic representation of a text embodies specific ideas of what is important in that text. A well-developed encoding scheme is thus in some sense a theory of the texts it is intended to mark up. The paper describes, with examples, the theory implicit in the work of the Text Encoding Initiative (TEI), a project to develop guidelines for the encoding of machine-readable texts. Any machine-readable representation of texts must use markup, but no finite vocabulary of markup items can be complete, since neither the set of textual features worth marking nor the set of texts to be studied is finite. Any useful markup scheme must therefore be extensible. Additionally, a markup scheme must allow several discrete views of texts. Texts are both linguistic and physical objects. They have simultaneously a linear, a hierarchical, and a directed-graph structure. They refer to objects in real or fictive universes. Texts, finally, are cultural and thus historical objects: a useful encoding scheme must be able to represent textual variation, parallel texts, and the gradual accretion of interpretation and commentary with which human culture adorns venerated texts.

[DeRose et al., 1990]
S. J. DeRose, D. Durand, E. Mylonas, and A. H. Renear. What is text, really?. Journal of Computing in Higher Education, 1(2):3-26, 1990. (reprinted in ACM SIGDOC Asterisk Journal of Computer Documentation 21:3, 1997 pp. 1-24).

The way in which text is represented on a computer affects the kinds of uses to which it can be put by its creator and by subsequent users. The electronic document model currently in use is impoverished and restrictive. The authors argue that text is best represented as an ordered hierarchy of content object (OHCO), because that is what text really is. This model conforms with emerging standards such as SGML and contains within it advantages for the writer, publisher, and researcher. The authors then describe how the hierarchical model can allow future use and reuse of the document as a database, hypertext, or network.

[Sperberg-McQueen and Burnard, 1990]
M. Sperberg-McQueen and L. Burnard, editors. TEI P1: Guidelines for the Encoding and Interchange of Machine Readable Texts. ACH-ALLC-ACL Text Encoding Initiative, Chicago/Oxford, 1990.

[Coombs et al., 1987]
J. H. Coombs, A. H. Renear, and S. J. DeRose. Markup systems and the future of scholarly text processing. Communications of the Association for Computing Machinery, 30(11):933-947, 1987.

Markup practices can affect the move toward systems that support scholars in the process of thinking and writing. Whereas procedural and presentational markup systems retard that movement, descriptive markup systems accelerate the pace by simplifying mechanical tasks and allowing the authors to focus their attention on the content.