MULTI-DIMENSIONAL DOCUMENT DESIGN

Allen Seifert - Information Strategies, Inc. November, 1993
I. INTRODUCTION:

The world of data and document design is undergoing significant changes, challenging many of the concepts that have long dominated the publishing and information dissemination industry. The primary forces driving these changes are the growth of automation in editorial support and end-user delivery. This paper describes some of these changes and suggests document design approaches to deal effectively with them.

II. THE LINK BETWEEN DOCUMENT STRUCTURE AND PROCESSING:

Data captured and organized into "document" files, whether SGML or other formats, is intended for some type of processing by computer software. Indeed, in the case of complex data tagged in SGML, it may be said that the only direct objective is to create input to some kind of software program.

Historically, this reality was masked by the fact that nearly all complex data, SGML or otherwise, was intended only for processing by composition software for the creation of page images. Because the processing strategy inherent in this type of software program (linear composition) and end-use of the data (paper pages) was nearly always the same, it was not recognized as a major variable in data design. With the growth of editorial automation and electronic delivery of information, however, there are at least four major processing environments for which document data must be designed.

  1. Page-image composition for paper or fixed image delivery.
  2. Editorial user session support including access and editorial transaction processing.
  3. Media-based (CD-ROM) mastering and end-user electronic delivery.
  4. Interactive (on-line) service-based mastering and end-user electronic delivery.

These differing processing and end-use environments are important for two primary reasons:

  1. The ways in which users locate and use information in each environment is sufficiently different from the others to warrant quite different delivery software approaches.
  2. The delivery software tools considered most capable in each environment operate sufficiently differently that data designed for one may not work properly in the others and may be difficult and expensive to convert so that it does.

These differences are critical to the effective planning and design of document data for inclusion in a multi-use document repository. Instead of creating data structures and tagging schemes for a singular, fixed software processing and end-use environment as has been the case, document designers must now understand and satisfy the needs and limitations of all environments in which their data will or may be processed.

 

INFORMATION ENVIRONMENTS, THEIR CHARACTERISTICS AND REQUIREMENTS:

 

A. PAGE-IMAGE COMPOSITION FOR PAPER OR FIXED IMAGE DELIVERY.

Composition, the baseline for most data design, is a linear process in which software, starting at a fixed entry point, sequentially creates output data to drive a typesetting device. It is significant that composition, still the most often encountered processing environment, is also the most forgiving of variations in data design. Accordingly, a data designer moving from a composition-centered environment to any other will likely encounter new and more stringent constraints.

The following paragraphs describe some of the major data-related attributes of page composition:

  1. The Assumption of Linearity (the "Burma-Shave" effect):
  1. Manual End Use:
  1. Visual Resolution of Information Structures:

 

B. EDITORIAL USER ACCESS AND SESSION SUPPORT PROCESSING:

Authoring and maintenance of publishable data has historically been accomplished via simple text editing or word processing software. The model for this work has permitted data and its editorial processing to be largely separate with little or no linkage during data design.

With the advent of SGML and SGML-sensitive editorial software, however, powerful (and expensive) software must be configured to support authoring transactions by authors who generate the highly complex data forms required for modern information delivery modes. This creates a new data usage environment that, while largely unfamiliar to traditional data designers, is heavily dependent on data structure for its operation. Indeed, editorial users comprise a totally new class of information client whose needs must be considered in the design of systems and data.

Editorial requirements and the software that supports them are often unique and demanding. The design of data structures that will guide the authoring and updating process must take these requirements and their attendant software into account if the overall environment is to achieve full productivity.

Among the major data-relevant characteristics of the editorial support environment are the following:

  1. Non-Linear Processing:
  1. Software Involvement in End-use:
  2. Unlike paper pages or fixed images, editorial processes require that software stay involved with both data and user through final completion of every action associated with the data. Indeed, in any electronic delivery environment, access to both data and data usage resources is available ONLY through software. Put in simple terms, the data designer's task is not complete until every action related to final consumption of the data is accomplished.

    Although seldom recognized as significant, this difference between pagination and electronic information processing, including editorial support, is major and highly relevant to data design.

  3. Unpredictable Processing Patterns:

Unlike pagination processing in which all software activities, once started, follow a predictable linear path, editorial support not only addresses data in a "hop-scotch" pattern, but also may require different processes for the same data depending on the user's requirements. Whatever the nature of the processing required, document data is the major ingredient and variable. Unless data structures are highly definitive, this high level of processing unpredictability can make it difficult, complex and potentially costly for the programmers developing and integrating editorial support environments to assure that their software routines will always operate correctly.

While a certain incidence of this condition is an unavoidable cost of using high-level editorial support software, it can be minimized by rigorous data design that avoids a phenomenon called structural ambiguity wherever possible. Structural ambiguity is the condition that results when software must search and collect additional information about an information element or structure in order to identify and process it properly. For example, if software is directed to a "heading" data element that requires different processing for different uses but contains no differentiating parameters, then the element is structurally ambiguous. Facing this, the software programmer must attempt to resolve the ambiguity by writing software to search the data around the tag to ascertain its context and appropriate treatment. The more complex and far-reaching the required search, the more structurally ambiguous the element is said to be.

Establishing and adhering to some basic rules in data design can minimize structural ambiguity. A rule set of this type might state, for example, that any element in the data must be capable of full software identification:

While not every data structure will be capable of adhering to this level of design rule, it is important for designers to determine and document the optimum level possible for each class of data structure and ensure that it is part of the design effort and subsequent administration function.

 

C. MEDIA-BASED (CD-ROM) MASTERING AND END-USER ELECTRONIC DELIVERY:

The interactive delivery of data on electronic media, such as CD-ROM, is perhaps the most exciting of all recent developments in the information world. These powerful new information tools free the user from the necessity to navigate through stacks of documents looking for desired data.

Along with these new capabilities, however, comes a new series of considerations not all of which are positive. Some of these considerations impact the data design, authoring and preparation environment. Electronic media based delivery shares the unique characteristics of editorial support described above as well as several additional described in the following paragraphs:

  1. Hardware Involvement in End-use:
  2. In delivery modes such as CD-ROM, user hardware also plays an important part in determining the optimum mode of delivery and, thereby, impacts a number of key data design features. User device characteristics such as processor speed, screen size and resolution, and storage device speed must be considered by data content and structure designers when they design information access, grouping and display.

    Because it is difficult to mandate minimum delivery device capabilities in today's market, careful planning is needed at the design phase to ensure that data design does not render users with low-capability delivery devices unable to fully use the final information products.

  3. Heavy Penalty for Linear Navigation of Information:

In short, electronic display devices can't display as much data as fits on a paper page, don't display with the same level of resolution and are often maddeningly slow in sequential browsing from screen to screen. This is the "dark side" of electronic delivery technology's ability to instantly link users from indices or tables of contents, directly into the referenced data content.

These shortcomings put heavy pressure on information authors to provide the necessary data links to allow users to go directly to what they wish to see, whether from index to content or content to related content. While the former of these two traversal paths (index to content) can be accommodated via linking an external index to "entry points" in the data content, the latter (content to related content) is more subtle.

Having entered the content at the most detailed subject level provided in an external index, users often find that what they actually want to read does not display with the opening screen. They are forced to "traverse" to the appropriate point that, due to the small capacity of most display screens (2,000 characters versus the average 5-6,000 character paper page), may be several screens forward or back in the file. If this traversal must be accomplished by "browsing" from screen to screen looking for the desired point, the user pays a heavy time penalty (up to 600 percent) due to the relatively slow access speed of CD-ROM data storage devices.

To avoid this kind of usage penalty, data design must provide the basis for intra-content traversal without the need to browse. For example, some large horizontal data structures such as tables might be redesigned as vertical, list-like, structures with a machine-generated internal table of contents-like structures attached to the top-level entry element. Such a data structure, containing links between the entry point and each stub element, would make the delivery software capable of displaying a list of stub values to a user upon entry into the structure, then jumping to the appropriate area within the structure upon the user's choice.

 

D. REMOTE, ON-LINE SERVICE DELIVERY:

This environment presents all of the limitations encountered in media-based environments as described above, with one additional limitation. In feeding data to a remote service such as Mead Data Central or Dialogue, an information provider must support delivery software over which it has little control or influence. This is important because customers of the remote service are likely to blame the information provider, not the delivery service, if they encounter problems or shortcomings in the delivery. Accordingly, it is to the provider's advantage to exert as much control over each remote delivery service as possible. Because the exchanged data files are the only contact between the information provider and the services it feeds, control of the data is of paramount importance in leveraging the specific capabilities and constraints of each system.

Data designers must be sensitive to this need in the way they represent data structures in SGML. Tables, for example, are often incapable of effective display by on-line services and, accordingly, should be structured for easy conversion to simpler formats.

 

III. DESIGN CONCEPTS FOR ELECTRONIC PROCESSING:

This section outlines some basic data design concepts that can be effectively used in dealing with the problems of heterogeneous usage environments as described above.

  1. Maximum Addressability through Nesting:
  2. When designing data structures for editorial or end-use processing, logical structures should be nested to the maximum degree consistent with prudent SGML design concepts. This strategy ensures that software used to process and deliver the data can easily identify and access logical data structures with a single action. The oft-encountered tendency to represent logically nested data elements as physical "siblings", while workable in a composition environment, seriously complicates delivery support software and limits available functionality.

  3. Avoid use of "Default" Elements:
  1. Avoid Representing Large Data/display Structures as complex "Tables":
  1. Don't ask too much of your text data:

 

IV. CONCLUSION:

As automation drives the information industry to new heights of complexity and productivity, design of data, the raw material of this evolution, must keep pace with advances in capture, management and delivery. If we evolve our ability to process and deliver information yet forget that our ways of building data must evolve with them we will forever limit the value of our efforts. Accordingly, data design should be an open and live subject in every step we take toward the brave new world of electronic information.

Tell me more . . . ISI Home Page