The Text Outline Project:
some preliminary thoughts about requirements
(reqs v. 1)

Before attempting to follow these thoughts about requirements, it will be very helpful to consult the outline example (1.3 MB).  Reviewing MS Word's outlining feature, and its ability to toggle between outline view and normal view, will help make the following much clearer.

It will also be necessary to review this glossary first.

Collation Project software planning

The first Textop subproject to launch will probably be the Collation Project, because (at least according to the current plan) it will generate the outline other projects will use.

Note also that we want to have internationalization (q.v.) built into the software from the very start.  That's why there will have to be at least three ways to interface with the database: a Content Contribution Interface, a Markup and Translation Interface, and a Display Interface.  For at least some of these interfaces, a series of columns is displayed, containing such data as the outline, the chunks at a node of the outline, a text, metadata for a chunk, and search results.  In general, what the user does within one column is constantly affecting what is displayed in another column.

Content Contribution Interface

It will help to review the screenshot first.

Very generally, this interface allows a contributor to divide a text into chunks and then put the chunks into an outline, which is itself editable.  The outline, the chunks found at a node of the outline, the text, and the metadata for a selected chunk might all live in different columns of the Content Contribution Interface.  Each column should be collapsable and resizable.

In column 1, the leftmost, is The Outline, although at any time only a small part of it can be displayed.  Only node headings, not chunks, are displayed here.  The outline is displayed in only one specific language at a time.  It should be possible to edit node headings using a click-and-drag interface.  Changes to the outline are trackable via a “Recent Changes” log, as in MediaWiki.  Since node headings and their positions must be subject to constant change, each node should have a unique ID (usually hidden, of course, from the user).  There might, perhaps, be a displayable node history and a forum of some sort attached to the IDs.

In column 2 are The Chunks that are found at the node currently highlighted in column 1.  When someone clicks on a node in the outline, then the chunks associated with that node are displayed (this might be just the default behavior).  This column might have various useful bits of information for the content contributor, e.g., node history and node forum might be displayed here, as well as an account of what properly falls under the node, or the node's definition.  Also, when one clicks on a text reference displayed under a chunk (see outline example), this should (by default?) center the display in the next (third) column, the text, on the part of the text from which the chunk is taken.

In column 3 is The Text; for example, Hobbes’ Leviathan.  Actually, there should be three text displays, editor view, markup view, and normal view, but the editor view, which displays the text in an easily readable and selectable format, is displayed by default in the Content Contribution Interface.  (The other views will be discussed below.)  When an editor selects some part of the text (that has not already been chunked), it brings up new metadata fields in the rightmost column.  It should be clearly and elegantly displayed how the text has already been chunked; so, for example, if Ch. 14, Para. 3 has been made into a chunk, then that paragraph should have brackets, or colors, or some other clear "marker" showing that it has been chunked already.  It should also be possible to click on that "marker" and bring up the metadata for that chunk, as well as the place in the outline where that chunk currently resides.  Note, however, that it should be possible to file the same chunk, and overlapping parts of the text (i.e., chunks that share some but not all sentences), in different parts of the outline.  This must be borne in mind in designing how markers are displayed.  We might make an upper limit on the number of places under which any sentence can be filed (e.g., three).  Also, there should be a Recent Changes function for the text, which displays a list of chunks that have been created, changed, or deleted recently.  Note that, probably, a chunk should not be considered created unless it has a specified function and summary (at least). 

Bear in mind that contributors should be able to change what sentences are included in a particular chunk.  For example, a chunk might begin life consisting of just the first sentence of a paragraph; then someone might drag the second and third sentences of the paragraph into the chunk.  Moreover, we might want to say that the same set of sentences can be assigned to different parts of the outline under different summaries (that, at least, has proven to be an occasionally useful thing to do in the outline of the Leviathan)--unless we want to say, as we might, that the same chunk might have different summaries, and it is the summaries that are filed under a node, and not chunks.  Further thought is necessary here.  At the very least, though, we can say that chunks need to be editable, in which case a unique identifier of a chunk needs to be created.  The chunk's function and summary are then associated with that unique identifier, rather than with some sentences; in that case, which sentences are assigned to a chunk is itself a piece of metadata.

In column 4, the rightmost, are The Metadata.  Under the current conception of the tool, the contributor selects part of the text, and upon doing so, the software generates a set of metadata fields attached to the selected sentences.  Technically speaking, the selected sentences are not stored, but rather, a unique identifier of the sentences, based on the markup of the text; this allows corresponding sentences in other languages to be displayed.  Then some metadata fields pop up in column 4, and the contributor specifies the function of the selection (argument, explanation, description, etc.) and summarizes it in a sentence.  (Again, see outline example.)  The software automatically generates the text reference based on what text is selected (e.g.: "Hobbes, Lev XXVII 3").  Finally, the contributor saves the metadata and drags an icon, representing the  chunk (including its metadata), from the rightmost (fourth) column to the leftmost (first) column, dropping it into the outline at the appropriate place(s).  Note that it should be easy to move chunks around from node to node in the second column, and that the same chunk should be able to exist in several places (but perhaps a limited number of places) in the outline.

It's possible that, instead of making a separate (fourth) column for the metadata, each chunk's metadata should be displayed in some sort of popup or hovering JavaScript bubble.

Markup and Translation Interfaces

The initial version of a text used in the system should be whatever scholars regard as the canonical version, in its original language.  Other editions and translations must be includable in the database, marked up according to the same convention, so that the markup schemes for a text and for all its editions are translations are identical (or mappable, anyway); so, for example, if there is some tag identifying a sentence in the original, there must be an identical tag in an isomorphic position identifying a corresponding sentence in the translation.

Markup Interface

So there needs to be a markup interface, displayed to people working "behind the scenes" to mark up the text with tags indicating parts, chapters, paragraphs, and so forth--in other words, marking the structure, locations, and interrelations of the parts of the text.  Some of this markup should be generated automatically, but it will no doubt need to be edited by hand.  Then, when a translation is added, it is linked to the original edition and marked up automatically using the original markup as a template.  This will make it possible for people viewing the outline and the chunks in it to toggle back and forth between languages.  This will have many nice effects; for one thing, it will make it possible to display a text in some desired language even if there is no summary of the chunk, yet, in that language.

For purposes of marking up translations of a text, there should be two mutually-tracking columns, the first containing the original text and its markup, the second containing the translation and whatever automatically-generated markup was added to it.  A markup editor then scrolls through the translation, with the original text window scrolling along (by default) to the same positions according to the default markup, and the markup person (who reads both languages) does copyediting, making sure that the same tags are used for corresponding chapters, paragraphs, and sentences.  (Given the vagaries of translation, whole sentences will probably have to be the smallest level of markup granularity.)  Finally, a utility should check that the tagging scheme of the translation is identical to the original.  It should not be possible to start chunking any text, original or translation, until the markup is perfect--since the rest of the software requires that text locations are set and not changeable.

Translation Interface

All parts of all interfaces should be displayable in about a dozen common languages to begin with.  But this means that there have to be tools for closely comparing and translating text from one language to another.  In what follows we will briefly describe some of these tools.

It may be helpful see internationalization for relevant policy considerations.  As argued there, it will probably be necessary for there to be one "master version" of the outline (but not necessarily of all other content), in English.  The reason for this is that, without a master version, there will be one outline per language, which will not achieve the remarkable and unprecedented benefits of tearing down language barriers.  Remember that outline nodes will have unique identifiers that are independent of position or header wording.  This means that the same node can be assigned multiple translations, and that the reordering and renaming of nodes in the English master version will not "break" versions in other languages.

So, first, there should be an outline translation utility.   The basic functionality of this utility is that it should display the English adjacent to the target language.  It should also track and display nodes according to a triage system: new and higher-level added and changed nodes are placed in a queue for translation first; minor changes at lower levels are placed further down in the queue.  As long as the queue can be worked on by multiple people interchangeably, as much as this system should, it seems that, with enough volunteer translators, the outline can be kept up-to-date in all languages nearly simultaneously.

Second, consider that, regarding the Collation Project, there is a fundamental decision to be made whether different languages will have all the same chunks, or instead whether the same text will be chunked differently for different languages.  Given consistent markup, as described above, different languages can have the same chunking.  As argued in internationalization, probably, all languages should have the same chunking.  In that case, clearly, the way a text is chunked should depend on how it is chunked in the original language.  And in that case, it would be helpful to have a tool that allowed translators to compare the same chunk in different languages, displaying the chunk in both languages, the chunk summary in the original language, and a prompt for a translation into the target language.  As with the outline translation utility, untranslated summaries (or summaries changed in the original language) should be placed in order of priority, or triaged.  But note that this is not strictly necessary--one could go through a text in the target language without consulting the original, if (as seems likely) there is no reason for summaries to be translated rather than written from scratch.

Third, there should be a utility that compares markup of different translations of the same text, so that the markup is isomorphic.  The outlining of a translation should not be permitted to proceed until the markup has been proven to be isomorphic to the original.

Display Interface

The Display Interface is the one presented to the user, not the contributor, e.g., person who is using Textop to do a comparative study of texts.

Searching

The user is likely to use search as much as anything.  It should be possible to select what to search: the outline headings, the summaries, the sentences, and/or the other metadata; and selected subsets of these (e.g., 18th century philosophical texts in French; only arguments; only one particular text).  How search results are displayed might or might not differ depending on the type of search; certainly in any case search should be configurable.  One main search result display would first place results in the context of the outline, and would be multi-step, as following.  First, in the left-hand column, after the user types in search terms, the search results appear below the search box.  Second, one can click on a result and then the parts of the outline where the result is found appears in the next column.  The outline should be displayed in such a way that a large number of results are displayed as part of the same outline, with non-matching nodes being collapsed and invisible.  Third, the user clicks a node and in the third column the chunks associated with that node are displayed.  Alternatively, if texts were included in the search, the third column automatically centers the displayed chunks on the first matching text.  But perhaps this method of search would be most appropriate for searching the outline itself.

Another type of search method would first display lists of matching summaries or of whole chunks.  Then the user could click on a summary or a chunk, and then its "first" location in the outline would be displayed, with its surrounding outline context.  At the same time, the context of the summary or chunk in the text it is taken from would be displayed.

Yet another type of search method would, after many texts had been added to the database, display matching texts, based on whether the search terms are found in their summaries.  Then the user might open up a given text and, at the same time, be shown a list of places where the search terms are found in the text, together with a summary of the chunks in which the search terms are found.

As you can see, this whole system, while complex, would offer tremendous and unprecedented text searching power.  If well designed, it is the search function that would make Textop an invaluable research tool.  And none of this is to mention the possibilities inherent in the semantic markup of classic texts--and Textop would be a natural venue for such markup to be done.

Outline browsing

Sometimes users will simply want to drill down through the outline to find topics they are interested in, particularly when they don't have a name or a clear idea of what they should be searching for.  The basic functionality here is that one clicks one part of a header, such as a plus sign at its left side, to view the children nodes of a node, and one clicks on the header text itself to view (in the next column over) the chunks that live at that node.  It should also be possible to collapse all chunks and view just the summaries and sources.

As there will be many thousands if not millions of outline headers, obviously not all of the outline will be able to be displayed at once.  Hence, the outline browser should probably re-center outline based on the most recent click, and (at the user's option) close open nodes that are not parents or children of the current node.  Furthermore, the user might for simplicity's sake want to hide all nodes except a given node (which might be buried deep) and its children.

Furthermore, as with the Digital Universe's Universal Navigator, the user should be able to select "bottom row favorites," except that in this case they will be top level (or left side) favorites.  In either case, in a large system, one needs to be able to compile a list of favorites just as one can compile favorites in a Web browser.

Note also that special browsing tools might be associated with the part of the outline that takes the form of a chronology or timeline, although in principle this doesn't seem to be any different from the rest of the outline.

Other kinds of browsing

Another handy way to browse the database would be to view all the chunk summaries for a single given text.  Since the text would have already been marked up into parts, chapters, and so forth, a handy display tool could be written that would automatically compile a summary of the entire text.  Students would have free "Cliff Notes" for a huge number of public domain texts, not just the usual suspects.

Summary browsing might be particularly useful for proprietary texts that the project has summarized, but the chunks of which are not viewable: at least the summary would be viewable, which would help scholars to determine if buying or otherwise obtaining the text is worthwhile.

How multiple languages affect browsing

The display interface should make it easy to switch languages.  Having identically-marked up translations will make it possible for the right chunks to appear under the right nodes, regardless of what language the user has chosen as default, and regardless of what language the original text was written in.

Note that, as long as a translation of a text has been marked up, even if summaries have not been written for the translation, it will be possible to display the sentences themselves in the proper part of the outline.  And that could be useful.  But, while the outline is growing, there are bound to be parts of the outline, the summaries, and chunks, depending on the language displayed, that are available in one language and not others.  The user, therefore, should be given a choice of languages.  For example, it should be possible for a user to tell the system to display French if available, and failing that English, and failing that German.  Moreover, this needs to be true of every part of the system.  So if a newly-created part of the outline itself is available only in English, and a user's default language is French, then those node headers should be displayed in English.  But for clarity it should be obvious from a glance what is available only in a non-preferred language; for example, the English-only node header would be (say) colored red for a Parisian.

General notes

So this tool is about the creation, search, and display of texts analyzed into chunks and organized into an outline, but also viewable in their original form.  So what kind of tool is that?  It's a tool for collating texts and viewing the results.

Note that, for the sake of usability, it might be convenient to display the same types of data always in the same column.  In fact, instead of hard-wiring particular "displays," as the above might be taken to imply, one might instead want to let the user have all possible columns displayed at the same time (at least in principle), opening and closing columns as needed or not.  This way the display is maximally user-friendly because designed by the user.  Some users, for example, might want to search through the outline when they are summarizing a text; others might find the search column unhelpful while summarizing, because they don't put their chunks immediately into the outline while summarizing.

Also note that scholars may well wish to include several different same-language editions of the same work in the database.  This is important because bits of text in one edition sometimes do not appear in another edition.  The implications of this for the system are not examined above, but obviously it would complicate matters even further.

Adapting and augmenting this software for the other subprojects

The discussion above concerns the requirements only for the Collation Project.  According to our current reasoning, since it is that project that will generate the outline used by the other projects, the Collation Project will be launched first, and the other projects will be launched only after the Collation project is well under way.

It is worth thinking now about how the other projects might make use of the software framework described above.

It seems clear that for the Analytical Dictionary Project as well as the Event Summary Project, the additions could be relatively minimal.  The system would of course have to distinguish between a text chunk, a dictionary entry, and an event summary, but they would all be assigned to particular nodes.  Beyond that, a dictionary entry and an event summary might essentially work like wiki pages.

But we need only to look at the special requirements of the fourth planned project, the Debate Guide Project, to be impelled to think twice about setting up just a simple text editor or wiki.  New projects usually inherently carry with them new requirements and significant system changes.  In the case of the debate guide, the differences are obvious.  A debate guide will set opposing arguments on specific questions side-by-side.  It might also allow the guide writers to elaborate points and sub-points, so that the entire debate guide can take an outline form.  Sub-questions might be linked to particular nodes.  Note that if a node at which a particular debate guide entry lived were deleted, then the software should prompt the person editing the outline to assign the debate guide to another entry, or it should in some other way make sure that people are not disoriented by broken and missing links.  The point is that adding new functionality to the original Collation Project system will complicate the original system, so coders building the system in the first place should be aware of the necessity of making the code maximally extensible.

If we allow ourselves to think freely about the ideal system for the Analytical Dictionary Project, then we might produce a whole set of different software proposals.  A simple wiki page would be only one.  Another would be a collaborative wiki-like database, in which, under a given concept heading, such as "affection," the contributor would be prompted for different pre-defined classes of data, such as words, idioms, slang, and jargon; then, for each of these, what distinguishes them from other items that fall under the heading, what connotations they have, their cultural associations, some representative quotations, and so forth.  Then project designers could massage this data into something very interesting and usable.

On second thought, the Event Summary Project, too, would seem to have some obvious special requirements.  To make an event summary maximally useful to the reader, there should be some mechanism whereby the latest-added news is displayable somehow: the reader might, for instance, be able to highlight edits that were made since last visit, or to aggregate them somehow at the top of an article for a quick update on a developing story.  But event summarizers (the contributors) would, then, want to distinguish between things like mere copyedits and rewordings, of relatively old news, from substantive new additions.  So there should be a way to mark, or otherwise generate information about, the difference between a "copyedit" and a "substantive addition."  Furthermore, since event summaries concern ongoing, complex events, which are not neatly distinguished into nicely separate bundles in advance, there needs to be a way to separate a single summary into others: what the best way to achieve that is not obvious.

We could, simply because it seemed convenient, use a wiki for all of these projects.  But participants and users and the world at large will thank us if we create specially-designed tools that achieve the very specific requirements of the tasks we set out to accomplish.

As you can see, we're talking about a very complex piece of software.  Coders, are you up to the challenge?

Back to home page