The Collation Project: Summary

The idea of an outline of texts

Imagine a scholar working through a digital copy of Thomas Hobbes' Leviathan, dividing it into chunks of approximately one paragraph in length, with each chunk distinguished by its linguistic function (e.g., assertion, argument, explanation).  Imagine the scholar labelling the chunks by function, summarizing them, and placing them into a single outline, reordered, beginning with the most abstract topics (e.g., Metaphysics) and working down to applied topics (e.g., Political Philosophy and Philosophy of Law).  For example, consider this text from Leviathan, Chapter XVII, Paragraph 4:

For if we could suppose a great multitude of men to consent in the observation of justice, and other laws of nature, without a common power to keep them all in awe, we might as well suppose all mankind to do the same; and then there neither would be, nor need to be, any civil government or Commonwealth at all, because there would be peace without subjection.

This chunk of text might be filed under Political Philosophy most broadly, then (drilling down to a specific place in the outline): The State; The goals and functions of the State; The size of the State; Anarchy; and finally Desirability of Anarchy.  Just this sort of outline was attempted for the philosophical parts of the Leviathan here.

Imagine the scholar repeating this process this for, say, the fifty most influential works of philosophy, with all the chunks of texts summarized and collated into the same outline.  This would require a scholar on the order of 5-10 years, depending on diligence.  But the result would be an unprecedented, highly interesting, and highly useful resource (as is explained further below).  But no scholar could do this single-handedly for all the important texts of philosophy; there are just too many of them.  Only a group of collaborators could do that.

Imagine, if you will, the entire corpus of scholarly texts--at least public domain texts--on every subject and from every nation, accessible from the same outline, using a specially designed piece of software that lives on the Web.  This project, and the outline (as we will call the result), would require a massive collaborative effort.  The result would, however, more than justify the effort, and as this project summary argues, the idea richly deserves your involvement.

First, however, it will help clarify the proposal to describe the tool that would be used to construct and manage the resource.

The tool

Our initial conception of the tool that will support this project calls for a Web interface that is divided into four columns: the outline, the chunks, the text, and metadata (i.e., annotations).  The columns would be divided like this: 

This is just an initial conception.  There are many other necessary features, and software engineers are invited to help design the optimum open source system for the project.

The scholarly problems the outline would solve

The proposal to outline a vast assortment of scholarly texts in this way is extremely ambitious.  What could motivate and justify such an enormous amount of work?  Why think that anyone will ever be impelled to work on such a project?

Even a relatively incomplete outline would have the potential to revolutionize scholarship.  First, consider how research in various fields could be made more efficient, offering unprecedented opportunities for finding and connecting information:

The larger impact on the practice of education and scholarship itself

Considering the above examples, one might take the position that, while an integrated outline of scholarly texts would be a very handy research tool indeed, the mere convenience of the tool might not be enough to justify the enormous labor that would go into it.  But this project, successfully prosecuted, would not merely make scholarship more convenient; it would be revolutionary, and research and schooling would never be the same again.

The outline would mean the end of academic provincialism: the comments of various theorists, from different disciplines, all appearing in the same place would make it abundantly clear what interdisciplinary reading needs to be done.  The artificial lines between disciplines come crashing down when one sees exactly how others, in other fields, are commenting on one's pet topic.  If contemporary papers are properly collated into the outline, conference organizers can see better who to invite for a specific purpose--not just the people who happen to work in the organizer's own field, and who publish in the same journals in which the organizer publishes.  And so forth.

Even more significantly, the outline would create a worldwide scholarly conversation, and this for a perhaps unexpected reason.  Assuming that a canonical text, in its original language, is marked up so that translations can be marked up in exactly the same way, scholars from around the world can look at the same outline in their own language.  Now consider that one obvious filter option would be that, if a text is not translated to your language, then display it in its original, or in another one that you know.  Suppose as a scholar you "live" in a certain area of the outline, and some untranslated (say) Polish texts appear in that area; then you have an excellent reason to get that text translated, and when you do, suddenly you and others like you are joined to a global conversation that simply did not exist previously.  By having a single outline for all the world's works, in all languages, collating small chunks of all texts into a single very fine-grained outline, we essentially brings researchers of all languages together--and the wisdom of humanity is explosively enriched.

Learning and education, too, would be revolutionized.  A student who is assigned to write a paper on a specific topic, using one or two main source texts, can instantly find where a specific passage is located in the outline, and then consult other texts from a wide assortment of thinkers on that precise point.  And if the outline had indexed all public domain texts the student might possibly be expected to cite--as in a history, literature, philosophy, or classics class--the student is saved trips to the library stacks.  Of course, the precise same point applies for scholars as well.  Scholarly research to find credible sources for specific facts, theories, quotes, etc., previously requiring long trips to the library, becomes nearly trivial for everyone.

Teachers who are designing reading lists, who know what general topics they wish to teach but who have not selected specific texts for those topics, would be able to find in the outline enormous numbers of texts relevant to the topic, and make more interesting, fine-grained selections from readings.  Moreover, the central availability of public domain texts, and perhaps of proprietary texts as well (under some creative arrangement), would make the construction of ad hoc anthologies, sometimes called "course packets," trivial.  The availability and the appropriateness of readings increases, and thus so does the quality of education.

Why this project is both unprecedented and feasible

The outline would be a brand new kind of reference work.  Some people we have spoken with about this project have said that it has such obvious merit that it is puzzling that no one has ever done it before.  But the reason seems simple.  Before the advent of personal computers, shuffling large amounts of text was prohibitively expensive and labor-intensive.  The Scriptorium of the Oxford English Dictionary project might, perhaps, be an example; but a dictionary has a simple structure.  It is hard to imagine anyone exerting the effort to maintain an enormous outline, which is constantly expanding and changing, as the one envisioned here would have to be, without a computer.

But why did it not happen in the 1980s, when computers become commonplace?  One reason perhaps is that a quorum of texts had not yet been created.  The task of typing in or scanning texts was, you might recall, very far from trivial at the time.  Project Gutenberg was well under way, however.  So why not collate the texts that were already available?  The reason here is surely that few people would have understood at the time some basic principles of collaborative content production that wikis have made obvious, and the understanding of how to construct the tools to manage such a project was still relatively primitive.  How to organize the project surely would have been a major sticking point.

So, in fact, the time is exactly right to start this project.  In recent years, Wikipedia has demonstrated to the world what can be accomplished by people editing each others' text in a highly asynchronous, distributed collaboration.  The Open Content Alliance, along with the Amazon and Google book-scanning projects, have many book-lovers salivating over the impending digitization of entire libraries.  Finally, the unreliability or irresponsibility of some collaborative content communities, like Wikipedia, the Blogosphere, and MySpace, has very many academics and professionals interested in "taking back the Web" for more serious, scholarly purposes--which is one reason that the Digital Universe, an expert-managed general information project, is growing in popularity.  All of these developments together will help persuade some plugged-in graduate students, professors, and others that the Text Collation Project is feasible.  We hope a quorum will be thus persuaded.

In short, until 2006, it simply was not feasible to integrate the intellectual products of humanity together in the way imagined here.  Now it is.  You can think of the outline as The Book of the World, a collective production under development for many thousands of years, but not fully integrated or indexed until recently.

Skeptics are very right to ask how the project might come about.  We begin by asking a small number of people, from academe and the open source software community, to serve on an Advisory Committee.  The Advisory Committee and others involved discuss, amend, and give their go-ahead for the project.  Their names are then used, collectively, to help publicize the project and to get coders and participants involved.  Mailing lists are used to discuss the design of the supporting community and of the software.  When requirements are settled upon, a pre-existing open source software package is selected for adaptation, or a new software system is started from scratch.  In none of the plans described so far are there any unreasonable expectations; the project would resemble any number of collaborative projects online.

A number of coders have told us that the system contemplated does not seem to pose any unusual or insurmountable problems.  Scalability is a concern, but it is a concern with many collaborative content projects; more on scalability below.

So suppose that, some months from now, a usable beta version of the software is ready for testing.  This seems reasonable simply because the essential functionality required from the software--which involves shuffling text around--is not especially complicated.  The next step is to select, from within the community of people who have expressed interest, a small number to work on collating a few books to begin with.  It might help for all the books to be in one area, and in a fundamental area, so we might begin with (say) a half-dozen philosophy books.  Experts on those books are recruited to manage work done by graduate students (unless the experts want to do the work themselves).  We fully expect that a quorum of graduate students will be highly motivated to participate, because this sort of summarizing, comparing, and organizing is precisely the sort of work that graduate students typically do.  Their use of the software leads to new requirements, and some elaborations of project policies; once those things are ready, though, a broad-based effort can be initiated.  Editorial committees in various disciplines will be started that will oversee (but not necessarily direct in a top-down fashion) summarizing and outlining work.

We estimate that, for a person familiar with a text, its subject matter, and the outline as constructed so far, about four pages of an average-sized book can be "collated" (selected, summarized, and put in an expanding outline) per hour.  Suppose there are twenty distinct fields with fifty seminal books, a book being an average of 200 pages.  That is a total of 1,000 books, or 200,000 pages, and some 50,000 person hours would be needed to get through those basic texts.  If there were 100 people involved in doing the work, each doing an average of two hours per week, the entire job (of collating these basic, public domain texts) would be finished in a little over a year (62.5 weeks, at 800 pages per week).

We believe that we will easily be able to get a quorum of interested participants, because of the exciting and nearly self-evident benefits of the product--especially for lawyers and law students, philosophers, historians, and many others whose work involves careful reading and comparison of texts.  So, in fact, we might have more interested participants than we can easily manage.  It might turn out that we (the scholarly community the world over) can collate many thousands of books per year.

Then, with working software and enough participation to make rapid progress toward something really useful--both of which are not unreasonable to expect--there is excellent reason to think the project is feasible.

Replies to critical questions

Still, one would be very right to have objections and reservations, because of the sheer ambition of the project proposed.  Let us address a few of these critical questions here.

Surely the world's scholarly works contain far too much text for one outline.

Even if the project were to collate a relatively small number of works, it would still be a fascinating and highly useful reference.

Of course, one will not be able to view the entire outline at once, any more than one can view every page of every book in the library at once.  But this obvious observation does not seem to pose any show-stopping technical impediment.  There are bound to be many members and partner institutions, so that bandwidth, and component databases, can be spread around.

Moreover, the database will be filterable and presentation-neutral.  It will be possible to construct websites using the data, or subsets of the data, that do not resemble an outline visible on one page at all, but instead, say, a whole series of pages like Yahoo!'s directory, that are arranged hierarchically.

Surely the outline will not be nearly as useful if it does not include proprietary works.  But Google's experience surely shows that you will not be able to include many proprietary works.  Doesn't this mean that the project in its finest imaginable form is not actually feasible?

This is a good point, and perhaps the best answer is simply, "Yes."  It is worth making three qualifications, however.  First, it is entirely possible that there will be fee versions of the resource, for use for example at research institutions, that do include the full text of most proprietary works.  Second, it appears prima facie legal to include summaries of texts, if not the texts themselves.  Again, the texts themselves might be available for a fee--something that publishers ought to appreciate.  Third, even supposing that the project remains always restricted to public domain works, the usefulness of the project in the older disciplines, such as the humanities, is still enormous and obvious.

How is it decided what the outline should look like?  Who makes the hard decisions?

They should be left up to the people who are experts in the subject matter in question (Editorial Committees).  There will surely, in any case, have to be a lot of negotiation and debate both within and between committees--which might itself prove to be very enlightening.

While there is no reason that there cannot be parallel markup of texts in multiple languages, how can the outline itself be built in multiple languages?  Surely people working just in English will want to make edits to the outline that are incompatible with what people working just in Russian will want.

This is discussed on the internationalization page.  This is definitely a hard problem, and unfortunately there is no ideal solution--but it is a problem that needs to be solved one way or another, or there is no project.  The simplest and perhaps the best approach is to create a single master copy of the outline, in a single language--and, as more of the scholarly community the world over knows English than any other language, English would have to be the common language of outline production.  On a regular basis, then, the English outline would be translated into other languages.  This would make it necessary for anyone adding to the outline, when doing work in another language, to know and work in English--which, for many, will be a problematic requirement.  There are other options, but it's not clear that they are any better.  Suppose that each language develops its own outline; then language becomes a barrier when a common project holds the promise for a new international intellectual community; besides, then we replicate work needlessly.  Suppose instead that there is a single outline, but that people can contribute to the outline in their own language.  But, unless we should be content with a veritable Babel, all work on the outline would have to be simultaneously translated into all approved languages as soon as it is done, and resolving disputes would be very difficult indeed, or impossible, for those who do not have a common language and inadequate resources for translation.  See internationalization for elaboration of these and other points.

What is likely is that the project will adopt English as the common language of the outline.  It would remain to be seen whether people in other non-English speaking countries, who find this policy intolerable, will start similar projects in their own language.  Perhaps the world will not be poorer for this variety, as long as there is one project can serve as a central meeting-point for the global scholarly community.

Back to home page