logo iconsfeedbackinfosearchpubshome
blank

guide




Tutorial

Methodology
     
OUR GOAL

In our planning for the new archive web site, we had to make decisions regarding functionality and presentation. Our goal was to capture the information from each issue and re-present that content in the most useful way, while retaining the aesthetic integrity of each issue.

FULL TEXT VERSUS CONTENT SUMMARIES

Since we needed a digital format for the web, flatbed digital scanning was the obvious solution to capturing the information from each document. However, we hoped to have a fully searchable archive, meaning that every word in every article could be indexed by our search engine. Although this would have maximized the functionality of the archive, it would have entailed an extremely complex and labor-intensive production process. Such a process would have required the conversion of every page into both image data and editable text data. The error-checking for the OCR (optical character recognition) program required to do this would have alone been an overwhelming workload. In addition, we would have had to reproduce the layout of each issue as faithfully as possible. Having neither the time nor the money to accomplish such an extensive task, we made a compromise.

Instead of relying on the full text of the article, we decided to compile a database of all the relevant elements of each page of every issue. In other words, each issue has a corresponding "content summary" in the archive database. A content summary may include such elements as article headlines, author's names, cartoons, photographs, drawings, poems. Advertisements and most announcements were excluded.

In order to enhance the usability of the content summaries, we had to develop specific strategies for the content extraction process. Minor spelling errors were corrected and abbreviations were written out. All poetry titles were followed by "[A Poem]" and story titles by "[A Story]." Important photographs were included, especially when they were not connected with a story. Unsyndicated cartoons were also included, as long as they were clearly signed. When issue elements did not have a specific title, such as editorials or letters to the editor, we excerpted (in brackets) the sentence or sentences that summarized author's message. When a headline was not specific, we often included keywords to make the content summary more descriptive. Although this was a somewhat arbitrary and imperfect process, it made the materials a great deal more searchable. For instance, to the headline "Bard is Victorious!" we might have added [Women's Volleyball Defeats New Jersey Technical Institute] or to "Fostering Understanding" we might have appended [Red Hook-Bard BRIDGES Program]. Whenever possible, articles were given detailed keywords. However, some articles may have escaped our scrutiny.

The content summary serves several important functions. Firstly, it acts as a substitute for the entire issue, providing the reader with the content of the issue at a glance. Therefore, the user can preview an issue online before they download its actual digital copy. Secondly, when one downloads an issue file, it will contain its own content summary, for quick reference when browsing through the issue's pages. Finally, in addition to being a research aid, the collection of content summaries is a database of keywords which can be indexed and retrieved by a search engine. This provides the user with limited, though very useful, search capabilities.

ADOBE'S PORTABLE DOCUMENT FORMAT

As mentioned previously, the issues themselves need to be downloaded from the web site to the user's hard drive. The necessity of downloading has to do with the large size of the issue files. The issue files contain between 4 to 40 very large images. Although we have taken every measure to reduce the size of these files, even the smallest files are still too large to be accessed efficiently over the web. Most files range in size from 1MB to 3.5MB. However, some large issues exceed 10MB. Though these long downloads may be an inconvenience, we believe that we have employed the most cost-effective archiving method for the highest quality presentation.

A technological breakthrough in document distribution by Adobe made this particular archiving approach feasible. Every issue file in the archive is saved as an Adobe PDF (portable document format) file. The format achieves exactly what it claims: maximum portability. This means that the user need not worry about platform compatibility, operating system version, font availability, or expensive software. PDF files are viewable with Adobe Acrobat Reader, which is distributed free-of-charge by Adobe. Click on this link to download Acrobat Reader 4.0.

The PDF format offers the essential advantage of excellent image compression; most issue files are a mere one-third of their original size. For example, twelve 11 in. x 17 in. bitmap images at 400 dpi resolution can be compressed to a manageable size of 3.3 MB. As a result of this compression technology, readers are able to view reproductions of the newspapers that are very similar to the originals.

Adobe Acrobat Reader allows the user to navigate through the issues with ease. Users can scroll through the images and zoom in on areas which catch their interest. Though the issues are very readable on-screen, another unique advantage of PDF files is enhanced printability. Since PDF files are encoded in PostScript (the language most printers use to interpret and execute print jobs), printed output is highly consistent with its digital, on-screen counterpart. Therefore, users can print out high resolution copies of complete issues, for easy reading and future reference.

ORGANIZATIONAL STRUCTURE OF THE ARCHIVE

As we began to manage over one thousand issues, we learned the value of consistent organization very quickly. We created detailed spreadsheets for all of the publications, recording important information about every issue as well as entire years of publication. The full Archive Catalogue, in Microsoft Excel format, can be downloaded by clicking on this link. This collection of spreadsheets was an essential guide in every stage of our process.

Unfortunately, our own structure could not meet all the organizational demands of the project. Often, we had to decide how to handle organizational errors made by newspaper staff of the past. In order to maintain the structural integrity of the Catalogue, we applied particular rules of organization. For example, since the date of publication was the most consistently accurate organizational feature of the documents, we chose to categorize issues by date and year.

When there were inconsistencies in organization, we supplemented the Archive Catalogue with data that would improve the intelligibility and organization of the records. In the case of misnumbered issues, we maintained a sequential numbering system for ease of reference. However, for the sake of accuracy, we included within parentheses the issue number information as it appeared on the issues themselves. When issue information was false or missing, we were forced to associate artificial information with that issue. Any information we applied to the archive is contained within brackets. A common task was to estimate, based on a regular publication schedule, the approximate date for an issue which had not been dated by the newspaper staff. In case an undated issue occurred during a period of irregular publication, we estimated its date based on the most recent events mentioned within the issue. We applied a similar tactic when two issues shared the same date.

The Archive Catalogue is the first comprehensive attempt to organize detailed information about all the extant student newspapers from Bard College. We have tried our best to apply a consistent structure to a body of information which itself varied in organizational consistency. Despite its limitations, we hope that the Catalogue provides researchers and enthusiasts with a helpful guide to this unique collection of Bard's historic materials.