[Update: one of my students, Sarah Weissman, has published a fabulous overview of her group’s web archiving project and the problems they encountered over on the Library of Congress’s Signals blog.]
This semester we’re doing large-scale web archiving in my iSchool Information Access in the Humanities course. To get a sense of what I mean by “large scale,” our data budget using the Internet Archive’s subscription-based Archive-It software is 1,024 GB and includes up to 12,000,000 documents (we likely won’t capture nearly that many documents, but part of the learning experience is getting a sense of how much information gets captured with just a handful of seed sites per group as starting points for the crawler). A shout-out to the fabulous Lori Donovan at the Internet Archive who has been working closely with our class. Here’s a copy of the assignment.
Web Archiving Assignment
Due: 28 November
Presentations: 5 December
In this project you’ll harvest and preserve a humanities web-based collection. Your group will scope the collection, troubleshoot media file format issues, create metadata, and deal with robot.txt files and copyright issues, as well as learn about the architecture of the web. Using specialized open-source software to harvest content, you’ll create a topically based collection of websites, which are then permanently hosted at the Internet Archive. The software–known as Heritrix–crawls and captures pages from the live web, which is then viewable through the Wayback Machine. The service also includes specialized search tools that allow for full-text and metadata searching.
A number of other graduate library or archival programs use Archive-It in the classroom, including University of Michigan’s iSchool, UNC- Chapel Hill’s SILS program, University of South Florida’s iSchool, Clayton State University’s Archival Studies program, and NYU’s Moving Image Archiving and Preservation Program, to name a few. Their collections are built around themes such as Alternative Energy Sources, Digital Tools for Human Rights Awareness, and the 2011 Wisconsin Union protests, among others. To get a sense of what others have done, you can search by collecting organization, collection, or specific URL.
Useful resources and links:
*Archive-It log-in page
*UMD’s Archive-It Collections
*”Preservation is Cultural Literacy” (my Huffington Post article, which includes a discussion of web-archiving in the K-12 classroom)
1.) 4-5 double-spaced page document (Times New Roman 12″ font, with 1″ margins), plus images and appendices. In terms of genre, this document is a cross between a report and a reflective essay. You should address the following information:
*Description of and rationale for your web archive collection. What is the theme or topic of your collection, and how did you arrive at it?
*What are the 7-10 seeds that make up your collection?
*How did you scope your collection? Did you have to make any scoping adjustments along the way?
*What did you choose to capture for each site or seed: the entire site, one or more directories, or one or more subdomains? (Be sure to attend to the syntax of your seed URLs to make sure you’re capturing what you intend.)
*How did you make these decisions? Before making your final selections, please read the “appraisal and selection” section of Jinfang Niu’s “An Overview of Web Archiving” in D-Lib Magazine. Take note of the various approaches to appraisal he identifies: selection by domain (such as .gov or .edu), topic or event, or media type and genre. Niu also distinguishes between value-based sampling and random or statistical sampling.
*What type of content was archived in the course of your crawls? Images? Video? form- and database-driven content? PDFs? Study your post-crawl reports to get a quantitative sense of the types and numbers of files that were captured.
*What major rendering problems did you encounter, and how did you troubleshoot them? What other technical issues did you run into (e.g., crawl traps, robots.txt files, etc.)?
*What are some of the major takeaways from this project? What did you learn, and what surprised you? Remember the Internet Archive’s motto: “The Web is a Mess.” How was the truth of this statement brought home to you?
2.) Include screenshots, charts, and appendices as needed.
3.) 5-7 Powerpoint slides that summarize your thematic collection and your experience using the Archive-It service. Include screenshots, statistics, and technical issues encountered along the way. You can derive your slides directly from your written report (i.e., there may be considerable redundancy or duplication between them).
You will submit a printed report to me, but you should also cross-post as much as possible to the class blog to share with your classmates. I’d also strongly encourage you to blog about your experiences as you begin to experiment with the Archive-It software and tools.
*Seed sites: 7-10 total
*Production crawls: 4-5 (think carefully about the frequency of your crawls and when you want to schedule them; please note that our data budget only permits five production crawls per group)
*Unlimited test crawls
*Dublin Core collection-level metadata: Title, Subject, Description, Creator, Date
*Dublin Core seed-level metadata: at your discretion.
Tips and suggestions:
*Do a test crawl before your first production crawl (and subsequent test crawls as needed)
*Carefully study your post-crawl reports and learn from them
*Pay attention to the amount of data and documents you’re archiving by studying your post-crawl reports and collection home page
*Browse your archived content through the Wayback Machine and check for rendering issues
*Study the Quality Assurance (QA) post-crawl report and run patch crawls as needed
*Attend to Robots.txt issues
*Study how metadata elements have been used in other archive-it collections
*Remember that you are archiving for future generations as well as users in the present. How does the question of (future) audience shape your collection description? See my Huffington Post blog entry for observations on how teachers and students in the K-12 web-archiving program have approached this task.
*When choosing metadata subject terms, research and review existing controlled vocabularies, ontologies, and classification systems for relevance (e.g., the Getty Art and Architecture Thesaurus ). See what creators of other collections have done.