Preservation for all: the future of government documents and the “digital FDLP” puzzle
Preservation for all: the future of government documents and the “digital FDLP” puzzle. A presentation at the Ohio GODORT spring 2011 meeting (by invitation). Friday, June 3, 2011 at the State Library of Ohio. Agenda: library principles and best practices case studies: --Everyday Electronic Materials (EEMs) “Water droplets”
--Collaboration: delicious, state agency databases “Reservoirs” --reflection of projects based on principles
Published on: Mar 4, 2016
Transcripts - Preservation for all: the future of government documents and the “digital FDLP” puzzle
Preservation for all: the future of government documents and the “digital FDLP” puzzle James R. Jacobs firstname.lastname@example.org lockss-usdocs.stanford.edu Ohio GODORT Spring Meeting Friday June 3, 2011 10 - 11:30Friday, June 3, 2011
agenda • library principles and best practices • case studies: • Everyday Electronic Materials (EEMs) “Water droplets” • Archive-it “Oceans” • lockss-usdocs “Waterfalls” • Collaboration: delicious, state agency databases “Reservoirs” • reﬂection of projects based on principles • [[slides available at slideshare.net/freegovinfo]]Friday, June 3, 2011introduce and agendaIʼd like to thank Ohio GODORT President Andrea Peakovic, Tom Adamich, Audrey Hall, Sarajean Petite and the otherorganizers for inviting me to talk with you today and facilitating a smooth and comfortable trip. I hope itʼll be worth your while :-)Weʼre at the very beginning of the digital era where tools, policies, best practices, etc are all in ﬂux. In many ways, weʼre at theage of new metaphors needed to describe what it is that we as librarians do on a daily basis.
Librarians ... ... Explore ... Collect ... Describe ... Share ... PreserveFriday, June 3, 2011... But basically, we explore, collect, describe, share and preserve the world of information.
mood survey • how many of you have: • run across a site or a PDF or a database that you wanted to add to your collection? • sent in a fugitive document to the GPO? • have created subject guides/pathﬁnders to govt information sites? • have (or want to have) a digital repository?Friday, June 3, 2011how many of you have:run across a site or a PDF that you wanted to add to your collection?ever sent in a fugitive document to the GPO?have created subject guides/pathfinders to govt information sites?have (or want to have!) a digital repository?I’ve answered each of these in the positive, and that’s what has led me to work on and advocate for digital documents collections and thedigital FDLP.Stanford University Library has been a federal depository library since 1895. Given that collections are still a vital aspect of libraries andgovt document collections are becoming hybrid paper/digital, the question becomes: how do government documents librarians deal withthe shift to digital formats and continue to build robust collections that serve our local user communities? Printing out or downloadingdigital documents to our desktops doesnt even begin to answer that question. When everythings ephemeral, how do we collect,organize, give access to, and preserve government documents?Today my talk will be centered around what I’m calling the “digital FDLP.” I use this term not as an accepted project name but simply as ametaphor to contextualize the work that I’m doing within the historic work of the federal depository library program (FDLP). I must add adisclaimer at the outset that, although GPO is participating in the locks-usdocs program, the work that I’m doing is largely outside of GPO/FDLP structure and no paperwork or MOU has been signed and therefore has no official status.
principles • Forward democratic ideals • Serve public interest / public access / public control / public preservation • Serve the information needs of the community • Forward the long-term institutional viability of libraries • Promote and leverage collective actionFriday, June 3, 2011So with that disclaimer out of the way, I’d like to talk about a few case studies in digital government information. But I’d like to talk aboutmore than simply the technical aspects of these projects. Because of the nature and history of libraries as memory organizations, wealso must deal with the social aspects and impacts on our practices. The social aspects of libraries are our fundamental raison d’etre.So in thinking about the projects of which I’m about to show, I kept coming back to the fundamental principles of libraries. Because thoseprinciples are, at the end of the day, the criteria for judging whether or not a project, a workflow, institutional energy should be consideredto be a success and whether your practices can be seen as “best.”So I’d like to first talk about library ideals. For me, running through this checklist helps me evaluate my work or a specific project. Forinstance, if I’m evaluating a project that seems to be valuable, but uses proprietary software or control of the content is not in the handsof the library (a trusted non-commercial entity!), or the goal of the project is profit over public interest, then this leads me to havequestions about that project.So as a reminder, I’d first like to enumerate some of the library principles or values that I use as a checklist as I go about my work. If youhave others, please let me know:Are you:--forwarding democratic ideals?--serving public interest / public access / public control / public preservation?--serving the information needs of the community?--forwarding the long-term institutional life of libraries?--promoting and leveraging collective action?These are the principles that we as documents librarians (and librarians in general!) hold dear. best practices = practices in which theseprinciples are embedded.Actions in support of values:--libraries as memory organizations--local control of collections (print libraries resist attack and are self-healing e.g. just yesterday I got an email from the American Samoabar assn. Samoan national library had lost much of collections in the 2009 tsunami including the entire run of the Samoan Pacific LawJournal and found that Stanford had holdings of the journal. So we’re trying to figure a way to help them rebuild their collections. U of HIManoa is another case where collections were lost due to natural disaster and the FDLP community helped to rebuild their collections.)--distributed system to meet local needs (spread responsibility for content among various locations and administrations)--public interest (affirms FDLP libraries’ role in ensuring permanent public access!)--value of library community--forward democratic ideals--A community, sharing preservation responsibilitiesWhile I talk about the following projects, please keep these principles and ideals in mind.
EEMs • Everyday Electronic Materials • serendipitous collection • Collecting the Web a drop at a time • Flickr photo by Elle Is Oneirataxic. Attribution-NonCommercial- ShareAlike 2.0 Generic Creative Commons licenseFriday, June 3, 2011EEMs – or Everyday Electronic Materials – is a Mellon funded project to build infrastructure and a workﬂow tosupport the collection, description, preservation and public access of digital objects by bibliographers and subjectspecialists.EEMs are those digital materials that are distributed by posting on Web sites, or through email notiﬁcation toscholars and bibliographers; those items that selectors come across in the course of doing their everyday work.The project has been a successful collaboration between Public Services, Technical Services, and Digital LibrarySystems and Services and has produced results that may be adopted and adapted for use by other librariesincluding:My colleague Katherine Kott wrote a report on the project for the fall 2010 CNI meeting. The policies andprocedures for collecting and processing EEMs that Katherine laid out are:--A clear framework for managing copyright issues associated with digital material distributed via the Web, andfor applying access policies that are consistent with redistribution rights--Training events and material for selectors and technical services staff--A Web-based tool to support selector and staff processing of EEMs via a lightweight workﬂow--Integration with the current integrated library system (ILS) and traditional ILS-based processes--Integration with other components of Stanford’s digital library infrastructure, including its preservationrepository, discovery systems and “digital stacks” delivery environment**From Katherine Kott’s CNI report on the project. See end slide for citation.Subject specialist workﬂow:1. identify the document (only pdf at this time)2. drag url of doc to the EEMs browser widget3. determine copyright status. Request permission to harvest/preserve if need be4. describe the document (title, author, rights status, comments)5. submit to acq and cataloging workﬂow.
Agencies tracked for EEMs • Bureau of Land Management CA ﬁeld ofﬁce: http://www.blm.gov/ca/st/en/info/publications.html • Department of Justice: http://www.justice.gov/05publications/05_3_a.html • Bureau of Ocean Energy Management, Regulation and Enforcement (BOEMRE) (including Minerals Management Service): http://www.boemre.gov/ • NOAA: http://www.noaa.gov/ • National Cancer Institute: http://www.cancer.gov/ • National Institutes of Health: http://www.nih.gov/ • USDA: http://www.usda.gov/wps/portal/usda/usdahome • OMB: http://www.whitehouse.gov/omb/ • **Harvesting with archive-it: • EPA: http://www.epa.gov/ • GAO: http://gao.gov/ • Census current industrial reports: http://www.census.gov/manufacturing/cir/index.htmlFriday, June 3, 2011My use of the EEMs workﬂow and tool grew out of 2 other projects that I’d like to brieﬂy mention: deliciousIAdeposit and lostdocs.freegovinfo.info.The delicious project was an attempt to build a collaborative effort toward collecting digital fugitive documents(those documents that should be distributed via the FDLP but are not). The original intent of that project was thatanyone could tag documents of interest with the delicious tag IAdeposit and a group of us would periodicallyupload those tagged documents to the Internet Archive’s US documents collection (http://www.archive.org/details/USGovernmentDocuments). I continue to upload documents when I come across them, but there hasn’tbeen enough structure to the project for many people to take it up and run with it unfortunately.A second, more successful project is Lostdocs.freegovinfo.info. Lostdocs is a blog and community effort thattracks fugitive document submissions to the GPO in order to provide a public listing of fugitive documents.Through the work of the lostdocs blog, I’ve been able to target 8 agencies that generally are the worst offendersin terms of fugitive documents:-- Bureau of Land Management CA ﬁeld office (can also check OR and WA ﬁeld offices) http://www.blm.gov/ca/st/en/info/publications.html-- Department of Justice http://www.justice.gov/05publications/05_3_a.html-- Bureau of Ocean Energy Management, Regulation and Enforcement (BOEMRE) (including Minerals ManagementService) http://www.boemre.gov/-- National Oceanic and Atmospheric Administration (NOAA) http://www.noaa.gov/-- National Cancer Institute http://www.cancer.gov/-- National Institutes of Health http://www.nih.gov/-- USDA http://www.usda.gov/wps/portal/usda/usdahome-- OMB http://www.whitehouse.gov/omb/We also found that 3 other agencies that were top fugitive offenders published too many documents to make theEEMs workﬂow feasible. So I’m harvesting the following 3 agencies with Archive-it:-- EPA http://www.epa.gov/-- GAO-- Census current industrial reports http://www.census.gov/manufacturing/cir/index.htmlI have a staff person working 1-2 hrs per week on this project. She checks the agency publications page for newpublications; Checks the CGP (http://catalog.gpo.gov) to see if the document has made it into the catalog of govt
EEM: http://searchworks.stanford.edu/view/8707790Friday, June 3, 2011Through the EEMs workﬂow, we’ve been able to collect over 300 documents like this one (notice the StanfordPURL), preserve them locally in the Stanford digital repository (SDR) and give access to them through our catalog,searchworks. Think what we could do if 100 libraries instituted this workﬂow? Collectively, we could cover allfederal agencies to assure that no document within scope of the FDLP falls through the cracks and becomesfugitive.
Archive-it • collecting the Web in bulk • Archive-it.org • Fotopedia image by Marcus Revertegat. Creative Commons Attribution 3.0 Unported license.Friday, June 3, 2011Archive-it is a subscription service from the Internet Archive. It’s an easy collection building tool whereby you givethe software a list of urls (called “seeds”), schedule the crawler to harvest the seeds, and then give public access tothe content collected. It’s a good way to contextualize or make sense of the ocean of content on the open Web.Since 2007:Documents Crawled:41,176,614Data Archived: 3,683.4 GB
SULAIR archive-it home: http://www.archive-it.org/home/SSRGFriday, June 3, 2011What I’m collecting with Archive-It:-- CRS Reports-- FOIA-- Fugitive US documents (shout-out to lostdocs.freegovinfo.info)-- Bay Area governments-- Climate change and environmental policy-- G-20-- CA Dept of education curriculum and instruction-- US budget-- FRUS
Collection seeds https://archive-it.org/public/collection.html?id=1078Friday, June 3, 2011Archive-it UIbuilding collections of urls or seedsmetadata creation: Metadata: assistance from a cataloger to do Dublin core metadata at the collection and seedlevel. Archive-it allows for metadata at the document level, but we have not done that.crawl reports
search and discover http://snipurl.com/crs-energyefﬁciencyFriday, June 3, 2011access (full text search IA, archive-it site, databases page, embeddable search form, Open Archives InitiativeProtocol for Metadata Harvesting (OAI-PMH) + plans to index collection/seed metadata in SUL catalog)
Collaboration • Delicious IAdeposit • State agency databases: wikis.ala.org/godort/index.php/ State_Agency_Databases • Seeding wikipedia with digital documents. See “Using Wikipedia to Extend Digital Collections” D-Lib. www.dlib.org/dlib/may07/lally/05lally.html • Yosemite’s Hetch Hetchy reservoir ﬂickr photo by Random Curiosity. Attribution- NonCommercial-ShareAlike 3.0 creative commons license.Friday, June 3, 2011I’ve already mentioned the delicious IAdeposit project, but I’d like to also brieﬂy mention the far more successful collaborativeGODORT project on – what else!? – state agency databases organized by my freegovinfo colleague Daniel Cornwall and kept up todate by 50+ volunteers. This project has been a huge success in collecting various public databases at the state level. I have togive a shoutout here to Audrey Hall, reference librarian in Government Information Services at the State Library of Ohio. Audrey’sbuilt a robust listing for OH state agencies (http://wikis.ala.org/godort/index.php/Ohio) Thanks Audrey!!State Agency Databaseshttp://wikis.ala.org/godort/index.php/State_Agency_Databases
LOCKSS-USDOCS • Targeted Web collection and distributed preservation • Lots of Copies Keep Stuff Safe • lockss-usdocs.stanford.edu • Flickr waterfall picture by discordia1967. That’s actually me at Hanakapi`ai falls in Kauai :-)Friday, June 3, 2011lockss-usdocs.stanford.edu
LOCKSS is ... • Distributed Digital Preservation System • Open source peer to peer (P2P) software • Standards based: OAIS, OpenURL, HTTP, WARC • content migrator to new formats as required “on the ﬂy” at point of access • bits and bytes are continually audited and repairedFriday, June 3, 2011LOCKSS – Lots of Copies Keep Stuff Safe – began at Stanford in 1999. The LOCKSS software was built to solve the problemof long-term preservation of digital content. It is an open-source distributed digital preservation system based on openstandards (OAIS, OpenURL, HTTP, WARC (web archive ﬁle format). Originally LOCKSS was focused on journal literature --and today CLOCKSS is going strong with 81 libraries and 30 journal publishers participating! – but over the last 10 years hasbeen used by other projects focusing on government information, theses and dissertations, numeric data etc.The goals of LOCKSS is to spread out the economic cost of digital preservation and use off the shelf hardware, so thatlibraries and content publishers can easily and affordably create, preserve, and archive local electronic collections and readerscan access archived and newly published content transparently at its original URLs.Think of LOCKSS boxes as digitally distributed bookshelves!--Distributed Digital Preservation System--Open source peer to peer software on Linux OS.--Standards: OAIS, OpenURL, HTTP, WARC--Migrates content to new formats as required “on the ﬂy” at point of access
Friday, June 3, 2011Besides the LOCKSS global network and CLOCKSS, the software is used by other projects focusing on government information, theses anddissertations, numeric data etc. showing LOCKSS is ﬂexible, reliable, efﬁcient and highly scalable.-- The Alabama Digital Preservation Network (ADPN)-- Arizona State Library, Archives and Public Records Persistent Digital Archives and Library System (PeDALS)-- Council of Prairie and Paciﬁc University Libraries (COPPUL) Consortium-- Data Preservation Alliance for the Social Sciences (Data-PASS)-- Digital Commons - Berkeley Electronic Press-- MetaArchive Cooperative Project
LOCKSS is funded by... • LOCKSS alliance library members • LOCKSS has received funding and in-kind support from: • Andrew W. Mellon Foundation • National Science Foundation • Library of Congress • The UKs Joint Information Systems Committee • Sun Microsystems • HP Labs • Intel Research Berkeley • Stanford University Libraries and Academic Information Resources • Stanford Computer Science Dept. • Harvard Computer Science Dept.Friday, June 3, 2011sustainable funding is always an issue. LOCKSS is primarily funded by libraries participating in the LOCKSS alliance and hasalso received major funding and in-kind support from several other organizations. Non-LOCKSS alliance members mayparticipate in LOCKSS-USDOCS for a token support fee of $1250/yr or $750/yr if they bring an additional library into theprogram.Web stronger and more viable than a silo
LOCKSS Permission Statement Digital Dissemination of Access Content Packages Interest has been expressed by customers and partners in creating their own digital collections of content. This includes building digital collections by accepting digital ﬁles and metadata disseminated by GPO’s Federal Digital System (FDsys). GPO will make available Access Content Packages (ACP’s) to download and store on local systems. For instance, for Federal depository libraries (FDL’s), this may assist in efforts to build digital collections at their libraries. GPO will continue to maintain responsibility for managing Federal content within scope of the Federal Depository Library Program (FDLP), and providing free and permanent access to this information. LOCKSS system has permission to collect, preserve, and serve this Archival Unit. http://www.gpo.gov/fdsys/bulkdata/FR/resources/lockss.htmlFriday, June 3, 2011How does lockss work?There are 2 parts to the LOCKSS software: harvest and content collection; and content checking and replication.1) any site that gives lockss permission to harvest can be collected by the LOCKSS harvester the state of the art in Webharvesting!next slide for #2 ...
Friday, June 3, 20112) and this is the cool part: locks goes through a process of checking and polling all digital content in all of the locks boxes on anetwork. If 1 box has content that is different from all of the other boxes, the software will ﬁx the content, assuring that allcontent in the whole network is exactly the same. It is for all intents and purposes injecting stem cells into the network toreplicate and ﬁx content thatʼs become corrupted over time.Thatʼs it. LOCKSS is elegant in its simplicity and proven effective in keeping digital content safely preserved over time. This isas close to the unix maxim of “doing one thing, doing it well.” In the digital world, this is as close to perfect as one could get.
LOCKSS-USDOCS • LOCKSS for US Documents • Replicates FDLP in the digital environment • “digital deposit” (for more on “digital deposit,” see http://freegovinfo.info/taxonomy/term/3) • Tamper evidentFriday, June 3, 2011So now you can see why some of us in the documents world are so excited about LOCKSS and why we decided to implementLOCKSS-USDOCS.Lockss-usdocs replicates key aspects of the FDLP in the digital environment – a network of 1250 libraries supporting accessto and long-term preservation of govt documents – and furthers the concept of "digital deposit," an essential component of thedigital FDLP. for more on “digital deposit” see http://freegovinfo.info/taxonomy/term/3In the paper environment, the decentralized FDLP is a tamper evident system. When someone tried to alter or withdraw apaper document from the system, the librarians were alerted. They had a chance to react, and frequently persuaded thegovernment to take different actions. Using the LOCKSS software we are re-implementing a tamper evident preservationsystem for digital documents. Rather than a central silo on a .gov server, digital govt documents reside on 36 servers at 36different libraries (and counting!).
Preserving • GPOaccess content (1991 - 2007) harvested from http://bulk.resource.org/gpo.gov/ • All current and future FDsys collections http://www.gpo.gov/fdsys/browse/ collectiontab.actionFriday, June 3, 2011
LOCKSS-USDOCS is ... Federal register, code of federal regulations, congressional record, congressional bills, congressional reports, US Code, Public&Private laws, Public Papers of the President, historic supreme court decisions, US Statutes at Large, GAO Reports, US Budget ... and more!!Friday, June 3, 2011Federal register, code of federal regulations, congressional record, congressional bills, congressional reports, US Code,Public&Private laws, Public Papers of the President, historic supreme court decisions, US Statutes at Large, GAO Reports, USBudget, etc. 40 collections plus bulk data repositories for the federal register and the code of federal regulations.Process is simple:Join the project and the discussion listSet up a lockss box with at least 3TB hard drive (some libraries recycle older hardware, some run in a virtual serverenvironment, some purchase new boxes from a vendor who has worked closely with the lockss staff to build to lockssspeciﬁcations)Sit back and watch it ﬁll up
LOCKSS-USDOCS participants http://snipurl.com/lockss-usdocs-partnersFriday, June 3, 201136 libraries and counting including 10 regional depository libraries -- but none in OH! The closest is the Indiana State Libraryand the University of Kentucky!! Looking for more, especially regionals but also other types of libraries (law, special, publicetc) and libraries outside the US.
What’s next for LOCKSS-USDOCS? • More participants • Expand collections • Make project participant-drivenFriday, June 3, 2011Essential Titles: http://www.fdlp.gov/collections/building-collections/135-essential-titles-list?start=1Ag. Stats, Census, Condition of Education, County/City data book, FRUS, Occupational Outlook Handbook, StatAb, Treatiesin ForceIn the 2008 Blue Ribbon Task Force on Sustainable Digital Preservation and Access, Abby Smith Rumsey wrote, “Access tovaluable digital materials tomorrow depends upon preservation actions taken today; and, over time, access depends onongoing and efﬁcient allocation of resources to preservation.”
Do these projects ... • ... forward democratic ideals? • ... serve public interest / public access / public control / public preservation • ... serve the information needs of the community? • ... forward the long-term institutional viability of libraries? • ... promote and leverage collective action?Friday, June 3, 2011
Librarians ... ... Explore ... Collect ... Describe ... Share ... PreserveFriday, June 3, 2011If you’re like me, you have a passion for documents. You want to explore, collect, describe, share,and preserve government information. Formats have unique properties and unique issues with doingthese things. But format is beside the point. We must continue to explore, collect, describe, share andpreserve government documents.
“...let us save what remains: not by vaults and locks which fence them from the public eye and use in consigning them to the waste of time, but by such a multiplication of copies, as shall place them beyond the reach of accident.” — Thomas Jefferson, February 18, 1791Friday, June 3, 2011Digital Strategy:These projects are part an overall digital strategy. It runs the gamut from saving individual documents, to smallcollaborative projects (IADeposit), to large-scale harvesting and preservation efforts. I want to stress that thisstrategy DOES NOT preclude paper documents. In many respects our historic collections are what drive our raisondetre.Farmington Plan Redux (collaborative collections)Technological tools are there. But there’s a real, critical need for a “Farmington Plan redux”:The Farmington Plan, which lasted from 1948 - 1972, was an innovative ARL program of collaborative collectiondevelopment whereby subscribing libraries would have responsibility for collecting and cataloging researchmaterials in certain subject and/or linguistic areas and would then distribute records (in the form of cards) to theNational Union Catalog. We need the same kind of plan in the FDLP. ASERL’s proposed “collections of excellence”starts to do that, but only focuses on historic collections.
Thanks!Friday, June 3, 2011Thanks everyone!
Further reading • Preservation for all: LOCKSS-USDOCS and our digital future. James Jacobs and Victoria Reich, Stanford University Libraries. Documents to the People (DttP) Volume 38:3 (Fall 2010). http://freegovinfo.info/system/ﬁles/lockssusdocs-dttp38%283%29.pdf • Everyday Electronic Materials in Policy and Practice. Coalition for Networked Information (CNI) project brieﬁng. Fall 2010. Katherine Kott. http://www.cni.org/tfms/2010b.fall/Abstracts/PB-everyday-kott.html • A Guide to Distributed Digital Preservation. K. Skinner and M. Schultz, Eds. (Atlanta, GA: Educopia Institute, 2010). http://www.metaarchive.org/GDDP • Several technical articles on LOCKSS at D-Lib Magazine www.dlib.org • “Digital Deposit” http://freegovinfo.info/taxonomy/term/3 • http://lockss-usdocs.stanford.eduFriday, June 3, 2011