A substantial amount of still-valuable research data is only preserved on paper or outdated digital media. For example, the Forest Service’s Fort Valley Experimental Forest still has paper-based research data from over 100 years ago. Bringing such data into the digital 21st century can take a substantial amount of work, variously labeled “data rescue” (see, for example, this USGS fact sheet) or “data archaeology”. Unfortunately, even when a data collection reflects expenditure of significant amounts of money it does not always make sense to attempt to excavate a particular site. After all, funding for data archaeology is limited and needs to be spent carefully so that projects can demonstrate modern value. On this page, we’ve provided some insights on deciding whether to initiate a data archaeology project and how to execute such a project.
We follow our advice. For example, when we were considering whether to start a dig at the Penobscot Experimental Forest, we determined that:
- There were sufficient metadata
- Some of the paper posed a health hazard (mold)
- The content was relevant to current work and was actively used by the current research team
- The data were intermixed with administrative content that were permanent records of interest to the National Archives and Records Administration (NARA)
For additional information about preferred archival formats and specifications see NARA recommendations from 2014. Even when the guidance is outdated, this can be a good starting place.
How to convert paper-based materials to electronic files
- Prioritize
It can be costly to convert paper-based materials to electronic files, so prioritizing ensures the most important data/files are converted first. Here are some things to consider when prioritizing:
- Do you have sufficient metadata (who, what, when, why, where, how, etc.) to make the content useful?
- Are the media fragile and in danger of not being readable?
- Are the media in an old format (e.g., punch cards) that will require addition work?
- Is the content important (e.g., relevant to a current study) and/or frequently requested?
- Are you in danger of losing important information about the files because a scientist is retiring or leaving?
- Are these materials permanent Forest Service records?
- Prepare
Having the right tools and the right staff can make this job easier. Here are some recommendations:
Tools- Scanner
- High resolution is a must (minimum 600 dpi or significantly higher for slides)
- Ability to auto feed (optional, but could be important)
- File formats must be in archival format (JPG, TIFF, PDF, etc.)
- Optical character recognition (OCR) software
- Enables a computer to “read” the scanned text/data
- Can help make documents 508 compliant
- Helps convert scanned data to a useable form (doesn’t work well for hand-written data in most cases)
- Meticulous
- Good organizational skills
- Subject matter knowledge extremely helpful
- Scanner
- Digitize
How to digitize materials varies based on the content type. Here are some recommendations:
- Pictures / Slides
- Scan printed photos: 600 dpi (grayscale recommended for black/white)
- Scan slides or negatives: minimum of 2400 dpi
- Scan only 1 picture or slide per file
- Documents / Maps / Other Files
- Scan at 600 dpi (grayscale recommended for black/white)
- Scan multiple pages of a single document as 1 file
- Data
- Scan at 600 dpi (grayscale recommended), use OCR software after scan, verify accuracy of the OCR
- Consider hiring someone to hand-enter data
*Files and filenames should be as simple and transparent as possible. Folders can be used to break files into meaningful categories and help keep filenames shorter.
- Pictures / Slides
- Archive
To properly archive data or other types of files, proper documentation (metadata) is needed if the information is to be useful in the future. Here are some examples of the type of information needed based on content type:
- Pictures / Slides
- Description, which should include what and where
- When photo taken
- Photographer (if known)
- Documents / Maps / Other Files
- Description
- Author(s)
- When written
- Data
- Description of data (needs to include complete description of each variable)
- Who collected the data
- Why data were collected
- Where data were collected
- Quality of the data
*Important data/files, once electronic, should ideally be archived in an electronic data repository. It is important to think of stability, long-term preservation, discovery, and access capabilities when choosing where the data/files reside. Consider submitting data to the Forest Service Research Data Archive.
- Pictures / Slides
For more information on archiving and data management contact the archive team.