![]() |
|||||
|
|
Home > News & Events > MITRE Publications > The MITRE Digest > | |||||||||||||||||||
Where in the World Is the Hidden Document Data? December 2005
Most people don't know that some software applications that create documents store additional information in the documents beyond what the writer creates. This additional information is often called hidden data because it's generated or updated by the application and can't be edited or viewed by the writer. The hidden data can reveal who has opened the document, who has revised it, and what printers have printed it. These features are fine for collaborative publishing where a trail of revisions is useful for multiple authors working together. But such hidden data can be a problem when authors in government agencies transfer these document files across agency boundaries. The hidden data might contain information that the government agency would prefer not to share. Yet presidential directives require these agencies to share more information than ever before, so not transferring the documents isn't an option for preventing these so-called data spills. Data spills are a form of unintended disclosure, and the data may or may not be classified. When a document crosses agency boundaries, unintended disclosure can be a problem because computer forensic experts can trace the document back to its original source and see the name of everyone who authored or edited the document, which may not be desirable. Because of their wide use, Microsoft Office applications are the de facto standard for creating documents, spreadsheets, and presentations in both the government and private sector. Microsoft has evolved these applications to make them more user-friendly, functional, and interoperable. As a result, third-party vendors have implemented additional, searchable file attributes (e.g., author, keyword, creation date)—also known as metadata. Unfortunately, this increased metadata also increases the complexity of Microsoft Office implementations and introduces the potential for inadvertently leaving information behind in files. Evaluating Identical Tools "MITRE was asked to evaluate commercial-off-the-shelf [COTS] tools that can identify hidden data in documents and reduce risk to an acceptable level," says Lora Voas, principal information security engineer. "Applications such as Microsoft Word, Excel, and PowerPoint are popular for collaboration and their designs are not generally driven by security considerations." Tools that are perfect for most users, in other words, may cause concerns for those with special security or confidentiality needs. These concerns are not limited to Microsoft products. "The problem [for nondisclosure] is that Microsoft tends to embed as much information as possible to make the data backwards-compatible," says Darien Kindlund, senior information security engineer. "Even if you authored a document in a newer version of the application, an older version would be able to read and view that document just as easily as the newer ones. So you get a lot more information than you see." Previously Deleted Content Previously deleted content may still linger after the author deletes text or crops data from an inserted image. While not necessarily viewable on screen or in print, previously deleted content resides in areas that can be searched, in some cases with plain text editors such as Notepad. Another example is an embedded object that is dragged off a slide in PowerPoint. By default, you can't see the object on your computer monitor, but it's still there. Objects can also be cropped and resized either intentionally or unintentionally. For instance, a large image such as aerial photograph may be copied and pasted into PowerPoint or Word and then cropped to fit the image in the document. The information that you think is cropped out still resides in the document and can be revealed with the right tool. People can also use these products to intentionally hide information. By extending the previous example, someone can resize an image to the size of a dot to make it almost impossible to detect. Hidden data can also be placed inside overlapping objects that have been grouped. Not Enough Human Reviewers The classical model used for reviewing cross-domain information sharing is known as the reliable human review (RHR) method. In RHR, authors submit content to a publisher, who aggregates the content into a report, and then submits it to a reviewer for release across different domains. But there aren't enough human reviewers for the increasing number of documents, and unfortunately, the use of artificial intelligence and natural language processing is not yet up to the task of removing the human totally from the loop. "When MITRE looked at different COTS products to identify hidden data, we found that the majority of the products don't adequately solve this problem," says Kindlund. "Many times, COTS products will leverage libraries that were written by Microsoft to access this content and identify all the individual structures in the file. The problem is that the information revealed by the Microsoft libraries isn't necessarily complete, and it's further subject to change if these libraries are ever upgraded. Ultimately, Microsoft can control what you can and cannot see, so you might miss something." HOFFA Tool Analyzes File Structures "We developed a prototype tool that analyzes these file structures," says Kindlund. "It is part of the Heuristic Office File Format Analysis toolkit, or HOFFA for short. The tool evaluates COTS products or open-source applications that attempt to meet this need. The tool is not a total solution because it's too low-level for a typical analyst or end user, but it gets us one step closer in the right direction. At this stage, we're looking at potential interim solutions that try to convert the file into something more readable and more usable for the analysis and the sanitization part of this process." The HOFFA tool is used to figure out if a COTS product is able to correctly parse and interpret each "file system layer" within a native Microsoft Office (MSO) file. Each MSO file can be thought of as an onion that potentially contains layers of obfuscated, hidden data. The HOFFA tool peels away these file system layers, analyzing content within each layer. Standard testing involves constructing "test onions" using HOFFA, where target data is stored at different layers, and then feeding each onion to the COTS product to see if it was able to identify the target data. Some COTS tools work partially. "We found a couple of solutions that seem to work well for certain specific cases," explains Kindlund. "For example, if your environment is only dealing with PowerPoint files, or if you are not concerned with preserving the actual file format, you can convert that file into a more readable format such as HTML or plain text. The problem with that method is that it's not completely usable on the other side of the cross-domain boundary." That means the recipient won't be able to edit the content further or use the same application. Says Kindlund: "We found that many COTS products attempt to try to preserve the file format across domains and try to sanitize the information as best as they can. But because the COTS tools use the internal libraries that Microsoft develops, you won't be able to see all the information." To aid commercial vendors who design products for sanitizing Microsoft Office documents, MITRE produced taxonomy documents and corresponding "white lists" for the Microsoft Office Suite. The taxonomy documents provide a comprehensive list of the hidden data and metadata types stored within Microsoft Office files (i.e., Word, PowerPoint, and Excel). White lists specify which data types are allowable to remain within a file. Because applications such as word processing or spreadsheet software aren't the only ones that carry hidden data, MITRE has now been asked to analyze Portable Document Format (PDF) files using a similar methodology. —by David Van Cleave |
||||||||||||||||||||
| Page last updated: December 7, 2005 | Top of page |
Solutions That Make a Difference.® |
|
|