![]() |
|||||
|
|
|
|
||||
Network Representations Support Powerful Data Analysis Sarah Piekut, Lowell Rosen, and Daniel Venese Some environments give investigators time to normalize and correct data anomalies to create a data warehouse before they analyze the data. Time-critical scenarios involving crime and terrorism do not permit this luxury. People must exploit the data in its native form while capturing imprecise and partial relationships—knowing that not all that information will be trustworthy. Criminals change official records, such as driver's licenses or Visa applications, to conceal their true identities. And trustworthy databases may exhibit differences in representations for data elements such as names and addresses. It's important to retain all reported "facts," even redundant and contradictory ones, since investigators cannot be sure which will prove true, or useful. Nothing can be discarded—as something trivial might be the key to solving the crime. How do the investigators organize this pile of information to be able to best analyze it and get to the critical clues? One approach for combining multisource information is with a network representation that describes many types of entities and emphasizes relationships between them, such as pairing a name to an address. Such networks provide greater flexibility than conventional databases, facilitate analyses of structural patterns, and reveal associations that otherwise would be difficult to detect. In conventional data analysis systems, the data capture and exploitation strategy is planned in advance. Network representations exploit whatever information is available since the types of entities and their relationships are hard to predict. MITRE has explored the use of network representations for several sponsors, coming up with new approaches and information management capabilities not yet available through commercial products.
Building a Network Representation A network is a "least common denominator" data model for expressing relationship information. This structure is able to capture and portray complex and often obscure relationships, such as suspicious connections between people. Formally, network structures equate to a graph representation, a series of edges and vertices. In graph data model terminology, entity instances are the vertices (or nodes) and relationships are the edges (or links). This type of representation emphasizes structural patterns, e.g., "this address is related to this name." Attributes can be associated with each edge and vertex. An address might contain its coordinates or an indication of whether it is a commercial or residential address. Name of a person might include a list of aliases. The relationship of name to address might include the dates when the name was associated with the address and a rating as to the confidence level of the relationship. A name or address might have an attribute indicating it is five hops away from a suspect organization. Network representation can be enhanced with more instances of edges and vertices and with new attributes that enhance the understanding of them. The implementation of a network representation (such as a network repository) can employ a variety of technologies. Links can be stored in a relational database or as a series of flat files. Attri-butes could be specified using a data dictionary or data language, such as Extensible Markup Language (XML) or Resource Description Framework (RDF). A common semantic understanding is important for matching field names and attribute values across multiple data sources. For example, the field named "make" in one database may be the same as "vehicle_make" in another. Attribute values also require a common understanding, e.g., the automobile make attribute "Chevy" is the same as "Chevrolet." Integration of information from multiple data sources requires a degree of preprocessing, normalization, and standardization. Depending on time and resource constraints, this can be accomplished in phases. Raw information might be loaded immediately, while derived and calculated attributes are added over time. Building a persistent network involves linking records or objects, even individual data elements from disparate data sources that were not intended to be used together. Once constructed, a network repository supports powerful data analysis operations that complement those of a traditional data mart. Analysis includes concepts such as adjacency (how far is one node from another) and graph topology (tree, cycle, etc.). You can also use network clustering to find subgraphs of related nodes and links. Subgraph matching tools find similar patterns of relationships among groups. We have applied these techniques effectively to government data sets containing tens of millions of records. However, there are still challenges to address in constructing investigative applications and exploiting a network repository for storage of node/link instances. For example, applications must be able to handle a large evolving set of attribute and link types, captured according to different standards and dealing with complex entity relationships. Distance estimates can still be misleading when entities have huge numbers of relationships. Thus, analysis within a domain context is needed to define the criteria to sever the low-value linkages. Finally, the sheer size of the network (often many terabytes or larger) can challenge implementers, who must choose among many options for storing it. MITRE continues to address these challenges and look for better ways to manage, visualize, and analyze information in a number of work programs and in our own research projects. |
|
||||
| For more information, please contact Sarah Piekut or Daniel Venese using the employee directory. Page last updated: July 30, 2004 | Top of page |
|||||
Solutions That Make a Difference.® |
|
|