Computer Vision Offers New Tools for Searching the Video ExplosionJanuary 2015
Topics: Intelligence Analysis, Data Management, Artificial Intelligence, Multimedia Information Systems, Image Processing, Pattern Recognition, Cloud Computing
With all the video footage that churns through millions of cameras every day, finding specific information is a burdensome, manual task. MITRE researchers have developed technology to help users find those needles in a haystack, whether the user is an intelligence analyst looking at video of a road in Iraq or a police officer looking at a street corner in New York City. Finding valuable information quickly can save lives and solve crimes.
"According to research from IBM, 90 percent of all the data in the world today has been created within the last two years," says Dr. Mikel Rodriguez. "And a tremendous portion of this growth is in video." He and his team in the MITRE Computer Vision Group are working on various technologies to make use of the explosion of video data. "Some of this growth comes from traditional sources, such as surveillance cameras, aerial video platforms, and commercial video from space. But there's also been a massive increase in pixels generated from 'unstructured' video sources such as cellphones and wearable cameras."
In response to our sponsors' most critical needs, Rodriguez and research team are advancing the state of "computer vision," a sub-field of artificial intelligence. For example, they are developing algorithms that enable computers to "see" by automatically extracting information from visual data (e.g., images, video, 3-D scans, infrared).
Deep-learning Algorithms Push Discoveries
"Computers can't come close to matching a human's ability to solve complex problems, but they can reduce some of the time-consuming drudgery so that people can focus their energies on the most important tasks," Rodriguez says.
The team has been developing new computer vision capabilities by applying recent advances in algorithms that employ deep learning and convolutional neural nets, which are based on theories of how the human brain works. Deep learning systems are typically made of many layers of intertwined nodes. These systems learn how to represent visual data over time by rearranging connections between nodes after observing each new example of an object or action in a video.
Along with the ever-increasing processing power of computers, these algorithms are now able to recognize objects and events automatically. "Essentially, we're doing the same thing with video that Google does with words," Rodriguez says. "Only images are much more complicated than words."
Putting Solutions in Users' Hands
One example of a successful computer vision technology that we are transitioning to our sponsors is Content-based Retrieval and Access (COBRA). This program helps computers automatically recognize objects and events of interest within millions of frames of aerial surveillance video. It first applies video stabilization and tracking algorithms to recognize motion. Then it employs object detection/classification—for example identifying whether movement comes from a car or person. After analyzing the 3-D structure of the scene, COBRA stitches frames together in a brief video summary. This process both reduces analysts’ workload and decreases the time it takes to review large quantities of video—thus increasing the speed with which an agency can act on information.
One of MITRE's other computer vision research projects, called Holodeck, was inspired by the Boston Marathon bombing and the challenges law enforcement faced in searching through thousands of frames from cellphone, security, and ATM video footage.
In the months following the bombing, the team began thinking about better ways to find relevant data within a large quantity of unstructured (and sometimes "uncooperative") sources such as cellphones. Unlike with traditional sources of imagery—for which a user usually knows the platform, the author, location, and time a video was taken—unstructured video data lacks most of this metadata.
"If you're trying to reconstruct what happened, being able to easily see when and where a camera was pointed in the context of the scene is very helpful," Rodriguez explains. "We realized that to make full use of such information, we needed to rely almost entirely on the pixels themselves."
The process begins by incorporating low-level feature matches against a database of reference imagery of 500 million geotagged photos. This helps reveal the location of each video and its relative position to other videos. Ultimately, using virtual reality headgear (manufactured by Oculus), a Holodeck user can step into an immersive world—say, on Boylston Street in Boston—and look around. He or she would then see small video camera icons, each positioned where the video was actually shot. By reaching out and touching the icon, the video plays in a window in the context of the virtual reality world.
"What tools such as COBRA and Holodeck do is really data triage," Rodriguez adds. "They cut down on the low level, mind-numbing work of watching hours of video and provide visual information within context, so people can do what they do best—solve problems."
A History of Finding Needles
The computer vision work fits in well with a research program that pioneered "audio hot-spotting," a technology that mines audio files to find words of interest, speakers, or key sound effects (e.g., laughter, applause, and mechanical noises). It also fits with our related cutting-edge research on full motion video and hyperspectral imaging.
According to Marin Halper, who oversees much of this research, "MITRE has established itself as a center for world-class research in computer vision and hyperspectral imaging. We're excited to explore how the intersection of those two domains may uniquely address some of our sponsors’ most technically complex problems."
—by Bill Eidson