Many companies and municipalities are saddled with hundreds or thousands of hours of video and limited ways to turn it into usable data. Voxel51 offers a machine learning-based option that chews through video and labels it, not just with simple image recognition but with an understanding of motions and objects over time.
Annotating video is an important task for a lot of industries, the most well-known of which is certainly autonomous driving. But it’s also important in robotics, the service and retail industries, for police encounters (now that body cams are becoming commonplace) and so on.
It’s done in a variety of ways, from humans literally drawing boxes around objects every frame and writing what’s in it to more advanced approaches that automate much of the process, even running in real time. But the general rule with these is that they’re done frame by frame.
A single frame is great if you want to tell how many cars are in an image, or whether there’s a stop sign, or what a license plate reads. But what if you need to tell whether someone is walking or stepping out of the way? What about whether someone is waving or throwing a rock? Are people in a crowd going to the right or left, generally? This kind of thing is difficult to infer from a single frame, but looking at just two or three in succession makes it clear.
That fact is what startup Voxel51 is leveraging to take on the established competition in this space. Video-native algorithms can do some things that single-frame ones can’t, and where they do overlap, the former often does it better.
Voxel51 emerged from computer vision work done by its co-founders, CEO Jason Corso and CTO Brian Moore, at the University of Michigan. The latter took the former’s computer vision class and eventually the two found they shared a desire to take ideas out of the lab.