While many of the high level features in TRECVID are static, such as "chair," "doorway," and "classroom," some of the features intrinsically involve motion that cannot be captured by analyzing a single keyframe. I spent a considerable amount of time exploring the use of optical flow information to detect such features as "person riding a bicycle," and "person playing soccer."
The first approach I pursued was using dense optical flow. I used the Lukas Kanade with pyramids method as implemented OpenCV. Unfortuanatley, the results were not that promising and furthermore, the computation time was subtstantial. One reason for the lackluster results could have been the amount of motion in the background, which might wash out the motion peritnent to the feature trying to be detected.
I next tried using sparse optical flow, again using the OpenCV implementation of Lukas Kanade with pyramids. Additionally, I used OpenCV's cvGoodFeaturesToTrack method to identify pixels to compute optical flow for. I created orientation histograms and concatanated these with our original bag of words histogram vectors. The results were a little bit more promising although ultimately it was determined that the trade off between the accuracy gains and the increase in computational time was too much.
One reaseon I think the optical flow information wasn't overly helpful is the large variation in the trecvid dataset. For instance, "person-playing soccer" may involve a zoomed out shot of players on a soccer field or a closeup of a person juggling a soccer ball, among other possiblities. The motion information from different shots thus may be very different.