1

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Recent works have begun to discover significant limitations in video-language grounding, suggesting that state-of-the-art techniques commonly overfit to hidden dataset biases. In this work, we present MAD (Movie Audio Descriptions). MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of video and exhibits a significant reduction in the currently diagnosed biases for video-language grounding datasets. MAD enables a novel and more challenging version of video-language grounding, where short temporal moments must be accurately grounded in diverse long-form videos.

Learning to Cut by Watching Movies

We propose a new method and pipeline to create video rditing cuts recommendations. Our method utilizes the information of already edited content (movies) to learn patterns between plausible and not plausible cuts via contrastive learning. We set up a new task and a set of baselines to benchmark video cut generation. To demonstrate our model in real-world applications, we conduct human studies in a collection of unedited videos. The results show that our model does a better job at cutting than random and alternative baselines.

BAOD: Budget-Aware Object Detection

We study the problem of object detection from a novel perspective in which annotation budget constraints are taken into consideration, appropriately coined Budget Aware Object Detection (BAOD). When provided with a fixed budget, we propose a strategy for building a diverse and informative dataset that can be used to optimally train a hybrid-supervised (weakly and fully supervision combined) detector. We show that one can achieve the performance of a strongly supervised detector on PASCAL-VOC 2007 while saving 12.8% of its original annotation budget.

RefineLoc: Iterative Refinement for Weakly-Supervised Action Localization

RefineLoc is a weakly-supervised temporal action localization method. RefineLoc uses an iterative refinement approach by estimating and training on snippet-level pseudo ground truth at every iteration. Our method shows competitive results with the state-of-the-art in weakly-supervised temporal localization. Additionally, our iterative refinement process significantly improves the performance of two state-of-the-art methods, setting a new state-of-the-art on THUMOS14.