Understanding movies and their structural patterns is a crucial task to decode the craft of video editing. While previous works have developed tools for general analysis such as detecting characters or recognizing cinematography properties at the shot level, less effort has been devoted to understanding the most basic video edit, the Cut. We construct a large-scale dataset called MovieCuts, which contains more than 170K video clips labeled among ten cut types.
Recent works have begun to discover significant limitations in video-language grounding, suggesting that state-of-the-art techniques commonly overfit to hidden dataset biases. In this work, we present MAD (Movie Audio Descriptions). MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of video and exhibits a significant reduction in the currently diagnosed biases for video-language grounding datasets. MAD enables a novel and more challenging version of video-language grounding, where short temporal moments must be accurately grounded in diverse long-form videos.
We propose a new method and pipeline to create video rditing cuts recommendations. Our method utilizes the information of already edited content (movies) to learn patterns between plausible and not plausible cuts via contrastive learning. We set up a new task and a set of baselines to benchmark video cut generation. To demonstrate our model in real-world applications, we conduct human studies in a collection of unedited videos. The results show that our model does a better job at cutting than random and alternative baselines.
We study the problem of object detection from a novel perspective in which annotation budget constraints are taken into consideration, appropriately coined Budget Aware Object Detection (BAOD). When provided with a fixed budget, we propose a strategy for building a diverse and informative dataset that can be used to optimally train a hybrid-supervised (weakly and fully supervision combined) detector. We show that one can achieve the performance of a strongly supervised detector on PASCAL-VOC 2007 while saving 12.8% of its original annotation budget.
RefineLoc is a weakly-supervised temporal action localization method. RefineLoc uses an iterative refinement approach by estimating and training on snippet-level pseudo ground truth at every iteration. Our method shows competitive results with the state-of-the-art in weakly-supervised temporal localization. Additionally, our iterative refinement process significantly improves the performance of two state-of-the-art methods, setting a new state-of-the-art on THUMOS14.