GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension (IJCAI 2024)

Jiafeng Liang1, Shixin Jiang1, Zekun Wang1, Haojie Pan3, Zerui Chen1, Zheng Chu1, Ming Liu1,2*, Ruiji Fu3, Zhongyuan Wang3, Bing Qin1,2
1Harbin Institute of Technology, Harbin, China 2Peng Cheng Laboratory, Shenzhen, China 3Kuaishou Technology, Beijing, China

If you are interested in our dataset, you can register your information through the google form.
We will send the dataset to your email address later.
Google Form


Teaser

Overview of the Gulde dataset.


The GUIDE consists of 560 task queries, each containing an average of 6.2 task-related videos. These instructional videos are divided into specific steps with timestamps and text descriptions (yellow area). Additionally, each task contains a set of guideline steps representing a common pattern shared by all task-related videos (purple area).

Abstract

There are substantial instructional videos on the Internet, which provide us tutorials for completing various tasks. Existing instructional video datasetsonly focus on specifc steps at the video level, lacking experiential guidelines at the task level. whichcan lead to beginners struggling to learn new tasksdue to the lack of relevant experience. Moreover the specific steps without guidelines are trivial and unsystematic, making it diffcult to provide a clear tutorial.

To address these problems, we present the GUIDE (Guideline-Guided) dataset, which contains 3.1K videos of 600 instructional tasks in 8 domains related to our daily life. Specifcally, we annotate each instructional task with a guideline, representinga common pattern shared by all task-related videos. On this basis, we annotate systematic specifc steps including their associated guideline steps, specifcstep descriptions and timestamps.

Our proposed benchmark consists of three sub-tasks to evaluatecomprehension ability of models:(1) Step Captioning: models have to generate captions for specifcsteps from videos.(2)Guideline Summarization: models have to mine the common pattern in task related videos and summarize a guideline from them.(3)Guideline-Guided Captioning: models have to generate captions for specifc steps under the guideof guideline. We evaluate plenty of foundation models with GUIDE and perform in-depth analysis. Given the diversity and practicality of GUIDEwe believe that it can be used as a better benchmark for instructional video comprehension.

Dataset Construction Pipeline

Video Collection

In this stage, we aim to collect a large number of high-quality instructional videos. To ensure the widely-used of the GUIDE. We collect videos from 560 different instructional tasks across the 8 most common domains in our daily life. We require annotators to collect videos containing explicit instructional steps and clearly defined time boundaries between these steps. To further enhance the practicality of the dataset, we also require the collected videos that include detailed video subtitles, i.e. each step accompanied by corresponding voice explanations.

Automatic Annotation

The automatic annotation frame work contains two stages: Specific Steps Generation and Guideline Steps Generation.
Teaser

Manual Annotation

We employ an expert in each domain to adjust all guideline steps and require the annotators to refine the specific steps generated by GPT-3.5-turbo and annotate the timestamps of steps by watching videos.

Guide Dataset

Comparison of Guide and other video datasets with annotations

GUIDE covers videos of 1) various domains, 2) with many step annotations per video, and 3) high-quality step captions written by human annotators and gpt-3.5-turto.
Teaser

Task Definition

Step Caption
The step captioning task evaluates the models’capabilities tounderstand the procedural temporal knowledge of the instructional video. In this task, models have to generate a set ofinstructional step captions.
Guideline Summarization
The guideline summarization task evaluates the models’capabilities to analyze correlations across videos. In this task models have to mine the common pattern in task-related videosand summarize a guideline from them.
Guideline-Guided Captioning
To explore the impact of guidelines on step captioning, wepropose the guideline-guided captioning task. In this task,models have to generate specific step captions under the guideof guideline.

Video Category Distribution

The videos and text queries are collected from the Kuaishou platform. There are a wide variety of categories for HiREST videos. The most frequent categories are “Hobbies and Crafts”, “Food and Entertaining”, and “Home and Garden”.
Teaser

Dataset Case

We evaluate three video foundation models on GUIDE: VideoChat , Video-LLaMA and mPLUG-Owl. We evaluate four language foundation models on GUIDE: GPT-3.5-turbo ,GPT-4 ,Flan-T5-XXL and Vicuna-13B.
Teaser
Comparison of foundation models and ground-truth annotation for step captioning, guideline summarization and guideline-guided captioning. Green, yellow, and red text denote ‘correct’, ‘partiallycorrect’,and ‘wrong’ respectively.

Citation

Please cite our paper if you use our dataset and/or method in your projects.

@misc{liang2024guideguidelineguideddatasetinstructional,
      title={GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension},
      author={Jiafeng Liang and Shixin Jiang and Zekun Wang and Haojie Pan and Zerui Chen and Zheng Chu and Ming Liu and Ruiji
      Fu and Zhongyuan Wang and Bing Qin},
      year={2024},
      eprint={2406.18227},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.18227},
      }