We provide a large-scale lecture video dataset consisting of artificially-generated lectures, and the corresponding ground-truth fragmentation, for the purpose of evaluating lecture video fragmentation techniques.
For creating this dataset, 1498 speech transcript files (generated automatically by ASR software) were used from the world’s biggest academic online video repository, the VideoLectures.NET. These transcripts correspond to lectures from various fields of science, such as Computer science, Mathematics, Medicine, Politics etc. In order to create the synthetic video lectures, all transcripts were randomly split in fragments, the duration of which ranges between 4 and 8 minutes. Each synthetic lecture was then assembled by combining (stitching) exactly 20 randomly selected fragments. 300 such artificially-generated lectures are included in the released dataset. Each such lecture file has a mean duration of about 120 minutes, thus the dataset contains altogether about 600 hours of artificially-generated lectures. Every pair of consecutive fragments in these lectures originally comes from different videos, consequently the point in time where such two fragments are joined is a known ground-truth fragment boundary. All these boundaries form the dataset’s ground truth. We should stress that we do not generate the corresponding video files for the artificially-generated lectures (only the transcripts), and one should not try to reverse-engineer the dataset creation process so as to use in some way the visual modality for detecting the fragments in this dataset.