Generate image clusters from input images
Transcribe audio or YouTube videos into text
Analyze video motion and attention with a neural model