Submission · sayonsom

When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt.

Source: git url
Created: 4/12/2026, 6:52:47 AM
Entrypoint: python main.py
Source ref: https://github.com/H-EmbodVis/NUMINA

Runs(5)

Failedd888e915sha256:a803f1e15e23…
3.3s4/13/2026, 7:31:53 PM
Failed36ace5cdsha256:a803f1e15e23…
3.6s4/13/2026, 3:04:24 PM
Failedf6a5a4c9sha256:a803f1e15e23…
3.4s4/13/2026, 2:47:23 PM
Failed8f2cf32c
4/13/2026, 2:09:54 PM
Failed68d29c88sha256:a803f1e15e23…
750.2s4/12/2026, 6:52:48 AM