Cosmos-Reason1, launched by NVIDIA, is a series of multimodal large language models designed to understand common sense and embodied reasoning in the physical world. Cosmos-Reason1 includes two models: Cosmos-Reason1 -8B and Cosmos-Reason1 -56B, which enables perception based on visual inputs and generates natural language responses through long-chain thinking, covering multiple areas ranging from interpretive insights to embodied decision-making.
Understanding of physical common sense: Understand space, time and basic physical laws, and judge the rationality of events.
Embodied reasoning: Generate reasonable decision-making and action planning for embodied agents such as robots and autonomous vehicles.
Long-chain thinking: Provides detailed reasoning processes to enhance the transparency and interpretability of decisions.
Multimodal input processing: supports video input, combines visual information with language instructions, and generates natural language responses.
Hierarchical ontology: A hierarchical ontology that defines physical common sense, covering space, time and basic physics.
Two-dimensional ontology: Designing a two-dimensional ontology for embodied reasoning, covering four key reasoning abilities of five embodied agents.
Multimodal architecture: Use a decoder multimodal architecture to process video and text input.
Four-stage training:
Visual pre-training: Align vision with text modality.
General Supervised Fine Tuning (SFT): Improves the performance of the model in general visual language tasks.
Physical AI SFT: Enhance physical common sense and embodied reasoning capabilities.
Physical AI reinforcement learning: further optimize reasoning ability through regular rewards.
Robot operation: Helps the robot understand task goals and generate operation plans.
Autonomous driving: Process road videos and make safe driving decisions.
Intelligent monitoring: Monitor abnormal behavior in videos in real time and issue alarms.
Virtual Reality/Augmented Reality: Generate interactive responses based on virtual environment input.
Education and training: assist in teaching, explaining physical phenomena or operating procedures.
Cosmos-Reason1 is a powerful tool that can promote the innovation and application of physical AI in multiple fields, especially in industries such as robotics, autonomous driving and intelligent monitoring.