Skip to main content
The Multimodal Node is a versatile reasoning engine capable of processing and generation standard text as well as understanding other media types like images & videos.

Typical Usage

The Multimodal node is a primary point for LLM invocation within your workflow. It is designed to process various content types, analyzing text, image, or video inputs and generate intelligent text outputs based on your specific prompts and configuration. This node connects your data with the reasoning capabilities of large language models.

Example Configuration

Multimodal Node Clean Workflow
  1. Connection: The node can be connected to a ‘Start’ node or any of the other nodes for input or output.
  2. Configuration: The configuration panel allows for detailed setup of the model’s behavior.

Configuration Details

Multimodal Config Top Section

Core Settings

  • Title: Give your node a descriptive name (e.g., “Analyze Receipt Image”).
  • Description: Add a brief summary of what this node does for documentation purposes.
  • LLM: Select the specific Large Language Model to power this node (e.g., GIDR LLM 2).
  • Reasoning level: Controls the depth of the model’s analysis. Options include Disable (for supported models), Low (faster), Medium (balanced), and High (thorough). Note: The “Disable” option is model-dependent and may not be available for all LLMs (e.g., GIDR LLM 2).

Multimodal Config Middle Section

Prompts & Conversation History

  • Prompt: The main instruction for the AI (e.g., “Please provide a detailed analysis of the input”). You can mix static text with dynamic variables.
  • No. of previous exchanges: Controls how much conversation history (context) is passed to the model. ‘0’ means no history (stateless).

Multimodal Config Bottom Section

Advanced Options

  • Skip if no image: Automatically bypasses this node if the input does not contain image data. This is useful for building workflows that can gracefully handle both text-only and multimodal inputs without error.
  • Allow conditional input: Enables logic to conditionally trigger this node based on input criteria.
  • Get input by reference: Advanced setting to process inputs via URL reference rather than direct value. This is typically used for handling large documents and video files.
This feature is ONLY supported by Google Gemini LLMs.

Input Limits (Gemini)

Media TypeCapacity & Size LimitsSupported Formats
Images• Max 7 MBPNG, JPEG, WEBP, HEIC, HEIF
Documents• Max 1,000 pages per file
• Max 50 MB
PDF, Plain Text
Video• ~45 min (with audio)
• ~1 hour (without audio)
FLV, MOV, MPEG, MP4, WEBM, WMV, 3GPP
Audio• ~8.4 hoursAAC, FLAC, MP3, M4A, MPEG, MPGA, WAV, OGG
  • Variable Selectors:
    • Input variable selector: Map specific input variables to the node.
    • Prompt variable selector: Inject variables (e.g., user name, date) directly into your prompt.