Multimodal Node

The Multimodal Node is a versatile reasoning engine capable of processing and generation standard text as well as understanding other media types like images & videos.

Typical Usage

The Multimodal node is a primary point for LLM invocation within your workflow. It is designed to process various content types, analyzing text, image, or video inputs and generate intelligent text outputs based on your specific prompts and configuration. This node connects your data with the reasoning capabilities of large language models.

Example Configuration

Connection: The node can be connected to a ‘Start’ node or any of the other nodes for input or output.
Configuration: The configuration panel allows for detailed setup of the model’s behavior.

Configuration Details

Core Settings

Title: Give your node a descriptive name (e.g., “Analyze Receipt Image”).
Description: Add a brief summary of what this node does for documentation purposes.
LLM: Select the specific Large Language Model to power this node (e.g., GIDR LLM 2).
Reasoning level: Controls the depth of the model’s analysis. Options include Disable (for supported models), Low (faster), Medium (balanced), and High (thorough). Note: The “Disable” option is model-dependent and may not be available for all LLMs (e.g., GIDR LLM 2).

Prompts & Conversation History

Prompt: The main instruction for the AI (e.g., “Please provide a detailed analysis of the input”). You can mix static text with dynamic variables.
No. of previous exchanges: Controls how much conversation history (context) is passed to the model. ‘0’ means no history (stateless).

Advanced Options

Skip if no image: Automatically bypasses this node if the input does not contain image data. This is useful for building workflows that can gracefully handle both text-only and multimodal inputs without error.
Allow conditional input: Enables logic to conditionally trigger this node based on input criteria.
Get input by reference: Advanced setting to process inputs via URL reference rather than direct value. This is typically used for handling large documents and video files.

This feature is ONLY supported by Google Gemini LLMs.

Input Limits (Gemini)

Media Type	Capacity & Size Limits	Supported Formats
Images	• Max 7 MB	PNG, JPEG, WEBP, HEIC, HEIF
Documents	• Max 1,000 pages per file • Max 50 MB	PDF, Plain Text
Video	• ~45 min (with audio) • ~1 hour (without audio)	FLV, MOV, MPEG, MP4, WEBM, WMV, 3GPP
Audio	• ~8.4 hours	AAC, FLAC, MP3, M4A, MPEG, MPGA, WAV, OGG

Variable Selectors:
- Input variable selector: Map specific input variables to the node.
- Prompt variable selector: Inject variables (e.g., user name, date) directly into your prompt.

Introduction

Organization & Teams

GIDRs & Gidgets

End User

Multimodal Node

Typical Usage

Example Configuration

Configuration Details

Core Settings

Prompts & Conversation History

Advanced Options

Input Limits (Gemini)

Introduction

Organization & Teams

GIDRs & Gidgets

End User

​Typical Usage

​Example Configuration

​Configuration Details

​Core Settings

​Prompts & Conversation History

​Advanced Options

​Input Limits (Gemini)

Typical Usage

Example Configuration

Configuration Details

Core Settings

Prompts & Conversation History

Advanced Options

Input Limits (Gemini)