Agent Execution Examples

A technical reference for implementing SwiftyClip MCP tool calls in autonomous agent workflows. These examples demonstrate how Large Language Models (LLMs) can orchestrate complex video editing pipelines through a series of atomic, verifiable tool operations.

The Model Context Protocol (MCP) allows agents to treat video processing as a standard input/output problem. By exposing SwiftyClip's core engine as a set of tools, agents like Claude Code or GitHub Copilot can perform high-level reasoning over video content without needing to handle raw binary streams or low-level AVFoundation APIs directly.

1. Ingest, Score, and Render Top Segments

The fundamental workflow for most video agents involves a full linear pipeline. This starts with asset discovery and ingestion, followed by deep analysis, and terminates in the generation of new media assets. When an agent receives a request to "process" a file, it must first establish a project context within SwiftyClip.

The clip.ingest tool is non-blocking at the transport layer but initiates a heavy background process for indexing. The agent observes the return project ID and uses it as the primary key for all subsequent operations. This state management is entirely handled by the agent's context window.

Once ingestion is confirmed, the agent calls clip.scoreSegments. This triggers SwiftyClip's internal multimodal analysis (combining audio energy, visual continuity, and transcript sentiment) to identify potential highlights. The agent then performs an "argmax" over the resulting score array to select the most impactful moments.

"Use swiftyclip_mcp to process ~/Movies/pod.mp4. Score segments and render the top 3 to ~/Desktop/."

Initial Ingest Request

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "clip.ingest",
    "arguments": {
      "path": "~/Movies/pod.mp4"
    }
  },
  "id": "req_1"
}

Server Response

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"projectId\": \"proj_82f1x92\",\n  \"status\": \"indexed\"\n}"
    }
  ]
}

What happens next:

The agent parses the JSON response and stores proj_82f1x92. It then executes clip.scoreSegments(projectId: "proj_82f1x92"). Upon receiving a list of segments with confidence scores ranging from 0.0 to 1.0, the agent sorts them descending and picks the first three. It then sends three parallelclip.render calls, specifying unique filenames for each export (e.g., clip_1.mp4, clip_2.mp4).

The agent monitors the render queue and provides a final summary to the user only after the file system confirms the existence of all three new files.

2. Semantic Reranking by Custom Query

Standard algorithmic scoring is often insufficient for niche content. An agent's true power lies in its ability to understand the *content* of the video through the transcript and visual metadata. In this example, the agent retrieves the raw segments and uses its internal LLM reasoning to determine which segments are topically relevant to a specific user query.

This "Semantic Reranking" step happens entirely within the agent's logic. It essentially acts as a filter between the raw SwiftyClip data and the final output instructions. The agent reads the transcripts provided byclip.listSegments and matches them against the concept of "building an AI startup."

"Find segments in project X about building an AI startup. Rerank by that semantic topic, render top 2."

Segment Listing Request

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "clip.listSegments",
    "arguments": {
      "projectId": "proj_X",
      "includeTranscripts": true
    }
  },
  "id": "req_2"
}

Response with Metadata

{
  "content": [
    {
      "type": "text",
      "text": "[{\"id\": \"seg_1\", \"transcript\": \"We needed to raise a seed round...\"}, {\"id\": \"seg_2\", \"transcript\": \"The GPU cost was our biggest hurdle...\"}]"
    }
  ]
}

What happens next:

The agent analyzes the transcript strings. Segment 1 is identified as "Fundraising" and Segment 2 as "Infrastructure," both of which are core to the "AI startup" theme. The agent then maps these segment IDs back to clip.renderinstructions.

This illustrates how SwiftyClip serves as the "arms and legs" of the agent, executing the heavy lifting of video extraction while the LLM provides the "brain" for topical selection.

3. Scheduling a Multi-Platform Campaign

Scheduling requires the agent to understand time, platform constraints, and project continuity. For a "TikTok style" request, the agent must ensure vertical aspect ratios (9:16) and high-impact captioning are applied during the render phase.

The agent calculates a distribution of timestamps (e.g., today at 9am, tomorrow at 9am, etc.) and performs a series of tool calls that link the rendering artifact to a scheduling queue. This moves SwiftyClip from a utility tool to a comprehensive content management system.

"Schedule 7 shorts from project Y — one per day, 9am ET, captioned for TikTok."

Scheduling Call (Day 1)

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "clip.schedule",
    "arguments": {
      "projectId": "proj_Y",
      "segmentId": "seg_77",
      "publishAt": "2026-04-23T09:00:00-04:00",
      "platform": "tiktok",
      "style": "dynamic_bold"
    }
  },
  "id": "req_3"
}

Confirmation Response

{
  "content": [
    {
      "type": "text",
      "text": "{\n  \"scheduleId\": \"sched_a1b2c3\",\n  \"confirmed\": true,\n  \"queuePosition\": 1\n}"
    }
  ]
}

What happens next:

The agent repeats this call six more times, incrementing the date for each request. It also verifies that the styling preset dynamic_boldis supported by the server before finalizing the schedule. The agent effectively operates as a high-level scheduler, managing the temporal logistics of a marketing campaign.

4. Extraction of Transcripts for Archival

Not all workflows result in a video file. Sometimes, the goal is purely informational. By using the clip.transcribe tool, an agent can extract structured text from an audio or video file to populate a database, generate a blog post, or create a README file.

This separates the "understanding" part of the engine from the "rendering" part. The agent can process a 60-minute talk and receive a JSON object containing every word spoken, which it can then compress into a summary for the user.

"Import ~/Audio/talk.mp3 and return the full transcript as plain text."

Transcription Request

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "clip.transcribe",
    "arguments": {
      "projectId": "proj_99",
      "format": "vtt",
      "diarization": true
    }
  },
  "id": "req_4"
}

Raw Text Response

{
  "content": [
    {
      "type": "text",
      "text": "WEBVTT\n\n00:00:00.000 --> 00:00:05.000\nSpeaker 1: Welcome to the symposium..."
    }
  ]
}

What happens next:

The agent takes the VTT (Web Video Text Tracks) format and performs a local transformation. It might strip the timestamps to create a clean blog post draft or use the timestamps to create a "chapters" section for a YouTube description. The flexibility of the MCP interface allows the agent to treat video metadata as a first-class citizen in its reasoning chain.

5. Intelligent Deduplication and Selection

Algorithmic highlight detection often produces redundant results. For example, a speaker might repeat a key point, or two overlapping segments might both score highly. A human editor would see this as redundancy; an agent can do the same by comparing the semantic embeddings of the segment transcripts.

In this example, the agent is tasked with a "dedupe pass." It doesn't just take the top 5; it takes the top 5 *unique* ideas. This requires a two-step process: first, gathering the metadata, and second, applying a selection heuristic that prioritizes diversity over raw score.

"Some of the scored segments overlap. Keep only the best 5 with no topical overlap."

Full Metadata Request

{
  "jsonrpc": "2.0",
  "method": "tools/call",
  "params": {
    "name": "clip.listSegments",
    "arguments": {
      "projectId": "proj_Z",
      "includeScores": true,
      "includeTranscripts": true
    }
  },
  "id": "req_5"
}

Overlap Response

{
  "content": [
    {
      "type": "text",
      "text": "[{\"id\": \"seg_A\", \"start\": 10, \"end\": 20, \"score\": 0.9}, {\"id\": \"seg_B\", \"start\": 12, \"end\": 22, \"score\": 0.85}]"
    }
  ]
}

What happens next:

The agent detects that seg_A and seg_B have an 80% overlap in time. Since seg_A has a higher score, the agent marksseg_B for exclusion. It repeats this analysis for the entire project, ensuring the final render queue contains only the most distinct and valuable clips. This prevents the user from receiving three variations of the same joke or point in their export folder.