If we started SwiftyClip over today, here's what we'd change

An honest engineering retrospective on the 10-hour build sprint that produced SwiftyClip v1.0 — what we'd keep, what we'd ditch, and what Apple Intelligence changes for 2026.

April 22, 2026•12 min read

What we built in 10 hours

The origin of SwiftyClip was a frantic, hyper-focused 10-hour build sprint. In that window, we moved from a blank Xcode project to a functional v1.0 that successfully automated the extraction of viral-ready clips from long-form video. The technical foundation was aggressive: we opted for Swift 6 with strict concurrency enabled from the first line of code. This was not a choice made out of academic purity, but out of a desperate need for runtime stability in a multi-threaded media processing environment. We knew that debugging race conditions in AVFoundation during a sprint would be fatal to our timeline.

By the time the timer hit zero, we had shipped a core engine supported by 67 unit tests, ensuring that our transcription and scene detection logic was resilient. We didn't just build an app; we built a protocol-first architecture. Version 1.0.3 introduced 9 Model Context Protocol (MCP) tools and 8 agentic features that allowed LLMs to interact directly with the video timeline. This rapid evolution from a simple utility to a developer-extensible platform was only possible because we prioritized the underlying data structures over the UI. We focused on the MCP implementation as the primary interface, treating the macOS status bar menu as just one of many potential consumers.

Looking back at the commit history from those first ten hours, it is clear that we traded off visual polish for architectural integrity. We spent more time on our transcription relay and segment scoring heuristics than we did on the onboarding flow. This "engine-first" approach meant that while the initial UX was Spartan, the reliability of the clipping process was high. We weren't just guessing where the "hooks" in a video were; we were calculating audio energy peaks and linguistic sentiment shifts with a precision that usually takes months to tune. This 10-hour sprint laid the groundwork for a system that could eventually handle gigabytes of video without breaking a sweat.

What we'd keep: The Apple-first stack

The most successful decision we made was doubling down on an on-device, Apple-first stack. In an era where many startups are just wrappers around expensive, high-latency cloud APIs, we stayed local. Using AVFoundation and CoreML natively on macOS gave us a performance floor that no web-based competitor could touch. The "drop a video" UX remains our North Star. It is minimalist, frictionless, and leverages the system-level drag-and-drop capabilities that users expect from a first-class Mac application. By processing everything on the user's silicon, we eliminated the privacy concerns and latency issues that plague cloud-based video tools.

Being MCP-native from day one is another architectural pillar we would never abandon. By exposing our internal pipeline stages as tools, we didn't just build a tool for users; we built a tool for other agents. This foresight allowed us to integrate into the burgeoning ecosystem of AI-driven workflows effortlessly. When we talk about Agentic Workflows, we are talking about a system where SwiftyClip isn't just a destination, but a reliable capability in a larger chain. This "headless-first" mentality allowed us to support CLI usage and external automation before we even had a proper settings panel.

Furthermore, starting with Swift 6 strict concurrency was painful in the first hour but a godsend by the tenth. The compiler-enforced thread safety meant that as we added more complex features like live capture and background rendering, we never hit the wall of "mysterious crashes" that often plague media apps. It forced us to think about data ownership and isolation early, preventing the "singleton soup" architecture that often emerges during rapid prototyping. The security model afforded by this isolation—keeping user media on-device—is a core value proposition that we would continue to defend. Native primitives for gradients and menus ensured that the app remained lightweight and responsive, avoiding the resource bloat of cross-platform frameworks.

What we'd ditch or rebuild

Not every early decision was a winner. The most glaring technical debt we accrued was our ad-hoc TCP loopback bridge used for communicating between the CLI and the main app process. It was a quick fix to get the MCP server talking to the rendering engine, but it introduced unnecessary overhead and potential port collisions. If we were starting today, we would use XPC + SMAppService from the very first commit. XPC provides a much more robust, system-native way to handle inter-process communication with built-in sandbox support. Using SMAppService would have simplified our background agent management significantly, moving us away from manual launch agent manipulation.

Similarly, our initial reliance on stdio for MCP transport was a mistake of convenience. While stdio is the easiest way to get started with MCP, it is fragile in a production environment where logs can pollute the stream. We would build an HTTP+SSE (Server-Sent Events) transport layer from day one. This would have provided a more stable, debuggable interface for third-party integrations and allowed for a cleaner separation of concerns between the transport and the protocol logic. We spent far too many hours debugging why a random print statement in a sub-module was crashing the protocol handler.

On the algorithmic side, our early scorer heuristics were too simplistic. We relied on basic keyword density and audio volume for too long. We jumped to MLX-based embeddings for semantic hook detection much later than we should have. If we restarted, MLX would be the primary analysis engine. The ability to run massive embedding models locally on Apple Silicon is a superpower, and we should have leaned into it from the start to provide more "intelligent" clip suggestions. Our early versions were "smart," but they weren't "aware" in the way our current MLX-driven pipeline is. We should have embraced vector-based analysis for hook detection on day one, allowing us to find emotional "vibes" rather than just loud noises.

Apple Intelligence changes everything

The landscape of 2026 is fundamentally different from when we first sketched the SwiftyClip architecture. The arrival of FoundationModels as a first-class, free on-device API via Apple Intelligence has shifted the "build vs. buy" calculus. In v1.0, we spent significant effort building custom summarization and title generation logic. Today, we would lean entirely on the SystemLanguageModel for these tasks. It is optimized for the hardware, consumes less power, and is already integrated with the system's privacy-preserving Compute units. By offloading these tasks to the OS, we could have focused even more on the unique aspects of video reframing and scene detection.

Caption rewriting, clip-title generation, and metadata extraction should have been delegated to these system-level models from launch. Instead of bundling our own weights, we could have shipped a smaller binary by utilizing the pre-installed models on macOS 26+. This isn't just about efficiency; it's about the quality of the output. The system models are tuned for the specific nuances of the user's language and context, providing a level of personalization we can't easily replicate with a generic local model. The integration with Private Cloud Compute also means we could have offered even more powerful analysis without sacrificing our local-first privacy promise.

We would also pivot from Whisper-based transcription to the native SpeechAnalyzer. The native Speech framework on macOS 26+ has reached a point of parity in accuracy while offering superior performance and lower memory overhead. By moving to the native API, we would gain better support for multi-speaker identification and more granular time-stamping without the need for custom C++ bindings. The goal has always been to be the best "Mac-native" tool, and that means embracing the platform's evolving intelligence capabilities. Finally, the new ImagePlayground and Vision frameworks would have changed how we handle B-roll, using system intelligence to suggest contextually relevant imagery for ogni scene.

Agent-first product design

The single most important architectural decision we made—and one we would lean into even harder—was exposing every single pipeline stage as an MCP tool. This agent-first design philosophy is what separates SwiftyClip from a traditional video editor. In a restart, we would extend this even further. We wouldn't just have tools for "extract clips" or "generate captions"; we would expose our billing and entitlements as tools. Imagine an agent that can check your remaining credits, suggest a tier upgrade based on your usage, and then execute the upgrade itself. This level of transparency and programmatic access is what modern power users demand.

We would also implement preset management as a series of tools. Instead of a complex UI for configuring clip styles, users should be able to query, create, and modify presets via a standardized protocol. This would allow for "AI-generated styles" where an LLM analyzes the video content and then uses an MCP tool to create a custom caption preset that matches the aesthetic of the footage. Everything that can be done via the UI should be doable via the protocol. This removes the "GUI bottleneck" and allows SwiftyClip to be integrated into massive, automated media factories.

Scheduled clips and automated batch processing should also have been tools from the start. We initially built these as internal background tasks, but making them part of the MCP surface would have allowed users to build complex, multi-app automation chains. By making the app "headless-first," the GUI becomes a secondary, optional convenience rather than a required bottleneck. In a restart, we would build a "context provider" for MCP that feeds information about the user's project history directly to the LLM, allowing agents to be true collaborators rather than just automated scripts.

The pricing math

Pricing is often the hardest variable to get right, but our lifetime $149 license was the right call for the prosumer market. It signaled quality and longevity in a market saturated with monthly subscriptions. We would keep the lifetime option, as it builds a loyal user base that feels like they are investors in the product's future. However, if we were starting over, we would tighten the Free tier caps significantly. The goal is to provide enough value to prove the concept, but not enough to solve the user's entire problem for free.

The initial Free tier was too generous, allowing users to process too much video before hitting a paywall. This delayed our feedback loop on whether the product was truly "must-have." By narrowing the Free tier to a single "Starter" experience, we would make the upgrade path more obvious and urgent. We want users to feel the value immediately, but also understand that high-performance local processing is a premium capability that warrants a fair price.

We would also introduce a "Compute-Only" tier earlier. With the rise of Apple Intelligence, some users just want the orchestration and the UX, while they provide their own API keys for certain cloud-based refinements. This tier would have been a great way to capture the "early adopter" developer market who wants maximum control over their stack. Finally, the $149 price point positions SwiftyClip alongside professional tools like Final Cut Pro. In a restart, we might even experiment with an even higher "Professional" tier that includes custom MCP tool development support or dedicated integration consulting for enterprises.

What we'd do differently about marketing

The "build it and they will come" philosophy is a dangerous myth. We focused 99% of our energy on the Swift code and 1% on telling people why it mattered. If we did it again, we would build a waitlist and an automated email sequence before writing the first line of the ClippingEngine. A waitlist isn't just for hype; it's for validation. It tells you if the problem you are solving is actually one people are willing to pay for. We also should have leaned into "build in public" more aggressively.

Every challenge we faced with AVFoundation and every breakthrough with MCP should have been a short-form video or a thread. This wouldn't just have been marketing; it would have been documentation. It would have positioned us as the experts in on-device AI video processing before the app even hit v1.0. People don't just buy software; they buy into the journey of the people building it. Marketing for a technical tool like SwiftyClip isn't about slogans; it's about proof.

We should have had a gallery of "Generated by SwiftyClip" examples from dozens of diverse creators on day one. Next time, the marketing pipeline will be just as "strictly concurrent" as the engineering pipeline. We would prioritize case studies that show real-world ROI—how much time a creator saved or how much their engagement grew. Lastly, we would have leveraged the developer community much earlier. By releasing parts of our MCP implementation as open-source, we could have had a small army of advocates helping us find bugs and suggesting features before the general launch.