Chrome shipped an MCP server as part of DevTools last September that gives any AI agent structured access to web app content.

I wanted the same thing for Windows apps — so I built it.

What lvt Does

lvt (Live Visual Tree) is an open-source CLI tool that gives AI agents structured visibility into any running Windows application.

lvt inspecting File Explorer

Point it at any app and it returns a unified element tree — every control, its type, text, bounds, and a stable ID — as JSON or XML that an agent can reason about directly.

# Get Notepad's visual tree as JSON
lvt --name notepad

# Capture annotated screenshot with element IDs
lvt --name notepad --screenshot out.png

# Scope to a subtree
lvt --name myapp --element e5 --depth 3

Why Not UIA?

The standard approach to Windows automation is UI Automation (UIA). It’s the accessibility layer that everything else builds on — screen readers, test frameworks, automation tools.

UIA was designed for accessibility, not AI agents. It has different performance and experience characteristics:

  • It’s an accessibility projection, not the real tree
  • Properties get flattened or lost in translation
  • Hierarchy doesn’t match what developers wrote
  • Many apps have limited UIA implementations

Every agent-driven Windows automation approach I’ve seen is either:

  1. UIA-based — different tradeoffs than agents need
  2. Screenshot + vision — expensive, fragile, can’t reason about structure
  3. Both — combining two approaches doesn’t eliminate their limitations

The Third Path: Direct Framework Introspection

lvt talks directly to each framework’s native tree:

  • Win32/ComCtl — direct window enumeration with control type enrichment
  • WinUI 3 — XAML Diagnostics API
  • System XAML (UWP) — XAML Diagnostics API

No abstraction layer. No accessibility tax. The actual visual tree, as the framework sees it.

This means:

  • Real property values, not accessibility projections
  • Correct hierarchy that matches the code
  • Faster enumeration
  • Element names and types that developers recognize

What Agents Can Do With This

Precise control references: Elements get stable IDs (e0, e1, e2…) that agents can use directly. “What’s in e14?” “Click e7.” No more guessing from screenshots.

Structural reasoning: Agents can understand app layout as a tree, not just pixels. “Find the button inside the toolbar” becomes a tree query, not visual pattern matching.

Annotated screenshots: lvt can capture screenshots with element IDs overlaid. Agents can correlate visual content with the structured tree for targeted follow-up.

Annotated screenshot showing element IDs

Mixed-framework apps: A WinUI 3 app hosted in Win32 chrome is fully decomposed from the top-level window down through every XAML element.

The Bigger Picture

This is a working implementation of a capability that would be incredibly useful: giving agents a structured, semantic representation of what’s on screen.

A DOM tree for Windows apps — a unified tree spanning all UI frameworks — would be incredibly useful for agents. lvt takes a pragmatic approach: it works today, on real apps, by talking directly to each framework’s native tree.

It’s also foundational for device-context intelligence. When an agent needs to understand “what’s the user looking at right now” or “what does ‘this button’ refer to,” lvt is the component that answers those questions.

Status

Working today:

  • Win32
  • WinUI 3
  • System XAML (UWP)
  • ComCtl enrichment

On the roadmap:

  • WPF
  • WinForms
  • MAUI
  • WebView2 (Chrome DevTools Protocol bridge)

The tool is MIT-licensed. I’ve also included an agent skill so GitHub Copilot CLI and other agents can use it immediately.

Try It

Repo: github.com/asklar/lvt

Install the GitHub Copilot CLI skill:

/plugin install asklar/lvt

Want to discuss? Connect with me on LinkedIn.

Updated: