API & MCP

The Developer's Guide to YouTube Caption XML/JSON Formats

A technical breakdown of YouTube's raw XML timed text format vs. modern JSON subtitle representations for developers.

June 20, 2026
5 min read
Kenji Sato

The Developer's Guide to YouTube Caption XML/JSON Formats

The difference between YouTube's raw XML timed text format and JSON subtitle files is that YouTube's XML format uses custom tags containing start/duration attributes, whereas JSON models structure subtitles as arrays of objects with millisecond time keys.


Technical Side-by-Side Formatting Comparison

| Data Structure | XML Format | JSON Format | |---|---|---| | Root Element | <transcript> | [ { ... }, { ... } ] | | Subtitle Item | <text start="1.2" dur="2.4"> | "start": 1200, "duration": 2400 | | Language Metadata | Handled in query parameters | Handled in response header properties | | Parsing Difficulty | High (Requires XML parsing parser) | Extremely Low (JSON.parse()) |


Code Serialization Samples

YouTube Raw XML Format

<transcript>
    <text start="10.5" dur="3.2">Welcome to the tutorial.</text>
</transcript>

Clean REST JSON Output

[
  {
    "start": 10500,
    "duration": 3200,
    "text": "Welcome to the tutorial."
  }
]

Parsing Recommendations

"Working with raw XML attributes on cloud edge functions is resource-heavy and introduces formatting errors. We recommend converting subtitles to standardized JSON arrays at the API boundary to optimize server runtimes." — Thomas Wright, Senior API Engineer

Access our clean JSON transcript APIs →

TRANSCRIPTION TOOL

Ready to Transcribe?

Extract transcripts and subtitles from online videos instantly. Try TranscribeYT for free today.

Share Article