In this article, we’re going to walk through how to add closed captions to HLS video streams using AWS Transcribe and MediaConvert, all orchestrated with an event-driven architecture on AWS using EventBridge, SQS, and Lambda functions.

This is a common requirement for video streaming platforms, online learning systems, or any application where accessibility matters, and let’s be honest, it should always matter!

“The goal is simple: automatically generate closed captions for every uploaded video and embed them directly into the HLS stream, so any player shows a native CC toggle without any extra work on the frontend.”

The Problem We’re Solving 🎯

If you’ve already got a video processing pipeline that transcodes uploads into HLS for adaptive bitrate streaming (and if you haven’t, check out our previous article on securing video with CloudFront signed URLs), the next question is: how do we add captions?

Without captions, you’re leaving out:

Users with hearing difficulties.
Viewers watching in noisy environments (trains, coffee shops).
Anyone struggling with strong accents or unfamiliar terminology.
Users in situations where they can’t play audio (late at night, in meetings, etc.).

We need a way to:

Automatically transcribe the audio from uploaded videos.
Generate WebVTT caption files from the transcription (these get passed into MediaConvert and video generation).
Embed those captions into the HLS manifest so players show a native CC button.
Handle failures gracefully — a video without captions is better than no video at all.
Do all of this asynchronously without blocking the admin or content creator.

Let’s dive into how we achieve this with an event-driven pipeline using AWS Transcribe and MediaConvert.

Our Example

Continuing from our previous article, we’re still working with our fictitious company, ‘Gilmore Movies and Series’ where users can sign up and purchase individual movies or series episodes.

We already have a pipeline that transcodes uploaded videos into HLS format using MediaConvert. Now we want to layer on automatic closed captions so that every episode gets a native CC toggle in the player, no extra frontend work needed.

💡 Note: All code examples are for discussion only and can be further productionised.

Architecture Overview 🏗️

The solution extends our existing video processing pipeline with two new phases before MediaConvert runs. Here’s the high-level flow:

┌───────────────┐      ┌─────────────┐      ┌──────────────────────┐
│  Content      │─────▶│  Amazon S3  │─────▶│  S3 Event            │
│  Creator      │      │  (Upload)   │      │  Notification        │
│  uploads .mp4 │      │             │      │  (suffix filter)     │
└───────────────┘      └─────────────┘      └────────┬─────────────┘
                                                     │
                                                     ▼
                                          ┌──────────────────────┐
                                          │  SQS Queue           │
                                          │  (transcribe-        │
                                          │   orchestrator)      │
                                          └──────────┬───────────┘
                                                     │
                              Phase 1                ▼
                     ┌───────────────────────────────────────────┐
                     │  Lambda: Transcribe Orchestrator          │
                     │  - Validates video path & extension       │
                     │  - Starts AWS Transcribe job              │
                     │  - Tags job with original video key       │
                     └───────────────────────┬───────────────────┘
                                             │
                                             ▼
                                  ┌──────────────────────┐
                                  │  AWS Transcribe      │
                                  │  (async job)         │
                                  │  Output: .vtt file   │
                                  └──────────┬───────────┘
                                             │
                              Phase 2        ▼
                     ┌───────────────────────────────────────────┐
                     │  EventBridge Rule                         │
                     │  source: aws.transcribe                   │
                     │  detail: COMPLETED | FAILED               │
                     │  filter: job name prefix "job-"           │
                     └───────────────────────┬───────────────────┘
                                             │
                                             ▼
                     ┌───────────────────────────────────────────┐
                     │  Lambda: Caption Processor                │
                     │  - Renames .vtt to deterministic path     │
                     │  - Cleans up temp files                   │
                     │  - Sends message to video-processing SQS  │
                     └───────────────────────┬───────────────────┘
                                             │
                              Phase 3        ▼
                     ┌───────────────────────────────────────────┐
                     │  SQS Queue (video-processing)             │
                     └───────────────────────┬───────────────────┘
                                             │
                                             ▼
                     ┌───────────────────────────────────────────┐
                     │  Lambda: Video Processor                  │
                     │  - Checks for captions.vtt                │
                     │  - Adds CaptionSelectors if found         │
                     │  - Creates MediaConvert job               │
                     └───────────────────────┬───────────────────┘
                                             │
                                             ▼
                                  ┌──────────────────────┐
                                  │  AWS MediaConvert    │
                                  │  HLS + embedded CC   │
                                  └──────────┬───────────┘
                                             │
                                             ▼
                                  ┌──────────────────────┐
                                  │  S3 — HLS Output     │
                                  │  video.m3u8          │
                                  │  video_captions.m3u8 │
                                  │  + segments          │
                                  └──────────────────────┘

The flow works like this:

Upload: Content creator/admin/producer uploads a video file (.mp4, .mov, etc.) to S3.
S3 Event: An S3 event notification (with suffix filters for video extensions only) sends a message to the transcribe-orchestrator SQS queue.
Transcribe Orchestrator: A Lambda function validates the upload and starts an AWS Transcribe job asynchronously.
Transcribe: The async transcription job produces a WebVTT (.vtt) caption file (a text file with the closed captions and video timings).
EventBridge: When the Transcribe job completes (or fails), EventBridge routes the event to the caption-processor Lambda.
Caption Processor: Renames the .vtt file to a deterministic path and sends a message to the existing video-processing SQS queue.
Video Processor: The existing Lambda checks for captions, adds them to the MediaConvert job settings, and kicks off transcoding.
MediaConvert: Produces HLS output with an embedded WebVTT subtitle rendition in the manifest.
Player: Any HLS-compatible player (hls.js, Safari native) discovers the subtitle track and shows a CC toggle automatically in the video player.

💡This is event-driven architecture in action! Each component does one thing, communicates via events, and the pipeline flows naturally from upload to playable captioned video.

We call this choreography rather than orchestration, and it is covered in this great course by James Eastham here: https://www.studyfromexperts.com/courses/serverless-integration-patterns/

Why Event-Driven? 🔄

Before we get into the implementation, it’s worth talking about why we chose an event-driven architecture (EDA) for this pipeline rather than a synchronous orchestration approach.

In our previous article, we had a straightforward flow: S3 upload → SQS → Lambda → MediaConvert. Adding Transcribe means we now have a sequential dependency, i.e., Transcribe must complete before MediaConvert can embed the captions. You might be tempted to reach for Step Functions here, but for this use case, a simple event-driven chain works well:

EventBridge as the backbone: AWS Transcribe natively emits Transcribe Job State Change events to EventBridge when a job completes or fails. We don’t need to poll, we don’t need a state machine, we just listen for the event and react.

Loose coupling: Each Lambda function has a single responsibility. The transcribe orchestrator doesn’t know about MediaConvert. The caption processor doesn’t know about HLS encoding. The video processor doesn’t know about Transcribe. They communicate through events and messages.

Graceful degradation: If Transcribe fails, the caption processor still triggers MediaConvert, albeit the video just won’t have captions. No human intervention needed, no pipeline stuck in a failed state. As admins, we get a notification that the CC element failed.

Idempotency: Each step can be retried independently. Re-uploading a video restarts the entire chain. The pipeline naturally handles re-processing.

💡 This is a key principle of EDA: design for failure. Every step should handle the case where the previous step didn’t produce the expected output.

Phase 1: Starting the Transcribe Job 🎙️

The first new component is the transcribe orchestrator Lambda. It receives S3 event notifications via SQS (filtered to video file extensions only) and starts an AWS Transcribe job.

The Transcribe Orchestrator Lambda

tsx

import {
  TranscribeClient,
  StartTranscriptionJobCommand,
} from '@aws-sdk/client-transcribe';
import { config } from '@config';
import { logger } from '@shared/logger';

const transcribeClient = new TranscribeClient({});
const bucketName = config.get('bucketName');

const ALLOWED_VIDEO_EXTENSIONS = ['.mp4', '.mov', '.avi', '.mkv', '.webm'];

async function startTranscriptionJob(inputKey: string): Promise<string> {
  const lastSlashIndex = inputKey.lastIndexOf('/');
  const folderPath = inputKey.substring(0, lastSlashIndex);

  // Deterministic job name based on the video path + timestamp for uniqueness
  // Transcribe job names must be unique, so we include a timestamp
  const sanitisedPath = folderPath.replace(/\\//g, '-');
  const jobName = `job-${sanitisedPath}-${Date.now()}`;

  logger.info('Starting Transcribe job', { jobName, inputKey });

  await transcribeClient.send(
    new StartTranscriptionJobCommand({
      TranscriptionJobName: jobName,
      LanguageCode: 'en-US', // You could override this with your own language of course
      Media: {
        MediaFileUri: `s3://${bucketName}/${inputKey}`,
      },
      OutputBucketName: bucketName,
      OutputKey: `${folderPath}/captions/`, // Temporary location
      Subtitles: {
        Formats: ['vtt'],
        OutputStartIndex: 0,
      },
      // Tag with the original video key so the caption processor
      // knows which video this transcription belongs to
      Tags: [
        { Key: 'videoKey', Value: inputKey },
      ],
    }),
  );

  return jobName;
}

A few things to call out here:

✔️ Job naming: We prefix all jobs with job- so our EventBridge rule can filter for only our jobs. If you’re running multiple applications in the same AWS account, this prevents cross-talk (you can add any prefix for your use case that would prevent clashes).

✔️ Tagging: We tag the Transcribe job with the original video S3 key. This is how the caption processor (Phase 2) knows which video folder to place the captions in, i.e., no need to reverse-engineer the path from the Transcribe output.

✔️ Subtitle format: We request WebVTT output specifically because MediaConvert can ingest WebVTT files directly as caption inputs. SRT would also work, but WebVTT is the native format for HLS subtitle renditions.

✔️ Output path: Transcribe writes to a captions/ subfolder temporarily. The caption processor will rename it to a deterministic path later.

Validation and Idempotency

The orchestrator reuses the same validation logic as the existing video processor, checking file extensions, validating the S3 path structure, and skipping files that have already been processed:

tsx

function isVideoFile(key: string): boolean {
  const extension = key.substring(key.lastIndexOf('.')).toLowerCase();
  return ALLOWED_VIDEO_EXTENSIONS.includes(extension);
}

function validateVideoPath(key: string): boolean {
  // Ensure path matches: content/series/{seriesId}/video/{episodeId}/{filename}
  const pattern = /^content\\/series\\/[^/]+\\/video\\/[^/]+\\/[^/]+$/;
  return pattern.test(key);
}

The orchestrator also checks if captions are already up-to-date (comparing timestamps of the source video and existing captions: captions.vtt), skipping Transcribe entirely if the captions are current. This means re-deploying or reprocessing doesn’t waste money on unnecessary transcription jobs.

Phase 2: Processing the Transcribe Output 📝

When the Transcribe job completes, AWS automatically emits an event to EventBridge. We set up a rule to catch it and route it to our caption processor Lambda.

The EventBridge Rule (CDK)

tsx

import * as events from 'aws-cdk-lib/aws-events';
import * as targets from 'aws-cdk-lib/aws-events-targets';

const transcribeCompletionRule = new events.Rule(this, 'TranscribeCompletionRule', {
  ruleName: `${stageName}-transcribe-completion`,
  eventPattern: {
    source: ['aws.transcribe'],
    detailType: ['Transcribe Job State Change'],
    detail: {
      TranscriptionJobStatus: ['COMPLETED', 'FAILED'],
      TranscriptionJobName: [{ prefix: 'job-' }],
    },
  },
});

transcribeCompletionRule.addTarget(
  new targets.LambdaFunction(captionProcessorLambda),
);

This is one of the things I love about EventBridge! The filtering is declarative. We only react to Transcribe jobs that:

Have completed or failed (not IN_PROGRESS)
Have our job- prefix (so we ignore jobs from other applications)

No polling, no cron jobs, no wasted compute. The event arrives, and we react.

The Caption Processor Lambda

This Lambda does three things: rename the Transcribe output to a deterministic path, clean up temporary files, and send a message to the existing video-processing queue.

tsx

import {
  TranscribeClient,
  GetTranscriptionJobCommand,
} from '@aws-sdk/client-transcribe';
import {
  S3Client,
  CopyObjectCommand,
  DeleteObjectCommand,
} from '@aws-sdk/client-s3';
import { SQSClient, SendMessageCommand } from '@aws-sdk/client-sqs';
import { config } from '@config';
import { logger } from '@shared/logger';

const transcribeClient = new TranscribeClient({});
const s3Client = new S3Client({});
const sqsClient = new SQSClient({});
const bucketName = config.get('bucketName');
const videoProcessingQueueUrl = config.get('videoProcessingQueueUrl');

export async function handler(event: TranscribeEventBridgeEvent) {
  const { TranscriptionJobName, TranscriptionJobStatus } = event.detail;

  if (TranscriptionJobStatus !== 'COMPLETED') {
    logger.warn('Transcription job did not complete successfully', {
      jobName: TranscriptionJobName,
      status: TranscriptionJobStatus,
    });
    // On failure, still trigger MediaConvert without captions
    // so the video is at least available for playback
    const videoKey = await getVideoKeyFromJob(TranscriptionJobName);
    if (videoKey) {
      await sendToVideoProcessingQueue(videoKey);
    }
    return;
  }

  // Get job details to find output location and original video key
  const jobDetails = await transcribeClient.send(
    new GetTranscriptionJobCommand({
      TranscriptionJobName: TranscriptionJobName,
    }),
  );

  const subtitleUri =
    jobDetails.TranscriptionJob?.Subtitles?.SubtitleFileUris?.[0];
  const videoKey = getVideoKeyFromTags(jobDetails.TranscriptionJob?.Tags);

  if (!subtitleUri || !videoKey) {
    logger.error('Missing subtitle URI or video key', {
      TranscriptionJobName,
    });
    return;
  }

  // Parse the S3 key from the URI
  const sourceKey = subtitleUri.replace(`s3://${bucketName}/`, '');

  // Derive the deterministic target path
  const videoFolder = videoKey.substring(0, videoKey.lastIndexOf('/'));
  const targetKey = `${videoFolder}/captions.vtt`;

  // Copy to deterministic name
  await s3Client.send(
    new CopyObjectCommand({
      Bucket: bucketName,
      CopySource: `${bucketName}/${sourceKey}`,
      Key: targetKey,
    }),
  );

  // Clean up temporary Transcribe output
  await s3Client.send(
    new DeleteObjectCommand({ Bucket: bucketName, Key: sourceKey }),
  );

  logger.info('Caption file processed, triggering MediaConvert', {
    targetKey,
    videoKey,
  });

  // Send message to video-processing queue to trigger MediaConvert
  await sendToVideoProcessingQueue(videoKey);
}

Why the Rename?

Transcribe generates output files with the job name in the filename, something like:

content/series/{seriesId}/video/{episodeId}/captions/job-content-series-abc-video-ep1-1714400000.vtt

That’s not great for MediaConvert to reference. We rename it to a deterministic path:

content/series/{seriesId}/video/{episodeId}/captions.vtt

Now the video processor can always check for captions.vtt at a known location and no need to list S3 objects or parse job names.

Graceful Degradation

Notice the failure handling: if Transcribe fails, we still send a message to the video-processing queue. The video processor will check for captions.vtt, find it doesn’t exist, and create the MediaConvert job without captions. The video is still playable, it just won’t have a CC button.

This is a deliberate design choice. Captions are important, but a failed transcription shouldn’t block the entire video from being available. Log the failure, emit a metric, and move on.

Phase 3: MediaConvert with Embedded Captions 🎬

This is where it all comes together. The existing video processor Lambda is modified to detect captions and pass them into the MediaConvert job as parameters.

Checking for Captions

Before creating the MediaConvert job, we do a simple S3 HEAD request to check if the caption file exists:

tsx

async function captionsExist(videoKey: string): Promise<boolean> {
  const folderPath = videoKey.substring(0, videoKey.lastIndexOf('/'));
  const captionsKey = `${folderPath}/captions.vtt`;

  try {
    await s3Client.send(
      new HeadObjectCommand({
        Bucket: bucketName,
        Key: captionsKey,
      }),
    );
    return true;
  } catch {
    return false;
  }
}

Adding CaptionSelectors to the MediaConvert Input

This is the key integration point. When captions exist, we add a CaptionSelectors block to the MediaConvert input that tells it where to find the WebVTT file:

tsx

async function createMediaConvertJob(
  inputKey: string,
): Promise<string | undefined> {
  const lastSlashIndex = inputKey.lastIndexOf('/');
  const folderPath = inputKey.substring(0, lastSlashIndex);
  const outputPath = `s3://${bucketName}/${folderPath}/`;

  const hasCaptions = await captionsExist(inputKey);
  const captionsKey = `${folderPath}/captions.vtt`;

  logger.info('Creating MediaConvert job', {
    inputKey,
    outputPath,
    hasCaptions,
  });

  // Build input with optional caption selectors
  const input: MediaConvertInput = {
    FileInput: `s3://${bucketName}/${inputKey}`,
    AudioSelectors: {
      'Audio Selector 1': {
        DefaultSelection: 'DEFAULT',
      },
    },
    VideoSelector: {},
  };

  // If captions exist, tell MediaConvert where to find them
  if (hasCaptions) {
    input.CaptionSelectors = {
      'Captions Selector 1': {
        SourceSettings: {
          SourceType: 'WEBVTT',
          FileSourceSettings: {
            SourceFile: `s3://${bucketName}/${captionsKey}`,
          },
        },
      },
    };
  }

  // ... rest of job creation
}

The CaptionSelectors block is the bridge between Transcribe’s output and MediaConvert’s input. It says: “Hey MediaConvert, there’s a WebVTT file at this S3 path, so treat it as a caption source.”

Adding CaptionDescriptions to Each Output

For each HLS output variant (1080p, 720p, 480p), we add CaptionDescriptions that tells MediaConvert to include the captions in that output:

tsx

// Build caption descriptions for each output (only if captions exist)
const captionDescriptions = hasCaptions
  ? [
      {
        CaptionSelectorName: 'Captions Selector 1',
        DestinationSettings: {
          DestinationType: 'WEBVTT' as const,
          WebvttDestinationSettings: {
            StylePassthrough: 'DISABLED' as const,
          },
        },
        LanguageCode: 'ENG',
        LanguageDescription: 'English',
      },
    ]
  : undefined;

And in the HLS group settings, we tell MediaConvert to insert caption language mappings into the manifest:

tsx

const jobSettings: CreateJobCommandInput = {
  Role: mediaConvertRoleArn,
  Settings: {
    Inputs: [input],
    OutputGroups: [
      {
        Name: 'Apple HLS',
        OutputGroupSettings: {
          Type: 'HLS_GROUP_SETTINGS',
          HlsGroupSettings: {
            SegmentLength: 6,
            MinSegmentLength: 0,
            Destination: outputPath,
            ManifestDurationFormat: 'INTEGER',
            SegmentControl: 'SEGMENTED_FILES',
            // These two settings embed the caption track in the manifest
            CaptionLanguageSetting: hasCaptions ? 'INSERT' : undefined,
            CaptionLanguageMappings: hasCaptions
              ? [
                  {
                    LanguageCode: 'ENG',
                    LanguageDescription: 'English',
                    CaptionChannel: 1,
                  },
                ]
              : undefined,
          },
        },
        Outputs: [
          {
            // 1080p output
            NameModifier: '_1080p',
            VideoDescription: {
              Height: 1080,
              Width: 1920,
              CodecSettings: {
                Codec: 'H_264',
                H264Settings: {
                  Bitrate: 5000000,
                  RateControlMode: 'CBR',
                },
              },
            },
            AudioDescriptions: [
              {
                AudioSourceName: 'Audio Selector 1',
                CodecSettings: {
                  Codec: 'AAC',
                  AacSettings: { Bitrate: 128000, SampleRate: 48000 },
                },
              },
            ],
            // Attach captions to this output
            CaptionDescriptions: captionDescriptions,
          },
          // ... 720p and 480p outputs follow the same pattern
          // Each gets CaptionDescriptions: captionDescriptions
        ],
      },
    ],
  },
};

What MediaConvert Produces

When the job completes, the S3 output looks like this:

content/series/{seriesId}/video/{episodeId}/
├── captions.vtt                  # WebVTT captions from Transcribe
├── video.m3u8                    # HLS master playlist
├── video_1080p.m3u8              # 1080p variant playlist
├── video_1080p_00001.ts          # 1080p segments
├── video_720p.m3u8               # 720p variant playlist
├── video_720p_00001.ts           # 720p segments
├── video_480p.m3u8               # 480p variant playlist
├── video_480p_00001.ts           # 480p segments
└── video_captions.m3u8           # Subtitle rendition playlist

The master playlist (video.m3u8) now includes an EXT-X-MEDIA:TYPE=SUBTITLES tag pointing to video_captions.m3u8. Any HLS-compatible player that reads this manifest will automatically discover the subtitle track and show a CC button.

No frontend changes needed for the player itself, hls.js and Safari’s native HLS support both handle this automatically.

Loop Prevention — A Critical Design Consideration 🛡️

Here’s something that will bite you if you’re not careful: S3 event notifications fire on every PutObject. That means when Transcribe writes a .vtt file to S3, or when MediaConvert writes .m3u8 and .ts files, those writes could re-trigger the pipeline and create an infinite loop.

We prevent this with multiple layers of defence:

Layer 1 — S3 suffix filtering: The S3 event notification only fires for video file extensions (.mp4, .mov, .avi, .mkv, .webm). All derivative files (.vtt, .m3u8, .ts, .json) are ignored at the notification level.

Layer 2 — Separate SQS queues: The transcribe-orchestrator queue receives S3 events. The video-processing queue only receives messages from the caption-processor Lambda. MediaConvert’s output landing in S3 doesn’t trigger any queue.

Layer 3 — Lambda validation (defence in depth): Even if a non-video file somehow reaches a Lambda, it validates the file extension before processing.

tsx

const IGNORED_EXTENSIONS = ['.vtt', '.m3u8', '.ts', '.json', '.srt'];

function isIgnoredFile(key: string): boolean {
  const extension = key.substring(key.lastIndexOf('.')).toLowerCase();
  return IGNORED_EXTENSIONS.includes(extension);
}

The chain looks like this:

Video upload (.mp4) → S3 event fires → suffix filter matches → transcribe-orchestrator queue ✅
Transcribe writes .vtt → S3 event fires → suffix filter rejects .vtt → nothing happens ✅
EventBridge catches Transcribe completion → caption-processor → video-processing queue ✅
MediaConvert writes .m3u8/.ts → S3 event fires → suffix filter rejects → nothing happens ✅

No loops. Each file type has exactly one path through the system.

Gotchas and Industry Patterns ⚠️

Before we wrap up, there are two important gotchas that you’ll run into when working with AWS Transcribe at scale. These are things that aren’t immediately obvious from the documentation but will save you time and money.

Gotcha 1: The 2GB File Size Limit

AWS Transcribe has a 2GB file size limit for transcription jobs. This might sound generous, but it’s not hard to hit:

A 30-minute 1080p video at a reasonable bitrate can easily be 1-2GB
A 60-minute 4K lecture? Forget about it, you’re well over the limit
Raw camera footage before any compression? Almost certainly too large

If you try to submit a file larger than 2GB, the Transcribe job will fail. And since we’ve designed for graceful degradation, the video will still be processed by MediaConvert, just without captions. Not ideal.

Gotcha 2: The Low-Quality Proxy Pattern

This is how the industry handles the 2GB limit, and it’s a pattern worth knowing about.

The key insight: Transcribe only needs the audio track. It doesn’t care about video quality at all. A 480p version of a video has the exact same audio as the 4K version.

So the standard approach is:

Transcode a low-quality proxy of the video, i.e., think 360p or even 240p, low bitrate, just enough to carry the audio track clearly.
Run Transcribe against the proxy — this tiny file is well under 2GB even for hours-long content.
Throw away the proxy video — it was only ever needed for the audio.
Pass the resulting captions into MediaConvert with the original high-quality video.

Original 4K video (8GB) ───┐
                           │
Low-quality proxy (50MB) ──┼──▶ Transcribe ──▶ captions.vtt
  (240p, low bitrate)      │
                           │
Original 4K video (8GB) ───┼──▶ MediaConvert ──▶ HLS (1080p/720p/480p + captions)
                           │
Low-quality proxy ─────────┘──▶ 🗑️ Delete

The proxy video is a throwaway artefact. Its only purpose is to get the audio into Transcribe under the 2GB limit. The resulting WebVTT captions are then paired with the original high-quality source video in MediaConvert.

In practice, you’d add a preliminary MediaConvert job (or use FFmpeg in a Lambda/container) to produce the proxy:

tsx

// Example: Creating a low-quality proxy for Transcribe
const proxyJobSettings = {
  Inputs: [{ FileInput: `s3://${bucketName}/${originalVideoKey}` }],
  OutputGroups: [
    {
      OutputGroupSettings: {
        Type: 'FILE_GROUP_SETTINGS',
        FileGroupSettings: {
          Destination: `s3://${bucketName}/${folderPath}/proxy/`,
        },
      },
      Outputs: [
        {
          VideoDescription: {
            Height: 240,
            Width: 426,
            CodecSettings: {
              Codec: 'H_264',
              H264Settings: {
                Bitrate: 200000, // Very low bitrate as we only need the audio
                RateControlMode: 'CBR',
              },
            },
          },
          AudioDescriptions: [
            {
              CodecSettings: {
                Codec: 'AAC',
                AacSettings: { Bitrate: 64000, SampleRate: 44100 },
              },
            },
          ],
          ContainerSettings: { Container: 'MP4' },
        },
      ],
    },
  ],
};

A 60-minute 4K video at 8GB becomes a 240p proxy at maybe 30-50MB. Well under the 2GB limit, and Transcribe produces identical captions because the audio is the same.

💡 For our ‘Gilmore Movies and Series’ platform, most episodes are under 30 minutes and uploaded at 1080p, so we’re comfortably under the 2GB limit. But if you’re building a platform that handles longer content or higher resolutions, the proxy pattern is essential.

Security and Signed URLs 🔐

One of the nice things about embedding captions in the HLS manifest is that no changes are needed to the signed URL infrastructure we covered in the previous article.

The existing wildcard signed URL policy covers all files in the episode’s video folder:

content/series/{seriesId}/video/{episodeId}/*

The captions.vtt file, the video_captions.m3u8 subtitle playlist, and all segment files are within this wildcard scope. The HLS player fetches them using the same signed URL policy as the video segments.

No extra signed URLs for caption files. No side-loading of tracks. No additional API endpoints. The captions are just part of the HLS stream.

Cost Considerations 💰

AWS Transcribe pricing is approximately $0.024 per minute of audio. Here’s what that looks like at scale:

Video Duration	Cost per Video	100 Videos	1,000 Videos
5 minutes	$0.12	$12.00	$120.00
15 minutes	$0.36	$36.00	$360.00
30 minutes	$0.72	$72.00	$720.00

The additional infrastructure costs are negligible:

Two new Lambda functions at ~100-200ms execution each are well within free tier.
One new SQS queue — minimal cost.
EventBridge events — $1.00 per million events.

If you’re using the low-quality proxy pattern, add the cost of the proxy MediaConvert job (a few cents per video at 240p).

Wrapping Up 📝

We’ve covered a lot of ground in this article:

The key takeaways:

EventBridge is your friend — AWS services emit events natively, so lean into event-driven patterns rather than polling or orchestration.
Design for failure — every step should handle the case where the previous step didn’t produce the expected output.
Embed captions in HLS — it’s the industry standard, players handle it natively, and it avoids the complexity of side-loading caption files.
Know the 2GB limit — and use the low-quality proxy pattern when your content exceeds it.
Keep it loosely coupled — each Lambda does one thing, communicates via events, and doesn’t know about the rest of the pipeline.

I hope you found this article useful. If you have any questions or feedback, feel free to reach out!

Ready to level up your AWS skills?

Visit sign-up today and join a community of builders and architects dedicated to mastering the cloud.