Serverless Integration Patterns
Course Description
Learn practical knowledge for implementing integration patterns with AWS serverless services. This course goes far beyond what is in the Enterprise Integration Patterns book
Technologies Covered
About the Instructor
James Eastham
James Eastham is a Serverless Developer Advocate at Datadog, AWS Community Builder, and a Microsoft MVP. He answered phones in front-line support, administered databases, and developed modern applicat...
View InstructorCourse Curriculum (32 modules)
Introduction
In this video you will learn what to expect from a course on advanced serverless integration patterns, going beyond basic enterprise integration patterns to cover pitfalls and production considerations for point-to-point messaging, publish/subscribe, and sagas. You’ll use a pizza restaurant backend as the consistent sample application, with TypeScript code provided on GitHub to demonstrate concepts that can be applied in other languages. The course assumes you already know core AWS serverless services (Lambda, SQS, SNS, DynamoDB, EventBridge, Step Functions) and focuses on patterns rather than service tutorials. You’ll also see how the repository is organized by concept, how to use a provided Postman collection to call three API endpoints, and what CDK JSON settings to update for OpenTelemetry telemetry (exporter endpoint, headers, and required regions).
Point-To-Point
In this video, you will learn what a point-to-point integration channel is and why queues are used to connect one producer to one consumer asynchronously. Using a pizza restaurant example, you’ll see how an order management service and a kitchen service avoid tight coupling from direct HTTP calls by sending orders through a queue, which helps handle outages, burst traffic, and prevents a single bad order from blocking processing. You’ll also cover recovery from downtime, observability considerations, error scenarios, and the use of dead letter queues. The module maps an AWS serverless architecture with API Gateway fronting an Order Service Lambda that stores orders in DynamoDB and sends messages to SQS, a Kitchen Lambda consuming those messages, and an asynchronous request-reply pattern where the kitchen sends acknowledgements to a separate queue processed by an acknowledgement Lambda.
Dead Letter Queue and Batch Processing
In this video, you will learn how to prevent a single “poison pill” message from blocking processing in an SQS queue by using a dead letter queue (DLQ) and partial batch failure handling with AWS Lambda. You’ll see how AWS CDK configuration can solve much of this without extra application code, including setting an SQS event source with reportBatchItemFailures enabled so only failed messages in a batch are retried, and configuring batch size and scaling behavior. You’ll also learn how maxReceiveCount moves messages to a DLQ after repeated failures (e.g., three tries) to handle transient vs. message-related errors. Finally, you’ll review TypeScript Lambda code using Lambda Powertools to wrap per-message processing in try/catch, delete validation-error messages, and rethrow other errors to trigger retries and DLQ routing.
Hexagonal Architecture in Lambda
In this video you will learn why separating infrastructure code from application/business logic in AWS Lambda is critical for maintainable, testable serverless systems, using a ports-and-adapters (hexagonal) style. The handler may be large due to configuration, observability, message processing, tracking, and error handling, but the core kitchen service logic (validating an order and saving to DynamoDB) should remain a small, pure function. This separation enables simple unit tests for business logic and targeted tests for processing mechanics, including verifying partial batch failures and DLQ behavior by mocking transient downstream errors and asserting returned batch item failures (e.g., one failure/one success). The key pattern is wrapping each message’s processing in a try-catch and, on failure, triggering the messaging system’s DLQ mechanism, which can be configuration-only with SQS and Lambda.
Understanding SQS Ordering and FIFO
In this video you will learn how ordering works in SQS and when you actually need it in serverless systems. You’ll compare Standard vs FIFO queues, including the trade-offs of FIFO (higher cost and lower throughput) and why order processing often doesn’t require guaranteed ordering, unlike stock trading, chat messages, or financial ledgers. You’ll also see that FIFO requires extra message configuration, using a message group ID to enforce ordering within a constraint (such as per customer or loyalty tier) and a deduplication ID (such as order ID), with FIFO supporting automatic deduplication.
Structuring Cloud Events for Messaging
In this video you will learn how to structure advanced asynchronous messages by wrapping raw SQS records in an envelope using the CloudEvents specification. You’ll see how the kitchen service handler parses the SQS message body as a CloudEvent of a custom type (pizza order), and what key fields the envelope includes: a unique message ID, source, message type, data type, and the actual order data. You’ll also learn why using an envelope (instead of sending a plain JSON payload) makes systems easier to develop through versioning and schema design, which is covered further later in the course, and that the producer side creates the same CloudEvent when publishing the order message.
Asynchronous Request Response
In this video you will learn how to implement an asynchronous request-reply pattern between an order service and a kitchen service using SQS, avoiding synchronous HTTP calls while preserving resilience and scalability. The order service publishes an order message wrapped in a CloudEvents envelope that includes a reply-to URL (an acknowledgement SQS queue URL passed via environment variables), and the kitchen service sends an acknowledgement message back to that queue if reply-to is set. Another Lambda in the order service polls the acknowledgement queue and updates the order status. You also learn to correlate requests and responses using the order ID as the correlation ID, or generating a unique ID and storing it (e.g., in DynamoDB) when no natural identifier exists.
Idempotent Message Processing
In this video you will learn why idempotency is essential when using at-least-once delivery systems like SQS, dead letter queues, and message reprocessing, where the same message can be received multiple times. You’ll see an example where the kitchen gets the same order twice and must avoid creating duplicate orders by processing only one. The approach uses DynamoDB with the order ID as the idempotency key and a conditional write that only inserts the record if the order ID does not already exist. If DynamoDB throws a conditional check failed exception (meaning the order already exists), the handler returns success so the system state remains the same. This sets up why idempotency matters even more when handling queue backlogs caused by errors or sudden message spikes.
Managing DLQ backlogs
In this video you will learn that while SQS queues and dead-letter queues make failures feel “durable” and easy to handle, recovery can overload downstream systems if you blindly redrive a large DLQ backlog into the main queue. You’ll see how to control recovery by setting a custom DLQ redrive velocity (e.g., 10 messages/second) and by limiting Lambda throughput using SQS event source maximum concurrency, batch size, and Lambda reserved concurrency, so downstream APIs or databases aren’t overwhelmed even though Lambda can scale. You’ll also learn to monitor failure and lag by alarming on DLQ depth (e.g., more than 1 message) and on the age of the oldest message in the main queue (e.g., over 5 minutes) using CloudWatch/CDK to detect outages, errors, or sudden traffic spikes.
Observing Point to Point Message Channels
In this video you will learn how to use OpenTelemetry and an observability backend to see end-to-end cause-and-effect across an SQS-driven, multi-service pizza ordering flow by linking spans in a single trace (order receiver → kitchen processor → acknowledgement processor) via parent-child relationships, which are especially useful for point-to-point messaging. You’ll see how to manually propagate trace context by injecting it into SQS message attributes (or alternatively into the message body) on the producer side, then extracting it on the consumer side and starting new spans as children of the extracted context. You’ll also learn how telemetry is exported from Lambda using the Rust-based Rotel Lambda extension (local OTLP endpoint that flushes after the function returns) and OpenTelemetry Lambda layers that wrap handlers to auto-capture AWS metadata, with notes on batch processing and span reparenting.
Point to Point Wrap Up
In this video you will learn how a seemingly simple point-to-point integration pattern using a queue becomes more complex when you consider partial batches, partial failures, poison pill messages, sudden load spikes, and outages, and why recovery planning matters as much as initial queue configuration. You will also learn the role of dead letter queues, reprocessing, and idempotency, and how to understand an asynchronous application’s behavior using parent-child relationships and span reparenting to connect queue processing into an end-to-end trace across services. The video closes by previewing the next module on publish/subscribe and the deeper considerations for modern serverless applications.
Introducing Publish Subscribe
In this video you will learn how the publish-subscribe integration pattern helps a single producer notify many independent downstream services (e.g., notifications, loyalty, delivery, analytics) when an order is confirmed, without the producer knowing who consumes the message, making systems easier to evolve by adding new subscribers later. You will see how this differs from point-to-point messaging and why pub/sub can cause problems if you ignore key design considerations. The script outlines an AWS implementation where an order acknowledgement Lambda publishes the same event to both SNS and Amazon EventBridge to compare approaches: EventBridge is recommended for cross-domain, multi-team integration via a central event bus, while SNS can fit within a single domain using specific topics. It also recommends placing an SQS queue at each service boundary for durability, resilience, and throughput control, since SNS/EventBridge retry delivery but can eventually drop messages.
Publish/Subscribe in Action
In this video you will learn what code changed to implement a publish/subscribe integration pattern in the module three pub/sub folder, including new downstream services (analytics, delivery, loyalty, notifications). The main update is in the order acknowledgement flow: after acknowledging the order in the database, you now publish a CloudEvent to both an SNS topic and an EventBridge event bus (to demonstrate both approaches), creating spans and setting technical and business attributes for better telemetry and order-level tracing. You’ll also see that downstream Lambda handlers remain similar to the prior point-to-point approach, unwrapping CloudEvents and extracting trace context, while combining pub/sub with an SQS queue between SNS and Lambda for durability, throughput control, and dead-letter handling. Finally, the CDK creates the SNS topic, EventBridge bus, and an SQS subscription using raw message delivery to avoid nested wrapping.
Inbox Pattern with DynamoDB
In this video you will learn how to handle duplicate message delivery from EventBridge and SNS by implementing idempotency with the inbox pattern. You store each inbound event in a DynamoDB “inbox” table using a unique identifier (such as the CloudEvents eventId or an orderId) and a conditional write so duplicates fail and are skipped. Messages are marked pending, processed once, then updated as processed. A DynamoDB TTL automatically expires inbox items (example: seven days), which helps prevent repeated near-term duplicates while still allowing reprocessing later if a bug is fixed. You also see why this can be smarter than relying only on an SQS queue, which would process the same message again if it reappears later.
Anti-Corruption Layers
In this video you will learn why an anti-corruption layer is essential in publish/subscribe systems to keep your system evolvable. When you consume events in a data format you don’t control (like an Order Confirmed event), you should first translate that payload into a structure you do control, then drive your business logic from that internal model. This concentrates external-schema handling in one place, lets you drop irrelevant fields, and enables early validation with clear, support-friendly errors when required fields (like orderId or customerId) are missing. With this approach, if the upstream service changes its schema, you update only the translation layer while your business logic stays the same, avoiding widespread code changes.
Observability for Publish Subscribe
In this video you will learn why publish-subscribe makes cause-and-effect hard to understand and how to use OpenTelemetry to discover downstream consumers when event schemas change. Instead of relying on often-outdated documentation, you’ll use production telemetry by injecting trace context (traceparent/tracestate) into the message body (e.g., CloudEvents), which is necessary for systems like EventBridge that don’t support message attributes and is portable across messaging technologies. On the consumer side, you’ll extract this context and use span links—rather than parent-child traces—to connect independent traces without creating huge, misleading waterfall graphs or artificially long end-to-end traces. You’ll see how observability queries can list which services consume a specific event type so teams can be notified before breaking changes.
A Triad of Publish Subscribe Problems
In this video you will learn about three error scenarios that can occur in point-to-point and publish/subscribe integrations: sensitive customer delivery addresses leaking into CloudWatch logs and being exposed to too many downstream services, safely evolving event structures over time (e.g., adding a customer tier field to an Order Confirmed event for the loyalty service), and supporting time-based notifications where the kitchen service needs to be alerted about upcoming scheduled orders 30 minutes in advance rather than only when the order is confirmed.
Thick vs Thin and Passing Sensitive Data
In this video you will learn how to structure microservice events using thick (event-carried state transfer) versus thin (notification) events, including the tradeoffs between schema coupling and runtime callbacks. You’ll see why publishing only identifiers (orderId, customerId) via SNS can better support evolvability for services like loyalty, notifications, and analytics that don’t need full payload data, and why each service should own its own domain data (email/phone, loyalty info). You’ll also learn a pattern for sharing PII like delivery addresses by encrypting the address at publish time with an AWS KMS key, sending only the encrypted field through EventBridge, and granting encrypt permissions to the order acknowledgement processor and decrypt permissions only to the delivery processor so other consumers can’t access the address.
Versioning your Events
In this video you will learn how to safely evolve event-driven systems by wrapping messages in the CloudEvents envelope and versioning event types to handle breaking schema changes (e.g., changing an order amount from a number to a currency/value object). You’ll see key CloudEvents fields (id, source, time, datacontenttype, subject, dataschema, replyto, traceparent/tracestate) and how the type field (e.g., order.confirmed.v1) enables consumers to route and translate payloads via an anti-corruption layer, reject unsupported versions, and send them to a dead-letter queue until handlers are added. It explains best practice of keeping only two versions in flight, optionally adding a deprecation date, and using standardized messaging telemetry to identify which services still consume older versions. It also covers why CloudEvents can be useful even when using Amazon EventBridge’s native wrapper.
Passage of Time Events
In this video you will learn about “passage of time” events—events triggered by time rather than user actions (e.g., an order due in 30 minutes or a customer inactive for 30 days). You’ll compare simple polling (like a daily scheduled Lambda scan) with scheduling a one-time future event using Amazon EventBridge Scheduler. The script shows how, after an order is acknowledged, you can create a scheduled delivery alert (only for delivery orders), wrap it in the CloudEvents specification, inject and later extract trace context, and delete the schedule after it fires. It also explains why span links (not parent-child spans) are important for tracing delayed events without creating extremely long traces, and notes CloudEvents remain useful even when no trace ID exists.
The Outbox Pattern
In this video you will learn how the outbox pattern improves resilience in event-driven, serverless systems when you must update database state and publish an event without risking mismatches (for example, an order acknowledged by a downstream service but not actually stored, or stored but never published). You’ll see how to write both the entity update and the event payload together within a transactional boundary, then use a separate outbox publisher process to publish only confirmed updates. The walkthrough shows an AWS implementation using DynamoDB by storing an _outbox field on the same item (“fat outbox”) and using DynamoDB Streams to trigger a Lambda that filters stream records for inserts/updates containing _outbox, publishes to a reply queue (or EventBridge), and then marks the record processed by removing _outbox, with trace context propagated via OpenTelemetry.
Two Layer Idempotency
In this video you will learn that idempotency needs two layers: infrastructure-level and business-logic-level. Using a loyalty points service, you see how an order-confirmed event is translated and then processed so points are awarded only once per order. Business logic idempotency uses rules (for example, a DynamoDB conditional write in an “award points if not already awarded” function) to prevent duplicates even when two different events arrive for the same order, such as when a user double-clicks and generates distinct event IDs. Infrastructure idempotency uses a DynamoDB inbox pattern with TTL (for example, seven days) to store processed event IDs and short-circuit re-deliveries of the same message (such as duplicate SNS/EventBridge deliveries), returning success quickly to avoid unnecessary Lambda execution cost.
Choreography vs Orchestration
In this video you will learn the difference between choreographed integrations (point-to-point and pub/sub services reacting independently to events) and orchestration, where a central workflow coordinates multiple microservices to implement a business process like a pizza order. You’ll see why orchestration improves observability and handles failures better by using the saga pattern with compensating actions (e.g., refund payment and release inventory if the kitchen can’t fulfill the order). The video explains that while AWS Step Functions can orchestrate workflows, you shouldn’t embed business logic in Amazon States Language because it ties core logic to a proprietary format. Instead, you’ll focus on using AWS Lambda Durable Functions to implement sagas in real code, keeping business logic portable and separated from infrastructure, and preparing it for production and observability.
The Saga Pattern in Action
In this video you will learn how to deploy and test a saga-based order workflow and observe it end-to-end. You run a CDK deploy in the Module 6 saga folder, use a Postman collection to call the create order API, and see 202 Accepted responses because the order is only accepted, not yet confirmed. You then inspect OpenTelemetry traces in your observability provider to follow the durable execution flow: create order, reserve inventory, charge payment, notify the kitchen via SQS and downstream services, and publish an order-confirmed event to EventBridge with KMS encryption. By using a special customer ID that forces payment failure, you see error tracing and compensating actions (release inventory, cancel order) and confirm the final order status is failed. You also view durable execution history in the AWS Lambda console to inspect inputs/outputs per step and understand why orchestration helps control critical business logic.
Sagas with Lambda Durable Functions
In this video you will learn how to implement the saga pattern using AWS Lambda Durable Functions by walking through the Module 6 code sample. An API order receiver validates and creates an order, then asynchronously invokes a durable-function saga (passing trace context) via the Lambda SDK, returning a fast 202 response; the CDK setup highlights the requirement to invoke a durable function via a version alias, not $LATEST. Inside the saga orchestrator, the workflow is composed of checkpointed steps with optional retry policies: create the order in DynamoDB, reserve inventory, take payment, notify the kitchen using wait-for-callback with a correlation/callback ID over SQS, then publish an order-confirmed event. You also see how compensating actions are collected after each successful step and executed in reverse on business failure (e.g., declined payment) or exceptions.
Distributed Tracing for Sagas
In this video you will learn why distributed traces can become disconnected in durable function orchestrators: after checkpoints and callbacks, the Lambda is reinvoked as a new invocation, creating a new trace context each time, so initial execution, downstream calls, and compensations appear as separate traces. You’ll see how to fix this by capturing the original trace context/trace parent in a durable step (so it runs once and is stored in state), then reusing it on subsequent reinvocations to create links and start spans with the same context, producing one continuous trace linked back to the original API call. You’ll also explore the tradeoffs between modeling downstream services as parent-child spans versus span links, using what’s most useful for debugging and considering timing (fast workflows vs long delays that would create very long traces).
The Claim Check Pattern
In this video you will learn how the claim check pattern helps you send data through AWS messaging services that have payload size limits by storing the full payload (for example, an uploaded image) in S3 and publishing only an S3 key/location in the message so the consumer can fetch it. You’ll also see how the same approach can support thin events by putting extra context in S3 to reduce direct callbacks to the originating service, improving resilience by shifting load to highly available S3. Finally, it shows how IAM permissions on the S3 object can restrict sensitive data access so multiple services can consume the event while only authorized services can retrieve the additional context.
Implementing the Claim Check Pattern
In this video you will learn how to deploy and verify the claim check pattern in the module seven sample by running CDK deploy, sending delivery order requests via the provided Postman collection, and inspecting traces in your observability tool. You’ll see the saga store the event payload in S3, generate a temporary pre-signed URL (default 7,200 seconds), and publish an EventBridge event containing that URL (or optionally the raw S3 key), then watch the delivery processor fetch the payload via the pre-signed URL and continue processing. The code walkthrough covers the saga publish path, the delivery processor’s claim check retrieval via a ports-and-adapters interface with an S3 implementation, and CDK infrastructure for a temporary claim check bucket with lifecycle expiration and IAM grants that control which processors can read sensitive/PII data. The main motivation highlighted is handling large payload limits while keeping events thin and access controlled.
Patterns for User Notifications in Async Systems
In this video you will learn how to notify a user about progress in an event-driven, serverless workflow where the initial order API returns 202 and the rest (inventory, payment, kitchen, events) happens asynchronously. You’ll compare three patterns for sending order status back to the user: polling (return a status URL and the frontend repeatedly GETs until it returns 200), callbacks (a third-party developer supplies a callback URL and your system POSTs updates on status changes), and WebSockets (a persistent bidirectional connection where a Lambda pushes status updates to active client connections). You’ll also learn a heuristic for choosing polling vs WebSockets: use polling for predictable sub-second end-to-end latency, and WebSockets for highly variable durations or chat-like interactions to avoid excessive polling load.
Polling, Webhooks & Websockets in Action
In this video you will learn how to run CDK Deploy for the module-eight Sync/Async app, retrieve the API and WebSocket endpoints from CloudFormation, and test three callback patterns using tools like Postman. You will first create an order and use the returned status URL to poll repeatedly for updates. Next, you will add a callbackUrl (using a Webhook.site URL) to receive an asynchronous webhook with the order status, and you will confirm it via telemetry in the Webhook Dispatcher service with traces linked back to the original saga. Finally, you will connect to the WebSocket endpoint (with /prod and a customerId query param), create orders, and see real-time status messages broadcast via a DynamoDB Streams–driven Status Change Broadcaster that queries active connections and uses the API Gateway Management API to push updates.
Implementing Polling, Webhooks & Websockets
In this video you will learn three ways to get results from an asynchronous, event-driven AWS serverless backend back to a client: polling, callbacks, and WebSockets. Polling is implemented by returning a dynamically built status URL from the order receiver Lambda using an HTTP_API_URL environment variable and the order ID. Callbacks pass a callback URL through downstream services; when the order is confirmed, a webhook is queued to SQS and a dispatcher Lambda reads SQS messages and makes an HTTP request to the callback URL, with the queue protecting the main workflow from third-party slowness. WebSockets use an API Gateway WebSocket API with connect/disconnect Lambdas that store connection records in DynamoDB, and a status-change broadcaster triggered by DynamoDB Streams that detects status changes and posts messages to all customer connections. It also notes that simpler notifications (like email) may be enough, and suggests polling for fast, consistent completion and WebSockets for high variability or frequent updates.
Course Wrap Up
In this video you will learn what was covered throughout the course on building modern serverless applications: core integration patterns like point-to-point and publish/subscribe, how to go beyond the basics with resilience, understanding, error handling, and production best practices, and additional patterns often used in serverless systems such as sagas, claim check, and combining synchronous and asynchronous flows to return data to the front end. You’re encouraged to avoid common mistakes that add hidden complexity, and you’re invited to reach out with questions via social media or email, or to inquire about limited one-to-one consulting help.
Sign in required to enroll
What you'll learn:
- Introduction
- Point-To-Point
- Dead Letter Queue and Batch Processing
- Hexagonal Architecture in Lambda
- + 28 more modules