Problem

We had an API endpoint whose JSON response size grew with the amount of data requested. Since it ran behind AWS Lambda, we knew unbounded responses were unsafe. Lambda has a hard response payload limit, so pagination was already the right direction.

The confusing part showed up while testing that pagination work. Responses we expected to be close to the limit looked much smaller in Postman. Then a slightly larger response from the same endpoint failed completely with 502 Bad Gateway or 413 Payload Too Large.

It felt like the limit was moving. A response we expected to be around 5.9 MB could appear as a tiny transfer in the client, while a response around 6.1 MB failed before we saw a body at all. The mistake was comparing the compressed size Postman showed us with the raw size Lambda was enforcing.

First, the Lambda boundary

Lambda does not care how small the response might become after a proxy compresses it. It has to receive the response from the application first. If the response crossing that boundary is too large, Lambda rejects it.

In our stack, the Flask app returned JSON through Lambda. At this point the body was still raw JSON because we had not enabled app-level compression. Lambda was not shrinking it. Lambda was only enforcing the response payload limit.

That is the first byte count to keep in your head: raw bytes produced by the app and handed back through Lambda. This is the byte count that can trigger 502 or 413 before the client receives anything useful.

Then, compression negotiation

HTTP compression is negotiated with headers. A client sends Accept-Encoding to say what it supports, such as gzip, deflate, or br. The server answers with Content-Encoding when it actually compresses the response.

Postman sends Accept-Encoding: gzip by default. Python requests also sends compression headers by default, so SDKs built on it usually ask for compressed responses unless they override those headers. That means normal clients may already be asking the gateway for gzip.

API Gateway CAN honor that request and compress the response, BUT ONLY AFTER Lambda has returned successfully. That ordering is the whole trap.

Client:      Accept-Encoding: gzip, deflate, br
App:         returns raw JSON unless app-level compression is enabled
Lambda:      enforces raw response payload limit
API Gateway: compresses only after Lambda returns successfully
Client:      receives compressed bytes and decompresses automatically

The gotcha in testing

Once pagination was in progress, this created a misleading test result. A near-limit response could succeed, then API Gateway compressed it, and Postman showed a much smaller transfer size. A slightly larger raw response failed before API Gateway ever got a chance to compress it.

That ordering creates the split behavior we saw in pagination testing.

Raw response below limit

~5.9 MB rawLambda returns successfullyAPI Gateway compresses⚡ Postman shows ~100 KB transferred

Raw response above limit

~6.1 MB rawLambda rejects the responseAPI Gateway never compresses🤔 Client sees 502 or 413

That is what made the bug feel strange. Before hitting the limit, the response suddenly looked dramatically smaller because we were seeing compressed transfer bytes. After adding only a little more data, the request crossed Lambda's raw 6 MB boundary and broke before compression could happen. It felt like the limit was moving, but we were really switching between two different byte counts.

Measure the bytes at the boundary that enforces the limit, not only the bytes your client reports after compression.

Fix

Pagination stayed the durable fix. The endpoint needed bounded pages or cursors so each response stayed safely below the raw Lambda limit. Compression is useful, but it should not become permission to return unbounded JSON.

We also added app-level compression with flask-compress. The important change was not "more compression" in the abstract. It was moving compression earlier, before the response crossed the Lambda boundary.

from flask import Flask
from flask_compress import Compress

app = Flask(__name__)
Compress(app)

@app.get("/api/results")
def results():
    return {"items": build_large_json_response()}

Gateway compression can still be useful for transfer size and latency on successful responses. It just cannot protect Lambda from the raw response size limit, because it runs too late in the path.

Best defaultPaginate or return a cursor so each response is bounded.
Tactical layerCompress in the app when compressible JSON still needs to be returned directly.
For very large resultsMove to async export, object storage, or a streaming-style workflow.

What changed in practice

On a representative JSON response around 2.65 MB raw, compression changed both transfer size and latency. The exact numbers will vary by payload shape and network path, but the direction is the useful part.

SetupRequestLatencyTransferred
Gateway compression onlyNo Accept-Encoding2.7s2.65 MB
Gateway compression onlyWith Accept-Encoding2.0s330 KB
App plus gateway compressionNo Accept-Encoding2.7s2.65 MB
App plus gateway compressionWith Accept-Encoding1.8s281 KB

The 2.65 MB to 330 KB case is roughly an 87 percent transfer-size reduction. The important distinction is not just 330 KB versus 281 KB. It is where the compression occurs. App-level compression can shrink compressible JSON before the response crosses the Lambda boundary. Gateway compression cannot rescue a response Lambda already rejected.

Debug these separately

  • Raw response bytes produced by the application.
  • Transferred bytes after Content-Encoding.
  • Whether the request sent Accept-Encoding.
  • Whether API Gateway or the app applied compression.

Production lesson

When a serverless endpoint returns large JSON, measure both raw bytes and transferred bytes. Browser tools, Postman, and HTTP clients often show the compressed transfer size, which can make a response look smaller than the payload your runtime actually had to return.

A good checklist is simple: know whether clients send Accept-Encoding, know the raw payload size, know where compression is applied, keep Lambda responses below the raw limit, paginate large results, and use app-level compression when returning compressible data directly. If the response can keep growing, move it out of the synchronous response path instead of tuning around a limit.