Forge Engineering

AMP Remote Write HTTP 400: The CloudWatch Metric That Solved It

Fri, 03 Jul 2026 00:00:00 GMT

AMP said 400. CloudWatch said why.

We turned on Amazon Managed Prometheus (AMP) remote write from our Quarkus services on ECS — Micrometer snapshots, Snappy-compressed Prometheus remote write, SigV4 signing — and everything mostly worked.

Mostly.

Every service logged AMP remote write failed with status 400 on a ~60 second cadence. Not constantly. Not only at startup. Intermittently, across the fleet, continually.

HTTP 400 with an empty body is a frustrating place to start.

This post is how we narrowed it down, why the obvious suspects weren't the culprit, and the CloudWatch metric that finally made the failure mode obvious.

The setup

Forge pushes application metrics in-process:

Micrometer with the Prometheus v1 registry (/q/metrics for local scrape, MetricSnapshots for export)
Remote write to AMP over HTTPS with Content-Encoding: snappy and AWS SigV4 (aps signing name)
One AMP workspace shared by every ECS service in the environment (auth, actor, document, notification, audit, BFF, …)
GraalVM native images on Fargate, private subnets, no collector sidecar

The architecture is deliberate: fewer moving parts, right-sized tasks, metrics, and traces exported directly from the app.

When remote write misbehaves, the failure is somewhere in a short chain: scrape → encode → compress → sign → POST.

What we ruled out first

Wrong URL

A previous deploy had pointed at the wrong path suffix. AMP's real endpoint is:

https://aps-workspaces..amazonaws.com/workspaces//api/v1/remote_write

Not .../remote_write alone (that returned 404).

We verified the CloudFormation export, the ECS task environment variable, and a SigV4-signed POST with an empty Snappy payload to the correct URL. That returned 200. So URL and IAM were fine.

SigV4 and transport

The same signing code path used in production (AwsSignedHttpTransport) worked from a JVM probe on a developer machine against the live workspace. Empty payload and full payload both succeeded when credentials and URL were correct.

"A bad metric family"

We scraped live exposition text from a running native task (more on that pattern below), loaded it into a JVM integration test, encoded it with the same remote-write encoder, and POSTed to AMP.

200. Full payload accepted.

So the metric content at scrape time was not inherently toxic. The encoder and Snappy path on JVM were fine.

That left something about the native periodic push path or fleet-wide ingestion behaviour — not a single broken histogram hiding in auth-service.

The pivot: CloudWatch `DiscardedSamples`

AMP exposes vended CloudWatch metrics under the AWS/Prometheus namespace. One of them is DiscardedSamples, with a Reason dimension explaining why samples were dropped.

We listed metrics for our workspace:

aws cloudwatch list-metrics \
  --namespace "AWS/Prometheus" \
  --metric-name "DiscardedSamples" \
  --region us-west-2

The only reason showing material volume:

Reason = new-value-for-timestamp

Querying that dimension:

aws cloudwatch get-metric-statistics \
  --namespace "AWS/Prometheus" \
  --metric-name "DiscardedSamples" \
  --dimensions \
    Name=Workspace,Value=ws-xxxxxxxx \
    Name=Reason,Value=new-value-for-timestamp \
  --start-time "$START" --end-time "$END" \
  --period 300 \
  --statistics Sum \
  --region us-west-2

We saw hundreds of discards per five-minute bucket, sustained, fleet-wide.

That is not a signing bug. It is not Snappy corruption.

In Prometheus remote write semantics, new-value-for-timestamp means: AMP already has a sample for this exact time series (same metric name and label set) at this exact timestamp, but the new sample has a different value.

In Prometheus, a series is defined by the metric name and its complete label set. If two writers produce identical labels, the backend treats them as one time series regardless of which process produced them.

Same series. Same millisecond. Different number. Rejected.

Once we saw that, the intermittent 400s and the ~95% push success rate (success counter climbing, failure counter occasionally ticking) both made sense: most pushes partially or fully conflicted with other writers or stale timestamps; some requests failed hard at the HTTP layer.

The actual bug: six services, one series

Each ECS service exports the same Micrometer binders: jvm_*, process_*, system_*, HTTP server metrics, and our own forge_observability_amp_push_* counters.

We had no common tags — no service, no instance, no pod.

So from AMP's perspective:

auth-service   → jvm_threads_live_threads = 21  @ T
actor-service  → jvm_threads_live_threads = 14  @ T
document-service → jvm_threads_live_threads = 19 @ T

That is one series (jvm_threads_live_threads with identical labels) receiving different values at the same timestamp from different tasks, every push interval, forever.

Classic multi-writer collision.

It is the same class of problem as running multiple Prometheus replicas remote-writing the same targets without external labels — except we did it with six application services and one workspace.

Why the JVM probe didn't catch it

The probe used a single scraped exposition file from one task. One writer. AMP happily ingested it.

The failure mode only appears when multiple tasks push overlapping label sets to one workspace on the same schedule.

Why `hasScrapeTimestamp()` wasn't the story

We also suspected stale per-datapoint scrape timestamps from the registry. On JVM Micrometer scrape, hasScrapeTimestamp() was false for every data point in our tests. The old encoder fell back to System.currentTimeMillis() per sample anyway.

Fleet collision fit the CloudWatch evidence better than stale scrape timestamps.

We still normalized to one batch push timestamp per remote-write request — correct semantics for periodic push export — but the fix that cleared discards was disambiguating series.

The fix

1. Common tags (the real fix)

In forge-kit forge-metrics, a CDI MeterFilter:

MeterFilter.commonTags(
    Tag.of("service", applicationName),   // quarkus.application.name
    Tag.of("instance", hostName)        // ECS task hostname; override via config
);

After deploy, AMP queries looked like:

forge_observability_amp_push_success_total
jvm_threads_live_threads{service="auth-service"}
count by (service) (jvm_memory_used_bytes)

Six services, six distinct label sets. Immediately after deploying the common tags, new DiscardedSamples{Reason="new-value-for-timestamp"} buckets dropped to zero across the workspace while historical buckets retained the earlier collisions. No other code changes were required.

2. Richer failure logging

AmpMetricsExporter had only logged status 400. We added Snappy body size, snapshot count, and AMP response body on failure so the next incident doesn't require guesswork.

3. Batch push timestamp

Remote-write encoding now stamps every sample in a WriteRequest with one wall-clock time at export. That is standard push-agent behaviour and avoids mixed timestamps within a single push. It did not replace the need for service / instance tags.

How we verified ingestion (not just "no errors")

Empty HTTP failures are necessary but not sufficient. We confirmed data was stored and queryable:

QUERY_ENDPOINT="https://aps-workspaces.us-west-2.amazonaws.com/workspaces/${WORKSPACE_ID}/api/v1/query"

awscurl --service aps --region us-west-2 \
  -X POST "${QUERY_ENDPOINT}" \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -d 'query=forge_observability_amp_push_success_total'

Success counters for all six workloads, each with service and instance labels. Failure metric absent (never incremented since deploy). CloudWatch discards at zero post-rollout.

That is the bar: PromQL returns your series with the labels you expect, and DiscardedSamples stays quiet.

Appendix: debugging when you can't reach `/q/metrics`

Our laptops can't hit task private IPs in a NAT-less VPC. To compare native exposition with the JVM encoder, we used a one-off Fargate task pattern:

Publish a tiny private ECR image (busybox + wget) — Docker Hub and public ECR aren't reachable from the VPC without NAT.
Run a task in the cluster that wgets http://:8080/q/metrics (same path the target group uses for health checks).
Pull logs into a local file and feed it to a JVM probe (FORGE_AMP_METRICS_FILE=...).

We also briefly added a same-SG ingress on 8080 so the scrape task could reach the service ENI — then revoked it after debugging. That rule is not part of normal infra; ALB → task rules stay as designed.

This tooling was valuable for one INT session. We don't plan to keep the scrape image or scripts in the application repo long term — the durable lessons are the CloudWatch reason dimension, the multi-writer tag discipline, and the PromQL verification queries.

Takeaways

AMP HTTP 400 with an empty body sends you down a long hallway. DiscardedSamples by Reason is often the door.
new-value-for-timestamp on a shared workspace screams label collision — multiple writers, identical series, aligned push intervals.
Micrometer MeterFilter.commonTags with low-cardinality service (and instance when you scale replicas) is not optional when many processes remote-write to one Prometheus-compatible backend.
Single-service probes prove encoding, not fleet behaviour. Reproduce collisions with multi-writer thinking.
Confirm consumption with PromQL and CloudWatch ingestion metrics — not only "the error log stopped."

If you're wiring Micrometer → AMP on ECS, check your tags before you tune Snappy JNI or second-guess your SigV4 implementation.

The platform was fine. We were just telling six services to sign the same name on the same line.

The software nobody plans to build - but every successful team eventually does...

Fri, 26 Jun 2026 00:00:00 GMT

Every software company starts with one product.

The product customers buy.

The product investors care about.

The product the roadmap revolves around.

But given enough time, almost every successful engineering team finds itself building something else.

Not because they planned to.

Because they have to.

The second product

It doesn't have a marketing website.

Customers never ask for it by name.

Nobody demos it to investors.

Yet it quietly grows alongside the business.

It's your engineering foundations.

The authentication services.

The deployment pipelines.

The infrastructure.

The observability stack.

The audit logging.

The notification system.

The release processes.

The operational tooling.

The architectural conventions.

Individually, none of these things are your product.

Collectively, they're what allow your product to scale.

It happens one decision at a time

Nobody sets out to spend months building engineering foundations.

Instead, they make perfectly reasonable decisions.

"We'll automate deployments later."

"This service just needs its own authentication for now."

"We'll improve the monitoring once we've got more customers."

"We'll standardise this after the next release."

Every one of those decisions is commercially rational.

Product delivery has to come first.

But each compromise adds another piece to a second codebase that the business never intended to own.

The hidden backlog

Earlier in my career, while CTO at a UK insurance startup, I estimated that around 60% of our engineering backlog wasn't product work at all.

It was engineering foundations.

Improving CI/CD.

Standardising infrastructure.

Strengthening security.

Adding observability.

Operational tooling.

Developer experience.

Performance and security testing.

None of those stories generated revenue directly.

But every one of them made future product delivery faster, safer and more predictable.

Eventually we realised something important:

We weren't just building an insurance platform anymore.

We were also building the engineering foundations that made the insurance platform possible.

The Groundhog Day problem

Over the last 20+ years I've worked across startups, consultancies and enterprise engineering teams.

Different industries.

Different products.

Different company sizes.

Yet I kept rebuilding remarkably similar engineering foundations.

Authentication.

Audit.

Notifications.

Infrastructure.

Deployment pipelines.

Observability.

Release management.

Security controls.

Developer tooling.

Different implementations.

The same problems.

After a while it started to feel like Groundhog Day.

Every organisation was independently solving problems that thousands of engineering teams had already solved before.

Engineering foundations are inevitable

This isn't an argument against building engineering foundations.

They're essential.

Every successful software company eventually needs them.

The question is simply when and how you build them.

Do you invest months (or years) constructing them incrementally while trying to deliver product?

Or do you begin with mature engineering foundations already in place and let your team focus on the capabilities that actually differentiate your business?

That's a very different starting position.

Final thoughts

I've come to believe that one of the biggest hidden costs in software engineering isn't technical debt.

It's repeatedly rebuilding the same engineering foundations.

Not because they're unique.

But because every organisation assumes it has to start from scratch.

That realisation is ultimately what led me to build Forge Platform.

After spending more than two decades repeatedly building the same operational capabilities across startups and enterprise programmes, I wanted to create something that lets engineering teams begin with mature foundations already in place - so more of their time is spent building the product they're actually in business to create.

If that resonates, you can learn more at forgeplatform.software.

What "production-ready" actually means - and why most teams discover it too late.

Sat, 06 Jun 2026 00:00:00 GMT

"Production-ready" is one of the most misused phrases in software engineering.

It usually means:

it runs
it deploys
it works in a happy path

But in real systems, production readiness is not about functionality.

It's about behaviour under failure, change, and scale.

The difference between working and production-ready

A system is not production-ready when:

it can be deployed

It is production-ready when:

it can fail safely
it can be observed
it can be redeployed without service interruption
it behaves consistently under load
it can be operated by people who did not build it

Most early-stage systems do not meet this bar.

Not because teams are careless - but because these properties are usually added after the system exists.

The problem with "we'll add it later"

In practice, "later" becomes:

after customers arrive
after scale pressure begins
after incidents expose gaps
after engineering velocity slows

At that point, the system is no longer neutral.

It has opinions:

about structure

about deployment

about observability

about service boundaries

And those opinions are expensive to change.

Where teams actually spend their time

Across multiple environments I've worked in - from startups to large AWS-based enterprise systems - a consistent pattern appears:

Engineering effort splits into two categories:

domain / product requirements and features
engineering foundation and operational work

In many early systems, the work that goes into engineering foundations - such as deployments, versioning, build and test standards and optimisations, pipelines etc. - becomes a dominant and usually hidden cost.

At one startup, my estimate was that "technical" stories accounted for the majority of backlog creation over time, eclipsing feature development.

This is not an edge case.

This is how systems evolve.

Why this is so hard to avoid

Most teams don't consciously choose to neglect operational maturity.

The problem is that product work is always visible, while engineering foundations are largely invisible.

A new feature can be demonstrated to customers, investors, and stakeholders. It can be tied directly to revenue, growth, or market validation. Improvements to deployment pipelines, observability, security controls, or operational tooling rarely have that luxury. Their value is indirect, preventative, and often only becomes obvious when something goes wrong.

As a result, engineering teams are under constant pressure to prioritise business-driven outcomes over engineering excellence. Every sprint presents another feature request, customer commitment, sales opportunity, or roadmap deadline competing for attention.

Over time, small compromises accumulate:

Deployment processes remain partially manual because "we'll automate it later."
Monitoring exists, but not at the depth needed to diagnose production issues quickly.
Security controls are good enough for today's customers, but not tomorrow's.
Operational knowledge lives in people's heads rather than in systems and documentation.

None of these decisions are unreasonable in isolation. In fact, most are rational responses to commercial pressure.

The challenge is that operational maturity compounds in exactly the same way technical debt does. The cost of postponing it is often hidden until growth, scale, compliance requirements, or a production incident suddenly expose the gap.

By that point, fixing the foundations is competing with an even larger backlog, a larger customer base, and a business that has become increasingly dependent on systems that were never designed for the level of demand being placed on them.

The real definition of production-ready

A more accurate definition is:

A system is production-ready when its operational properties are designed, not discovered.

That includes:

observability as a first-class concern
consistent service structure and bounded contexts
predictable deployment behaviour
explicit failure handling patterns
security and access boundaries defined early

The uncomfortable truth

Most teams don't lack capability.

They lack a reusable starting point.

So they rebuild production-readiness repeatedly, instead of inheriting it once.

The shift that matters

The real architectural question is not:

"How do we make this production-ready?"

It is:

"Why are we rebuilding production readiness every time?"

This is the problem space I've been focused on with Forge: creating a reusable foundation so teams don't rediscover production-readiness under pressure.

The real startup killer isn't product - it's building platform foundations from scratch.

Fri, 08 May 2026 00:00:00 GMT

There's a pattern I've seen repeat a few times over 20 years building software systems in both startup and enterprise environments.

Most early-stage teams believe they are building a product.

In reality, they are building two things at once:

Their actual product
An internal platform they didn't intend to build

And the second one quietly consumes man-years of engineering time.

The hidden tax on early-stage teams

At some point, every startup hits the same phase:

The MVP works
The first customers arrive
Engineering velocity starts to slow
"Just one more service" becomes a platform discussion

And suddenly the backlog shifts.

Not because the product changed - but because the foundations underneath it needed to evolve (without the wheels falling off!).

Based on greenfield experience, I'd guesstimate that up to 60% of engineering effort can be focused on platform foundations, delivery lifecycle and operational processes, not domain features or core business differentiators.

That ratio is not unusual.

It is normal.

And it is also destructive.

The repeated mistake

Across organisations, I've seen the same systems rebuilt repeatedly:

CI/CD pipelines
Infrastructure-as-code structures
Observability setups
Security and access control patterns
Service templates and API conventions
Deployment and rollback strategies

Each time:

slightly different
slightly inconsistent
always re-learned under pressure
and generally whilst trying to keep production lights on

The irony is that none of these are domain-specific.

They are global engineering concerns.

Yet every team reinvents them.

Why this keeps happening

It's not incompetence.

It's timing.

Early-stage teams optimise for:

speed
product delivery
survival

So they defer platform thinking until it becomes unavoidable.

At which point:

the system is already live
constraints are already baked in
rewrites are expensive

So they rebuild under pressure instead of designing under intention.

The real cost

The cost isn't just engineering time.

It's:

delayed product delivery
inconsistent system behaviour
increased operational risk
premature senior hiring in platform roles
architectural fragmentation across services (and teams)

Most importantly:

It shifts engineering from building product value to managing internal complexity.

What changes when you design it once

When these concerns are treated as a reusable foundation instead of one-off decisions:

teams ship faster
systems stay consistent
operational overhead drops
architecture stops diverging across services
engineering focus returns to product domain logic

This is the problem I've been formalising into an opinionated bootstrap approach with Forge Platform.

It enables production-grade microservices from day one with enterprise foundations at startup speed.

Not because teams cannot build these things.

But because they keep building them repeatedly, under pressure, in slightly different ways, everywhere.

The question I keep coming back to

If every team ends up rebuilding the same foundations…

Why are we still rebuilding them every time?

If you have any questions or are interested to find out more, it would be great to hear from you.