<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Forge Engineering</title>
    <link>https://forgeplatform.software/blog/</link>
    <atom:link href="https://forgeplatform.software/blog/feed.xml" rel="self" type="application/rss+xml" />
    <description>Opinionated engineering notes on building production SaaS platforms - from runtime architecture and platform engineering to the realities of running software at scale.</description>
    <language>en</language>
    <lastBuildDate>Fri, 03 Jul 2026 00:00:00 GMT</lastBuildDate>
    <item>
      <title>AMP Remote Write HTTP 400: The CloudWatch Metric That Solved It</title>
      <link>https://forgeplatform.software/blog/when-amp-returns-400-check-discarded-samples/</link>
      <guid isPermaLink="true">https://forgeplatform.software/blog/when-amp-returns-400-check-discarded-samples/</guid>
      <pubDate>Fri, 03 Jul 2026 00:00:00 GMT</pubDate>
      <description>Intermittent Amazon Managed Prometheus remote-write failures looked like encoding or signing bugs. The real signal was CloudWatch DiscardedSamples — and the fix was embarrassingly familiar if you&apos;ve operated Prometheus at scale.</description>
      <content:encoded><![CDATA[<p>AMP said 400. CloudWatch said why.</p>
<p>We turned on Amazon Managed Prometheus (AMP) remote write from our Quarkus services on ECS — Micrometer snapshots, Snappy-compressed Prometheus remote write, SigV4 signing — and everything <em>mostly</em> worked.</p>
<p>Mostly.</p>
<p>Every service logged <code>AMP remote write failed with status 400</code> on a ~60 second cadence. Not constantly. Not only at startup. Intermittently, across the fleet, continually.</p>
<p>HTTP 400 with an empty body is a frustrating place to start.</p>
<p>This post is how we narrowed it down, why the obvious suspects weren&#39;t the culprit, and the CloudWatch metric that finally made the failure mode obvious.</p>
<h2>The setup</h2>
<p>Forge pushes application metrics in-process:</p>
<ul>
<li><strong>Micrometer</strong> with the Prometheus v1 registry (<code>/q/metrics</code> for local scrape, <code>MetricSnapshots</code> for export)</li>
<li><strong>Remote write</strong> to AMP over HTTPS with <code>Content-Encoding: snappy</code> and AWS SigV4 (<code>aps</code> signing name)</li>
<li><strong>One AMP workspace</strong> shared by every ECS service in the environment (auth, actor, document, notification, audit, BFF, …)</li>
<li><strong>GraalVM native</strong> images on Fargate, private subnets, no collector sidecar</li>
</ul>
<p>The architecture is deliberate: fewer moving parts, right-sized tasks, metrics, and traces exported directly from the app.</p>
<p>When remote write misbehaves, the failure is somewhere in a short chain: scrape → encode → compress → sign → POST.</p>
<h2>What we ruled out first</h2>
<h3>Wrong URL</h3>
<p>A previous deploy had pointed at the wrong path suffix. AMP&#39;s real endpoint is:</p>
<pre><code class="language-text">https://aps-workspaces.&lt;region&gt;.amazonaws.com/workspaces/&lt;ws-id&gt;/api/v1/remote_write
</code></pre>
<p>Not <code>.../remote_write</code> alone (that returned 404).</p>
<p>We verified the CloudFormation export, the ECS task environment variable, and a <strong>SigV4-signed POST with an empty Snappy payload</strong> to the correct URL. That returned <strong>200</strong>. So URL and IAM were fine.</p>
<h3>SigV4 and transport</h3>
<p>The same signing code path used in production (<code>AwsSignedHttpTransport</code>) worked from a JVM probe on a developer machine against the live workspace. Empty payload and full payload both succeeded when credentials and URL were correct.</p>
<h3>&quot;A bad metric family&quot;</h3>
<p>We scraped live exposition text from a running native task (more on that pattern below), loaded it into a JVM integration test, encoded it with the same remote-write encoder, and POSTed to AMP.</p>
<p><strong>200. Full payload accepted.</strong></p>
<p>So the metric <em>content</em> at scrape time was not inherently toxic. The encoder and Snappy path on JVM were fine.</p>
<p>That left something about the <strong>native periodic push path</strong> or <strong>fleet-wide ingestion behaviour</strong> — not a single broken histogram hiding in auth-service.</p>
<h2>The pivot: CloudWatch <code>DiscardedSamples</code></h2>
<p>AMP exposes <a href="https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-CW-usage-metrics.html">vended CloudWatch metrics</a> under the <code>AWS/Prometheus</code> namespace. One of them is <strong><code>DiscardedSamples</code></strong>, with a <strong><code>Reason</code> dimension</strong> explaining why samples were dropped.</p>
<p>We listed metrics for our workspace:</p>
<pre><code class="language-bash">aws cloudwatch list-metrics \
  --namespace &quot;AWS/Prometheus&quot; \
  --metric-name &quot;DiscardedSamples&quot; \
  --region us-west-2
</code></pre>
<p>The only reason showing material volume:</p>
<pre><code class="language-text">Reason = new-value-for-timestamp
</code></pre>
<p>Querying that dimension:</p>
<pre><code class="language-bash">aws cloudwatch get-metric-statistics \
  --namespace &quot;AWS/Prometheus&quot; \
  --metric-name &quot;DiscardedSamples&quot; \
  --dimensions \
    Name=Workspace,Value=ws-xxxxxxxx \
    Name=Reason,Value=new-value-for-timestamp \
  --start-time &quot;$START&quot; --end-time &quot;$END&quot; \
  --period 300 \
  --statistics Sum \
  --region us-west-2
</code></pre>
<p>We saw <strong>hundreds of discards per five-minute bucket</strong>, sustained, fleet-wide.</p>
<p>That is not a signing bug. It is not Snappy corruption.</p>
<p>In Prometheus remote write semantics, <strong><code>new-value-for-timestamp</code></strong> means: AMP already has a sample for this <strong>exact time series</strong> (same metric name and label set) at this <strong>exact timestamp</strong>, but the new sample has a <strong>different value</strong>.</p>
<p>In Prometheus, a series is defined by the metric name and its complete label set. If two writers produce identical labels, the backend treats them as one time series regardless of which process produced them.</p>
<p>Same series. Same millisecond. Different number. Rejected.</p>
<p>Once we saw that, the intermittent 400s and the ~95% push success rate (success counter climbing, failure counter occasionally ticking) both made sense: most pushes partially or fully conflicted with other writers or stale timestamps; some requests failed hard at the HTTP layer.</p>
<h2>The actual bug: six services, one series</h2>
<p>Each ECS service exports the same Micrometer binders: <code>jvm_*</code>, <code>process_*</code>, <code>system_*</code>, HTTP server metrics, and our own <code>forge_observability_amp_push_*</code> counters.</p>
<p>We had <strong>no common tags</strong> — no <code>service</code>, no <code>instance</code>, no <code>pod</code>.</p>
<p>So from AMP&#39;s perspective:</p>
<pre><code class="language-text">auth-service   → jvm_threads_live_threads = 21  @ T
actor-service  → jvm_threads_live_threads = 14  @ T
document-service → jvm_threads_live_threads = 19 @ T
</code></pre>
<p>That is <strong>one series</strong> (<code>jvm_threads_live_threads</code> with identical labels) receiving <strong>different values at the same timestamp</strong> from different tasks, every push interval, forever.</p>
<p>Classic multi-writer collision.</p>
<p>It is the same class of problem as running multiple Prometheus replicas remote-writing the same targets without external labels — except we did it with six application services and one workspace.</p>
<h3>Why the JVM probe didn&#39;t catch it</h3>
<p>The probe used a <strong>single</strong> scraped exposition file from <strong>one</strong> task. One writer. AMP happily ingested it.</p>
<p>The failure mode only appears when <strong>multiple tasks</strong> push <strong>overlapping label sets</strong> to <strong>one workspace</strong> on the same schedule.</p>
<h3>Why <code>hasScrapeTimestamp()</code> wasn&#39;t the story</h3>
<p>We also suspected stale per-datapoint scrape timestamps from the registry. On JVM Micrometer scrape, <strong><code>hasScrapeTimestamp()</code> was false for every data point</strong> in our tests. The old encoder fell back to <code>System.currentTimeMillis()</code> per sample anyway.</p>
<p>Fleet collision fit the CloudWatch evidence better than stale scrape timestamps.</p>
<p>We still normalized to <strong>one batch push timestamp per remote-write request</strong> — correct semantics for periodic push export — but the fix that cleared discards was <strong>disambiguating series</strong>.</p>
<h2>The fix</h2>
<h3>1. Common tags (the real fix)</h3>
<p>In forge-kit <code>forge-metrics</code>, a CDI <code>MeterFilter</code>:</p>
<pre><code class="language-java">MeterFilter.commonTags(
    Tag.of(&quot;service&quot;, applicationName),   // quarkus.application.name
    Tag.of(&quot;instance&quot;, hostName)        // ECS task hostname; override via config
);
</code></pre>
<p>After deploy, AMP queries looked like:</p>
<pre><code class="language-promql">forge_observability_amp_push_success_total
jvm_threads_live_threads{service=&quot;auth-service&quot;}
count by (service) (jvm_memory_used_bytes)
</code></pre>
<p>Six services, six distinct label sets. Immediately after deploying the common tags, new <code>DiscardedSamples{Reason=&quot;new-value-for-timestamp&quot;}</code> buckets dropped to zero across the workspace while historical buckets retained the earlier collisions. No other code changes were required.</p>
<h3>2. Richer failure logging</h3>
<p><code>AmpMetricsExporter</code> had only logged <code>status 400</code>. We added <strong>Snappy body size</strong>, <strong>snapshot count</strong>, and <strong>AMP response body</strong> on failure so the next incident doesn&#39;t require guesswork.</p>
<h3>3. Batch push timestamp</h3>
<p>Remote-write encoding now stamps every sample in a <code>WriteRequest</code> with <strong>one wall-clock time</strong> at export. That is standard push-agent behaviour and avoids mixed timestamps within a single push. It did not replace the need for <code>service</code> / <code>instance</code> tags.</p>
<h2>How we verified ingestion (not just &quot;no errors&quot;)</h2>
<p>Empty HTTP failures are necessary but not sufficient. We confirmed data was <strong>stored and queryable</strong>:</p>
<pre><code class="language-bash">QUERY_ENDPOINT=&quot;https://aps-workspaces.us-west-2.amazonaws.com/workspaces/${WORKSPACE_ID}/api/v1/query&quot;

awscurl --service aps --region us-west-2 \
  -X POST &quot;${QUERY_ENDPOINT}&quot; \
  -H &#39;Content-Type: application/x-www-form-urlencoded&#39; \
  -d &#39;query=forge_observability_amp_push_success_total&#39;
</code></pre>
<p>Success counters for all six workloads, each with <code>service</code> and <code>instance</code> labels. Failure metric absent (never incremented since deploy). CloudWatch discards at zero post-rollout.</p>
<p>That is the bar: <strong>PromQL returns your series with the labels you expect</strong>, and <strong>DiscardedSamples stays quiet</strong>.</p>
<h2>Appendix: debugging when you can&#39;t reach <code>/q/metrics</code></h2>
<p>Our laptops can&#39;t hit task private IPs in a NAT-less VPC. To compare native exposition with the JVM encoder, we used a one-off Fargate task pattern:</p>
<ol>
<li>Publish a tiny <strong>private ECR</strong> image (<code>busybox</code> + <code>wget</code>) — Docker Hub and public ECR aren&#39;t reachable from the VPC without NAT.</li>
<li>Run a task in the cluster that <code>wget</code>s <code>http://&lt;task-private-ip&gt;:8080/q/metrics</code> (same path the target group uses for health checks).</li>
<li>Pull logs into a local file and feed it to a JVM probe (<code>FORGE_AMP_METRICS_FILE=...</code>).</li>
</ol>
<p>We also briefly added a <strong>same-SG ingress on 8080</strong> so the scrape task could reach the service ENI — then <strong>revoked</strong> it after debugging. That rule is not part of normal infra; ALB → task rules stay as designed.</p>
<p>This tooling was valuable for one INT session. We don&#39;t plan to keep the scrape image or scripts in the application repo long term — the durable lessons are the <strong>CloudWatch reason dimension</strong>, the <strong>multi-writer tag discipline</strong>, and the <strong>PromQL verification queries</strong>.</p>
<h2>Takeaways</h2>
<ol>
<li><strong>AMP HTTP 400 with an empty body</strong> sends you down a long hallway. <strong><code>DiscardedSamples</code> by <code>Reason</code></strong> is often the door.</li>
<li><strong><code>new-value-for-timestamp</code></strong> on a shared workspace screams <strong>label collision</strong> — multiple writers, identical series, aligned push intervals.</li>
<li><strong>Micrometer <code>MeterFilter.commonTags</code></strong> with low-cardinality <code>service</code> (and <code>instance</code> when you scale replicas) is not optional when many processes remote-write to one Prometheus-compatible backend.</li>
<li><strong>Single-service probes prove encoding</strong>, not fleet behaviour. Reproduce collisions with multi-writer thinking.</li>
<li><strong>Confirm consumption</strong> with PromQL and CloudWatch ingestion metrics — not only &quot;the error log stopped.&quot;</li>
</ol>
<p>If you&#39;re wiring Micrometer → AMP on ECS, check your tags before you tune Snappy JNI or second-guess your SigV4 implementation.</p>
<p>The platform was fine. We were just telling six services to sign the same name on the same line.</p>
]]></content:encoded>
      <category>devops</category>
      <category>aws</category>
      <category>observability</category>
      <category>prometheus</category>
      <category>micrometer</category>
    </item>
    <item>
      <title>The software nobody plans to build - but every successful team eventually does...</title>
      <link>https://forgeplatform.software/blog/the-software-nobody-plans-to-build/</link>
      <guid isPermaLink="true">https://forgeplatform.software/blog/the-software-nobody-plans-to-build/</guid>
      <pubDate>Fri, 26 Jun 2026 00:00:00 GMT</pubDate>
      <description>Every software company starts with one product. Given enough time, almost every successful team ends up building a second one they never planned for.</description>
      <content:encoded><![CDATA[<p>Every software company starts with one product.</p>
<p>The product customers buy.</p>
<p>The product investors care about.</p>
<p>The product the roadmap revolves around.</p>
<p>But given enough time, almost every successful engineering team finds itself building something else.</p>
<p>Not because they planned to.</p>
<p>Because they have to.</p>
<h2>The second product</h2>
<p>It doesn&#39;t have a marketing website.</p>
<p>Customers never ask for it by name.</p>
<p>Nobody demos it to investors.</p>
<p>Yet it quietly grows alongside the business.</p>
<p>It&#39;s your engineering foundations.</p>
<p>The authentication services.</p>
<p>The deployment pipelines.</p>
<p>The infrastructure.</p>
<p>The observability stack.</p>
<p>The audit logging.</p>
<p>The notification system.</p>
<p>The release processes.</p>
<p>The operational tooling.</p>
<p>The architectural conventions.</p>
<p>Individually, none of these things are your product.</p>
<p>Collectively, they&#39;re what allow your product to scale.</p>
<h2>It happens one decision at a time</h2>
<p>Nobody sets out to spend months building engineering foundations.</p>
<p>Instead, they make perfectly reasonable decisions.</p>
<p>&quot;We&#39;ll automate deployments later.&quot;</p>
<p>&quot;This service just needs its own authentication for now.&quot;</p>
<p>&quot;We&#39;ll improve the monitoring once we&#39;ve got more customers.&quot;</p>
<p>&quot;We&#39;ll standardise this after the next release.&quot;</p>
<p>Every one of those decisions is commercially rational.</p>
<p>Product delivery has to come first.</p>
<p>But each compromise adds another piece to a second codebase that the business never intended to own.</p>
<h2>The hidden backlog</h2>
<p>Earlier in my career, while CTO at a UK insurance startup, I estimated that around <strong>60% of our engineering backlog</strong> wasn&#39;t product work at all.</p>
<p>It was engineering foundations.</p>
<p>Improving CI/CD.</p>
<p>Standardising infrastructure.</p>
<p>Strengthening security.</p>
<p>Adding observability.</p>
<p>Operational tooling.</p>
<p>Developer experience.</p>
<p>Performance and security testing.</p>
<p>None of those stories generated revenue directly.</p>
<p>But every one of them made future product delivery faster, safer and more predictable.</p>
<p>Eventually we realised something important:</p>
<p>We weren&#39;t just building an insurance platform anymore.</p>
<p>We were also building the engineering foundations that made the insurance platform possible.</p>
<h2>The Groundhog Day problem</h2>
<p>Over the last 20+ years I&#39;ve worked across startups, consultancies and enterprise engineering teams.</p>
<p>Different industries.</p>
<p>Different products.</p>
<p>Different company sizes.</p>
<p>Yet I kept rebuilding remarkably similar engineering foundations.</p>
<p>Authentication.</p>
<p>Audit.</p>
<p>Notifications.</p>
<p>Infrastructure.</p>
<p>Deployment pipelines.</p>
<p>Observability.</p>
<p>Release management.</p>
<p>Security controls.</p>
<p>Developer tooling.</p>
<p>Different implementations.</p>
<p>The same problems.</p>
<p>After a while it started to feel like Groundhog Day.</p>
<p>Every organisation was independently solving problems that thousands of engineering teams had already solved before.</p>
<h2>Engineering foundations are inevitable</h2>
<p>This isn&#39;t an argument against building engineering foundations.</p>
<p>They&#39;re essential.</p>
<p>Every successful software company eventually needs them.</p>
<p>The question is simply <strong>when</strong> and <strong>how</strong> you build them.</p>
<p>Do you invest months (or years) constructing them incrementally while trying to deliver product?</p>
<p>Or do you begin with mature engineering foundations already in place and let your team focus on the capabilities that actually differentiate your business?</p>
<p>That&#39;s a very different starting position.</p>
<h2>Final thoughts</h2>
<p>I&#39;ve come to believe that one of the biggest hidden costs in software engineering isn&#39;t technical debt.</p>
<p>It&#39;s repeatedly rebuilding the same engineering foundations.</p>
<p>Not because they&#39;re unique.</p>
<p>But because every organisation assumes it has to start from scratch.</p>
<p>That realisation is ultimately what led me to build Forge Platform.</p>
<p>After spending more than two decades repeatedly building the same operational capabilities across startups and enterprise programmes, I wanted to create something that lets engineering teams begin with mature foundations already in place - so more of their time is spent building the product they&#39;re actually in business to create.</p>
<p>If that resonates, you can learn more at <a href="https://forgeplatform.software/">forgeplatform.software</a>.</p>
]]></content:encoded>
      <category>saas</category>
      <category>aws</category>
      <category>software-engineering</category>
    </item>
    <item>
      <title>What &quot;production-ready&quot; actually means - and why most teams discover it too late.</title>
      <link>https://forgeplatform.software/blog/what-production-ready-actually-means/</link>
      <guid isPermaLink="true">https://forgeplatform.software/blog/what-production-ready-actually-means/</guid>
      <pubDate>Sat, 06 Jun 2026 00:00:00 GMT</pubDate>
      <description>Production readiness isn&apos;t about whether a system runs. It&apos;s about how it behaves under failure, change, and scale - and most teams discover the gap too late.</description>
      <content:encoded><![CDATA[<p>&quot;Production-ready&quot; is one of the most misused phrases in software engineering.</p>
<p>It usually means:</p>
<ul>
<li>it runs</li>
<li>it deploys</li>
<li>it works in a happy path</li>
</ul>
<p>But in real systems, production readiness is not about functionality.</p>
<p>It&#39;s about behaviour under failure, change, and scale.</p>
<h2>The difference between working and production-ready</h2>
<p>A system is not production-ready when:</p>
<ul>
<li>it can be deployed</li>
</ul>
<p>It is production-ready when:</p>
<ul>
<li>it can fail safely</li>
<li>it can be observed</li>
<li>it can be redeployed without service interruption</li>
<li>it behaves consistently under load</li>
<li>it can be operated by people who did not build it</li>
</ul>
<p>Most early-stage systems do not meet this bar.</p>
<p>Not because teams are careless - but because these properties are usually added after the system exists.</p>
<h2>The problem with &quot;we&#39;ll add it later&quot;</h2>
<p>In practice, &quot;later&quot; becomes:</p>
<ul>
<li>after customers arrive</li>
<li>after scale pressure begins</li>
<li>after incidents expose gaps</li>
<li>after engineering velocity slows</li>
</ul>
<p>At that point, the system is no longer neutral.</p>
<p>It has opinions:</p>
<p>about structure</p>
<p>about deployment</p>
<p>about observability</p>
<p>about service boundaries</p>
<p>And those opinions are expensive to change.</p>
<h2>Where teams actually spend their time</h2>
<p>Across multiple environments I&#39;ve worked in - from startups to large AWS-based enterprise systems - a consistent pattern appears:</p>
<p>Engineering effort splits into two categories:</p>
<ul>
<li>domain / product requirements and features</li>
<li>engineering foundation and operational work</li>
</ul>
<p>In many early systems, the work that goes into engineering foundations - such as deployments, versioning, build and test standards and optimisations, pipelines etc. - becomes a dominant and usually hidden cost.</p>
<p>At one startup, my estimate was that &quot;technical&quot; stories accounted for the majority of backlog creation over time, eclipsing feature development.</p>
<p>This is not an edge case.</p>
<p>This is how systems evolve.</p>
<h2>Why this is so hard to avoid</h2>
<p>Most teams don&#39;t consciously choose to neglect operational maturity.</p>
<p>The problem is that product work is always visible, while engineering foundations are largely invisible.</p>
<p>A new feature can be demonstrated to customers, investors, and stakeholders. It can be tied directly to revenue, growth, or market validation. Improvements to deployment pipelines, observability, security controls, or operational tooling rarely have that luxury. Their value is indirect, preventative, and often only becomes obvious when something goes wrong.</p>
<p>As a result, engineering teams are under constant pressure to prioritise business-driven outcomes over engineering excellence. Every sprint presents another feature request, customer commitment, sales opportunity, or roadmap deadline competing for attention.</p>
<p>Over time, small compromises accumulate:</p>
<ul>
<li>Deployment processes remain partially manual because &quot;we&#39;ll automate it later.&quot;</li>
<li>Monitoring exists, but not at the depth needed to diagnose production issues quickly.</li>
<li>Security controls are good enough for today&#39;s customers, but not tomorrow&#39;s.</li>
<li>Operational knowledge lives in people&#39;s heads rather than in systems and documentation.</li>
</ul>
<p>None of these decisions are unreasonable in isolation. In fact, most are rational responses to commercial pressure.</p>
<p>The challenge is that operational maturity compounds in exactly the same way technical debt does. The cost of postponing it is often hidden until growth, scale, compliance requirements, or a production incident suddenly expose the gap.</p>
<p>By that point, fixing the foundations is competing with an even larger backlog, a larger customer base, and a business that has become increasingly dependent on systems that were never designed for the level of demand being placed on them.</p>
<h2>The real definition of production-ready</h2>
<p>A more accurate definition is:</p>
<blockquote>
<p>A system is production-ready when its operational properties are designed, not discovered.</p>
</blockquote>
<p>That includes:</p>
<ul>
<li>observability as a first-class concern</li>
<li>consistent service structure and bounded contexts</li>
<li>predictable deployment behaviour</li>
<li>explicit failure handling patterns</li>
<li>security and access boundaries defined early</li>
</ul>
<h2>The uncomfortable truth</h2>
<p>Most teams don&#39;t lack capability.</p>
<p>They lack a reusable starting point.</p>
<p>So they rebuild production-readiness repeatedly, instead of inheriting it once.</p>
<h2>The shift that matters</h2>
<p>The real architectural question is not:</p>
<blockquote>
<p>&quot;How do we make this production-ready?&quot;</p>
</blockquote>
<p>It is:</p>
<blockquote>
<p>&quot;Why are we rebuilding production readiness every time?&quot;</p>
</blockquote>
<p>This is the problem space I&#39;ve been focused on with Forge: creating a reusable foundation so teams don&#39;t rediscover production-readiness under pressure.</p>
]]></content:encoded>
      <category>startup</category>
      <category>devops</category>
      <category>microservices</category>
      <category>aws</category>
    </item>
    <item>
      <title>The real startup killer isn&apos;t product - it&apos;s building platform foundations from scratch.</title>
      <link>https://forgeplatform.software/blog/the-real-startup-killer/</link>
      <guid isPermaLink="true">https://forgeplatform.software/blog/the-real-startup-killer/</guid>
      <pubDate>Fri, 08 May 2026 00:00:00 GMT</pubDate>
      <description>Most early-stage teams think they&apos;re building one product. They&apos;re actually building two - and the second one quietly consumes man-years of engineering time.</description>
      <content:encoded><![CDATA[<p>There&#39;s a pattern I&#39;ve seen repeat a few times over 20 years building software systems in both startup and enterprise environments.</p>
<p>Most early-stage teams believe they are building a product.</p>
<p>In reality, they are building two things at once:</p>
<ul>
<li>Their actual product</li>
<li>An internal platform they didn&#39;t intend to build</li>
</ul>
<p>And the second one quietly consumes man-years of engineering time.</p>
<h2>The hidden tax on early-stage teams</h2>
<p>At some point, every startup hits the same phase:</p>
<ul>
<li>The MVP works</li>
<li>The first customers arrive</li>
<li>Engineering velocity starts to slow</li>
<li>&quot;Just one more service&quot; becomes a platform discussion</li>
</ul>
<p>And suddenly the backlog shifts.</p>
<p>Not because the product changed - but because the foundations underneath it needed to evolve (without the wheels falling off!).</p>
<p>Based on greenfield experience, I&#39;d guesstimate that up to 60% of engineering effort can be focused on platform foundations, delivery lifecycle and operational processes, not domain features or core business differentiators.</p>
<p>That ratio is not unusual.</p>
<p>It is normal.</p>
<p>And it is also destructive.</p>
<h2>The repeated mistake</h2>
<p>Across organisations, I&#39;ve seen the same systems rebuilt repeatedly:</p>
<ul>
<li>CI/CD pipelines</li>
<li>Infrastructure-as-code structures</li>
<li>Observability setups</li>
<li>Security and access control patterns</li>
<li>Service templates and API conventions</li>
<li>Deployment and rollback strategies</li>
</ul>
<p>Each time:</p>
<ul>
<li>slightly different</li>
<li>slightly inconsistent</li>
<li>always re-learned under pressure</li>
<li>and generally whilst trying to keep production lights on</li>
</ul>
<p>The irony is that none of these are domain-specific.</p>
<p>They are global engineering concerns.</p>
<p>Yet every team reinvents them.</p>
<h2>Why this keeps happening</h2>
<p>It&#39;s not incompetence.</p>
<p>It&#39;s timing.</p>
<p>Early-stage teams optimise for:</p>
<ul>
<li>speed</li>
<li>product delivery</li>
<li>survival</li>
</ul>
<p>So they defer platform thinking until it becomes unavoidable.</p>
<p>At which point:</p>
<ul>
<li>the system is already live</li>
<li>constraints are already baked in</li>
<li>rewrites are expensive</li>
</ul>
<p>So they rebuild under pressure instead of designing under intention.</p>
<h2>The real cost</h2>
<p>The cost isn&#39;t just engineering time.</p>
<p>It&#39;s:</p>
<ul>
<li>delayed product delivery</li>
<li>inconsistent system behaviour</li>
<li>increased operational risk</li>
<li>premature senior hiring in platform roles</li>
<li>architectural fragmentation across services (and teams)</li>
</ul>
<p>Most importantly:</p>
<blockquote>
<p>It shifts engineering from building product value to managing internal complexity.</p>
</blockquote>
<h2>What changes when you design it once</h2>
<p>When these concerns are treated as a reusable foundation instead of one-off decisions:</p>
<ul>
<li>teams ship faster</li>
<li>systems stay consistent</li>
<li>operational overhead drops</li>
<li>architecture stops diverging across services</li>
<li>engineering focus returns to product domain logic</li>
</ul>
<p>This is the problem I&#39;ve been formalising into an opinionated bootstrap approach with Forge Platform.</p>
<p>It enables production-grade microservices from day one with enterprise foundations at startup speed.</p>
<p>Not because teams cannot build these things.</p>
<p>But because they keep building them repeatedly, under pressure, in slightly different ways, everywhere.</p>
<h2>The question I keep coming back to</h2>
<p>If every team ends up rebuilding the same foundations…</p>
<p>Why are we still rebuilding them every time?</p>
<p>If you have any questions or are interested to find out more, it would be great to hear from you.</p>
]]></content:encoded>
      <category>startup</category>
      <category>devops</category>
      <category>microservices</category>
    </item>
  </channel>
</rss>
