Not So Greenfield: Infrastructure Patterns When You Inherit the Mess

You join a team or engagement and inherit the codebase. The task is simple - make things right, fix the pipelines, the usual. Except much of this infrastructure is already in production. The documentation is non-existent, multiple consultants and teams have taken turns on it, and what you’re looking at is a monster of GitLab repos provisioning AWS infrastructure via CDK. You start questioning every life choice that led you here.

I don’t think this is a unique situation. Let’s talk through some of these not so greenfield situations. What I list below isn’t the result of some careful evolved design - it’s a bit of survival and quick fixes to keep the thing running so it can go on the ever growing list. You will get to that list, I am sure of it. Just not yet. These are from recent engagements. I owe Claude Sonnet for naming the patterns because I genuinely lack that bone in me.

The context:

Before the patterns or the post, lets review the setup we are working with:

AWS CDK with Python for infrastructure modeling
GitLab CI/CD
Multi-account deployment

The make things right part of the engagement usually includes:

Multi-region expansion - adding us-east-1, us-east-2 to existing us-west-2 deployments
Zero downtime. You cannot break existing production infrastructure especially with the customer team not having a lot of individuals who can manage any disruptions.

These repos were built by different consulting teams at different times with different constraints. My job is/was to make them work together better. The zero downtime part is non-negotiable.

Pattern 1: The Shared Stack Savior

You have environment-specific CDK deployments that work fine in us-west-2, and now you need to add regions without touching the existing production deployments.

Whoever had the bright idea to copy the instructure across different folders as an easy means of managing region level change deserves a good look in the mirror on what they are doing. With the current set of AI tools and coding harnesses, it should be fairly easy to have set this right. I do want to give them the benefit of the doubt as many teams or customers do not know the extend of infrastructure configurations they need to manage and where tey needs to deploy.

# my-infrastructure/
#   ├── prod/us-west-2/infrastructure/  (production - don't touch!)
#   ├── dev/us-west-2/infrastructure/
#   └── qa/us-west-2/infrastructure/

# Each has its own app.py with hardcoded paths
# prod/us-west-2/infrastructure/app.py
from aws_cdk import App, Environment
from infrastructure.platform_stack import PlatformStack

app = App()
PlatformStack(app, 'PlatformStack-prod-usw2',
              env=Environment(region='us-west-2'))

What I ended up doing was creating a shared infra/ directory with parameterized stacks, leaving every existing deployment alone:

# my-infrastructure/
#   ├── prod/us-west-2/  (unchanged)
#   ├── dev/us-west-2/   (unchanged)
#   ├── qa/us-west-2/    (unchanged)
#   └── infra/           (new, shared)
#       └── infrastructure/
#           ├── app.py   (parameterized)
#           ├── config/
#           │   ├── dev/us-east-1.yaml
#           │   ├── dev/us-east-2.yaml
#           │   └── qa/us-east-1.yaml
#           └── infrastructure/platform_stack.py

# infra/infrastructure/app.py
import os, yaml
from aws_cdk import App, Environment
from infrastructure.platform_stack import PlatformStack

app = App()

# env and region come from context (cdk --context) or environment variables,
# not from positional argv. Keeps the CLI honest and the CI pipeline clean.
env    = app.node.try_get_context("env")    or os.environ["ENV"]
region = app.node.try_get_context("region") or os.environ["AWS_REGION"]

with open(f"config/{env}/{region}.yaml") as f:
    config = yaml.safe_load(f)

PlatformStack(app, f'PlatformStack-{env}-{config["region_code"]}',
              config=config,
              env=Environment(region=region))
app.synth()

Existing deploys keep running the way they always did: cd prod/us-west-2/infrastructure && cdk deploy. New regions go through cd infra/infrastructure && cdk deploy --context env=dev --context region=us-east-1 (or the CI job sets ENV and AWS_REGION and runs cdk deploy). Nothing in the old tree has to move for this to work, and that’s the entire point. The old stacks are someone’s production. You don’t touch them to go faster somewhere else.

One note before you write this off as duplication - you could technically run a one time cdk import pass and pull the existing resources into the new shared stack so everything ends up in one place. I chose not to. The existing infra was baseline VPC components that nobody wanted to re-plan against, and my time on the engagement was about two weeks in and out. Enough to straighten things out a little and move on, not enough to re-home production state. If you have the runway and the appetite for a cdk diff review on live VPCs, cdk import is the cleaner long term answer.

Pattern 2: The Monolithic Pipeline Explosion

One .gitlab-ci.yml file, 500+ lines, with repeated job definitions for every environment, account, and region. Every new deployment target means another 50 lines of copy-paste YAML.

This one always ends up being a people problem as much as a YAML problem. The single-file layout persists because it is easy to read top to bottom and easy for folks who haven’t touched GitLab CI internals to modify. Any cleanup I propose has to not feel like I’m making it harder for them to own the pipeline after I leave.

# .gitlab-ci.yml - the monolith
deploy-dev-product-us-west-2:
  stage: deploy
  script:
    - aws sts assume-role --role-arn arn:aws:iam::123456789012:role/gitlab-role
    - export AWS_ACCOUNT_ID=123456789012
    - export AWS_DEFAULT_REGION=us-west-2
    - export ENVIRONMENT=dev
    - cdk deploy --context env=dev --context product=product
  rules:
    - if: '$CI_COMMIT_BRANCH == "main"'

deploy-dev-revenue-us-west-2:
  stage: deploy
  script:
    - aws sts assume-role --role-arn arn:aws:iam::234567890123:role/gitlab-role
    - export AWS_ACCOUNT_ID=234567890123
    # ... identical to above, different account and product

Pull the repetition into a template and split the job definitions into a separate include:

# .gitlab/.gitlab-ci-base.yml
.deploy-template:
  stage: deploy
  script:
    - aws sts assume-role --role-arn $ROLE_ARN
    - cdk deploy --context env=$ENVIRONMENT --context product=$PRODUCT

# .gitlab/.gitlab-ci-deploy-jobs.yml
deploy-dev-product:
  extends: .deploy-template
  variables:
    ENVIRONMENT: dev
    PRODUCT: product
    AWS_ACCOUNT_ID: "123456789012"
    AWS_DEFAULT_REGION: us-west-2
  rules:
    - !reference [.rules-dev-deploy]

deploy-dev-revenue:
  extends: .deploy-template
  variables:
    ENVIRONMENT: dev
    PRODUCT: revenue
    AWS_ACCOUNT_ID: "234567890123"
    AWS_DEFAULT_REGION: us-west-2
  rules:
    - !reference [.rules-dev-deploy]

# .gitlab-ci.yml
include:
  - local: ".gitlab/.gitlab-ci-base.yml"
  - local: ".gitlab/.gitlab-ci-rules.yml"
  - local: ".gitlab/.gitlab-ci-deploy-jobs.yml"

One caveat worth calling out: GitLab extends and !reference are not the same thing, and mixing them in the same job without thinking is how you end up chasing ghost overrides for an afternoon. extends merges, !reference pastes. Pick one per axis and stick to it.

One more thing - the refactor above is the right first step, but it is not the end of the road. GitLab offers a few mechanisms and picking the right one matters depending on where your mess lives:

Within a single repo - stick with extends + include: local:. This is what the example above shows. Cheap, no extra moving parts, and the person inheriting the repo after you can grep the whole pipeline without leaving the project. Good default.
Multiple repos with similar stages (assume role, plan, apply, notify) - lift those into CI/CD components and reference them by version tag from each repo’s .gitlab-ci.yml. Components are the GA reusable-pipeline primitive. They support typed inputs, are versioned, and can be published to the CI/CD Catalog for discovery. The docs explicitly recommend components over copy-paste include: across repos, and recommend pinning to a commit SHA or release tag rather than ~latest.

A couple of things I would not do on a short engagement:

Do not convert one repo’s pipeline straight into components. Components are a library investment. If only one repo would ever consume it, you have added a second repo, a tagging process, and a review flow for zero benefit. Wait for the second and third consumer to show up.
Do not pin consumers to ~latest “just to keep things simple.” The docs are explicit about this being an availability and supply-chain risk. Tag a version, pin to it, and upgrade on purpose.

If I had the full engagement to do over and there were 5+ repos in play, the progression I would recommend is: consolidate to extends + include: local: inside each repo first (what the example shows), then lift the two or three job templates that are actually identical across repos into a component, and stop there. Parent-child only if anyone actually asks for concurrency or a cleaner UI on a single big pipeline. Everything else is premature abstraction on a two-week engagement.

Pattern 3: The Bootstrap Tax

Related but distinct from the monolithic pipeline problem. Once you have the job structure right, look at what every single job is doing in setup portions of the CI Stages. In this case , a before_script. On one engagement, every job started with something like this:

before_script:
  - apk add --update nodejs npm git bash
  - apk --no-cache add curl
  - npm install -g aws-cdk
  - cdk --version
  - apk add --no-cache python3 py3-pip
  - apk add aws-cli
  - aws --version
  - unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN
  - export $(aws sts assume-role-with-web-identity ... )
  - aws configure set aws_access_key_id "$ACCESS_KEY_ID" --profile core-account
  # ... 8 more aws configure lines

Every plan job, every deploy job, every destroy job. Re-installing Node, re-installing CDK, re-installing AWS CLI, re-assuming the role. Twelve minutes of setup per job, multiplied by dozens of jobs per pipeline. The apk add aws-cli line in particular was hiding an expat symbol bug on the Alpine 3.22 base image that would start failing silently whenever the upstream package got bumped.

The right fix is to bake Node, Python, CDK, and AWS CLI into the shared CI base image that the runner uses, and move the OIDC role assumption into a reusable .aws-auth template that every job extends once. Job time drops from minutes of setup to seconds.

The part worth calling out is that you often do not own the base image. It belongs to a platform team, usually one you do not sit with and whose backlog you are not on. The fix stops being a commit and becomes a conversation. If you cannot get the base image changed inside your timebox, the second-best move is cache: node_modules/ plus short-circuiting the CDK install if cdk --version already works on the cached runner. It is ugly but it ships. The real fix still needs to happen, just not by you.

This pattern does not look like an architecture problem. It is a pipeline performance problem, and it is almost always the first thing people inheriting a CDK-on-GitLab setup notice because their feedback loop is painful. Worth fixing early because every other pattern in this post is faster to validate once your pipeline stops taking half an hour to tell you your change worked.

A few things that kept coming up

A couple of observations that are less pattern and more theme:

The existing infrastructure usually isn’t wrong. It was right for the constraints that existed when it was written. The team that shipped the 500-line gitlab-ci knew it was bad. They also had a deadline and a business to run. Going in with the assumption that you are smarter than the people before you is a fast way to break production and learn humility.
Hybrid approaches beat rewrites almost every time on these engagements. A new infra/ folder that lives alongside the old prod/us-west-2/ directories ships faster, blast-radius is smaller, and the reviewers can actually hold both versions in their head. A clean-slate rewrite never ships on time, and when it does, the team that owns the old thing doesn’t trust it.

That’s really it. Not every engagement are greenfield. Most of the interesting work is in the ugly middle.