SST + Nomad
Repo with the code is there
Things that require further figuring out:
- How to expose only the Traefik service to the outside world without relying on VPS provider to, well, provide a private network
- How to properly install and configure initial Nomad setup without a lot of manual work
- CI/CD integration
- Vault integration and where to host it
$5 VPS
Get yourself a VPS with at least 1GB of RAM, preferably a bit more and a private network. I know it is possible to provision one with SST from a myriad of providers, but I wanted a more general setup that could be used with any VPS provider or even a bare metal server.
DNS
For DNS, I'll be using Cloudflare because it integrates neatly with Traefik, which will be our reverse proxy, load balancer, and TLS certificate provisioner. You can use any other DNS provider, but you'll have to adjust the Traefik configuration accordingly. You can read more here, here, and here.
Pretend that 10.11.12.13
is your server's public IP and example.com is your domain. Then create A records for your domain, pointing to your server's IP address:
A example.com 10.11.12.13
A nomad.example.com 10.11.12.13
A database.example.com 10.11.12.13
Get a CF_ZONE_API_TOKEN
for Traefik to use Cloudflare's API for DNS challenges to issue TLS certificates, specifying the zone example.com
. You can do that here. Save the token.
Installing things
Install Nomad, then follow Linux post-installation steps to install CNI plugin.
Do not install the consul-cni
.
Configuring Nomad
I'm sorry for the lack of syntax highlighting for HCL -_-
SSH into your server, go to /etc/nomad.d
, do vim nomad.hcl
, paste this:
hcldata_dir = "/opt/nomad/data" bind_addr = "0.0.0.0" server { enabled = true bootstrap_expect = 1 } plugin "docker" { config { volumes { enabled = true } } } plugin "containerd-driver" { config { containerd_runtime = "io.containerd.runc.v2" } } client { enabled = true servers = ["127.0.0.1"] host_network "private" { cidr = "10.0.0.2/32" } } acl { enabled = true }
Where 10.0.0.2/32
is your server's private IP address. You can get it by doing ip a
and looking for the interface named something like enp7s0
.
Here we are:
- binding Nomad to all interfaces so that it can be accessed from outside
- setting up a number of expected Nomad servers to 1
- enabling Docker plugin
- allowing the use of volumes in Docker containers
- switching to the 2nd version of the containerd runtime
- enabling the client and binding it to the localhost since we don't need it to be accessible from the outside
- setting up a private network for the client, this is the network that service will use to communicate with each other
- enabling ACL, meaning that we'll have to authenticate with Nomad to do anything
Enable and start Nomad systemctl enable --now nomad
Perform a bootstrapping via nomad acl bootstrap
. You should get something like this:
shellAccessor ID = faacbd2a-1085-8552-5e14-1bc604d95ace Secret ID = 3f30403d-f5a3-00ff-b00f-bd256721b867 Name = Bootstrap Token Type = management Global = true Create Time = 2024-10-16 13:08:38.082016962 +0000 UTC Expiry Time = <none> Create Index = 14 Modify Index = 14 Policies = n/a Roles = n/a
Do export NOMAD_TOKEN=<secret_id>
Then do nomad acl token create -name="frontend" -type="management"
, this will be used to authenticate with the Nomad UI
And then do nomad acl token create -name="sst" -type="management"
, this one for SST to be able to interact with Nomad remotely
Write them all down
Visit http://10.11.12.13:4646/ui
, you should see Nomad's UI. Authenticate with the frontend
token.
Traefik Host Configuration
Create folders /opt/letsencrypt
and /opt/traefik
In /opt/traefik/dynamic-config.yml
put this:
ymlhttp:
routers:
nomad:
rule: "Host(`nomad.example.com`)"
entryPoints:
- websecure
service: nomad
tls:
certResolver: myresolver
services:
nomad:
loadBalancer:
servers:
- url: "http://10.11.12.13:4646"
Don't forget to replace nomad.example.com
with your domain.
First we will create just a Traefik container, which will be responsible for routing traffic to our services and issuing TLS certificates. Since we won't have TLS before we have Traefik, we will use HTTP for now and rotate tokens later when we will have HTTPS configured.
First SST Interaction
Actually no, first execute nomad var put nomad/jobs/traefik cf_dns_api_token=<cf_zone_api_token>
Now init SST somehow, add nomad
provider via sst add nomad
, change home
to "local"
You should have something like this:
tsx...
app(input) {
return {
name: "sst-nomad-thing",
removal: input?.stage === "production" ? "retain" : "remove",
home: "local",
providers: { nomad: "2.3.3" }
}
},
...
Create .env
file, put this inside:
envNOMAD_URL=http://10.11.12.13:4646 NOMAD_TOKEN=<nomad-sst-secret-id>
Create a folder named .nomad
inside your project, inside it create a file named traefik.nomad
with the following content:
hclvariable "NOMAD_URL" { type = string } job "traefik" { group "traefik-group" { network { mode = "host" port "http" { static = 80 } port "http_secure" { static = 443 } port "database" { static = 5432 } } service { name = "traefik" provider = "nomad" } task "traefik-task" { driver = "docker" config { image = "traefik" ports = ["http", "http_secure", "database"] volumes = ["/opt/letsencrypt:/letsencrypt", "/opt/traefik:/traefik"] args = [ "--api.dashboard=false", "--api.insecure=true", "--entrypoints.web.address=:${NOMAD_PORT_http}", "--entrypoints.web.http.redirections.entrypoint.to=websecure", "--entrypoints.web.http.redirections.entrypoint.scheme=https", "--entrypoints.websecure.address=:${NOMAD_PORT_http_secure}", "--entrypoints.websecure.http.tls=true", "--entrypoints.database.address=:${NOMAD_PORT_database}", "--providers.nomad=true", "--providers.nomad.endpoint.address=${NOMAD_URL}", "--providers.nomad.exposedByDefault=false", "--accesslog=true", "--log.level=DEBUG", "--certificatesresolvers.myresolver.acme.dnschallenge=true", "--certificatesresolvers.myresolver.acme.dnschallenge.provider=cloudflare", "--certificatesresolvers.myresolver.acme.storage=/letsencrypt/acme.json", "--providers.file.filename=/traefik/dynamic-config.yml" ] } env { NOMAD_URL = var.NOMAD_URL } template { data = <<EOF {{- with nomadVar "nomad/jobs/traefik" -}} CF_DNS_API_TOKEN = {{.cf_dns_api_token}} {{- end -}} EOF destination = "secrets/env" env = true } identity { env = true change_mode = "restart" } } } }
Oof, that's a lot. Let's break it down:
- variables passed to jobs are required to be defined separately, here we have only one variable
NOMAD_URL
which is needed for Traefik to know where to find Nomad so that it can use Nomad for service discovery - we have a job named
traefik
with a group namedtraefik-group
, names are arbitrary - set host networking mode, meaning that Traefik will be able to bind to ports 80, 443, and 5432 on the host
- defined a service named
traefik
,provider = "nomad"
is super important there, without it service discovery won't work - defined a task named
traefik-task
which is a Docker container - binding ports 80, 443, and 5432 to the host
- we are mounting
/opt/letsencrypt
and/opt/traefik
to the container - passing
NOMAD_URL
to the container as an environment variable with the value from the variable - we are templating (official term) the
secrets/env
file with the Cloudflare DNS API token which we stored in Nomad variables earlier - i'm not sure what
identity
does, I guess it allows the task to authenticate (?) itself with Nomad
Also we are passing a bunch of arguments to Traefik:
--api.dashboard=false
- we don't need the dashboard--api.insecure=true
- we don't need the API to be secure because inter-service communication won't leave the host (although it will be done with server's public IP -_-)--entrypoints.web.address=:${NOMAD_PORT_http}
,--entrypoints.websecure.address=:${NOMAD_PORT_http_secure}
,--entrypoints.database.address=:${NOMAD_PORT_database}
- defining entrypoints for Traefik using variables provided by Nomad--entrypoints.web.http.redirections.entrypoint.to=websecure
,--entrypoints.web.http.redirections.entrypoint.scheme=https
- redirecting HTTP to HTTPS--providers.nomad=true
,--providers.nomad.endpoint.address=${NOMAD_URL}
,--providers.nomad.exposedByDefault=false
- enabling Nomad provider, setting the address of the Nomad server, and setting that services won't be exposed by default--accesslog=true
,--log.level=DEBUG
- enabling access logs and setting log level to debug--certificatesresolvers.myresolver.acme.dnschallenge=true
- enabling DNS challenge for ACME--certificatesresolvers.myresolver.acme.dnschallenge.provider=cloudflare
- setting Cloudflare as the DNS challenge provider--certificatesresolvers.myresolver.acme.storage=/letsencrypt/acme.json
- setting the path to store ACME certificates--providers.file.filename=/traefik/dynamic-config.yml
- setting the path inside the container to the dynamic configuration file which we created earlier
Add this function to sst.config.ts
:
tsxconst getEnvVariables = () => {
const nomadUrl = process.env.NOMAD_URL
if (!nomadUrl) throw new Error("NOMAD_URL is not set")
const nomadToken = process.env.NOMAD_TOKEN
if (!nomadToken) throw new Error("NOMAD_TOKEN is not set")
return {
nomadUrl,
nomadToken
}
}
This is how I do it, you can do it however you want, but you need to have access to NOMAD_URL
and NOMAD_TOKEN
in your code
Update the run
function:
tsxasync run() {
const { nomadUrl, nomadToken } = getEnvVariables()
const nomadProvider = new nomad.Provider("NomadProvider", {
address: nomadUrl,
skipVerify: true,
secretId: nomadToken
})
const traefik = new nomad.Job(
"Traefik",
{
jobspec: readFileSync(".nomad/traefik.nomad", "utf-8"),
hcl2: {
vars: {
NOMAD_URL: nomadUrl
}
}
},
{
provider: nomadProvider
}
)
}
Seems self-explanatory, we are creating a Nomad provider, then a Traefik job, then passing the NOMAD_URL
variable to the job.
skipVerify
is set to true
because we don't have a TLS configured yet.
Perform env $(cat .env | xargs) sst deploy
, visit the Nomad UI http://10.11.12.13:4646/ui
, you should see a traefik
job running.
Once it's healthy, check Traefik logs, there check that everything is ok, then check /opt/letsencrypt/acme.json
, it should be populated with a certificate for nomad.example.com
.
If you can visit https://nomad.example.com/ui
and see, well, the UI, then everything is working fine.
Change skipVerify
to false
in the nomadProvider
.
Hardening (lmao) the Nomad
Now we need to rotate sst
and frontend
tokens
Do nomad acl token list
, you should see something like this:
shellName Type Global Accessor ID Expired Bootstrap Token management true f4ab3e26-ce1d-11e6-3d9a-238db337c10a false frontend management false 0d60c989-a5b4-874a-2940-43e7549a060c false sst management false 0b0f1d6b-85e4-d654-4635-7775dcbe43db false
If you get access denied do export NOMAD_TOKEN=<bootstrap_token>
Delete tokens:
shellroot@sst-nomad-thing:~# nomad acl token delete 0d60c989-a5b4-874a-2940-43e7549a060c Successfully deleted 0d60c989-a5b4-874a-2940-43e7549a060c policy! root@sst-nomad-thing:~# nomad acl token delete 0d60c989-a5b4-874a-2940-43e7549a060c Successfully deleted 0d60c989-a5b4-874a-2940-43e7549a060c policy!
Recreate them as we did before, update NOMAD_TOKEN
in the .env
file with the new token, also change NOMAD_URL
to https://nomad.example.com
while you're at it.
Now we can transport secrets to the server over encrypted connection since we have HTTPS figured out.
If you do env $(cat .env | xargs) sst refresh
now, you'll get a 403 error even though we updated the token.
That's because the token is cached, so we need to do env $(cat .env | xargs) sst deploy
. It will error out too, but do env $(cat .env | xargs) sst deploy
again, and everything should be fine.
Deploying Services
Echo Service
Create echo.nomad
file in the .nomad
folder in the project root, put this inside:
hclvariable "POSTGRES_USER" { type = string } variable "POSTGRES_PASSWORD" { type = string } variable "POSTGRES_DATABASE" { type = string } variable "DOMAIN" { type = string } job "echo" { group "echo-group" { count = 3 network { mode = "bridge" port "http" { to = -1 host_network = "private" } } service { name = "echo" provider = "nomad" port = "http" tags = [ "http-echo", "traefik.enable=true", "traefik.http.routers.http-echo.rule=Host(`${var.DOMAIN}`)", "traefik.http.routers.http-echo.entrypoints=websecure", "traefik.http.routers.http-echo.tls.certresolver=myresolver", "traefik.http.services.http-echo.loadbalancer.server.port=${NOMAD_PORT_http}" ] check { name = "HTTP Echo Health" type = "tcp" interval = "10s" timeout = "2s" } } task "echo-task" { driver = "docker" config { image = "hashicorp/http-echo" ports = ["http"] args = ["-text=DATABASE_URL: ${DATABASE_URL}\n\nCURRENT_PORT: ${NOMAD_PORT_http}", "-listen=:${NOMAD_PORT_http}"] } template { data = <<EOF {{- range nomadService "postgres" }} DATABASE_URL=postgres://${var.POSTGRES_USER}:${var.POSTGRES_PASSWORD}@{{ .Address }}:{{ .Port }}/${var.POSTGRES_DATABASE} {{- end }} EOF destination = "secrets/env" env = true } } } }
Mostly same as the Traefik job, but with some differences:
- here we are creating 3 instances of the service which will be load balanced (selected randomly)
hashicorp/http-echo
image is used to create a simple HTTP server which will return whatever we pass to it, in this case, the database URL and the port current instance is running on- we are templating the
secrets/env
file with the database URL, which we will use in theecho
service range nomadService "postgres"
is Nomad's way of service discovery, it finds a service with the namepostgres
(comes next) for us and allows us to access its{{ .Address }}
and{{ .Port }}
Traefik labels:
traefik.enable=true
- enabling Traefik for this servicetraefik.http.routers.http-echo.rule=Host(``${var.DOMAIN}``)
- setting the rule for Traefik to route traffic to this service when the request is made to the domain specified in the variabletraefik.http.routers.http-echo.entrypoints=websecure
- setting the entrypoint for this service towebsecure
which is the HTTPS entrypointtraefik.http.services.http-echo.loadbalancer.server.port=${NOMAD_PORT_http}
- setting the port to which Traefik should route the traffic, different for each instance and provided by Nomad
Postgres Service
Create postgres.nomad
file in the .nomad
folder, put this inside:
hclvariable "POSTGRES_PASSWORD" { type = string } variable "POSTGRES_USER" { type = string } variable "POSTGRES_DATABASE" { type = string } variable "DOMAIN" { type = string } job "postgres" { group "postgres-group" { network { mode = "bridge" port "database" { to = -1 host_network = "private" } } service { name = "postgres" provider = "nomad" port = "database" tags = [ "database", "traefik.enable=true", "traefik.tcp.routers.db.rule=HostSNI(`database.${var.DOMAIN}`)", "traefik.tcp.routers.db.tls=true", "traefik.tcp.routers.db.entrypoints=database", "traefik.tcp.routers.db.tls.certresolver=myresolver", "traefik.tcp.services.db.loadbalancer.server.port=${NOMAD_PORT_database}" ] } task "postgres-task" { driver = "docker" config { image = "docker.io/postgres" ports = ["database"] volumes = ["/opt/nomad/data/postgres:/var/lib/postgresql/data"] } env { POSTGRES_PASSWORD = var.POSTGRES_PASSWORD POSTGRES_USER = var.POSTGRES_USER POSTGRES_DB = var.POSTGRES_DATABASE PGPORT = "${NOMAD_PORT_database}" } } } }
Same as the echo
service, but over TCP.
Second SST Interaction
Add these to the .env
file:
envPOSTGRES_PASSWORD=super-secret POSTGRES_USER=oofer POSTGRES_DB=boofer DOMAIN=domain.com
Don't forget to replace domain.com
with your domain.
Update getEnvVariables
function
tsconst getEnvVariables = () => {
const nomadUrl = process.env.NOMAD_URL
if (!nomadUrl) throw new Error("NOMAD_URL is not set")
const nomadToken = process.env.NOMAD_TOKEN
if (!nomadToken) throw new Error("NOMAD_TOKEN is not set")
const domain = process.env.DOMAIN
if (!domain) throw new Error("DOMAIN is not set")
const postgresPassword = process.env.POSTGRES_PASSWORD
if (!postgresPassword) throw new Error("POSTGRES_PASSWORD is not set")
const postgresUser = process.env.POSTGRES_USER
if (!postgresUser) throw new Error("POSTGRES_USER is not set")
const postgresDatabase = process.env.POSTGRES_DB
if (!postgresDatabase) throw new Error("POSTGRES_DB is not set")
return {
nomadUrl,
nomadToken,
domain,
postgresPassword,
postgresUser,
postgresDatabase
}
}
And its call
tsconst {
nomadUrl,
nomadToken,
domain,
postgresPassword,
postgresUser,
postgresDatabase
} = getEnvVariables()
Add new jobs to the run
function:
tsconst echo = new nomad.Job(
"Echo",
{
jobspec: readFileSync(".nomad/echo.nomad", "utf-8"),
hcl2: {
vars: {
POSTGRES_PASSWORD: postgresPassword,
POSTGRES_USER: postgresUser,
POSTGRES_DATABASE: postgresDatabase,
DOMAIN: domain
}
}
},
{
provider: nomadProvider
}
)
const postgres = new nomad.Job(
"Postgres",
{
jobspec: readFileSync(".nomad/postgres.nomad", "utf-8"),
hcl2: {
vars: {
POSTGRES_PASSWORD: postgresPassword,
POSTGRES_USER: postgresUser,
POSTGRES_DATABASE: postgresDatabase,
DOMAIN: domain
}
}
},
{
provider: nomadProvider
}
)
Do env $(cat .env | xargs) sst deploy
Visit the UI, you should see jobs, wait for them to be healthy, then visit https://example.com
, you should see something like this:
shellDATABASE_URL: postgres://oofer:super-secret@10.11.12.13:24847/boofer CURRENT_PORT: 28721
Running openssl s_client -connect database.example.com:5432
should show us that TLS is working, indicating that we have an encrypted connection to the database from the outside.
Refresh the page a couple of times, you should see different ports in the CURRENT_PORT
field.
And I guess that's it. Now we have a working Nomad setup with Traefik, and we have a way to do rolling updates, run multiple instances of services, and connect them to each other.