PulseCheck Journal

June 7, 2026•60 min read

PulseCheck

TL;DR

Disclaimer:
I wrote this TL;DR section at the very end, after completing the journal. Some of the conclusions mentioned here were not part of the original plan from the beginning. This project changed shape multiple times throughout development, and some decisions were made only after I had worked through later sub-iterations.
In case something feels inconsistent while reading the journal, that’s probably the reason.

PulseCheck is a multi-user service monitoring dashboard. It periodically pings configured HTTP endpoints and records their health status, response time, and availability. For third-party services, it uses their official status page APIs, if supported, to monitor availability and display information about ongoing incidents, scheduled maintenance, and whatnot.

PulseCheck is a live service anyone can use for free, no strings attached. If you manage your own infrastructure like VPSes, self-host websites and services, or use third-party services and LLMs, PulseCheck is for you. Anything you want to keep an eye on for availability, you can monitor with PulseCheck from a single place.

Try out PulseCheck
PulseCheck Repository

On the other hand, I use PulseCheck as my permanent DevOps lab. The application runs through multiple layers: a dev layer with Docker Compose for development, a local production layer with K3d for pre-production testing, a real production layer with K3s on a VPS running a single-node cluster, and an additional four-node Kind cluster that I use as a playground for multi-node Kubernetes, security, and operational experiments.

For each layer, I have separate OAuth applications, so I can spin up all layers at once without any conflicts. That was the architecture I wanted.

Any experiments I do in the Kind layer stay local and do not affect production.

In the future, I might integrate PulseCheck with more advanced DevOps concepts. Production could grow into a multi-node deployment. I might integrate technologies such as Terraform, Jenkins, and more.

It’s a journey.

In short, PulseCheck is a reference environment.

Explanation:
“reference environment” — a stable, real deployment that I keep using as the foundation for ongoing experimentation. Not a project with a finish line. Not a sandbox that I throw away. A persistent, real environment that doubles as a lab.

The idea came from my virtual homelab. I already maintain a lab for security research, attack chains, defense, malware testing, and analysis. I keep documentation for the techniques I use in the lab. PulseCheck is the equivalent lab environment for DevOps.

I developed PulseCheck in iterations. You can check out each iteration and what was done during it in this journal.

PulseCheck Iteration 1
PulseCheck Iteration 2 - Coming Soon

Design

Disclaimer:
Things that come under this design section were part of my initial plan for the project. Some things might have changed as I moved forward with the project. In some parts, the design decisions and actual implementation might be different because this is a journal of the development process, not a documentation of the final state. Some ideas changed, some evolved, and some were replaced entirely as the project moved forward. For most of the changes, I mention them in the implementation section under each sub-iteration.

This diagram shows the basic flow of the application.

Architecture Diagram

Services

It’s a multi-service application, and each service runs in a separate Docker container.

Service	Tech	Purpose
frontend	React + Vite + Nginx	Dashboard UI showing service status
api	Python FastAPI	REST endpoints for CRUD + serving data
worker	Python	Pings endpoints on intervals, records results
postgres	PostgreSQL (official image)	Stores services, check history, incidents
redis	Redis (official image)	Caches latest status per service

Frontend (React + Vite + Nginx)

Port: 3000 (dev), 80 (production via Nginx)
Responsibility: Dashboard UI
Talks to: API only
Key views:
- Dashboard — all services grouped by category, live status, uptime %, latency, mini charts
- Service detail — response time chart (24h), uptime bar (30d), incident history
- Incidents — timeline of all incidents across services
- Add service — form with name, URL, category, interval, status page link, thresholds

API (Python FastAPI)

Port: 8000
Responsibility: REST API serving data to frontend, CRUD for services
Talks to: PostgreSQL, Redis
Auth: Google OAuth + GitHub OAuth
Key endpoints:
- GET /api/services — list current user’s monitored services
- POST /api/services — add a service to monitor
- PUT /api/services/{id} — update a service
- DELETE /api/services/{id} — remove a service
- GET /api/services/{id}/checks — check history for a service
- GET /api/services/{id} — service detail with uptime stats
- GET /api/incidents — list all incidents for current user’s services
- GET /api/health — API health check
- GET /api/auth/google — Google OAuth flow
- GET /api/auth/github — GitHub OAuth flow

Worker (Python)

Responsibility: Background process that runs health checks
Talks to: PostgreSQL, Redis, external URLs
Behavior:
- Runs continuously, respecting each service’s individual check interval
- For each service, sends HTTP GET to the endpoint
- Records: status code, response time (ms), timestamp
- Determines status: “up” (expected status code + below degraded threshold), “degraded” (responding but slow), “down” (error/timeout/unexpected status)
- If status transitions (up -> down, up -> degraded, down -> up, degraded -> up): creates/resolves incident
- Caches latest status per service in Redis for fast dashboard loads

PostgreSQL

Port: 5432
Database schema:

users:
  id              UUID PRIMARY KEY
  email           VARCHAR UNIQUE NOT NULL
  name            VARCHAR NOT NULL
  auth_provider   VARCHAR NOT NULL        -- "google" or "github"
  created_at      TIMESTAMP

services:
  id                  UUID PRIMARY KEY
  user_id             UUID REFERENCES users(id)
  name                VARCHAR NOT NULL        -- "Claude API"
  url                 VARCHAR NOT NULL        -- "https://api.anthropic.com"
  category            VARCHAR                 -- "AI Tools" (user-defined, free text)
  check_interval      INTEGER DEFAULT 30      -- seconds
  status_page_url     VARCHAR                 -- "https://status.anthropic.com" (optional)
  expected_status     INTEGER DEFAULT 200     -- what status code means "up"
  timeout_ms          INTEGER DEFAULT 5000    -- after this, consider it down
  degraded_threshold_ms INTEGER DEFAULT 1000  -- above this = degraded
  created_at          TIMESTAMP

checks:
  id              UUID PRIMARY KEY
  service_id      UUID REFERENCES services(id)
  status_code     INTEGER                     -- HTTP status code (null if timeout)
  response_time   INTEGER                     -- milliseconds
  status          VARCHAR NOT NULL            -- "up", "down", "degraded"
  checked_at      TIMESTAMP

incidents:
  id              UUID PRIMARY KEY
  service_id      UUID REFERENCES services(id)
  type            VARCHAR NOT NULL            -- "downtime" or "degraded"
  started_at      TIMESTAMP NOT NULL
  resolved_at     TIMESTAMP                   -- null if ongoing
  checks_failed   INTEGER DEFAULT 0           -- consecutive failures count

Redis

Port: 6379
Stores:
- Latest status per service — service:{id}:status -> JSON with current status, last check time, response time
- Used for fast dashboard loads without hitting PostgreSQL for current status

Other Tech Decisions

Helm: Used from iteration 1 for K8s deployments
CI/CD: GitHub Actions
Local K8s: Kind for local cluster
Production K8s: K3s on VPS
Domain: It could be pulsecheck.kavindujayarathne.com, or I might use its own domain dedicated to this project. I haven’t decided yet (will finalize once the VPS is ready)
Development: Everything containerized with Docker Compose plus volume mounts for hot reload

Commit Strategy -> Based on iterations

As I kept going with the idea, I realized that this kind of project can go far beyond just Docker and Kubernetes automation. There are a lot of DevOps concepts out there, and this project has the potential to slowly evolve into a full showcase of those.

But I can’t integrate everything at once. And that’s not the main purpose either. That’s one side of it.

The other thing is, I don’t want to just dump the code into GitHub with a single “Initial commit”. I want this project to feel more alive and descriptive than that.

So the best approach is to build this iteration by iteration. The first iteration focuses on deploying this as a working solution on a VPS, showcase my DevOps skills using Docker and Kubernetes automation, and handling the full deployment locally with kind, while using K3s on the VPS (in prod).

Each iteration is broken down into smaller sub-iterations. While working on the first main iteration, I complete each sub-iteration and commit it to GitHub. That’s always better than dumping the entire codebase with a single “Initial commit”.

Once the first main iteration is complete, I move on to the next one. Depending on the scope, each main iteration can also include multiple sub-iterations.

Multi-user model

Initial idea was to put this out as a single-user dashboard. That means I add all the services I want to monitor and deploy it. While it’s live, people can access it, but it only shows what I’ve added. Services that are useful to me and my infrastructure. No one can add services or do anything other than me. It’s basically my personal system monitoring dashboard out there, live.

At that point, there’s no real value for other people. So why even make it public? I’d rather keep it local and monitor my own services privately.

That changed my mind. I decided to make it a multi-user model, so anyone can make it their own personal dashboard by adding whatever they want to monitor and now it makes sence making it public.

Users can list all their services, and they can define their own categorization for the services they add. I’m not adding any predefined categories or anything. That’s up to each individual using it as their personal monitoring dashboard.

For authentication, I’m using Google OAuth and GitHub OAuth. It’s simple, and I don’t have to deal with handling security myself. I also thought about using magic links, but that would require setting up a mail service. So the easiest and cost-free option is OAuth, and I’ll stick with that.

Data Flow

1. Worker runs continuously
2. For each service, checks if it's time to ping (based on service's check_interval)
3. Sends HTTP GET to service URL (with configured timeout)
4. Determines status:
   - Response received + expected status code + response_time < degraded_threshold -> "up"
   - Response received + expected status code + response_time >= degraded_threshold -> "degraded"
   - Error / timeout / unexpected status code -> "down"
5. Writes check result to PostgreSQL
6. Updates latest status in Redis
7. Compares with previous status:
   - up -> down: create "downtime" incident
   - up -> degraded: create "degraded" incident
   - down/degraded -> up: resolve open incident
8. Frontend loads dashboard
9. API reads latest status from Redis (fast) + historical data from PostgreSQL on demand
10. Frontend renders dashboard grouped by user-defined categories

Docker Compose and Kubernetes

We don’t need Kubernetes for this project. Docker Compose alone would be enough on VPS. But because of the shape of this entire work as a DevOps playground and experimental environement, I thought I would run a single-node Kubernetes cluster on a VPS with K3s.

All local development happens with Docker Compose. Once the app is ready to be deployed, I use Kind locally to test the Kubernetes setup, then deploy it to the single-node Kubernetes cluster (K3s) running on my VPS.

Network (Docker Compose)

All services share a single Docker network (pulsecheck-net). Services reference each other by container name:

API connects to postgres:5432 and redis:6379
Worker connects to postgres:5432 and redis:6379
Frontend proxies API requests to api:8000

Network (kubernetes)

I do not predefine the network part for Kubernetes yet. The main network segmentation logic is the same, but the way of defining it could be different because Docker Compose and Kubernetes are two different container orchestration types. I’m gonna handle this later with another sub-iteration when I test the Kubernetes environment locally with Kind.

Development approach

All development happens in containers. Nothing is installed on the host. Docker Compose with volume mounts gives hot reload. All code changes reflect instantly without rebuilding images. Anyone can clone the repo and run docker compose up without installing language runtimes.

Deployment path

Local-first, then production:

Local development: Docker Compose
Local Kubernetes: Kind cluster
Production Kubernetes: K3s on a VPS
Domain: pulsecheck.kavindujayarathne.com
CI/CD: GitHub Actions first (lives in the repo, no extra server).
Helm charts: Introduced from Sub-Iteration 10, used for all K8s deployments

Frontend pages (designed upfront)

Login page — Google + GitHub OAuth buttons
Dashboard — services grouped by category, status dots, uptime %, latency, mini response time charts, “X of Y operational” summary, recent incidents
Service detail — response time chart (24h/7d/30d), uptime bar (30d), incident history
Incidents page — timeline across all services, filterable by service/type/status
Add/edit service form — name, URL, category, interval, thresholds, status page link

Redis usage (designed upfront)

Key pattern: service:{id}:status -> JSON with current status, last check time, response time. Purpose: fast dashboard loads without querying PostgreSQL for current state every time.

Implementation

There are three valid approaches. I can select one of them to move forward with the implementation.

The first one is the backend-first approach. Here, I finish the backend side first and then move to the UI at the end.

The second one is the frontend-first approach. Here, I build the UI against mocked data and then wire the real backend behind it.

The third approach is called vertical slices (Agile style). Here, what happens is I pick one feature at a time and build it top to bottom. As an example, I can simply take the “user can add a service” feature and do the DB + API + UI for just that feature. Then we can move to the next feature likewise.

I’m going with first one, which is the backend-first and then UI approach.

Main Iteration 1: Core App + Docker/Kubernetes Automation

Sub-Iteration 1: Project Structure with Docker Compose Setup

In this sub-iteration, I completed all five containers running locally via Docker Compose with placeholder code.

Completed tasks under 1st sub-iteration:

Sub-Iteration 2: Network Segmentation in Docker Compose

In the design section, I had defined a flat network. All 5 containers were on a single Docker bridge network (pulsecheck-net). Every container could communicate with every other container. Network segmentation was not part of the initial design, but I identified it as a needed improvement later.

Having a flat network during local development is not a bad thing. It is easier sometimes. But having a segmented network is better because it is closer to the production build, and I can cut down some unnecessary interactions between containers.

As an example:

The frontend container has no business touching the database
The frontend container has no business touching Redis

Likewise, these cases are unnecessary interactions.

So I decided to segment the network properly.

Network segmentation rules:

Public-facing: frontend, api
Internal only: worker, postgres, redis
Rules:
- Frontend can reach api only
- Api can reach postgres and redis
- Worker can reach postgres and redis, and can reach external URLs on the internet
- Postgres and Redis do not initiate outbound traffic and are not reachable from anything public
- Frontend cannot reach postgres, redis, or worker directly

Implementation of network segmentation:

I added two networks: web and data. Both frontend and api are attached to the web network. Both have host port mappings in development. Api, worker, postgres, and redis are attached to the data network. Internal communication happens by container name. Postgres and Redis have no port mappings.

Container names resolve via Docker’s embedded DNS:

Api connects to postgres:5432 and redis:6379
Worker connects to postgres:5432 and redis:6379
Frontend is not required to reach any other container directly during development (the browser calls the api via the host port)

Completed tasks under 2nd sub-iteration:

Sub-Iteration 3: Database Schema and Migrations

Migration doesn’t mean moving data between databases here. It means changing the structure of the database in some sort of way.

Like:

Adding a table is a migration
Adding a column to the database is a migration
Renaming a column is also a migration

Only the structure evolves while keeping the data as is.

In this sub-iteration, I used SQLAlchemy as the ORM (Object-Relational Mapper). It makes the Python code cleaner. The other thing that I used here was Alembic to make table creation automatic across every environment.

Sometimes, these tools have some trade-offs. As an example:

ORMs could cause performance overhead
Things could get messier with complex queries

So it requires dropping down to raw SQL when necessary.

Database Schema:

Changes from original schema on the design section to what I implemented:

TIMESTAMP -> TIMESTAMP WITH TIME ZONE on all timestamp columns
ON DELETE CASCADE added to all foreign keys (user_id, service_id)
NOT NULL added to all foreign key columns (services.user_id, checks.service_id, incidents.service_id)
NOT NULL DEFAULT now() added to created_at (users, services) and checked_at (checks)
NOT NULL added to check_interval, expected_status, timeout_ms, degraded_threshold_ms, checks_failed

users:
  id              UUID PRIMARY KEY
  email           VARCHAR UNIQUE NOT NULL
  name            VARCHAR NOT NULL
  auth_provider   VARCHAR NOT NULL                      -- "google" or "github"
  created_at      TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now()

services:
  id                    UUID PRIMARY KEY
  user_id               UUID REFERENCES users(id) ON DELETE CASCADE NOT NULL
  name                  VARCHAR NOT NULL                -- "Claude API"
  url                   VARCHAR NOT NULL                -- "https://api.anthropic.com"
  category              VARCHAR                         -- "AI Tools" (user-defined, free text)
  check_interval        INTEGER NOT NULL DEFAULT 30     -- seconds
  status_page_url       VARCHAR                         -- "https://status.anthropic.com" (optional)
  expected_status       INTEGER NOT NULL DEFAULT 200    -- what status code means "up"
  timeout_ms            INTEGER NOT NULL DEFAULT 5000   -- after this, consider it down
  degraded_threshold_ms INTEGER NOT NULL DEFAULT 1000   -- above this = degraded
  created_at            TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now()

checks:
  id              UUID PRIMARY KEY
  service_id      UUID REFERENCES services(id) ON DELETE CASCADE NOT NULL
  status_code     INTEGER                               -- HTTP status code (null if timeout)
  response_time   INTEGER                               -- milliseconds
  status          VARCHAR NOT NULL                      -- "up", "down", "degraded"
  checked_at      TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now()

incidents:
  id              UUID PRIMARY KEY
  service_id      UUID REFERENCES services(id) ON DELETE CASCADE NOT NULL
  type            VARCHAR NOT NULL                      -- "downtime" or "degraded"
  started_at      TIMESTAMP WITH TIME ZONE NOT NULL
  resolved_at     TIMESTAMP WITH TIME ZONE              -- null if ongoing
  checks_failed   INTEGER NOT NULL DEFAULT 0            -- consecutive failures count

Completed tasks under 3rd sub-iteration:

Sub-Iteration 4: OAuth authentication with GitHub and Google

I already mentioned that I’m using Google and GitHub OAuth for login in the PulseCheck application.

I took a few more decisions in this sub-iteration.

For token delivery and storage, I use HttpOnly cookie. I have already written an article about these cookie flags. If you are interested in that, check out the article.

With that:

The browser handles token attachment automatically on every request
The token is never accessible to JavaScript (XSS-proof)
In production (Nginx proxy, same origin), it works out of the box with zero extra config
The only downside (dual-port complexity in development) is solvable

I had a small issue when using HttpOnly cookie. In PulseCheck, the frontend and api run on separate ports during development. The React dev server runs on port 3000, and the FastAPI server runs on port 8000. They need separate ports because each has its own hot-reload server.

In production, Nginx serves both through one port, so this problem doesn’t exist. But in development, the browser sees two different origins, which makes cookie handling complicated.

I had two options to solve this.

I could either use loose cookie settings in development (SameSite=None, Secure=False) and tighten them in production (SameSite=Lax, Secure=True), or configure a server.proxy rule in vite.config.js so the Vite dev server proxies /api requests to http://api:8000, making the browser see everything as same-origin.

With the second option, when the frontend calls /api/services, the Vite dev server intercepts it and forwards it to http://api:8000/api/services behind the scenes. The browser only ever talks to port 3000, so cookies work as if everything is on the same server.

I decided to go with the second option, which is using a dev proxy to solve the dual-port cookie problem.

It mirrors how production works (Nginx proxy), needs no environment-specific cookie config, and removes the need for VITE_API_URL in development.

The next decision that I took was about JWT expiration.

I selected the (1h + 7d) option for PulseCheck.

Initially, I was about to select 24h. But users would have to log in every day. As someone who uses dozens of services, I know how disturbing it is to log in again and again within a short time period.

But with (1h + 7d), it covers both the security aspect and doesn’t require users to log in every day at the same time.

In this way, login gets two tokens:

Access token (1h) — used for API calls
Refresh token (7d) — used only to get a new access token

The access token expires after 1h, after which the frontend automatically calls /api/auth/refresh with the refresh token and gets a new access token (1h). The user doesn’t notice anything.

This keeps working until the refresh token expires (7 days). Then the user has to log in again.

In this way, if the access token somehow gets stolen, it is only valid for 1h. The refresh token is harder to steal because it is only sent on one specific endpoint. It is not sent on every API call. This is how it covers the security aspect.

When the cookie expires, it returns 401 Unauthorized when the frontend tries to call /api/services, and then redirects the user to the login page.

Endpoints Structure

Endpoint structure was organized this way inside ./api.

main.py — creates the app, includes routers
auth.py — /api/auth/* (login, callback, logout, refresh, me)
deps.py — shared functions used by multiple routers
services.py — /api/services/* (Sub-Iteration 5)
incidents.py — /api/incidents/* (Sub-Iteration 5)

Our initial design endpoints only showed these two related to auth:

GET /api/auth/google
GET /api/auth/github

This is the updated OAuth endpoint structure in this sub-iteration:

GET /api/auth/github — redirects the user to GitHub’s login page
GET /api/auth/github/callback — GitHub redirects here with the code, exchanges it, sets the cookie, and redirects to the frontend
GET /api/auth/google — same pattern
GET /api/auth/google/callback — same pattern
POST /api/auth/logout — clears the cookie
POST /api/auth/refresh — exchanges the refresh token for a new access token (called automatically by the frontend when the access token expires)
GET /api/auth/me — returns current user info (the frontend needs this to know who is logged in)

Protected vs non-protected endpoints:

I built the protected route dependency for FastAPI that resolves the current user from the access token during this sub-iteration.

Protected endpoints require a logged-in user. Adding Depends(get_current_user) to an endpoint makes it protected. Non-protected endpoints are accessible to anyone.

It works by reading the access_token cookie from the incoming request, decoding and verifying the JWT signature, looking up the user in the database, and passing the User object to the endpoint. If anything fails at any step (no cookie, invalid token, expired token, user not found), it rejects the request with a 401 Unauthorized response and the endpoint code never runs.

Non-protected:

/api/health — monitoring tools and health checks need to reach this without auth
/api/auth/github, /api/auth/github/callback — cannot require login on the login flow itself
/api/auth/google, /api/auth/google/callback — same reason
/api/auth/logout — intentionally unprotected so that even a user with an expired token can cleanly log out instead of getting a 401 error
/api/auth/refresh — needs to work when the access token has expired, which is the whole point of refreshing

Protected:

/api/auth/me — returns the current user’s info, only makes sense if someone is logged in
All service and incident endpoints (Sub-Iteration 5), they need user_id to scope queries, without knowing who is calling there is no way to return the right data

Completed tasks under 4th sub-iteration:

Sub-Iteration 5: API Endpoints for Services and Incidents

In this sub-iteration, I added schemas.py for request and response models, and handled the services and incidents API endpoints with services.py and incidents.py.

api/schemas.py — Pydantic schemas defining the shape of every request body and every response. FastAPI uses them to validate incoming JSON before any handler runs and to strip outgoing responses to exactly the declared shape, which doubles as a security boundary against accidentally leaking internal fields.
api/services.py — Services router with the full CRUD surface (list, create, read, update, delete) plus discover/validate-url endpoints for the status-page flow. Every endpoint declares its request schema via the function parameter and its response schema via response_model, so the validation contract lives in the route signature itself.
api/incidents.py — Incidents router exposing a filtered listing endpoint: scoped to the authenticated user’s services and optionally narrowed by service_id and time range.

main.py was modified to register the new routers.

Endpoints that delivered in this sub-iteration

Method	Endpoint	Purpose
GET	/api/services	List current user’s services
POST	/api/services	Add a service
GET	/api/services/{id}	Service detail with 24h/7d/30d uptime stats
PUT	/api/services/{id}	Update a service (owner only)
DELETE	/api/services/{id}	Remove a service (owner only)
GET	/api/services/{id}/checks	Check history with pagination
GET	/api/incidents	List incidents, filterable by service_id, type, status

Only users can modify their own data because all queries are scoped by user_id. I also tested every endpoint with a real JWT token against the running containers.

All the endpoints that i created this far of PulseCheck

Sub-Iteration	Method	Endpoint	Purpose
1	GET	/api/health	API health check
4	GET	/api/auth/github	Redirect to GitHub login
4	GET	/api/auth/github/callback	GitHub OAuth callback, sets JWT cookies
4	GET	/api/auth/google	Redirect to Google login
4	GET	/api/auth/google/callback	Google OAuth callback, sets JWT cookies
4	POST	/api/auth/logout	Clear auth cookies
4	POST	/api/auth/refresh	Exchange refresh token for new access token
4	GET	/api/auth/me	Return current user info
5	GET	/api/services	List current user’s services
5	POST	/api/services	Add a service
5	GET	/api/services/{id}	Service detail with 24h/7d/30d uptime stats
5	PUT	/api/services/{id}	Update a service (owner only)
5	DELETE	/api/services/{id}	Remove a service (owner only)
5	GET	/api/services/{id}/checks	Check history with pagination
5	GET	/api/incidents	List incidents, filterable by service_id, type, status

Completed tasks under 5th sub-iteration:

Pydantic schemas for request and response models
GET /api/services (list current user’s services)
POST /api/services (add a service, associated with current user)
PUT /api/services/{id} (update a service, only if owned by current user)
DELETE /api/services/{id} (remove a service, only if owned by current user)
GET /api/services/{id} (service detail with uptime stats for 24h, 7d, 30d)
GET /api/services/{id}/checks (check history for the service)
GET /api/incidents (list incidents across current user’s services, filterable by service, type, and status)
All queries scoped by user_id to enforce data isolation

Sub-Iteration 6: Worker Health Check Logic

In Sub-Iteration 6, I replaced the placeholder worker with real health check logic. The worker is a background process that continuously monitors all configured services, writing results to both PostgreSQL (for permanent history) and Redis (for fast dashboard reads). It handles the full lifecycle, detecting when services go down or become degraded, creating incidents on status transitions, tracking consecutive failures, and resolving incidents on recovery. This is the component that makes PulseCheck a real monitoring tool rather than just a CRUD app with a list of URLs.

Completed tasks under 6th sub-iteration:

Sub-Iteration 7: Frontend Dashboard and Pages

This is where the UI side of PulseCheck comes together. I structured all the main pages here: login, dashboard, service detail, incidents, and the add/edit service form. I also wired up the auth context, protected routes, and the shared layout. Tailwind CSS v4 and React Router v7 are the main pieces on the frontend stack.

Completed tasks under 7th sub-iteration:

Sub-Iteration 8: Improve Monitoring Capabilities

This is an additional iteration. I hit a wall in the previous iteration. I encountered several issues that I didn’t realize at the beginning.

First, I was about to monitor LLMs as well as my personal infrastructure. But that’s complex. I cannot just ping https://chatgpt.com or https://claude.ai and check whether those models are up, down, degraded, or something. I have to synthetically ping the API endpoints for that, and that costs tokens.
Second, even for normal HTTP pings, most websites have bot protection. Websites with bot protection return 403 status codes, so we cannot monitor most websites with simple HTTP pings.

Thus, I had to use some strategies to make this useful. Even though this project doubles as my permanent lab environment for DevOps practices, from the beginning, I wanted to build something useful alongside that goal.

In the 8th sub-iteration, I tried to solve some of those limitations.

I found something interesting. Most official status pages for different services use the same underlying mechanism or status page provider to host their status pages. I don’t have an exact idea where it originally comes from, but I found some publicly exposed endpoints related to those status pages that provide monitoring status information about those services.

This does not support every service’s status page, but most major ones support those API endpoints.

I added a new monitoring type called status page. With this monitoring type, I used those API endpoints to fetch monitoring information about services. Even the official status pages themselves are backed by this data. I can get all the updates immediately with zero waiting time, so I don’t have to probe LLMs and similar services just to figure out their status.

Common endpoints that I found:

/api/v2/summary.json
/api/v2/components.json
/api/v2/incidents.json

summary.json contains all the available information. We can jump into each section separately by calling the other endpoints individually.

I used these endpoints to fully handle this part of the project. This is how I gave it actual value.

Now, for users’ own infrastructure, such as things hosted on self-managed servers (VPSes) and personal infrastructure, they can use HTTP pings. For third-party platforms that users rely on daily, such as Claude, ChatGPT, and the components related to those services, they can monitor them with the status page monitoring type using the public endpoints mentioned above.

It took me a considerable amount of time to research and finish this part because it involved a lot of work..

Completed tasks under 8th sub-iteration:

Sub-Iteration 9: Production-ready Docker Setup

The next iteration is about deploying this with Kind (local Kubernetes). That’s a production simulation that I handle locally. This sub-iteration was about getting things ready for that.

Before deploying, production Docker hardening is a must. So I wanted to create different stages in the Dockerfiles.

I used multi-stage Dockerfiles for /frontend, /api, and /worker, which is a Docker build technique where a single Dockerfile uses multiple isolated build stages. So the final image only includes what’s necessary for production.

Changed files:

./frontend/Dockerfile

# Dev stage — Vite dev server with hot reload
FROM node:22-alpine AS dev
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
EXPOSE 3000
CMD ["npm", "run", "dev"]

# Builder stage — produces dist/ for production
FROM node:22-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Prod stage — Nginx serving static files
FROM nginx:alpine AS prod
RUN apk add --no-cache curl
COPY --from=builder /app/dist /usr/share/nginx/html
COPY nginx.conf /etc/nginx/conf.d/default.conf
EXPOSE 80
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD curl -f http://localhost/health || exit 1
CMD ["nginx", "-g", "daemon off;"]

./api/Dockerfile

# Builder stage — installs Python dependencies into an isolated user prefix
FROM python:3.12-slim AS builder
WORKDIR /app
COPY api/requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# Dev stage — full image with reload, used by docker-compose.yml
FROM python:3.12-slim AS dev
RUN apt-get update && apt-get install -y --no-install-recommends curl && rm -rf /var/lib/apt/lists/*
COPY --from=builder /root/.local /root/.local
ENV PATH=/root/.local/bin:$PATH \
    PYTHONUNBUFFERED=1
WORKDIR /app
COPY db/ ./db/
COPY parsers/ ./parsers/
COPY api/ .
RUN chmod +x entrypoint.sh
EXPOSE 8000
CMD ["./entrypoint.sh", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--reload"]

# Prod stage — same runtime, no reload
FROM python:3.12-slim AS prod
RUN apt-get update && apt-get install -y --no-install-recommends curl && rm -rf /var/lib/apt/lists/*
COPY --from=builder /root/.local /root/.local
ENV PATH=/root/.local/bin:$PATH \
    PYTHONUNBUFFERED=1
WORKDIR /app
COPY db/ ./db/
COPY parsers/ ./parsers/
COPY api/ .
RUN chmod +x entrypoint.sh
EXPOSE 8000
CMD ["./entrypoint.sh", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

./worker/Dockerfile

# Builder stage — installs Python dependencies into an isolated user prefix
FROM python:3.12-slim AS builder
WORKDIR /app
COPY worker/requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# Dev stage — used by docker-compose.yml with source mounts for hot reload
FROM python:3.12-slim AS dev
COPY --from=builder /root/.local /root/.local
ENV PATH=/root/.local/bin:$PATH \
    PYTHONUNBUFFERED=1
WORKDIR /app
COPY db/ ./db/
COPY parsers/ ./parsers/
COPY worker/ .
CMD ["python", "main.py"]

# Prod stage — same runtime, immutable image
FROM python:3.12-slim AS prod
COPY --from=builder /root/.local /root/.local
ENV PATH=/root/.local/bin:$PATH \
    PYTHONUNBUFFERED=1
WORKDIR /app
COPY db/ ./db/
COPY parsers/ ./parsers/
COPY worker/ .
CMD ["python", "main.py"]

docker-compose.yml

services:
  postgres:
    image: postgres:16-alpine
    container_name: pulsecheck-postgres
    environment:
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB: ${POSTGRES_DB}
    volumes:
      - postgres-data:/var/lib/postgresql/data
    networks:
      - data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}"]
      interval: 10s
      timeout: 5s
      retries: 5
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    container_name: pulsecheck-redis
    networks:
      - data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5
    restart: unless-stopped

  api:
    build:
      context: .
      dockerfile: api/Dockerfile
      target: dev
    container_name: pulsecheck-api
    ports:
      - "8000:8000"
    volumes:
      - ./api:/app
      - ./db:/app/db
      - ./parsers:/app/parsers
    environment:
      DATABASE_URL: ${DATABASE_URL}
      REDIS_URL: ${REDIS_URL}
      SECRET_KEY: ${SECRET_KEY}
      GOOGLE_CLIENT_ID: ${GOOGLE_CLIENT_ID}
      GOOGLE_CLIENT_SECRET: ${GOOGLE_CLIENT_SECRET}
      GITHUB_CLIENT_ID: ${GITHUB_CLIENT_ID}
      GITHUB_CLIENT_SECRET: ${GITHUB_CLIENT_SECRET}
      OAUTH_REDIRECT_BASE_URL: ${OAUTH_REDIRECT_BASE_URL}
      FRONTEND_URL: ${FRONTEND_URL}
    networks:
      - web
      - data
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/api/health"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 10s
    restart: unless-stopped

  worker:
    build:
      context: .
      dockerfile: worker/Dockerfile
      target: dev
    container_name: pulsecheck-worker
    volumes:
      - ./worker:/app
      - ./db:/app/db
      - ./parsers:/app/parsers
    environment:
      DATABASE_URL: ${DATABASE_URL}
      REDIS_URL: ${REDIS_URL}
    networks:
      - data
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    restart: unless-stopped

  frontend:
    build:
      context: ./frontend
      dockerfile: Dockerfile
      target: dev
    container_name: pulsecheck-frontend
    ports:
      - "3000:3000"
    volumes:
      - ./frontend/src:/app/src
    networks:
      - web
    depends_on:
      api:
        condition: service_healthy
    restart: unless-stopped

volumes:
  postgres-data:

networks:
  web:
    driver: bridge
  data:
    driver: bridge

docker-compose.prod.yml

services:
  api:
    build:
      target: prod
    ports: !reset []
    volumes: !reset []

  worker:
    build:
      target: prod
    volumes: !reset []

  frontend:
    build:
      target: prod
    ports: !override
      - "80:80"
    volumes: !reset []
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost/health"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s

Completed tasks under 9th sub-iteration:

Multi-stage Dockerfile for frontend (build stage producing static files, runtime stage serving via Nginx)
Multi-stage Dockerfile for api producing a smaller runtime image without build tools
Multi-stage Dockerfile for worker matching the api pattern
frontend/nginx.conf with /api reverse proxy to the api container and a /health endpoint
Health check added for frontend (Nginx based)
api/entrypoint.sh updated to exec passed args so dev and prod targets can pass different uvicorn flags
docker-compose.prod.yml override using !reset and !override tags so dev mounts and ports are actually cleared in prod
Verified production stack runs locally via docker compose -f docker-compose.yml -f docker-compose.prod.yml up

Sub-Iteration 10: Local Kubernetes Deployment via kind (multi-node showcase)

This is an important stage. From the beginning until the 9th sub-iteration, I kept saying that my idea was to use Kind as the local production layer that I use to test the real production deployment locally before PulseCheck gets deployed there.

But as I worked through the 10th sub-iteration, I started to feel like I was going a bit off track with that idea. The real production environment is supposed to run on a single-node cluster. But here, with Kind, I was practicing on a four-node cluster.

I was playing with a more advanced setup with HPA, multi-node scheduling, and a lot more than the real production environment. Real production has a different shape because it is a single-node cluster, and that is fundamentally different from a multi-node cluster.

I had to change my plan immediately. It became obvious that this sub-iteration with kind did not fit as the local production testing layer. A local production testing layer should be something much closer to the real production environment. Instead, I decided to make this the experimental layer.

I decided to deploy PulseCheck with multiple layers. I develop PulseCheck with Docker Compose. That is the development layer.

Then there are multiple cluster environments ahead.

The Kind cluster is the experimental layer, which I mostly use for lab purposes. I run a local production layer on a single-node cluster to test application deployments before PulseCheck gets deployed to the real production environment with K3s. For that local production layer, I use K3d, which is the containerized version of K3s.

Finally, the real production layer runs on a single-node K3s cluster hosted on a VPS.

I programmed PulseCheck to run on all three Kubernetes environments (Kind, K3d, and K3s) simultaneously without any conflicts. It has separate GitHub and Google OAuth applications for each layer, including the Docker Compose development layer.

GitHub OAuth Applications

So from this point forward, both the 10th and 11th sub-iterations are about the Kind layer. This is where I talk about how I handled and experimented with a multi-node Kubernetes cluster locally.

Up to this point, I was using Docker Compose. This is where PulseCheck starts running on a Kind cluster with Helm charts.

One thing worth mentioning before jumping into the cluster itself is that I use the same Helm chart across all Kubernetes layers. Kind, K3d, and K3s all deploy from the same chart. The environment is swapped through values files, not by maintaining separate templates for each layer.

That was an important design decision because it means the artifact I test in Kind and K3d is the same artifact that eventually lands in production. Multi-node-specific features, production settings, hostnames, TLS configuration, replica counts, and similar things are controlled through values files. The chart itself remains the same.

Why do I keep a separate experimental multi-node Kind cluster instead of deploying the real production environment as a multi-node cluster?

I simply do not have the resources to run a multi-node cluster in production. But with Kind, I can simulate nodes locally because Kind uses Docker containers as Kubernetes nodes. That allows me to practice handling a multi-node cluster and experiment with things that would otherwise require multiple real machines.

Here, I built a 4-node cluster with Kind.

cluster nodes

The control-plane is the brain of the cluster. I limited it only to its essential components. That’s how it is supposed to be anyway. Additionally, there are 3 worker nodes. One is dedicated to the Ingress controller, and the other two are for app pods.

Previously, I had handled the Ingress controller inside the control-plane. But then I thought about the security implications and other disadvantages that could come with that, so I moved it to a dedicated worker node. That became the front-facing node, and it routes all the traffic.

That’s how I ended up with 3 worker nodes. Otherwise, I would have only had 2 worker nodes, and the Ingress controller would have been mixed with control-plane node.

I used ingress-nginx inside the worker node that I dedicated to the Ingress controller. It is standard, well-documented, and has great community support. That’s why I selected it.

I could have selected Traefik as well. That’s what comes by default with K3s, which is the Kubernetes environment that I’m going to set up on the VPS. But ingress-nginx was my preferred choice. Maybe this could change later, but currently that’s what I’m using, and that’s what I’m planning to use in production as well.

While I was testing the cluster with Kind, the most memorable rabbit hole was a silent cluster bring-up failure. I had labelled the edge worker node-role.kubernetes.io/edge= because that’s the convention kubectl get nodes uses for the ROLES column. Kind happily started all four containers, but kubectl get nodes showed only three. Digging into journalctl -u kubelet inside the failing container revealed that the kubelet was crash-looping with a label-validation error: the NodeRestriction admission plugin (enabled by default in kubeadm clusters) refuses to let kubelets self-assign labels in the kubernetes.io/* namespace. That restriction exists for a good reason: a compromised worker shouldn’t be able to relabel itself as a control-plane node and attract sensitive workloads.

Lesson: Kubernetes defaults are security-conservative by design, and the right move is to work with them. I switched to a custom role=edge key for both the label and taint, which matched the project’s “use restricted-namespace features only when justified” stance.

This is worth remembering. CoreDNS landed on the control plane by default; I patched it onto the app workers with hostname anti-affinity so a single node failure can’t take out cluster DNS. NetworkPolicies were rewritten from “open by default with restrictions” to “default-deny everything, then allow exactly what’s needed” (frontend <-> api, api <-> postgres+redis, worker <-> postgres+redis+external HTTP/HTTPS). Pod security got the full treatment: non-root, drop-all-capabilities, no-privilege-escalation, seccomp: RuntimeDefault, and a read-only root filesystem with explicit emptyDir for the few writable paths. Making this work required retrofitting the Dockerfiles to add a non-root app user and switching the frontend to nginx-unprivileged on port 8080, which is one of those “only became necessary once Kubernetes started enforcing it” changes that I’d otherwise have skipped.

The other most important thing that I should mention here is, I changed the Dockerfiles for api, worker, and frontend in this sub-iteration to remove root access from containers for security. The last time I had made changes to those Dockerfiles was in the 9th sub-iteration

Completed tasks under 10th sub-iteration:

Sub-Iteration 11: Local Kubernetes Operations with kind (multi-node showcase)

This sub-iteration is the second part of the previous sub-iteration. Here, I played with some operational configurations of the multi-node cluster as if I were preparing it for production, even though the real production environment for PulseCheck is a single-node cluster.

It’s hypothetical. If I were to deploy PulseCheck in this multi-node cluster in production, I would have handled it this way.

I fixed couple of issues here.

I got hit by a race-condition issue. On a fresh Helm install, the api and worker pods showed 2-4 restarts during cold start. Frontend, postgres, and redis didn’t.

The reason is that helm install submits every pod to Kubernetes at once, and they all start in parallel. Postgres needs about 10-15 seconds to finish initialization and start accepting connections on port 5432.

During that window, the api and worker (which both open database connections at startup) try to connect, get connection refused, raise an exception, and crash. Kubernetes restarts the crashed container; if postgres still is not ready, the new container crashes too.

The loop continues for two to four cycles until postgres becomes ready. From that point onward, the pods run fine, but the restart count stays as a permanent record of the rough startup.

Frontend doesn’t have this problem because Nginx serves static files and never connects to the database. Postgres and redis don’t have this problem because they are not waiting on anything.

Pods Restarting Restarting Pods PulseCheck Api Error Log Error logs in API pod PulseCheck Worker Error Log Error logs in worker pod

As the fix, I added init containers to the api and worker pods. An init container is a container that Kubernetes runs before the main container, and the main container is not started until every init container has exited with status 0. That’s exactly the gating behavior this race condition needed.

Each of the two affected pods got two init containers:

wait-for-postgres runs until pg_isready -h postgres ...; do sleep 2; done using the postgres image
wait-for-redis runs the same loop pattern with redis-cli -h redis ping using the redis image

By the time the main api or worker container starts, both pg_isready and redis-cli ping have already returned successfully, so the application’s first DB connection succeeds immediately and the pod never crashes.

Frontend, postgres, and redis don’t get init containers because they don’t have startup dependencies.

The wait-for-redis container on the api and worker is defensive. Redis comes up in about two seconds, so the race rarely happens for redis in practice, but keeping the same gating pattern for both dependencies makes things consistent.

Init containers inherit the pod’s existing non-root security context, so no special hardening was needed.

After the fix, a fresh make cluster-down && make cluster-up shows every pod at RESTARTS: 0 on the first try.

After the fix:

Output after fix Output after the fix

I fixed the uneven pod spread across the two worker nodes that I dedicated for app pods.

I have two worker nodes dedicated for app pods, but the app pods had not been spread evenly across the workers. I had 5 pods in total. One worker node had 4 pods, while the other worker node had only 1 pod.

I fixed that with topologySpreadConstraints, and I added maxSkew: 1.

The trick was using a shared labelSelector across all my chart’s pods, because the Kubernetes scheduler only spreads replicas within the same ReplicaSet by default. With the shared selector, the constraint applies across all my workloads together:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway
    labelSelector:
      matchLabels:
        app.kubernetes.io/instance: { { .Release.Name } }

added to all five workload templates. after rolling the deployments, the 4-1 became 2-3 (postgres stays pinned to its node by its PVC, but the other four pods balanced across both app workers).

The other major thing that I handled in this sub-iteration was the worker liveness/readiness probe. The other four containers already had liveness and readiness probes from sub-iteration 10:

api — HTTP probe on /api/health
frontend — HTTP probe on /health (nginx endpoint)
postgres — exec probe running pg_isready
redis — exec probe running redis-cli ping

The worker was the only one left without probes because the worker didn’t have an HTTP endpoint to probe. The other four containers all had a natural surface to probe.

I had a few ways to achieve this.

I could have added an in-process HTTP server just to give probes something to hit. But I thought I didn’t want the complexity of adding new code, a new port, and all that.

I went with the simplest answer. The simpler answer was a heartbeat file.

The worker code defines HEARTBEAT_FILE = Path("/tmp/healthy") and calls HEARTBEAT_FILE.touch() once before the loop starts and again at the end of each tick. That updates the file’s mtime every ~10 seconds (TICK_INTERVAL).

The probe on the container side runs:

[ $(($(date +%s) - $(stat -c %Y /tmp/healthy 2>/dev/null || echo 0))) -lt N ]

which is just “current epoch minus the file’s mtime epoch, is that less than N seconds.”

Liveness uses N=60, and readiness uses N=30.

If the worker is alive and looping, the file is always fresh, and the probe exits 0. If the loop hangs or the file stops updating, the diff grows past the threshold, the probe exits non-zero, and Kubernetes restarts the container (liveness) or marks it not-ready (readiness).

/tmp is already an emptyDir mount because of readOnlyRootFilesystem: true, so no extra volume work was needed.

So, three lines of Python, twelve lines of YAML, and no new server.

The HPA work (handling horizontal pod autoscaling) was also one of the interesting parts that I had within this sub-iteration. Kind doesn’t come with metrics-server. Without metrics, there is no way horizontal pod autoscaling would work. It’s like trying to handle HPA without the trigger.

metrics-server is what does the CPU reading and triggers the signal when it meets its limits. So I had to add a Helm install step for metrics-server into the cluster bootstrap before I could write the HPA template.

Then I added the HPA, generated synthetic load against /api/health with ab, and tested replicas scaling from one to two to three in real time. They also got spread across both app workers thanks to the topology constraint.

After the load stopped, the five-minute stabilization window held, then the HPA scaled down one pod per minute exactly as configured.

So it worked out in this multi-node Kind cluster. But in the real single-node production environment, HPA will essentially never fire because traffic is tiny and there is only one node. So it is mostly here for the showcase, as I mentioned.

I usually like to automate things. I wrote some Makefile automation to cover most of the repetitive things in the cluster, which made things easier.

This Makefile is intentionally scoped only to this multi-node Kind cluster. I will create separate automation for the local production cluster with K3d and the real production cluster with K3s in sub-iteration 12 under Makefile.prod.

I also completed some other minor things in this sub-iteration.

Completed tasks under 11th sub-iteration:

Sub-Iteration 12: VPS Deployment with K3s

This was the biggest sub-iteration so far. This is where PulseCheck made the jump from local Kubernetes environments to a real production deployment on a VPS. That meant preparing the Helm charts for production (values-production.yaml for the VPS and a smaller values-production-local.yaml override for K3d testing), hardening the VPS, installing K3s, layering origin shielding through Cloudflare and the provider firewall, setting up SSH-tunneled kubectl, refactoring the repository into helm/ and k8s/ directories by command, bootstrapping cluster infrastructure (ingress-nginx, cert-manager, and Let’s Encrypt), deploying the application, and verifying HTTPS end to end.

A couple of specific notes.

On the K3s install, I disabled Traefik (ingress-nginx handles routing) and ServiceLB (there is no need for LoadBalancer Services on a single bare-metal node), and opened the kubeconfig for non-root read access:

curl -sfL https://get.k3s.io | sh -s - \
  --disable=traefik \
  --disable=servicelb \
  --write-kubeconfig-mode=644

I also pointed pulsecheck.com at the loopback in /etc/hosts so Google OAuth accepts it (.local is reserved for mDNS and gets rejected). This was also the first sub-iteration where I grouped the completed tasks into sections.

Completed tasks under 12th sub-iteration:

Pre-VPS preparation

Setting up VPS and K3s install

Inbound firewall configured to allow only 22 (SSH), 80 (HTTP), 443 (HTTPS), and ICMP from the public internet; all other inbound dropped at the network boundary
K3s installed via the official installer (curl -sfL https://get.k3s.io | sh -) with --disable=traefik --disable=servicelb --write-kubeconfig-mode=644. Traefik off so ingress-nginx handles routing; servicelb off because LoadBalancer Services are unused on bare-metal single node; kubeconfig mode 644 for non-root read

Origin shielding (Cloudflare proxy + VPS provider firewall)

DNS A record for pulsecheck.kavindujayarathne.com pointed at the VPS and proxied through Cloudflare (orange cloud), so DNS resolves to Cloudflare edge IPs rather than the origin
VPS provider firewall inbound rules for TCP 80 and TCP 443 narrowed to Cloudflare’s published IPv4 + IPv6 ranges (15 v4 + 7 v6 CIDRs from https://www.cloudflare.com/ips-v4 and ips-v6)
Cloudflare SSL/TLS mode initially set to Flexible during cert-manager bootstrap, then switched to Full (strict) once the Let’s Encrypt cert was issued. Always Use HTTPS enabled afterward
Origin shielding verified: direct nc -zv -w 10 <vps-ip> 443 from a non-Cloudflare source times out, while the same probe via Cloudflare succeeds

Remote cluster access

Set up SSH local-port-forward tunnel for laptop kubectl access, forwarding local port 16443 over SSH to the cluster API server inside the VPS
Fetched K3s kubeconfig (/etc/rancher/k3s/k3s.yaml) via scp to the laptop as ~/.kube/pulsecheck-prod.yaml (chmod 600), with the server URL rewritten to https://127.0.0.1:16443
Kept the prod kubeconfig separate from ~/.kube/config; reaching prod requires an explicit KUBECONFIG=~/.kube/pulsecheck-prod.yaml

Repo structure refactor

Adopted rule that helm/ holds anything consumed by helm install -f and k8s/ holds cluster-creation tool configs plus raw manifests applied via kubectl apply -f (split by command, not by topic)
Moved k8s/ingress-nginx-values.yaml to helm/ingress-nginx/values-kind.yaml via git mv
Created per-chart subdirectories under helm/ (helm/ingress-nginx/, helm/cert-manager/) so each upstream chart’s values stay isolated from the user-authored helm/pulsecheck/ chart package
Updated path references in Makefile (INGRESS_VALUES) and README.md (Project Structure section)

Cluster infrastructure (bootstrap)

Created helm/ingress-nginx/values-production.yaml: hostPort 80/443 binding (no LoadBalancer on bare-metal single node), CF-Connecting-IP trust via real-ip-header + proxy-real-ip-cidr populated with the same Cloudflare CIDRs the VPS provider firewall uses, single replica, resource limits
Created helm/cert-manager/values.yaml: crds.enabled: true so the chart installs the CRDs, single replica per component (controller, webhook, cainjector, startupapicheck), resource limits
Created k8s/cert-manager/clusterissuer.yaml with two ClusterIssuers (letsencrypt-staging and letsencrypt-prod) using the HTTP-01 solver via the nginx ingress class
Added bootstrap-prod target in Makefile.prod that runs the one-time cluster infrastructure install (helm install ingress-nginx + helm install cert-manager + kubectl apply ClusterIssuers); safe to re-run via helm upgrade --install
Bootstrap executed against the VPS K3s cluster via make -f Makefile.prod bootstrap-prod; all components reached Ready

App deployment

Added --platform linux/amd64 to the build-images target in Makefile.prod so GHCR images match the VPS architecture (local-prod-build left untouched since it runs on the laptop)
App deployed against the VPS K3s cluster via make -f Makefile.prod deploy-prod; all pods reached Running 1/1

End-to-end TLS

Switched Cloudflare SSL/TLS mode from Flexible to Full (strict)
Enabled Cloudflare Always Use HTTPS
Verified end-to-end via curl -v https://pulsecheck.kavindujayarathne.com: HTTP/2 200, response body is the pulsecheck SPA HTML
Verified origin cert via openssl s_client -connect localhost:443 -servername pulsecheck.kavindujayarathne.com: issuer Let’s Encrypt, subject CN matches the production hostname