Problem
The payment processing logic lived alongside Rails controllers, background jobs, webhooks, and admin tooling in a single process. At peak load, a slow ActiveRecord query (often the ledger reconciliation job) would exhaust the Puma thread pool and stall payment authorisations. The on-call rotation was being woken at least twice a week. With Black Friday eight weeks out, the pressure was acute.
Attempts to fix it in-place had produced a tangle of conditional logic and feature flags. The codebase had grown defensive rather than clean.
Solution
- Mapped all code paths touching payment state using automated dependency analysis and manual tracing.
- Defined the bounded context for a standalone payment service: authorise, capture, refund, void — nothing else.
- Built a Go microservice for this context. Go's concurrency model (goroutines) gave us natural fan-out for parallel payment provider calls.
- Implemented a strangler fig proxy: the Rails app routed payment requests to the new service via gRPC, with a feature flag controlling the cutover per-endpoint.
- Established dual-write to both the old and new ledger tables during the migration window, with a reconciliation job running automated checks.
- Rolled out to 1%, 5%, 25%, 100% of traffic over two weeks. Rolled back twice to fix edge cases found at 5% before proceeding.
Impact
- Payment API throughput increased 3× on the same infrastructure.
- On-call payment incidents dropped 40% in the first month post-cutover.
- Black Friday peak was handled without incident — the team reported it as the first uneventful peak in three years.
- The Go service became the template for two subsequent extractions the team undertook independently after the engagement ended.