Case studies

What the work
actually looks like.

Four engagements — each with a real problem, an honest account of the approach, and quantified outcomes. Clients are anonymised where required by NDA.

Fintech · Payment Infrastructure

Zero-downtime payment core extraction ahead of Black Friday

A Series B fintech had a Rails monolith handling all payment processing. As transaction volume grew, a single slow database query could stall the entire application. The engineering team knew it needed to change, but couldn't afford downtime — ever.

Ruby on Rails Go PostgreSQL Redis gRPC 6-week engagement
Throughput improvement
40%
Reduction in payment incidents
0
Minutes of planned downtime
P99 <80ms
API latency post-migration

Problem

The payment processing logic lived alongside Rails controllers, background jobs, webhooks, and admin tooling in a single process. At peak load, a slow ActiveRecord query (often the ledger reconciliation job) would exhaust the Puma thread pool and stall payment authorisations. The on-call rotation was being woken at least twice a week. With Black Friday eight weeks out, the pressure was acute.

Attempts to fix it in-place had produced a tangle of conditional logic and feature flags. The codebase had grown defensive rather than clean.

Solution

  • Mapped all code paths touching payment state using automated dependency analysis and manual tracing.
  • Defined the bounded context for a standalone payment service: authorise, capture, refund, void — nothing else.
  • Built a Go microservice for this context. Go's concurrency model (goroutines) gave us natural fan-out for parallel payment provider calls.
  • Implemented a strangler fig proxy: the Rails app routed payment requests to the new service via gRPC, with a feature flag controlling the cutover per-endpoint.
  • Established dual-write to both the old and new ledger tables during the migration window, with a reconciliation job running automated checks.
  • Rolled out to 1%, 5%, 25%, 100% of traffic over two weeks. Rolled back twice to fix edge cases found at 5% before proceeding.

Impact

  • Payment API throughput increased 3× on the same infrastructure.
  • On-call payment incidents dropped 40% in the first month post-cutover.
  • Black Friday peak was handled without incident — the team reported it as the first uneventful peak in three years.
  • The Go service became the template for two subsequent extractions the team undertook independently after the engagement ended.
SaaS · Real-Time Analytics

From hourly batch jobs to a sub-second event stream

A B2B SaaS platform served dashboard analytics to customers from hourly batch reports. Customers complained that the data felt stale. Sales was losing deals to competitors with real-time dashboards. The engineering team had been told to "make it real-time" without a clear path forward.

Go React.js Kafka TimescaleDB PostgreSQL WebSockets 10-week engagement
99.6%
Reduction in data latency
<2s
Dashboard refresh latency
Faster feature iteration post-launch
+18 NPS
Dashboard satisfaction score

Problem

The existing pipeline ran scheduled cron jobs that pulled from the main application database, aggregated in memory, and wrote summary rows to a reporting schema. With data touching multiple services before reaching the dashboard, freshness could lag by 65–90 minutes at peak.

Attempts to reduce the cron interval had introduced database lock contention. The team's instinct was to "just run it every 5 minutes" — which would have made the contention worse, not better.

Solution

  • Introduced an event bus (Kafka) as the authoritative stream for all user activity events emitted by the application.
  • Built a Go consumer service that maintained running aggregates in memory, flushing to TimescaleDB on a configurable interval (default: 5s).
  • Replaced polling React components with a WebSocket subscription layer — dashboards updated on push, not pull.
  • Preserved the batch pipeline in parallel during the migration window and used it to validate the stream aggregates.
  • Designed a back-fill capability so customers could query historical data from TimescaleDB without a separate data warehouse.

Impact

  • Data latency dropped from ~75 minutes to under 2 seconds — a 99.6% reduction.
  • The clear separation between event emission and aggregation made the codebase significantly easier to reason about.
  • The engineering team shipped three new dashboard widgets in the six weeks following handoff — compared to one in the prior quarter.
  • Customer dashboard NPS increased 18 points in the first post-launch survey.
Consumer App · Mobile Platform

Unifying two native apps into a single React Native platform

A consumer marketplace had separate iOS and Android teams — separate codebases, separate release cadences, and a product backlog that required every feature to be built twice. Leadership wanted parity without the cost, but the teams were nervous about performance regressions.

React Native React.js TypeScript iOS Android Expo 12-week engagement
60%
Codebase size reduction
Faster feature delivery
<300ms
Cold start time (down from 900ms)
4.6★
App store rating maintained

Problem

The iOS app (Swift) and Android app (Kotlin) had diverged significantly over four years. Feature flags didn't line up, edge cases were platform- specific, and QA had to run full regression cycles on both platforms for every release. New engineers took 6–8 weeks to become productive because they needed to learn both codebases.

A previous attempt at React Native migration had been abandoned after performance concerns — specifically, scroll jank in the main feed, which was the product's core surface.

Solution

  • Profiled the original React Native spike to identify the root cause of the jank: an unvirtualised FlatList with excessive re-renders from a poorly structured Redux store.
  • Implemented a FlashList-based feed with windowed rendering, memoised item components, and a Zustand store scoped to local interaction state.
  • Adopted a "React Native shell, native modules for the seams" architecture: critical animations and camera interactions kept as native modules wrapped in a clean TypeScript interface.
  • Screen-by-screen migration over 10 weeks, with automated visual regression tests comparing native and RN renders side by side at each milestone.
  • Retained platform-specific entry points (AppDelegate / MainActivity) to preserve deep linking, push notification, and store compliance without abstraction.

Impact

  • Shared TypeScript codebase replaced ~120,000 lines of platform-specific Swift and Kotlin with ~48,000 lines of shared React Native — a 60% reduction.
  • Cold start improved from 900ms to under 300ms after bundle splitting and lazy screen loading.
  • Feature delivery accelerated 2× in the first quarter after migration; the team shipped their biggest product update in two years within three months of handoff.
  • App store ratings held at 4.6★ across both platforms with no performance regression reports.
Financial Services · Legacy Stabilisation

Eliminating cascading failures in a critical Perl reporting system

A mid-market financial services firm depended on a Perl-based regulatory reporting system written years before the current team joined. The original authors had long since left. Documentation was sparse. The system had developed a pattern of cascading failures that caused multi-hour outages twice a quarter — each one requiring manual data correction under time pressure.

Perl Oracle Shell scripting Linux 8-week engagement
99.97%
Availability (up from 99.1%)
0
Cascade failures in 6 months post-engagement
100%
Code path coverage via characterisation tests
4hr → 12min
Incident resolution time

Problem

The system ran nightly batch processes that generated regulatory reports from Oracle views. When any step failed — due to a database timeout, a malformed input record, or a resource limit — the error was swallowed and the next step would proceed with incomplete data. The corruption propagated silently until a human noticed an anomaly in the final output, hours later.

The engineering team (none of whom wrote Perl) was reluctant to touch the code at all. The system was critical enough that the wrong change had regulatory consequences.

Solution

  • Built a full inventory of the codebase using static analysis and dynamic execution tracing — 47 modules, 312 subroutines, 18 distinct data flows.
  • Wrote a characterisation test suite in Perl (using Test::More) that documented the current behaviour of every public subroutine. This created a safety net before any changes were made.
  • Replaced bare die calls and swallowed errors with a structured error propagation model: each step returned a typed result, and the orchestrator would halt and page on non-recoverable errors.
  • Added structured logging (JSON to syslog) to every data flow step — the first time the system had machine-readable logs in its entire operational history.
  • Wrote runbooks for the three most common failure modes, including SQL to query the exact state of the Oracle views at any point in a failed run.

Impact

  • Zero cascade failures in the six months following the engagement — down from two per quarter.
  • System availability improved from 99.1% to 99.97%.
  • Incident resolution time dropped from ~4 hours (manual data archaeology) to 12 minutes (structured log query + runbook).
  • The characterisation test suite gave the internal team confidence to make their first proactive changes to the system — adding a new report type — within three months of handoff.

Have a similar challenge?

Every system is different — but the methodology is consistent. Let's talk about yours.

Discuss your project