Toxiproxy: Simulate Network Failures for Robust TestingTesting how applications behave under adverse network conditions is essential for building resilient systems. Toxiproxy is a lightweight, programmable TCP proxy designed to simulate network failures—latency, bandwidth restrictions, disconnects, and more—so you can validate how services react before those failures happen in production. This article covers what Toxiproxy is, why it matters, how it works, practical examples, integration patterns, best practices, limitations, and alternatives.
What is Toxiproxy?
Toxiproxy is an open-source tool originally created by Shopify. It sits between a client and a server as a TCP proxy, allowing you to inject faults (“toxics”) into the network path. Each toxic represents a specific failure mode (latency, bandwidth, packet loss, etc.) and can be applied in isolation or combined to create complex failure scenarios. Toxiproxy exposes a control API (HTTP and clients in multiple languages), letting tests programmatically configure and manipulate network conditions.
Key fact: Toxiproxy operates at the TCP level and supports creating, updating, and removing network fault behaviors on demand.
Why use Toxiproxy?
- Validate resilience: Confirm that retry logic, timeouts, circuit breakers, and fallback behavior work as intended.
- Prevent regressions: Integrate Toxiproxy into CI to detect when changes weaken fault tolerance.
- Reproduce real-world issues: Simulate intermittent or partial failures that are hard to replicate otherwise.
- Test microservices interactions: Introduce failures between services without touching service code or infrastructure.
- Low friction: Lightweight and easy to run locally, in CI, or alongside test environments.
How Toxiproxy works — core concepts
- Proxy: A TCP listener that forwards traffic to an upstream server (the real service).
- Proxy endpoints: Each proxy has a listen address (client connects here) and an upstream address (target service).
- Toxics: Fault-injection primitives attached to a proxy. Types include latency, bandwidth, slow_close, timeout, and more.
- Control API: HTTP endpoints for creating proxies and toxics; client libraries for programmatic control.
- Direction: Toxics can be applied to the upstream or downstream direction (client→server or server→client).
Common toxic types
- latency: Adds a fixed delay (and optional jitter) to packets.
- bandwidth: Limits throughput, simulating slow connections.
- timeout: Simulates abrupt timeouts (closing connection after some time).
- slow_close: Delays closing the connection to simulate graceful shutdown issues.
- limit_data: Limits total bytes before the connection is closed.
- reset_peer: Immediately resets the TCP connection.
- corrupt_data: Randomly flips bits or corrupts data.
Example usage
Below are practical examples showing how to run Toxiproxy and use it in tests.
- Run Toxiproxy (Docker):
docker run --rm -p 8474:8474 -p 8666:8666 shopify/toxiproxy
- HTTP control API: http://localhost:8474
- Example proxy listen: localhost:8666 -> upstream your service on 127.0.0.1:9000
-
Create a proxy with curl:
curl -s -X POST http://localhost:8474/proxies -d '{"name":"example","listen":"0.0.0.0:8666","upstream":"127.0.0.1:9000"}' -H "Content-Type: application/json"
-
Add 200ms latency toxic:
curl -s -X POST http://localhost:8474/proxies/example/toxics -d '{"name":"latency_down","type":"latency","attributes":{"latency":200},"stream":"downstream"}' -H "Content-Type: application/json"
-
Remove the toxic:
curl -s -X DELETE http://localhost:8474/proxies/example/toxics/latency_down
-
Programmatic control (Node.js example snippet): “`js const Toxiproxy = require(‘toxiproxy-node-client’); const client = new Toxiproxy(’http://localhost:8474’);
async function example() { const proxy = await client.createProxy(‘svc’, ‘0.0.0.0:8666’, ‘127.0.0.1:9000’); await proxy.addToxic({ name: ‘latency’, type: ‘latency’, stream: ‘downstream’, attributes: { latency: 500 } }); // Run tests… await proxy.removeToxic(‘latency’); } “`
Integration patterns
- Local development: Run Toxiproxy with developers’ local stacks to test failure handling before pushing code.
- CI pipelines: Spin up Toxiproxy in Test stage and run automated resilience tests (e.g., integration tests that assert retry behavior).
- Component tests: Use Toxiproxy in Docker Compose setups so service A talks to service B via Toxiproxy.
- Chaos testing: Use Toxiproxy as part of broader chaos experiments to focus on network-related failures.
- Service virtualization: When downstream services are unavailable or costly, use Toxiproxy to emulate degraded responses.
Test design tips
- Start small: Test a single toxic first (e.g., latency) to validate specific behavior.
- Combine toxics: Layer latency + bandwidth limits to simulate congested networks.
- Vary duration and intensity: Short bursts vs prolonged degradation to test both transient and enduring failures.
- Assert observable behavior: Check retries, timeouts, metrics emitted, and whether fallbacks were used.
- Clean up: Ensure tests remove toxics and proxies to avoid cross-test interference.
- Time determinism: Use deterministic parameters where possible to avoid flaky tests (e.g., set fixed latency without jitter during CI).
Best practices
- Use environment isolation: Run Toxiproxy per-test or per-suite so state is predictable.
- Keep production configs separate: Don’t run toxics against production services.
- Monitor side effects: Tests that inject heavy faults can create misleading alerts; silence or tag monitoring during tests.
- Combine with load testing: Observe how failures manifest under load, not just single-request tests.
- Automate rollback: Ensure toxics are removed even on test failures (use fixtures or finally blocks).
Limitations and caveats
- Layer: Toxiproxy works at TCP layer; it cannot inspect or modify application-layer semantics (HTTP headers, JSON payloads) unless you build additional logic around it.
- Not a full network emulator: For very low-level network behaviors (e.g., packet-level reordering at sub-TCP granularity) or complex routing topologies, more advanced network emulators or kernel-level tools may be needed.
- Platform behavior: Some clients and protocols may recover differently depending on TCP stack behavior, making results vary by OS or runtime.
- Resource constraints: Running many proxies/toxics under heavy load may consume CPU and memory; measure overhead in tests.
Alternatives and complementary tools
- tc (Linux Traffic Control): Powerful kernel-level traffic shaping for advanced scenarios.
- netem: Kernel module commonly used with tc for latency, loss, duplication, and reordering.
- Istio/Envoy fault injection: For service meshes, apply HTTP/gRPC-level faults.
- Chaos engineering tools: Gremlin, Chaos Mesh — broader system-failure experiments beyond networking.
- Wiremock / Mock servers: For simulating downstream behavior at application layer (HTTP), not TCP-level.
Comparison (simple):
Tool | Layer | Strength |
---|---|---|
Toxiproxy | TCP | Lightweight, programmable, ideal for microservice integration tests |
tc/netem | Kernel/TCP/IP | Low-level, powerful, system-wide control |
Envoy/Istio | HTTP/gRPC/TCP | Integrates with service meshes, richer routing/faulting |
Wiremock | HTTP | Application-layer response simulation |
Sample test scenarios
- Intermittent latency: Add random latency spikes and verify exponential backoff and retry caps.
- Connection resets: Use reset_peer to ensure clients handle abrupt disconnects gracefully.
- Slow upstream: Combine bandwidth limit + latency to validate streaming or large-file transfers.
- Partial failures: Apply toxics only downstream to simulate responses arriving slowly while requests get through quickly.
- Failover validation: Introduce failures to primary upstream and verify system correctly fails over to secondary.
Troubleshooting
- No effect on traffic: Verify client connects to Toxiproxy listen address, not directly to upstream.
- Flaky tests: Reduce jitter, increase timeouts in test harness, or seed deterministic randomness.
- Overhead concerns: Measure CPU and memory usage of Toxiproxy under load; consider running with fewer concurrent proxies or increasing host resources.
Conclusion
Toxiproxy is a practical, developer-friendly tool for simulating network failures at the TCP level. It fills an important niche between high-level application mocks and low-level OS network emulators, enabling reproducible and automated resilience testing. When used thoughtfully—isolated from production, integrated into CI, and combined with good test design—Toxiproxy helps teams catch and fix brittle error-handling code before it affects users.