This guide is intended to help folks who want to understand more about how Consul works from a code perspective, or who are thinking about contributing to Consul. For a high level overview of Consul's design, please see the Consul Architecture Guide as a starting point.
Consul is designed around the concept of a Consul Agent. The agent is deployed as a single Go binary and runs on every node in a cluster.
A small subset of agents, usually 3 to 7, run in server mode and participate in the Raft Consensus Protocol. The Consul servers hold a consistent view of the state of the cluster, including the service catalog and the health state of services and nodes as well as other items like Consul's key/value store contents. An agent in server mode is a superset of the client capabilities that follow.
All the remaining agents in a cluster run in client mode. Applications on client nodes use their local agent in client mode to register services and to discover other services or interact with the key/value store. For the latter queries, the agent sends RPC requests internally to one of the Consul servers for the information. None of the key/value data is on any of the client agents, for example, it's always fetched on the fly from a Consul server.
Both client and server mode agents participate in a Gossip Protocol which provides two important mechanisms. First, it allows for agents to learn about all the other agents in the cluster, just by joining initially with a single existing member of the cluster. This allows clients to discover new Consul servers. Second, the gossip protocol provides a distributed failure detector, whereby the agents in the cluster randomly probe each other at regular intervals. Because of this failure detector, Consul can run health checks locally on each agent and just sent edge-triggered updates when the state of a health check changes, confident that if the agent dies altogether then the cluster will detect that. This makes Consul's health checking design very scaleable compared to centralized systems with a central polling type of design.
There are many other aspects of Consul that are well-covered in Consul's Internals Guides.
The components in this section are shared between Consul agents in client and server modes.
|command/agent||This contains the actual CLI command implementation for the
|agent||This is where the agent object is defined, and the top level
|agent/config||This has all the user-facing configuration processing code, as well as the internal configuration structure that's used by the agent.|
|agent/checks||This has implementations for the different health check types.|
|agent/ae, agent/local||These are used together to power the agent's Anti-Entropy Sync Back process to the Consul servers.|
|agent/router, agent/pool||These are used for routing RPC queries to Consul servers and for connection pooling.|
|agent/structs||This has definitions of all the internal RPC protocol request and response structures.|
The components in this section are only used by Consul servers.
|agent/consul||This is where the Consul server object is defined, and the top-level
|agent/consul/fsm, agent/consul/state||These components make up Consul's finite state machine (updated by the Raft consensus algorithm) and backed by the state store (based on immutable radix trees). All updates of Consul's consistent state is handled by the finite state machine, and all read queries to the Consul servers are serviced by the state store's data structures.|
|agent/consul/autopilot||This contains a package of functions that provide Consul's Autopilot features.|
There are several other top-level packages used internally by Consul as well as externally by other applications.
|acl||This supports the underlying policy engine for Consul's ACL system.|
|command||This contains a sub-package for each of Consul's CLI command implementations.|
|snapshot||This has implementation details for Consul's snapshot archives.|
|watch||This has implementation details for Consul's watches, used both internally to Consul and by the [watch CLI command]](https://www.consul.io/docs/commands/watch.html).|
|website||This has the full source code for consul.io. Pull requests can update the source code and Consul's documentation all together.|
This section addresses some frequently asked questions about Consul's architecture.
When you query Consul for information about a service, such as via the DNS interface, the agent will always make an internal RPC request to a Consul server that will query the consistent state store. Even though an agent might learn that another agent is down via gossip, that won't be reflected in service discovery until the current Raft leader server perceives that through gossip and updates the catalog using Raft. You can see an example of where these layers are plumbed together here - https://github.com/hashicorp/consul/blob/v1.0.5/agent/consul/leader.go#L559-L602.
Consul's blocking queries make a best-effort attempt to wait for new information, but they may return the same results as the initial query under some circumstances. First, queries are limited to 10 minutes max, so if they time out they will return. Second, due to Consul's prefix-based internal immutable radix tree indexing, there may be modifications to higher-level nodes in the radix tree that cause spurious wakeups. In particular, waiting on things that do not exist is not very efficient, but not very expensive for Consul to serve, so we opted to keep the code complexity low and not try to optimize for that case. You can see the common handler that implements the blocking query logic here - https://github.com/hashicorp/consul/blob/v1.0.5/agent/consul/rpc.go#L361-L439. For more on the immutable radix tree implementation, see https://github.com/hashicorp/go-immutable-radix/ and https://github.com/hashicorp/go-memdb, and the general support for "watches".
No. These are always fetched via an internal RPC request to a Consul server. The agent doesn't do any caching, and if you want to be able to fetch these values even if there's no cluster leader, then you can use a more relaxed consistency mode. You can see an example where the
/v1/kv/<key> HTTP endpoint on the agent makes an internal RPC call here - https://github.com/hashicorp/consul/blob/v1.0.5/agent/kvs_endpoint.go#L56-L90.
We strongly recommend running the Consul agent on each node in a cluster. Even the key/value store benefits from having agents on each node. For example, when you lock a key it's done through a session, which has a lifetime that's by default tied to the health of the agent as determined by Consul's gossip-based distributed failure detector. If the agent dies, the session will be released automatically, allowing some other process to quickly see that and obtain the lock without having to wait for an open-ended TTL to expire. If you are using Consul's service discovery features, the local agent runs the health checks for each service registered on that node and only needs to send edge-triggered updates to the Consul servers (because gossip will determine if the agent itself dies). Most attempts to avoid running an agent on each node will face solving issues that are already solved by Consul's design if the agent is deployed as intended.