SNMP vs Streaming Telemetry for network monitoring

To build a resilient network, monitoring the health of the network is a key requirement; if the normal state of the network cannot be measured and established, then fault conditions, particularly subtler issues, like reduced link bandwidth, or higher than normal latency can be very hard to detect and diagnose.

This blog post will discuss SNMP, NETCONF and gNMI, and how they are used for network monitoring and where they differ. All of these protocols also can support configuration setting and retrieval, but this post will focus on the status or telemetry aspect.

Historically, SNMP has been the primary source of network monitoring data. SNMP is a pull based protocol, where an SNMP client queries an SNMP agent on a networking device, requesting one or more OIDs (Object Identifiers), which represent individual values. To determine what OIDs are valid for a particular SNMP agent, vendors will provide a set of MIBs (Management Information Base), which provide a hierarchical tree of the different data values that are available via SNMP. There are a number of standard MIBs that most network vendors will implement, along with vendor specific MIBs that have a specific layout for the vendors product. These MIBs will provide more detailed information in most cases.

With SNMP, periodic retrieval of information requires an SNMP client to query the SNMP agent periodically to retrieve the data. For example, to determine how many packets are traversing a network interface, an SNMP client would need to query that OID at least twice, and compare the measurements, then divide by the time between queries. Some vendors will provide OIDs that give a rate based measurement rather than raw counters, but this measurement itself will still need to be polled periodically to monitor how the rate changes over time.

SNMP can also be used to alert on events by the use of SNMP Traps or Informs. These are pushed packets from the SNMP Agent that are sent to an SNMP trap receiver, and will contain the set of OIDs that describe the event. The difference between Traps and Informs is that the sending SNMP agent has no way to determine if an SNMP trap has been received, so it is only sent once, while on receipt of an SNMP Inform, the trap receiver will notify the sending SNMP agent that the Inform has been received; if an inform is not acknowledged, the SNMP agent will continually resend the SNMP inform.

Security-wise, SNMP is a mixed bag. Earlier versions (SNMPv1 and SNMPv2c) do not provide any authentication or encryption of data; instead they rely on plain text community strings to restrict access to particular sets of OIDs, and they expect to be transported over private secure networks.

SNMPv3, which is the most recent version does provide authentication and encryption, along with the ability to restrict access to particular groups of OIDs, which does make it more suitable for more public networks, but it can be very complex to configure. In many cases, a single user and encryption password is used.

Streaming Telemetry comes in many flavours, from high speed protocols like sFlow, Netflow, and IPFIX, which are aimed at capturing network flow metadata, to more general purpose protocols like NETCONF and gNMI. This blog will focus on the latter two protocols.

Rather than using a set of MIBs, both of these protocols use a language called YANG to specify a number of data model that can be queried to retrieve the data. A single value (referred to a leaf) can be retrieved, or an entire tree of data. YANG incorporates the concept of references, so parts of the data tree can link to other parts where that makes sense. An example of that may be VLANs, where there could be a link to the parent interface.

When interacting with NETCONF and  gNMI devices, they provide a list of the YANG models they support, and the versions of these modules, as part of a capabilities structure. This is distinct from SNMP, which requires the SNMP client to already know what MIBs the underlying agent supports.

NETCONF is primary accessed over SSH (Secure SHell) or a TLS connection, and can be authenticated using certificates, username/password pairs, or SSH keypairs. Normally, the data is encoded in XML, but JSON can also be used in some implementations.

gNMI uses gRPC as the transport, and uses either username/password pairs, or certificates for authentication. gRPC uses HTTP/2 as transport, which uses TLS for host authentication and encryption.

As well as supporting one-off retrievals of telemetry data, gNMI also support long lived connections, which can be configured to push telemetry data at a specific frequency, as well as pushing telemetry data when the specified data changes. This means less processing load for the telemetry client, pushing more of the operational load to the telemetry agent.

gNMI and NETCONF also support dial back connections, where they can be configured to perform an outbound connection to a telemetry receiver. This can be helpful for navigating across firewalls and NAT gateways.

Given the advantages of streaming telemetry over SNMP, it is clear why the industry is moving towards it, but in most networks, SNMP will still be around for older networking equipment, or adjacent peripherals like UPSes and RPC/PDUs, which do not have the processing power to provide NETCONF or gNMI support.

SNMP (particularly SNMPv3) still provides adequate monitoring for low change rate status information, such as device MAC addresses, serial numbers, and other information which is mostly static, while streaming telemetry shines at providing more scalable access to high change rate data like network transmission rates.