ITDK 2024-02
April 23rd, 2024 by Matthew LuckieCAIDA has released the 2024-02 Internet Topology Data Kit (ITDK), the 24th ITDK in a series published over the past 14 years. In the year since the 2023-03 release, CAIDA has expanded its Ark platform with both hardware and software vantage points (VPs), and re-architected the ITDK probing software. We have been busy modernizing the software to enable us to collect ITDK snapshots more regularly, as well as annotate the router-level Internet topology graph with more features.
For IPv4, the ITDK probing software is based primarily around two reliable alias resolution techniques. The first, MIDAR, probes for IPID behavior that suggests that responses from different IP addresses had IPID values derived from a single counter, and thus the addresses are assigned to the same router. This inference is challenging because of the sheer number of router addresses observed in macroscopic Internet topologies, and the IPID value is held in a 16-bit field, requiring sophisticated probing techniques to identify distinct counters. The second, iffinder, probes for common source IP addresses in responses to probes sent to different target IP addresses.
In the past few months, we have replaced the MIDAR and iffinder probing component on the Ark VPs to use alias resolution primitives present in scamper (specifically, the midarest, midardisc, and radargun primitives). We used the recently released scamper python module, and 902 lines of python, which executes on a single machine at CAIDA to coordinate the probing from many VPs.
The following table provides statistics illustrating the growth of the ITDK over the past year, driven by the expansion of Ark VPs. Overall, we increased the number of Ark VPs providing topology data from 93 to 142, the number of addresses probed from 2.6 to 3.6M, doubled the number of VPs that we use for alias resolution probing, and found aliases for 50% more addresses than a year ago. Note that we use the term “node” to distinguish between our router inferences, and the actual routers themselves. By definition all routers have at least two IP addresses; our “nodes with at least two IPs” are the subset of routers we were able to observe with that property.
2023-03 | 2024-02 | |
Input: | ||
Number of addresses probed: | 2.64M | 3.58M |
Number of ark VPs: | 93 | 142 |
Number of countries: | 37 | 52 |
Alias resolution: | ||
Number of ark VPs for MIDAR: | 55 | 101 |
Number of ark VPs for iffinder: | 46 | 101 |
Output: | ||
Nodes with at least two IPs: | 75,660 | 107,976 |
Addresses in nodes with at least two IPs: | 284,479 | 425,964 |
For 2024-02, we also evaluated the gains provided by SNMPv3 probing, following a paper published in IMC 2021 that showed many routers return a unique SNMP Engine ID in response to a SNMPv3 request; the basic idea is that different IP addresses returning the same SNMPv3 Engine ID are likely aliases. Of the 3.58M addresses we probed, 672K returned an SNMPv3 response. We inferred that IP addresses belonged to the same router when they return the same SNMP Engine ID, the size of the engine ID was at least 4 bytes, the number of engine boots was the same, and the router uptime was the same; we did not use the other filters in section 4.4 of the IMC paper. This inferred 47,770 nodes with at least two IPs, many of which were shared with existing nodes found with MIDAR + iffinder. In total, when we combined MIDAR, iffinder, and SNMP probing, we obtained a graph with 124,857 nodes with at least two IPs, covering 515,524 addresses. We are including both the MIDAR + iffinder and MIDAR + iffinder + SNMP graphs in ITDK 2024-02.
Our ITDK also includes an IPv6 graph derived from speedtrap, which infers that IPv6 addresses belong to the same router if the IPID values in fragmented IPv6 responses appear to be derived from a single counter, and a graph derived from speedtrap and SNMP. For IPv6, the gains provided by SNMP are more significant, as the effectiveness of the IPv6 IPID as an alias inference vector wanes. Of the 929K IPv6 addresses we probed, 68K returned an SNMPv3 response.
2023-03 | 2024-02 | |
Input: | ||
Number of addresses probed: | 592K | 929K |
Number of ark VPs: | 36 | 54 |
Number of countries: | 18 | 25 |
Speedtrap output: | ||
Nodes with at least two IPs: | 4,945 | 4,129 |
Addresses in nodes with at least two IPs: | 12,638 | 10,886 |
Speedtrap + SNMP output: | ||
Nodes with at least two IPs: | – | 8,935 |
Addresses in nodes with at least two IPs: | – | 35,164 |
Beyond the alias resolution, the nodes are also annotated with their bdrmapIT-inferred operator (expressed as an ASN) as well as an inferred geolocation. We look up the PTR records of all router IP addresses with zdns, following CNAMEs where they exist, and provide these names as part of the ITDK. For router geolocation, we used a combination of DNS-based heuristics, IXP geolocation (routers connected to an IXP are likely located at that IXP), and Maxmind GeoLite2.
We inferred DNS-based geolocation heuristics using RTT measurements from 148 Ark VPs in 52 countries to constrain Hoiho, which automatically infers naming conventions in PTR records as regular expressions, and covered 819 different suffixes (e.g., ^.+\.([a-z]+)\d+\.level3\.net$ and ^.+\.([a-z]{3})\d+\.[a-z\d]+\.cogentco\.com$ extract geolocation hints in hostnames for Level3 and Cogent in the above figure). There is no dominant source of geohint observed in these naming conventions; 443 (54.1%) embedded IATA airport codes (e.g. IAD, WAS for the Washington D.C. area), 310 (37.9%) embedded place names (e.g. Ashburn for Ashburn, VA, US), 87 (10.6%) embedded the first six characters of a CLLI code (e.g. ASBNVA for Ashburn), and 12 (1.4%) embedded locodes (e.g. USQAS for Ashburn, VA, US). Interestingly, the operators that used CLLI and locodes had conventions that were more congruent with observed RTT values than operators that used IATA codes or place names. For the nodes in the ITDK, hoiho provided a geolocation inference for 127K, IXP provided a geolocation inference for 14K, and maxmind covered the remainder. The rules we inferred are usable via CAIDA’s Hoiho API.
ITDKs older than one year are publicly available, and ITDK 2024-02 is available to researchers and CAIDA members, after completing a simple form for access.
Acknowledgment: We are grateful to all of the Ark hosting sites, MaxMind’s freely available geolocation database, and academic research access to Iconectiv’s CLLI database to support this work.