Archive for the 'Data Collection' Category

Help save the Internet: Install the new Spoofer client (v1.1.0)!

Sunday, December 18th, 2016 by Josh Polterock

The greatest security vulnerability of the Internet (TCP/IP) architecture is the lack of source address validation, i.e., any sender may put a fake source address in a packet, and the destination-based routing protocols that glue together the global Internet will get that packet to its intended destination. Attackers exploit this vulnerability by sending many (millions of) spoofed-source-address packets to services on the Internet they wish to disrupt (or take offline altogether). Attackers can further leverage intermediate servers to amplify such packets into even larger packets that will cause greater disruption for the same effort on the attacker’s part.

Although the IETF recommended best practices to mitigate this vulnerability by configuring routers to validate that source addresses in packets are legitimate, compliance with such practices (BCP38 and BCP84) are notoriously incentive-incompatible. That is, source address validation (SAV) can be a burden to a network who supports it, but its deployment by definition helps not that network but other networks who are thus protected from spoofed-source attacks from that network. Nonetheless, any network who does not deploy BCP38 is “part of the DDoS problem”.

Over the past several months, CAIDA, in collaboration with Matthew Luckie at the University of Waikato, has upgraded Rob Beverly’s original spoofing measurement system, developing new client tools for measuring IPv4 and IPv6 spoofing capabilities, along with services that provide reporting and allow users to opt-in or out of sharing the data publicly. To find out if your network provider(s), or any network you are visiting, implements filtering and allow IP spoofing, point your web browser at and install our simple client.

This newly released spoofer v1.1.0 client has implemented parallel probing of targets, providing a 5x increase in speed to complete the test, relative to v.1.0. Among other changes, this new prober uses scamper instead of traceroute when possible, and has improved display of results. The installer for Microsoft Windows now configures Windows Firewall.

For more technical details about the problem of IP spoofing and our approach to measurement, reporting, notifications and remediation, see the slides from Matthew Luckie’s recent slideset, “Software Systems for Surveying Spoofing Susceptibility”, presented to the Australian Network Operators Group (AusNOG) in September 2016.

The project web page reports recently run tests from clients willing to share data publicly, test results classified by Autonomous System (AS) and by country, and a summary statistics of IP spoofing over time. We will enhance these reports over the coming months.

This material is based on research sponsored by the Department of Homeland Security (DHS) Science and Technology Directorate, Homeland Security Advanced Research Projects Agency, Cyber Security Division (DHS S&T/HSARPA/CSD) BAA HSHQDC-14-R-B0005, and the Government of United Kingdom of Great Britain and Northern Ireland via contract number D15PC00188. Views should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Department of Homeland Security, the U.S. Government, or the Government of United Kingdom of Great Britain and Northern Ireland.

The Remote Peering Jedi

Friday, November 11th, 2016 by Josh Polterock

During the RIPE 73 IXP Tools Hackathon, Vasileios Giotsas, working with collaborators at FORTH/University of Crete, AMS-IX, University College, London, and NFT Consult, created the Remote Peering Jedi Tool to provide a view into the remote peering ecosystem. Given a large and diverse corpus of traceroute data, the tool detects and localizes remote peering at Internet Exchange Points (IXP).

To make informed decisions, researchers and operators desire to know who has remote peering at the various IXPs. For their RIPE hackathon project, the group created a tool to automate the detection using average RTTs from the RIPE Atlas’ massive corpus of traceroute paths. The group collected validation data from boxes inside the three large IXPs to compare to RTTs estimated via Atlas. The data suggests possible opportunities for Content Distribution Networks (CDN) to improve services for smaller IXPs. The project results also offer insights into how to interpret some of the information in PeeringDB. The project further examined how presence-informed RTT geolocation can contribute to identifying the location of resources. These results help reduce the problem space by exploiting the fact that the IP space of a given AS can appear where the AS has presence.

For more details, you can watch Vasileios’ presentation of the Remote Peering Jedi Tool. Or, visit the remote peering portal to see the tool in action.


CAIDA’s 2015 Annual Report

Tuesday, July 19th, 2016 by kc

[Executive summary and link below]

The CAIDA annual report summarizes CAIDA’s activities for 2015, in the areas of research, infrastructure, data collection and analysis. Our research projects span Internet topology, routing, security, economics, future Internet architectures, and policy. Our infrastructure, software development, and data sharing activities support measurement-based internet research, both at CAIDA and around the world, with focus on the health and integrity of the global Internet ecosystem. The executive summary is excerpted below:

Mapping the Internet. We continued to pursue Internet cartography, improving our IPv4 and IPv6 topology mapping capabilities using our expanding and extensible Ark measurement infrastructure. We improved the accuracy and sophistication of our topology annotation capabilities, including classification of ISPs and their business relationships. Using our evolving IP address alias resolution measurement system, we collected curated, and released another Internet Topology Data Kit (ITDK).

Mapping Interconnection Connectivity and Congestion.
We used the Ark infrastructure to support an ambitious collaboration with MIT to map the rich mesh of interconnection in the Internet, with a focus on congestion induced by evolving peering and traffic management practices of CDNs and access ISPs, including methods to detect and localize the congestion to specific points in networks. We undertook several studies to pursue different dimensions of this challenge: identification of interconnection borders from comprehensive measurements of the global Internet topology; identification of the actual physical location (facility) of an interconnection in specific circumstances; and mapping observed evidence of congestion at points of interconnection. We continued producing other related data collection and analysis to enable evaluation of these measurements in the larger context of the evolving ecosystem: quantifying a given ISP’s global routing footprint; classification of autonomous systems (ASes) according to business type; and mapping ASes to their owning organizations. In parallel, we examined the peering ecosystem from an economic perspective, exploring fundamental weaknesses and systemic problems of the currently deployed economic framework of Internet interconnection that will continue to cause peering disputes between ASes.

Monitoring Global Internet Security and Stability. We conduct other global monitoring projects, which focus on security and stability aspects of the global Internet: traffic interception events (hijacks), macroscopic outages, and network filtering of spoofed packets. Each of these projects leverages the existing Ark infrastructure, but each has also required the development of new measurement and data aggregation and analysis tools and infrastructure, now at various stages of development. We were tremendously excited to finally finish and release BGPstream, a software framework for processing large amounts of historical and live BGP measurement data. BGPstream serves as one of several data analysis components of our outage-detection monitoring infrastructure, a prototype of which was operating at the end of the year. We published four other papers that either use or leverage the results of internet scanning and other unsolicited traffic to infer macroscopic properties of the Internet.

Future Internet Architectures. The current TCP/IP architecture is showing its age, and the slow uptake of its ostensible upgrade, IPv6, has inspired NSF and other research funding agencies around the world to invest in research on entirely new Internet architectures. We continue to help launch this moonshot from several angles — routing, security, testbed, management — while also pursuing and publishing results of six empirical studies of IPv6 deployment and evolution.

Public Policy. Our final research thrust is public policy, an area that expanded in 2015, due to requests from policymakers for empirical research results or guidance to inform industry tussles and telecommunication policies. Most notably, the FCC and AT&T selected CAIDA to be the Independent Measurement Expert in the context of the AT&T/DirecTV merger, which turned out to be as much of a challenge as it was an honor. We also published three position papers each aimed at optimizing different public policy outcomes in the face of a rapidly evolving information and communication technology landscape. We contributed to the development of frameworks for ethical assessment of Internet measurement research methods.

Our infrastructure operations activities also grew this year. We continued to operate active and passive measurement infrastructure with visibility into global Internet behavior, and associated software tools that facilitate network research and security vulnerability analysis. In addition to BGPstream, we expanded our infrastructure activities to include a client-server system for allowing measurement of compliance with BCP38 (ingress filtering best practices) across government, research, and commercial networks, and analysis of resulting data in support of compliance efforts. Our 2014 efforts to expand our data sharing efforts by making older topology and some traffic data sets public have dramatically increased use of our data, reflected in our data sharing statistics. In addition, we were happy to help launch DHS’ new IMPACT data sharing initiative toward the end of the year.

Finally, as always, we engaged in a variety of tool development, and outreach activities, including maintaining web sites, publishing 27 peer-reviewed papers, 3 technical reports, 3 workshop reports, 33 presentations, 14 blog entries, and hosting 5 workshops. This report summarizes the status of our activities; details about our research are available in papers, presentations, and interactive resources on our web sites. We also provide listings and links to software tools and data sets shared, and statistics reflecting their usage. sources. Finally, we offer a “CAIDA in numbers” section: statistics on our performance, financial reporting, and supporting resources, including visiting scholars and students, and all funding sources.

For the full 2015 annual report, see

Toward a Congestion Heatmap of the Internet

Friday, June 3rd, 2016 by Amogh Dhamdhere

In the past year, we have made substantial progress on a system to measure congestion on interdomain links between networks. This effort is part of our NSF-funded project on measuring interdomain connectivity and congestion. The basic nugget of our technique is to send TTL-limited probes from a vantage point (VP) within a network, toward the near and the far end of an interdomain (border) link of that network, and to monitor diurnal patterns in the near and far-side time series. We refer to this method as “Time-Series Latency Probing”, or TSLP. Our hypothesis is that a persistently elevated RTT to the far end of the link, but no corresponding RTT elevation to the near side, is a signal of congestion at the interdomain link.

It turns out that identifying interdomain links from a VP inside a network is surprisingly challenging, for several reasons: lack of standard IP address assignment practices for inter domain links; unadvertised address space by ISPs; and myriad things that can go wrong with traceroute measurements (third-party addresses, unresponsive routers). See our paper at the 2014 Internet Measurement Conference (IMC) for a description of these issues. To overcome those challenges and identify network borders from within a network, we have developed bdrmap, an active measurement tool to accurately identify interdomain links between networks. A paper describing the bdrmap algorithms is currently under submission to IMC 2016.

Our second major activity in the last year has been to develop a backend system that manages TSLP probing from our set of distributed vantage points, collects and organizes data, and presents that data for easy analysis and visualization. A major goal of the backend system is to be adaptive, i.e., the probing state should adapt to topological and routing changes in the network. To this end, we run the bdrmap topology discovery process continuously on each VP. Every day, we process completed bdrmap runs from each monitor and add newly discovered interdomain links or update the probing state for existing links (i.e., destinations we can use to probe those links, and the distance of those links from our VP). We then push updated probing lists to the monitor. This adaptive process ensures that we always probe a relatively current state of thousands of interdomain links visible from our VPs.

Third, we have greatly expanded the scale of our measurement system. We started this project in 2014 with an initial set of approximately ten VPs in 5-6 access networks mostly in the United States. We are now running congestion measurements from over sixty Archipelago VPs in 39 networks and 26 countries around the world. Our Ark VPs have sufficient memory and compute power to run both the border mapping process and the TSLP probing without any issues. However, when we looked into porting our measurements to other active measurement platforms such as Bismark or the FCC’s measurement infrastructure operated by SamKnows, we found that the OpenWRT-based home routers were too resource-constrained to run bdrmap and TSLP directly. To overcome this challenge, we developed a method to move the bulk of the resource-intensive processing from the VPs to a central controller at CAIDA, so the VP only has to run an efficient probing engine (scamper) with a small memory footprint and low CPU usage. We have deployed a test set of 15 Bismark home routers in this type of remote configuration, with lots of help from the folks at the Bismark Project. Our next target deployment will be a set of >5000 home routers that are part of the FCC-SamKnows Measuring Broadband America infrastructure.

A fourth major advance we have made in the last year is in visualization and analysis of the generated time series data. We were on the lookout for a time series database to store, process and visualize the TSLP data. After some initial experimentation, we found influxDB to be well-suited to our needs, due to its ability to scale to millions of time series, scalable and usable read/write API, and SQL-like querying capability. We also discovered Grafana, a graphing frontend that integrates seamlessly with the influxDB database to provide interactive querying and graphing capability. Visualizing time series plots from a given VP to various neighbor networks and browsing hundreds of time series plots is now possible with a few mouse clicks on the Grafana UI. The figure below shows RTT data for 7 interdomain links between a U.S. access provider and a content provider over the course of a week. This graph took a few minutes to produce with influxDB and Grafana; previously this data exploration would have taken hours using data stored in standard relational databases.



As the cherry on the cake, we have set up the entire system to provide a near real-time view of congestion events. TSLP data is pulled off our VPs and indexed into the influxDB database within 30 minutes of being generated. Grafana provides an auto-refresh mode wherein we can set up a dashboard to periodically refresh when new data is available. There is no technical barrier to shortening the 30-minute duration to an arbitrarily short duration, within reason. The figure below shows a pre-configured dashboard with the real-time congestion state of interdomain links from 5 large access networks in the US to 3 different content providers/CDNs (network names anonymized). Several graphs on that dashboard show a diurnal pattern that signals evidence of congestion on the interdomain link. While drawing pretty pictures and having everything run faster is certainly satisfying, it is neither the goal nor the most challenging aspect of this project. A visualization is only as good as the data that goes into it. Drawing graphs was the easy part; developing a sustainable and scalable system that will keep producing meaningful data was infinitely more challenging. We are delighted with where we are at the moment, and look forward to opening up the data exploration interface for external users.


So what happens next? We are far from done here. We are currently working on data analysis modules for time series data with the goal of producing alarms, automatically and without human intervention, that indicate evidence of congestion. Those alarms will be input to a reactive measurement system that we have developed to distribute on-demand measurement tasks to VPs. We envision different types of reactive measurement tasks, e.g., confirming the latency-based evidence of congestion by launching probes to measure loss rate, estimating the impact on achievable throughput by running NDT tests, or estimating potential impacts to user Quality of Experience (QoE). The diagram below shows the various components of the measurement system we are developing. The major piece that remains is continuous analysis of the TSLP data, generating alarms, and pushing on-demand measurements to the reactive measurement system. Stay tuned!


The team: Amogh Dhamdhere, Matthew Luckie, Alex Gamero-Garrido, Bradley Huffaker, kc claffy, Steve Bauer, David Clark

1st CAIDA BGP Hackathon brings students and community experts together

Thursday, February 18th, 2016 by Josh Polterock

We set out to conduct a social experiment of sorts, to host a hackathon to hack streaming BGP data. We had no idea we would get such an enthusiastic reaction from the community and that we would reach capacity. We were pleasantly surprised at the response to our invitations when 25 experts came to interact with 50 researchers and practitioners (30 of whom were graduate students). We felt honored to have participants from 15 countries around the world and experts from companies such as Cisco, Comcast, Google, Facebook and NTT, who came to share their knowledge and to help guide and assist our challenge teams.

Having so many domain experts from so many institutions and companies with deep technical understanding of the BGP ecosystem together in one room greatly increased the kinetic potential for what we might accomplish over the course of our two days.


Recent collections added to DatCat

Monday, September 29th, 2014 by Paul Hick

As announced in the CAIDA blog “Further Improvements to the Internet Data Measurement Catalog (DatCat)” of August 26, 2014, the new Internet Data Measurement Catalogue DatCat is now operational. New entries by the community are welcome, and about a dozen have been added so far. We plan to advertise new and interesting entries on a regular basis with a short entry in this blog. This is the first contribution in this series.

Added on July 31, 2014, was the collection “DNS Zone Files”.;
contributed 2014-07-31 by Tristan Halvorson:

This collection contains Zone files with NS and A records for all new (2013 and later) TLDs.

ICANN has opened up the TLD creation process to a large number of new registries with a centralized service for downloading all of this new data. Each TLD has a separate zone file, and each zone file contains entries for every registered domain. This data collection contains step-by-step instructions to acquire this data directly from the registries through ICANN. This method only works for TLDs released during 2013 or later.

DHS S&T PREDICT PI Meeting, Marina del Rey, CA

Friday, June 6th, 2014 by Josh Polterock

On 28-29 May 2014, DHS Science and Technology Directorate (S&T) held a meeting of the Principal Investigators of the PREDICT (Protected Repository for the Defense of Infrastructure Against Cyber Threats) Project, an initiative to facilitate the accessibility of computer and network operational data for use in cybersecurity defensive R&D. The project is a three-way partnership among government, critical information infrastructure providers, and security development communities (both academic and commercial), all of whom seek technical solutions to protect the public and private information infrastructure. The primary goal of PREDICT is to bridge the gap between producers of security-relevant network operations data and technology developers and evaluators who can leverage this data to accelerate the design, production, and evaluation of next-generation cybersecurity solutions.

In addition to presenting project updates, each PI presented on a special topic suggested by Program Manager Doug Maughan. I presented some reflective thoughts on 10 Years Later: What Would I Have done Differently? (Or what would I do today?). In this presentation, I revisited my 2008 top ten list of things lawyers should know about the Internet to frame some proposed forward-looking strategies for the PREDICT project in 2014.

Also noted at the meeting, DHS recently released a new broad agency announcement (BAA) that will contractually require investigators contribute into PREDICT any data created or used in testing and evaluation of the funded work (if the investigator has redistribution rights, and subject to appropriate disclosure control).

CAIDA Delivers More Data To the Public

Wednesday, February 12th, 2014 by Paul Hick

As part of our mission to foster a collaborative research environment in which data can be acquired and shared, CAIDA has developed a framework that promotes wide dissemination of our datasets to researchers. We classify a dataset as either public or restricted based on a consideration of privacy issues involved in sharing it, as described in our data sharing framework document Promotion of Data Sharing (

Public datasets are available for downloaded from our public dataserver ( subject to conditions specified in our Acceptable Use Agreement (AUA) for public data ( CAIDA provides access to restricted datasets conditionally to qualifying researchers of academic and CAIDA-member institutions agreeing to a more restrictive AUA (

In January 2014 we reviewed our collection of datasets in order to re-evaluate their classification. As a result, as of February 1, we have converted several popular restricted CAIDA datasets into public datasets, including most of one of our largest and most popular data collections: topology data from the (now retired) skitter measurement infrastructure (operational between 1998 and 2008), and its successor, the Archipelago (or Ark) infrastructure (operational since September 2007). We have now made all IPv4 measurements older than two years (which includes all skitter data) publicly available. In addition to the raw data, this topology data includes derived datasets such as the Internet Topology Data Kits (ITDKs). Further, to encourage research on IPv6 deployment, we made our IPv6 Ark topology and performance measurements, from,December 2008 up to the present, publicly available as a whole. We have added these new public data to the existing category of public data sets, which includes AS links data inferred from traceroute measurements taken by skitter and Ark platforms.

Several other datasets remain under consideration for public release, so stay tuned. For an overview of all datasets currently provided by CAIDA (both public and restricted) see our data overview page (

Support for this data collection and sharing provided by DHS Science and Technology Directorate’s PREDICT project via Cooperative Agreement FA8750-12-2-0326 and NSF’s Computing Research Infrastructure Program via CNS-0958547.



(Re)introducing the Internet Measurement Data Catalog (DatCat)

Monday, October 7th, 2013 by Josh Polterock

In 2002, we began to create a metadata catalog where operators and other data owners could index their datasets. We were motivated by several goals we hoped the catalog would enable: data providers sharing data with researchers; researchers finding data to support specific research questions; promoting reproducibility of scientific results using Internet data; and correlating heterogeneous measurement data to analyze macroscopic Internet trends. This last goal was perhaps the most ambitious: we imagined a scenario where enough data would be richly enough indexed that the metadata itself would reveal macroscopic trends about Internet (traffic) characteristics, e.g., average packet size over time, average fraction of traffic carried via HTTP, without even needing to touch the underlying traffic data (netflow or pcap files).

To support this variety of uses of the catalog, we developed a rich metadata model that supported extremely precise descriptions of indexed data sets. For a given data set, one could specify: a description of a collection of files with similar purpose; scholarly paper, articles or publications that make use of the data; descriptions of the files containing the actual data and its format; the package format used for download; contact information; location of the data; a list of keywords; the size of the files/collection; the geographic, network, and logical location of the data; the platform used to collect the data; the start time, end time, and duration; and free form user notes. We allowed searching on any of these fields.

The catalog allows the user to not only index data but also flexibly group data sets into collections, link collections to entries describing the tools used to collect the data, and link collections to publications that used the data. We considered many corner cases and implemented our complex metadata model in an industrial strength relational database. We released the Internet Measurement Data Catalog (DatCat) in June of 2006, prepopulated with our own data sets and introduced via a hands-on workshop where we helped create scripts to assist other researchers in indexing their own data for contribution to the catalog.

In retrospect, we over-complicated the data model and the process of data indexing. Having undertaken data collection for years ourselves, we were familiar with the jargon used to describe precise characteristics of data and the variety of scenarios in which users collect Internet data. We tried to cover each and every possible case. We overshot. The result was a cumbersome and time-consuming interface. Based on feedback from those who took the time to contribute to DatCat, it became clear that we needed to streamline the submission interface. Further, we had built the original service atop an expensive, proprietary database that incurred unnecessary licensing costs.

In August 2011, NSF’s SDCI program provided support for three additional tasks building on what we learned: (1) reduce the burden on those contributing data via a streamlined interface and tools for easier indexing, annotation and navigation of relevant data; (2) convert from use of a proprietary database backend (Oracle) to a completely open source solution (Postgresql); and (3) expand DatCat’s relevance to the cybersecurity and other research communities via forums.

The new database objects have drastically fewer required fields so that contributors can more easily enter new dataset collections. The new streamlined collections require only collection name, short description, and summary fields. We have the new DatCat web site back online serving with the new open-source Postgresql database backend and streamlined interface. Also, we developed a public forums interface to hold discussions of data sharing issues and to answer frequently asked questions regarding the DatCat and the information it contains.

We hope that DatCat evolves to become a lightweight mechanism supporting operators and researchers who want to announce the availability and existence of datasets relevant to (cybersecurity) research. It could also assist NSF PIs with the new requirement that every proposal must include a data management plan for documenting types of data, data and metadata standards, policies for access, sharing, and provisions for re-use, re-distribution, and derivative works and location of archives. Finally, we hope the DatCat service will facilitate collaboration among cybersecurity and network researchers and operators around the world.

We now invite you to take a(nother) look at the Internet Measurement Data Catalog (DatCat). Please point your browser at, browse the catalog, run a few searches, crawl the keywords, create an account, and index your favorite dataset. Please send any questions or feedback to info at datcat dot org.

Targeted Serendipity: the Search for Storage

Wednesday, April 4th, 2012 by Josh Polterock

On the heels of our recent press release regarding fresh publications that  make use of the UCSD Network Telescope data, we would like to take a moment to thank the institutions that have helped preserve this data over the last eight years. Though we recently received an NSF award to enable  near-real-time sharing of this data as well as improved classification, the award does not cover the cost to maintain this historic archive. At current UCSD rates, the 104.66 TiB would cost us approximately $40,000 per year to store. This does not take into account the metadata we have collected which adds roughly 20 TB to the original data.  As a result, we had spent the last several months indexing this data in preparation for deleting it forever.

Then, last month, I had the opportunity to attend the Security at the Cyberborder Workshop in Indianapolis. This workshop focused on how the NSF-funded IRNC networks might (1) capture and articulate technical and policy cybersecurity considerations related to international research network connections, and (2) capture opportunities and challenges for the those connections to foster cybersecurity research.  I did not expect to find a new benefactor for storage of our telescope data at the workshop though, in fact, I did.