Archive for the 'Data Collection' Category

Recent collections added to DatCat

Monday, September 29th, 2014 by Paul Hick

As announced in the CAIDA blog “Further Improvements to the Internet Data Measurement Catalog (DatCat)” of August 26, 2014, the new Internet Data Measurement Catalogue DatCat is now operational. New entries by the community are welcome, and about a dozen have been added so far. We plan to advertise new and interesting entries on a regular basis with a short entry in this blog. This is the first contribution in this series.

Added on July 31, 2014, was the collection “DNS Zone Files”.;
contributed 2014-07-31 by Tristan Halvorson:

This collection contains Zone files with NS and A records for all new (2013 and later) TLDs.

ICANN has opened up the TLD creation process to a large number of new registries with a centralized service for downloading all of this new data. Each TLD has a separate zone file, and each zone file contains entries for every registered domain. This data collection contains step-by-step instructions to acquire this data directly from the registries through ICANN. This method only works for TLDs released during 2013 or later.

DHS S&T PREDICT PI Meeting, Marina del Rey, CA

Friday, June 6th, 2014 by Josh Polterock

On 28-29 May 2014, DHS Science and Technology Directorate (S&T) held a meeting of the Principal Investigators of the PREDICT (Protected Repository for the Defense of Infrastructure Against Cyber Threats) Project, an initiative to facilitate the accessibility of computer and network operational data for use in cybersecurity defensive R&D. The project is a three-way partnership among government, critical information infrastructure providers, and security development communities (both academic and commercial), all of whom seek technical solutions to protect the public and private information infrastructure. The primary goal of PREDICT is to bridge the gap between producers of security-relevant network operations data and technology developers and evaluators who can leverage this data to accelerate the design, production, and evaluation of next-generation cybersecurity solutions.

In addition to presenting project updates, each PI presented on a special topic suggested by Program Manager Doug Maughan. I presented some reflective thoughts on 10 Years Later: What Would I Have done Differently? (Or what would I do today?). In this presentation, I revisited my 2008 top ten list of things lawyers should know about the Internet to frame some proposed forward-looking strategies for the PREDICT project in 2014.

Also noted at the meeting, DHS recently released a new broad agency announcement (BAA) that will contractually require investigators contribute into PREDICT any data created or used in testing and evaluation of the funded work (if the investigator has redistribution rights, and subject to appropriate disclosure control).

CAIDA Delivers More Data To the Public

Wednesday, February 12th, 2014 by Paul Hick

As part of our mission to foster a collaborative research environment in which data can be acquired and shared, CAIDA has developed a framework that promotes wide dissemination of our datasets to researchers. We classify a dataset as either public or restricted based on a consideration of privacy issues involved in sharing it, as described in our data sharing framework document Promotion of Data Sharing (

Public datasets are available for downloaded from our public dataserver ( subject to conditions specified in our Acceptable Use Agreement (AUA) for public data ( CAIDA provides access to restricted datasets conditionally to qualifying researchers of academic and CAIDA-member institutions agreeing to a more restrictive AUA (

In January 2014 we reviewed our collection of datasets in order to re-evaluate their classification. As a result, as of February 1, we have converted several popular restricted CAIDA datasets into public datasets, including most of one of our largest and most popular data collections: topology data from the (now retired) skitter measurement infrastructure (operational between 1998 and 2008), and its successor, the Archipelago (or Ark) infrastructure (operational since September 2007). We have now made all IPv4 measurements older than two years (which includes all skitter data) publicly available. In addition to the raw data, this topology data includes derived datasets such as the Internet Topology Data Kits (ITDKs). Further, to encourage research on IPv6 deployment, we made our IPv6 Ark topology and performance measurements, from,December 2008 up to the present, publicly available as a whole. We have added these new public data to the existing category of public data sets, which includes AS links data inferred from traceroute measurements taken by skitter and Ark platforms.

Several other datasets remain under consideration for public release, so stay tuned. For an overview of all datasets currently provided by CAIDA (both public and restricted) see our data overview page (

Support for this data collection and sharing provided by DHS Science and Technology Directorate’s PREDICT project via Cooperative Agreement FA8750-12-2-0326 and NSF’s Computing Research Infrastructure Program via CNS-0958547.



(Re)introducing the Internet Measurement Data Catalog (DatCat)

Monday, October 7th, 2013 by Josh Polterock

In 2002, we began to create a metadata catalog where operators and other data owners could index their datasets. We were motivated by several goals we hoped the catalog would enable: data providers sharing data with researchers; researchers finding data to support specific research questions; promoting reproducibility of scientific results using Internet data; and correlating heterogeneous measurement data to analyze macroscopic Internet trends. This last goal was perhaps the most ambitious: we imagined a scenario where enough data would be richly enough indexed that the metadata itself would reveal macroscopic trends about Internet (traffic) characteristics, e.g., average packet size over time, average fraction of traffic carried via HTTP, without even needing to touch the underlying traffic data (netflow or pcap files).

To support this variety of uses of the catalog, we developed a rich metadata model that supported extremely precise descriptions of indexed data sets. For a given data set, one could specify: a description of a collection of files with similar purpose; scholarly paper, articles or publications that make use of the data; descriptions of the files containing the actual data and its format; the package format used for download; contact information; location of the data; a list of keywords; the size of the files/collection; the geographic, network, and logical location of the data; the platform used to collect the data; the start time, end time, and duration; and free form user notes. We allowed searching on any of these fields.

The catalog allows the user to not only index data but also flexibly group data sets into collections, link collections to entries describing the tools used to collect the data, and link collections to publications that used the data. We considered many corner cases and implemented our complex metadata model in an industrial strength relational database. We released the Internet Measurement Data Catalog (DatCat) in June of 2006, prepopulated with our own data sets and introduced via a hands-on workshop where we helped create scripts to assist other researchers in indexing their own data for contribution to the catalog.

In retrospect, we over-complicated the data model and the process of data indexing. Having undertaken data collection for years ourselves, we were familiar with the jargon used to describe precise characteristics of data and the variety of scenarios in which users collect Internet data. We tried to cover each and every possible case. We overshot. The result was a cumbersome and time-consuming interface. Based on feedback from those who took the time to contribute to DatCat, it became clear that we needed to streamline the submission interface. Further, we had built the original service atop an expensive, proprietary database that incurred unnecessary licensing costs.

In August 2011, NSF’s SDCI program provided support for three additional tasks building on what we learned: (1) reduce the burden on those contributing data via a streamlined interface and tools for easier indexing, annotation and navigation of relevant data; (2) convert from use of a proprietary database backend (Oracle) to a completely open source solution (Postgresql); and (3) expand DatCat’s relevance to the cybersecurity and other research communities via forums.

The new database objects have drastically fewer required fields so that contributors can more easily enter new dataset collections. The new streamlined collections require only collection name, short description, and summary fields. We have the new DatCat web site back online serving with the new open-source Postgresql database backend and streamlined interface. Also, we developed a public forums interface to hold discussions of data sharing issues and to answer frequently asked questions regarding the DatCat and the information it contains.

We hope that DatCat evolves to become a lightweight mechanism supporting operators and researchers who want to announce the availability and existence of datasets relevant to (cybersecurity) research. It could also assist NSF PIs with the new requirement that every proposal must include a data management plan for documenting types of data, data and metadata standards, policies for access, sharing, and provisions for re-use, re-distribution, and derivative works and location of archives. Finally, we hope the DatCat service will facilitate collaboration among cybersecurity and network researchers and operators around the world.

We now invite you to take a(nother) look at the Internet Measurement Data Catalog (DatCat). Please point your browser at, browse the catalog, run a few searches, crawl the keywords, create an account, and index your favorite dataset. Please send any questions or feedback to info at datcat dot org.

Targeted Serendipity: the Search for Storage

Wednesday, April 4th, 2012 by Josh Polterock

On the heels of our recent press release regarding fresh publications that  make use of the UCSD Network Telescope data, we would like to take a moment to thank the institutions that have helped preserve this data over the last eight years. Though we recently received an NSF award to enable  near-real-time sharing of this data as well as improved classification, the award does not cover the cost to maintain this historic archive. At current UCSD rates, the 104.66 TiB would cost us approximately $40,000 per year to store. This does not take into account the metadata we have collected which adds roughly 20 TB to the original data.  As a result, we had spent the last several months indexing this data in preparation for deleting it forever.

Then, last month, I had the opportunity to attend the Security at the Cyberborder Workshop in Indianapolis. This workshop focused on how the NSF-funded IRNC networks might (1) capture and articulate technical and policy cybersecurity considerations related to international research network connections, and (2) capture opportunities and challenges for the those connections to foster cybersecurity research.  I did not expect to find a new benefactor for storage of our telescope data at the workshop though, in fact, I did.


Internet Censorship Revealed Through the Haze of Malware Pollution

Wednesday, March 28th, 2012 by Josh Polterock

We were happy to see the coverage of UCSD’s press release describing two papers we recently published, introducing new methods and applications for analyzing dark net data (aka “Internet background radiation” or IBR).  The first paper, “Analysis of Country-wide Internet Outages Caused by Censorship”, presented by author Alberto Dainotti last November at IMC 2011, focused on using IBR in conjunction with other data sources to reveal previously unreported aspects of the disruptions seen during the uprisings of early 2011 in Egypt and Libya. The second paper, “Extracting benefit from harm: using malware pollution to analyze the impact of political and geophysical events on the Internet”, published in ACM SIGCOMM CCR (January 12), used IBR data observed by UCSD’s network telescope to characterize Internet outages caused by natural disasters. In both cases the analysis of this (mostly malware-generated) background traffic contributed to our understanding of events unrelated to the malware itself. Our press release was picked up by several online publications, including The Wall Street Journal Blog, ACM TechnewsCommunications of the ACM Web siteSpacedailyPhysorgTom’s GuideProduct Design & DevelopmentNewswiseDomain-bEurekAlertEurasia reviewSecurity-today.comEverything San DiegoSpacewar Cyber War.

The papers are also available on CAIDA’s publications page.

data collection and reporting requirements for broadband stimulus recipients

Thursday, November 12th, 2009 by kc

No one was more surprised than I to see data collection requirements in the NTIA’s Notice of Funds Availability (NOFA) for the Rural Utilities Service’s (RUS) Broadband Initiatives Program (BIP) and the Broadband Technology Opportunities Programs (BTOP):


Proposal for ICANN/RIR scenario planning exercise

Monday, May 25th, 2009 by kc

Internet infrastructure economics research”, and how to do reasonable examples of it, has come up a lot lately, so i’m posting a brief description of an academic+icann community workshop i’ve been recommending for a few years, which has yet to happen, and (I still believe) is long past due, and specifically more important than passing policies, especially emergency ones to allow IP address markets with no supporting research on the impact on security and stability of the Internet, and even at the risk of killing IPv6 altogether.]


DatCat and DITL (day-in-the-life) data used in classroom curriculum — anonymization revisited

Friday, January 23rd, 2009 by kc

I was delighted to see Sid Faber and Tim Shimeall co-teaching a “Network situational awareness” course at Carnegie-Mellon University last semester, using DatCat and DITL data, they even put the class projects online. Not only did some of the students use DITL data (contributed by Japanese academics), as well as Internet2’s netflow data, but they used DatCat to find both data sets. To quote Sid,

“About three weeks into the class, we finally got across one of the key features to the students: we were looking at how things really work on the internet, not just a theoretical discussion of RFCs. The data sets were invaluable, but we had challenges dealing with anonymization, sampling, and the overall volume of the data sets — kind of understandable for the first offering of the course.”


proposition: International Bureau of Internet Statistics

Friday, January 9th, 2009 by kc

Last month I submitted two proposals to the National Cyber Leap Year call for input from the U.S. Networking Information Technology Research and Development (NITRD) Program. I submitted two ideas, the International Bureau of Internet Statistics, and Cooperative Measurement and Modeling of Open Networked Systems (COMMONS, a two-year old idea). The Bureau of Internet Statistics still strikes some as batty, but over the holidays I caught up on some panicky OECD state-of-malware-landscape papers on how uninformed we are and how little data we have, while the only concrete recommendation in the “ITU’s study on the financial aspects of network security: malware and spam” report was

Although the financial aspects of malware and spam are increasingly documented, serious gaps and inconsistencies exist in the available information. This sketchy information base also complicates finding meaningful and effective responses. For this reason, more systematic efforts to gather more reliable information would be highly desirable.