DatCat and DITL (day-in-the-life) data used in classroom curriculum — anonymization revisited
January 23rd, 2009 by kcI was delighted to see Sid Faber and Tim Shimeall co-teaching a “Network situational awareness” course at Carnegie-Mellon University last semester, using DatCat and DITL data, they even put the class projects online. Not only did some of the students use DITL data (contributed by Japanese academics), as well as Internet2’s netflow data, but they used DatCat to find both data sets. To quote Sid,
“About three weeks into the class, we finally got across one of the key features to the students: we were looking at how things really work on the internet, not just a theoretical discussion of RFCs. The data sets were invaluable, but we had challenges dealing with anonymization, sampling, and the overall volume of the data sets — kind of understandable for the first offering of the course.”
Sid also repeated something I’ve heard too many times from the research community — that the anonymization scheme for the Internet2 flow data, which masks 11 bits of the IP address, makes the data nearly useless for the research they were trying to do. In particular, Muley & Allen’s project on implementing traffic management could only identify blocks of traffic from networks, not individual computers. Nor could they do any network profiling, e.g., count web servers or identify client gateways or find un-NAT’d clients. The data also indicated substantial asymmetric routing — where the data shows only one direction of a conversation on the given link — which may be an artifact of the monitor setup, or the 1/100 packet sampling, or actual traffic patterns. Apparently other students working with data sets anonymized with prefix-preserving anonymized schemes had much more success.
On the other hand, a prefix-preserving anonymization scheme would make it impossible to investigate other questions, such as estimates of traffic flow between organizations attached to Internet2. Internet2 is in a difficult position, trying to accommodate as many researcher needs as possible, but with no guiding set of priorities and concrete questions from the network research community to influence data collection and curation architectures. One could argue the Internet2 should be investing resources in gathering this list of questions, but one could also argue such a list should come from the network research community, or network operators, or funding agencies, or lately, law enforcement.
We did brainstorm a set of questions researchers would like to learn about the Internet at a January 2008 workshop in support of DITL2008, but we mostly learned from that we were not going to be able to collect data to answer most questions on the list. Emile also set up a Google moderator page to invite others to vote on what empirical questions about the Internet they think are important, but we have not announced it widely yet. The reality is that Internet data is so scarce, and the pressure on researchers to publish papers so strong, that researchers often just look around for data sets that might lead to a publication, rather than investing effort in coming up with the most important questions, ascertaining what data is needed to study them, and then trying to solve all the legal, financial, and logistical obstacles to getting the data. So we muddle through, studying a system whose opaqueness gives the subprime mortgage industry a run for its money. Only our system is already obviously riddled with corruption and worse. Our time will come.
[Disclosure: I’ve spent the last year on Internet2’s Research Advisory Council trying to advise Internet2 management that they must prioritize dealing with data privacy issues, so that they can reasonably support their network science mission. It’s slow-going at Internet2 though, in about 14 months we’ve gotten as far as starting another committee, which Internet2 promises to announce on their web site soon.]