Carna botnet scans confirmed
May 13th, 2013 by Alistair King and Alberto DainottiOn March 17, 2013, the authors of an anonymous email to the “Full Disclosure” mailing list announced that last year they conducted a full probing of the entire IPv4 Internet. They claimed they used a botnet (named “carna” botnet) created by infecting machines vulnerable due to use of default login/password pairs (e.g., admin/admin). The botnet instructed each of these machines to execute a portion of the scan and then transfer the results to a central server. The authors also published a detailed description of how they operated, along with 9TB of raw logs of the scanning activity.
Online magazines and newspapers reported the news, which triggered some debate in the research community about the ethical implications of using such data for research purposes. A more fundamental question received less attention: since the authors went out of their way to remain anonymous, and the only data available about this event is the data they provide, how do we know this scan actually happened? If it did, how do we know that the resulting data is correct?
Since we could not find any third-party validation of this event, we looked for evidence in the traffic captured at the UCSD Network Telescope (a large darknet). From this traffic we selected probing packets consistent with the default nmap host probe (comprised of four different types of packets) that the carna botnet used. The visualization below shows, for each day of 2012, the total number of probes we observed at the telescope in bins of 1 day (blue line). While these probes may have been generated by any host on the Internet, the large increase visible between April and September 2012 matches the logs distributed by the authors of the botnet (red line), showing evidence of this scanning activity.
We also found that some of the raw logs of the carna botnet erroneously reported that a large number of IPs in our darknet were active, and specifically accepting connections on port TCP 80 (darknet IP addresses are inactive by definition, thus not accepting connections). A preliminary analysis suggests that this measurement error is likely due to the presence of HTTP proxies in some of the networks that hosted scanning bots. The default nmap host probe sends four different packets trying to solicit a response from the target: (i) ICMP echo request, (ii) ICMP timestamp, (iii) TCP ack on port 80, (iv) TCP syn on port 443. For darknet addresses that the carna logs report as inactive, we observed all four of these packets, but for the addresses misreported as active, packets of type (iii) did not reach the telescope. We suspect that these packets were intercepted by HTTP proxies whose replies caused the bots to falsely report the target IP address as listening on port TCP 80.
Assuming these bots probed the rest of the IPv4 Internet proportionally to their probing of the darknet we can observe, about 3% of the host probe logs and port scan logs of the carna botnet could potentially be affected by this particular problem. The maps and animations they published seem unaffected by this issue because they were based on ICMP pings and actual (application-layer) responses from the target hosts.
We have only briefly investigated the carna botnet scan, but there are clearly epistemological issues related to any potential scientific use of the data published by the botnet authors. There are even more complex ethical issues related to using this data set, as well as with its original collection. We have previously mentioned efforts to provide ethical guidance to Internet researchers; the debate continues and this data set will likely become an interesting part of it.
May 14th, 2013 at 10:01 pm
It’s heartening to learn that ethical (and legal) concerns are at the forefront of researchers concerns here. Transparency and accountability are key applications of the ethical principle Respect for Law and Public Interest, and researchers cannot properly assess risks and benefits of their activities if data quality is not assessed. We will continue to see more situations such as this which exploit gaps in oversight and guidance surrounding the ethical and legal issues around secondary use of “public information” posted on the Internet for research purposes. To this end, we will spin up dialogue and recommendations among community stakeholders at the upcoming CREDS Workshop (Cyber-security Research Ethics Dialogue and Strategy) at IEEE Security & Privacy Symposium in a few weeks (www.caida.org/workshops/creds), which is motivated in part by earlier work by the community in developing the Menlo Report – Ethical Principles Guiding Information and Communication Technology Research; http://www.cyber.st.dhs.gov/wp-content/uploads/2011/12/MenloPrinciplesCORE-20110915-r560.pdf). Also, I have a forthcoming paper on the issue that this Guerrilla Scan incident has rendered more timely (‘How to Throw the Race to the Bottom- Ethical & Legal Concerns with Using Internet Data for ICT Security Research’).
May 15th, 2013 at 2:28 am
I have downloaded the 9TB data and being used by my students for research and analysis. We have sampled the data and it looks real, at least that’s what it reflects.