|
CMU-CS-08-157
Computer Science Department
School of Computer Science, Carnegie Mellon University
CMU-CS-08-157
Traffic Analysis for Network Security
using Learning Theory and Streaming Algorithms
Shobha Venkataraman
September 2008
Ph.D. Thesis
CMU-CS-08-157.pdf
Keywords: Network security, machine learning, learning theory,
traffic analysis, streaming algorithms
A recurring problem in network traffic analysis is to automatically distinguish
legitimate traffic from malicious or spurious traffic. This problem arises in several
guises in network security (e.g., spam mitigation, worm detection), and is, at core,
a machine learning or data mining problem. However, traffic analysis for network
security has many fundamental challenges that are not present in typical machine
learning or data mining problems, and a blackbox application of classical algorithms
may not address these challenges adequately. For example, many standard machine
learning algorithms may not scale to the volume and diversity of network traffic,
or perform well in the presence of a malicious adversary who aims to evade detection.
It is, therefore, necessary to design algorithms that meet these challenges, and
provide formal guarantees on how well they have been met by the algorithms and the
extent to which they can be met by any algorithm.
In this thesis, we consider four problems in network security with these challenges,
and we use tools from computational learning theory and streaming algorithm design
to address them. In each of these four problems, the difference between the malicious
traffic and normal traffic is characterized by a specific structure of the traffic
distributions: temporal structure, structure in content, structure in communication patterns
of hosts and network structure given by host IP addresses. We present both efficient
algorithms as well as fundamental lower bounds for these problems:
- In the stepping-stones problem, we use the temporal structure of the traffic – in
particular, the inter-packet timing delays &ndash' to identify pairs of streams that are
likely to be stepping-stones. We provide algorithms with strong upper bounds
on the number of packets they need to observe, to detect attacks with given false
positive and false negative rates. We also present lower bounds showing how
an adversary, with sufficient chaff, could evade any detection mechanism that is
based only on the timing delays between packets.
- When generating signatures for exploits with pattern-extraction techniques, the
content structure of network traffic is used to identify packets that are likely to
be worms. A sequence of prior work has alternately developed pattern-extraction
algorithms for signature-generation, and attacks on these pattern-extraction algorithms.
We present lower bounds showing how any pattern-extraction algorithm
could be misled, in the presence of an adversary with sufficient control over the
malicious data.
- We present efficient streaming algorithms to identify superspreaders, which are
sources that contact many distinct destinations in a short time period. The
communication structure of most hosts on Internet makes finding superspreaders of
interest to security applications, as they are likely indicators of worms, scanning,
or other malicious activity. Our experimental results on real network traces show
that our algorithms are substantially more efficient than earlier approaches.
- Finally, we study the problem of tracking regions of the IP address space that
send malicious traffic. In the first part of our work, we focus on spam traffic, and
explore whether the history and structure of IP addresses could be used to distinguish
spammers from senders of legitimate mail. In the second part, we design
online algorithms that, with low space requirements, can dynamically track IP
prefixes that originate the malicious traffic, and provide a near-optimal prediction
of IP addresses that send malicious traffic and normal traffic. Our experimental
results demonstrate that our algorithm finds prefixes that are orders of magnitude
more accurate than fixed commonly-used IP prefixes.
148 pages
|