Tomahawk
   home  |   about  |   install  |   tutorial  |   resources  |   license  |   test
Tomahawk Test

Networking and Application Performance Testing
Although there are many possible ways to quantify the performance of a network device, experience has shown that a few measurements for characterizing NIPS performance are invaluable. These are:
  • Throughput is the total bandwidth, in megabits per second (Mbps) that can pass through a NIPS.
  • Network latency is the amount of time, in microseconds, that it takes for a packet to pass one-way through the NIPS.
  • Application latency is the time that the NIPS adds to the completion of an application level task. For example, if a file copy takes 10 seconds to complete on a network without a NIPS, and 11 seconds to complete on a network with a NIPS, then the application latency is 1 second or 10%.
  • Connections per second measures how many TCP connections per second can be cleanly set up and torn down when a NIPS is deployed in a network.
All of these measures can be highly dependent on the traffic mix that is running through the device at the time the measurement is taken. Since traffic mix can vary widely from one network to another, it is worthwhile to spend a little time discussing this aspect of testing.

Traffic Mix

When testing switches, it is usually sufficient to test at extremes and assume that the device will work well in between. For instance, you can perform a common test for switch performance on both the smallest and largest packets accepted by the switch (e.g., 64 and 1518 bytes). If performance is acceptable at these extremes, it is likely to be acceptable in-between.

This strategy works well because only one degree of freedom typically affects switch performance, namely packet size. In contrast, NIPS products are designed to inspect traffic at the application layer (layer 7). The network performance of a NIPS is a function of traffic ordering, payload contents, protocol, and many other factors. With this many variables, enumerating all the possible extremes becomes impractical because the number of combinations grows considerably larger.

As a consequence, TippingPoint chooses to evaluate NIPS performance with a few, well chosen mixes that are representative of different environments and extremes. If a NIPS performs well in all these environments, it is likely to perform well in most environments. An alternative is to gather a packet trace from the target environment for the NIPS and use it in performance testing, in addition to the extremes.

Latency and Throughput

To measure network performance, we use Tomahawk to replay the various traffic mixes and simultaneously measure network latency. Our specific mixes include HTTP, FTP, and UDP traffic. We encourage you to use a packet trace gathered from your own environment. A guide to running the tests using these traces can be found in Appendix B.

A provided tool (qatool) contacts the qatcld daemon on each traffic server every 2 seconds to collect statistics from the traffic server's NICs. Qatool tallies these statistics and computes the aggregate throughput of the NIPS in Mbps.

To measure latency, a network measurement tool such as a SmartBits or Agilent can be used, if available. If such a tool is not available, the Linux utility ping may be used. If the latter is used, it is important to run ping between two otherwise idle traffic servers, since the processing associated with Tomahawk can significantly affect the measured latency. It is also important to note ping measures round trip time; one-way latency will be half the reported value.

Regardless of what tool is used to measure latency, the switches in the test jig can introduce latency. For this reason, you should measure the baseline latency and throughput of the system with the NIPS replaced by a wire. This procedure also verifies that everything in the jig is working correctly.

For example, suppose the baseline ping time is 200 microseconds and the baseline throughput (reported by qatool) is 3 Gbps. When the NIPS is placed inline, the throughput drops to 1.36 Gbps and the ping time increases to 420 microseconds. This data indicates that the aggregate throughput of the NIPS is 1.36 Gbps and the one-way latency is (420-200)/2 = 110 microseconds.

Connections per Second

Tomahawk can be used evaluate the maximum connection setup rate of a NIPS. To perform this test, we created a script that opened, then immediately closed, a connection to a web server 1000 times. We used tcpdump to capture the traffic generated by this script. The result is a PCAP that contains 1000 TCP connections, each with a full 3-way handshake (SYN, SYN-ACK, ACK) and 3-way shutdown (FIN-ACK, FIN-ACK, ACK).

We use Tomahawk to replay this PCAP 250 times on each of the traffic servers in the test jig. The result is 250,000 connections per traffic server. When played through a switch, each traffic server completes its run in about 7 seconds, which corresponds to about 35,700 connections/second. With 3 traffic servers, 750,000 unique connections are created at a rate of 107,100 connections/second. Each traffic server uses a different IP address range to ensure that all 1 million TCP 4-tuple are unique during the course of a run.

If the NIPS is placed inline, completing the replay of all 1 million connections may take longer. For instance, when an IntruShield 2600 is placed inline, the test with 1 traffic server takes 56.8 seconds to complete (a connection rate of 4405 connections/second). With 2 traffic servers it takes 106 seconds to complete (a connection rate of 4716 connections/second).

Recursive Web Retrieval

Our third performance test measures the impact of the NIPS on applications, in particular, web browsing. NIPS are often deployed to protect web servers. This test measures the performance impact of such a NIPS on clients browsing a protected server. Our test uses the Linux tool wget to recursively retrieve many files from an Apache web server. As a baseline, we measure how long it takes to retrieve the files through the switch alone. We repeat the process, retrieving through the switch plus the NIPS. The relative difference is the impact of the NIPS. For example, if the baseline retrieval takes 10 seconds and the NIPS retrieval takes 12 seconds, the NIPS has a 20% impact on performance ((12-10)/10 = 20%).

In the test jig, server1 runs the web server. To measure the baseline, we execute the client (wget) on server3. We then run the same client on server2 to measure the impact of the NIPS. The Linux utility time is used to measure performance. Each measurement is taken three times, and the median measurement is used.

NFS File Copy

NIPS are often deployed in internal LANs to segment the network into security zones. Our fourth performance test measures the impact of the NIPS on LANs by measuring its impact on network file system operations.

As with the recursive web retrieval test, we measure the wall clock time needed to complete a file copy operation with and without the NIPS. The relative difference is the impact of the NIPS.

In the test jig (figure 2), server1 runs the NFS server. The exported file system is mount on server2 and server3. We execute the command time cat filename > /dev/null on both server2 and server3 three times, unmounting and remouting the file system between measurements to ensure that no part of the file is cached locally.

As before, the difference in the execution time with and without the NIPS inline is the performance impact of NIPS.

Security Performance Testing
Evaluating the security performance of a NIPS consists of two parts: verifying that the NIPS can accurately and reliably detect and block attacks, and verifying that the NIPS does not block legitimate traffic.

Vendors often confuse the issue by highlighting their different mechanisms for detecting attacks. These mechanisms include signatures, protocol anomaly, application anomaly, and behavioral or statistical anomaly. Regardless of how the NIPS detects the attack, the goal of a NIPS is simple: to block the attack. We use the term filter to denote the attack detection function, independent of the detection mechanism.

There are many considerations when evaluating a NIPS's filter set, including:
  • Attack Recognition is what attacks the NIPS purportedly detects and blocks.
  • Repeatability is whether the NIPS can block the attack each and every time the attack is sent through the NIPS.
  • Evasion Resistance measures the NIPS ability to detect attacks when an attacker attempts to modify the attack to make it more difficult for and IDS/IPS to detect. Evasions include simple changes to an exploit and the use of automated IDS evasion tools such as fragroute [2] and whisker [7]
Security Performance: Attack Recognition

Experimentally determining the attack recognition capabilities of a NIPS is difficult. Since there are tens of thousands of known vulnerabilities and an infinite number of attacks, a sampling methodology must be used where a modest number of attacks are played through the NIPS. A typical test runs anywhere from 10 to 100 randomly selected attacks through the NIPS, with the assumption that, if the vendor provides good coverage for those attacks, they are likely to cover other attacks well. The problem is that a good, random sample of attacks is hard to find.

Commercial tools such as Blade IDS Informer [4] or scanning tools such as Nessus [5] are often used to supply the attacks. Unfortunately, using these tools to evaluate NIPS coverage gives misleading results. Scanning tools like Nessus are designed to detect systems that might be vulnerable to attack. The technology uses "banner scraping" to determine the version of software running on the server being scanned. For example, such a tool might request a web page and determine from the HTTP response header that the server is running IIS 5.0, which is known to be vulnerable to a certain attack. The scanner would then report that the server is vulnerable. Since no attack has been launched, it would be wrong for the NIPS to block the traffic (blocking it would mean blocking legitimate traffic).

Commercial tools that launch known attacks are also of limited value because the vendors have access to the attacks and can ensure that their filter set detects every attack. Consider the following scenario. NIPS vendors X and Y run 100 attacks from Blade in the lab. Vendor X discovers that their product misses 50 application anomaly attacks. In response, vendor X creates 50 signature-based filters for the missed attacks. Meanwhile, vendor Y's product only misses 2, and in analyzing the missed attacks, Y finds that they are invalid.

A customer now repeats the same test for X and Y in the customer lab and finds that X outperforms Y: X blocks 100% of the attacks, Y blocks 98%. The customer erroneously concludes that X is better than Y, not knowing that the 2 missed attacks are invalid.

So how can attack coverage be evaluated? The NSS labs [8] evaluated each product using a secret set of 120 attacks that were unknown to the vendors before the test. After running the test, each vendor is given a chance to respond to the missed attacks, either proving that they are invalid or adding coverage. If the attack is found to be invalid, it is counted as a false positive to the vendors that (incorrectly) detect the attack. This methodology is quite good, provided enough valid, important attacks can be gathered to get a statistically significant sample. Since gathering those attacks is a tremendous amount of work, a second method is to simply analyze the coverage of the filter set by comparing it against a list of significant threats, such as those published by SANS [9].

Despite these constraints, customers will want to see a NIPS in action, blocking attacks. The software included with this document contains 16 different attacks that can be used to test whether a NIPS can block attacks. The software uses Tomahawk to replay network traffic between two network cards on the same machine (eth0 and eth1 in figure 1). The packets arrive at the NIPS interfaces in the same order as they would in the live network. If the NIPS blocks a packet, Tomahawk resends the packet up to a configure number of times (typically 10 retries).

This methodology provides independent verification that the NIPS blocks the attack. It can also be used to replay to same attack many thousands of times.

Security Performance: Repeatability

Modern network attacks can generate significant load on a NIPS and a network. It is important that a NIPS block an attack each and every time an attack is seen, regardless of the background traffic. We use repeatability testing to check this functionality.

The idea of repeatability testing is simple: replay a set of attacks a large number of times at high speed. If a NIPS blocks the attack once, it should block it every time. If a NIPS misses an attack once, it should miss it every time.

An important detail is that each attack should appear to come from a different host. Many NIPS block a flow once an attack is identified within the flow. Flows are identified by IP source and destination addresses and TCP/UDP port numbers (the so-called "host-port quadruple"). If the same quadruple is used in multiple attacks, the repeatability test is weakened considerably because once the NIPS correctly identifies the attack in the flow, it will drop every packet associated with that quadruple. This can give the appearance that the NIPS is correctly identifying each subsequent attack without testing the NIPS capabilities to do so.

The Tomahawk toolset can replay the 16 provided attacks 1000 times in 32 seconds over a bare wire (about 500 attacks/second for each traffic server). Each attack is given a unique host/port quadruple, avoiding the problem described above.

A passing grade should be given to NIPS that exhibits perfect repeatability.

Security Performance: Evasion Resistance

Our final test checks the NIPS ability to resist evasions. Hackers modify their attack to increase the difficulty of IDS/IPS detection. Automated IDS evasion tools such as fragroute generate a particularly important class of evasions.

Fragroute is designed to exploit ambiguities in the IP and TCP protocols. To use it, you run the command

     fragroute -f script target

where script is a text file describing the evasions to use, and target is the IP address of the victim under attack. Fragroute modifies the kernel routing table so that all traffic destined for the target is passed through fragroute, where the evasions specified in script are applied.

Once fragroute is installed, an attack can be launched at the victim. Our attack is based on HTTP, and fetches a URL used by the Nimda worm [10]. It is important that a NIPS correctly block the attack, independent of the script used. It is not very important that a NIPS correctly identify the attack once it has been subject to evasion, since the evasion itself does not occur naturally and is therefore reason enough to block the traffic.

A passing grade should be given to NIPS that can detect and the valid evasions that can be generated by fragroute. It is possible to use fragroute to generate traffic that is sufficiently modified that the intended victim cannot decode it. In this case, the attack could never succeed. For this reason, it is important to verify that any homegrown evasions allow the attack to continue to work. The accompanying software includes 12 evasions that are known to work.

Conclusions

The tests above can be combined to test different aspects of the NIPS simultaneously. For example:
  • Blocking under load.
    The throughput and repeatability tests can be run simultaneously. The NIPS should be able to maintain high repeatability even with background traffic running.
  • Network performance while blocking.
    The application performance tests (NFS and HTTP) can be run on a pair of servers while a third runs the repeatability tests to generate attacks. Given a moderate attack load (e.g., 500 attacks/second), the performance of the applications should be relatively unaffected.
  • Manageability while under attack.
    It is important that a NIPS remain manageable while the network is under attack. You can use the repeatability test on multiple servers to generate a large number of attacks, and then try to manage the NIPS using the management system.
Testing NIPS is a complex and challenging problem involving both network and security performance testing. The tools and methodologies described in this paper represent a large step forward in the creation of fair criteria for evaluating NIPS products. Despite this, more research and development needs to be done to improve the methodologies and tools. In particular, the problem of fairly evaluating attack coverage remains an open issue.

Copyright @2004. All rights reserved.