This is a (labelled!) snippet of the data that have been used for the experiments in the journal paper "Detection and Threat Prioritization of Pivoting Attacks in Large Networks", which has been accepted for publication in IEEE Transactions on Emerging Topics in Computing (IEEE TETCI). If you wish to read the paper, you can retrieve it here.
The dataset we disclose was collected in the major department of a large organization, and has been anonymized for privacy reasons. Regardless, the network data included can still be useful to conduct experiments on pivoting activities, and for any other task involving network traffic analysis.
The dataset contains network traffic data in the forms of network flows collected in a large organization over an entire working day. These flows represent the communications of the internal hosts of the monitored network environment -- that is, it only contains internal-to-internal network traffic. For privacy reasons, the true IP addresses of the hosts have been anonymized: they are now represented as integers.
All flow samples included in the dataset are associated to a binary label that denotes whether the corresponding flow is part of a pivoting activity or not. The labelling procedure was performed and verified manually.
The dataset is provided as a single compressed .tar.gz file of ~1.5GB. When extracted, it will result in a single .CSV file of ~6GB in size, and containing nearly 75M network flows. The following is the list of features captured by our network flow collector and included in the dataset.
Feature Name | Description | Type |
---|---|---|
(Number) | the number of the flow in the .CSV file | int |
src | the source IP address of the flow (anonymized) | int |
dst | the destination IP address of the flow (anonymized) | int |
p_src | the port used by the source host | int |
p_dst | the port used by the destination host | int |
b_in | the destination-to-source bytes sent | int |
b_out | the source-to-destination bytes sent | int |
d | the duration of the flow | float |
timestamp | the timestamp at which the communication began | timestamp |
pkts_in | the destination-to-source packets sent | int |
pkts_out | the source-to-destination packets sent | int |
proto | the communication protocol used | int |
is_pivoting | the label specifying if the flow is part of a pivoting activity | bool |
Feel free to augment the dataset with additional features that can be derived by the ones we provide (such as the ratio of incoming/outgoing bytes).
Most of the pivoting activities captured in the dataset involved the remote use of the "terminal" host, which was controlled by the "pivoter" host through some third-party application (such as Windows Remote Desktop protocol). We remark that these activities are the "normal" pivoting activities that occurred in the monitored organization -- they are not malicious, and hence represent benign and rare events.
If you wish to use the dataset, feel free to write an email to giovanni.apruzzese@uni.li mentioning "Pivoting Dataset" in the subject, and remembering to specify your institution in the email body!
If you use the dataset, you are kindly invited to cite our Paper. Also, if you obtain this dataset through other means, we invite you to inform us so that we can update the list of Institutions that used this dataset.