By Ildiko Pete and Yi Ting Chua
The availability of publicly accessible datasets plays an essential role in the advancement of cybercrime and cybersecurity as a field. There has been increasing effort to understand how datasets are created, classified, shared, and used by scholars. However, there has been very few studies that address the usability of datasets.
As part of an ongoing project to improve the accessibility of cybersecurity and cybercrime datasets, we conducted a case study that examined and assessed the datasets offered by the Cambridge Cybercrime Centre (CCC). We examined two stages of the data sharing process: dataset sharing and dataset usage. Dataset sharing refers to three steps: (1) informing potential users of available datasets, (2) providing instructions on application process, and (3) granting access to users. Dataset usage refers to the process of querying, manipulation and extracting data from the dataset. We were interested in assessing users’ experiences with the data sharing process and discovering challenges and difficulties when using any of the offered datasets.
To this end, we reached out to 65 individuals who applied for access to the CCC’s datasets and are potentially actively using the datasets. The survey questionnaire was administered via Qualtrics. We received sixteen responses, nine of which were fully completed. The responses to open-ended questions were transcribed, and then we performed thematic analysis.
As a result, we discovered two main themes. The first theme is users’ level of technological competence, and the second one is users’ experiences. The findings revealed generally positive user experiences with the CCC’s data sharing process and users reported no major obstacles with regards to the dataset sharing stage. Most participants have accessed and used the CrimeBB dataset, which contains more than 48 million posts. Users also expressed that they are likely to recommend the dataset to other researchers. During the dataset usage phase, users reported some technical difficulties. Interestingly, these technical issues were specific, such as version conflicts. This highlights that users with a higher level of technical skills also experience technical difficulties, however these are of different nature in contrast to generic technical challenges. Nonetheless, the survey shown the CCC’s success in sharing their datasets to a sub-set of cybercrime and cybersecurity researchers approached in the current study.
Ildiko Pete presented the preliminary findings on 12thAugust at CSET’19. Click here to access the full paper.
the Rapid 7 Datasets provide a great insight into the murky world of vulnerable devices out there in the wild.
https://github.com/tg12/rapid7_OSINT
Re ANY crime dataset:
Because of
the diversity and vulnerability of persons gathering these datasets, and
the extensive complexity of those variables being observed, and
the inescapable propensity of that observation to change the observed variable, and the impossibly huge resulting size of the datasets, (if they are to be usefully granular)… …I object that the only thing that can be generically said about those using a dataset produced about criminal activity is that it is TOO PROBABLY the ghost image of those users/creators own minds that will ever be reliably found there.