In their talk at the June 2017 RISCS community meeting, Thomas Gross and Kovila Coopamootoo (both Newcastle), discussed the results of a small grant project on evidence-based methods in cyber security. RISCS was founded to pursue an evidence-based approach, but the question of the extent to which the community employs such methods to a high standard has been unanswered.
The motivation for the project stemmed from a workshop Gross and Coopamootoo ran in the summer of 2016 at IFIP’s privacy and identity summer school. The surprises they encountered in performing a systematic literature review of the submitted papers led them to extend these evaluations into a broader investigation of the research space.
The pair began with a systematic literature review focusing on papers in the field of human factors in security and privacy. Based on the several research questions they defined, they ran a search query on Google Scholar and ended up with 1,157 papers to review, most of which came from the SOUPS conference. They narrowed this list to 146 using inclusion and exclusion criteria limiting the search to studies with human participants that lent themselves to quantitative evaluation. Of these, only 19 were eligible for quantitative meta-analysis evaluation. The qualitative analysis revealed that, of the main themes, authentication, mostly to do with passwords, proved to be the most frequent topic, followed by privacy.
The researchers sought to answer a number of questions about these articles. First: were the studies replicating existing methods or could they be reproduced in the future? Could they say if the papers were internally valid or not? How important was the effect the papers reported and what was its magnitude?
A large percentage neither used nor adapted existing methods. Of the papers whose results could be reproduced in the future, most described their methodology and measurement apparatus quite well. In terms of determining the validity of the papers’ findings, there was a large proportion in which details were missing, and 79% did not report on the magnitude of the effect they found or explain how it was important.
In the quantitative analysis, the researchers’ goal was to determined the state of play of human factors research in cyber security in terms of quantitative properties. What kinds of effect sizes are usually found? What confidence intervals do we get? The researchers did not actually carry out a meta-analysis, which would mean focusing on particular effects and seeing how they can be combined in multiple papers, though they did use tools created for that purpose.
The researchers began by coding the papers – that is, identifying the evidence within them that supports quality indicators and quantitative reasoning – in order to identify the papers suitable for a detailed view. This proved to be frustrating, because only 8% of the papers overall explicitly report effect sizes, although the researchers could infer them for about half of the rest from the reported mean and standard deviations. Gross and Coopoamootoo found that 33% had small effect sizes that were quite situational; it was unknown if the effect would exist in real life.
The takeaways from this work:
- We have a replication crisis. There is very little reproduction of validated methods and measures. There were no replication studies in the entire sample, even though the authors are doing well at describing the research so it can be reproduced.
- Only 12% full fulfil the American Psychological Association guidelines for standardised reporting of experiments on human beings. This could be done much better, and improving this aspect would improve the state of the field.
- Reporting on quantitative aspects such as effect sizes is weak. There would be considerable benefit in including parameter estimation in such research; it would make doing meta-analyses easier and substantially improve the state of the field.
In answer to a question, the researchers noted that they had looked for a correlation with the venues in which the papers originally appeared, comparing main conferences like Usenix and PET Symposium with more specialised ones like SOUPS and the LASER Workshop, and found no substantive differences between the two sub-samples. However, they did start to see differences within individual conferences.