Responses to our analyses of animal drug/toxicology tests, and continued defence of animal drug testing
Following the publication of each of our three, complementary papers in 2013, 2014 and 2015, we wrote to dozens of representatives of pharmaceutical companies, regulators and other stakeholders, requesting feedback, thereby hoping to build on our work and open some dialogue on this important issue, with ethical implications for the animals used, as well as for human users of pharmaceuticals. Disappointingly, only scant responses were received, and almost all of them were formulaic, and polite, but not engaging. The Association of the British Pharmaceutical Industry (ABPI) voiced some concerns over various attributes of the data set we used [12], but our substantial, published response constituted a full rebuttal [13]. Perhaps belatedly, the UK’s National Centre for the 3Rs (NC3Rs)—despite its initially dismissive stance—announced in the summer of 2016 its own collaborative project with the ABPI, to analyse industry data [14] We naturally welcome this, providing, of course, that it is done transparently and objectively, and preferably with independent oversight. Its eagerly-awaited report was expected in late 2018, but still has not been announced at the time of writing.
In the meantime, some advocates of animal drug-tests have continued to argue that these tests have utility, by citing some of the few, previous reports suggesting that this might be the case. This must be addressed, because this conclusion is not supported by those papers. One of these reports [2], as we have already discussed in our work, did not estimate specificity, without which the evidential weight toward likelihood of human toxicity/non-toxicity provided by the animal models—which is precisely what we need to know—cannot be calculated. As the authors of the cited study themselves acknowledged, “A more complete evaluation of this predictivity aspect will be an important part of a future prospective survey.” Another such cited report [15] showed human predictability for some therapeutic areas to be over 90%—yet it also showed many other areas where results from animal studies failed to significantly correlate with human observations, which were overlooked. Importantly, this analysis also utilised Likelihood Ratios (LRs), and the author argued why this is superior and necessary— much as we did in our own papers. Our rationale for using LRs—in place at the inception of our analyses, before any data were analysed, and in common with the aforementioned study—was, simply, because LRs are much more appropriate and inclusive, incorporating sensitivity and specificity, both of which are necessary to derive the true value of the results of any test, and which are superior to Predictive Values (PVs), because they do not depend on the prevalence of adverse effects. We discussed this in detail in our papers, and others have specifically supported this approach [16].
Other, recent published analyses of drug toxicology data
Two studies similar to our own have been published in the past year. Given our interest in this, and given the ethical and scientific importance of the issue, we wish to add to the discussion and debate, by highlighting areas with which we agree and that we welcome, but also some issues we have with those papers and their conclusions.
Monticello et al.
A study not limited to, but relying on, PVs was very recently published by Monticello et al. in November 2017 [17]. While we welcome and appreciate the authors’ attempts to elucidate this controversial and opaque issue, we believe their conclusion that, “These results support the current regulatory paradigm of animal testing in supporting safe entry to clinical trials and provide context for emerging alternate models”, must be addressed.
In our opinion, there are several important caveats. Perhaps the most salient is that—while the authors report both PVs and LRs—they focus almost exclusively on Negative Predictive Value (NPV) to support their conclusion. This is puzzling, given the nature of these statistical metrics and their associated qualities and shortcomings, and especially so, given that the authors specifically discuss some of them before ultimately overlooking them. For instance, even though they admit that LRs “are not influenced by clinical positive prevalence” (which is why, some assert, they may be superior), this doesn’t prevent the authors going on to concentrate on the PVs, which are influenced by toxicity prevalence.
We, in our analyses, argued, in some detail, why LRs should be used in preference to PVs [9,10,11, 13], as mentioned above. There is plentiful support for this in the literature. In brief, experts assert that LRs are the “optimal choice”, are “more informative than PVs”, and are “the single most powerful indicator of diagnostic usefulness”, as they incorporate sensitivity and specificity, and are independent of prevalence, which must be taken into account to estimate the value of a test (see [18,19,20,21,22,23,24]).
Monticello et al.’s emphasis on a high NPV is accepted to be “…largely based on the low clinical positive prevalence observed in our database and in the literature, which can be attributed to the fact that compounds entering clinical development have typically cleared many safety hurdles via extensive in silico, in vitro, and in vivo lead optimization screening activities.” Yet, it seems that the authors overlook the contribution of these screening activities, when they conclude that it is not they, but the lack of toxicity in animal tests, which predicts a lack of toxicity clinically, to the degree that they support the current paradigm centred on animal testing. What also challenges their conclusion—even taking the authors’ stance and sidestepping the LRs to concentrate on the PVs—is that their calculated Positive PVs (PPVs) were relatively low (a reported mean of just 36%, even when the low-scoring ‘other’ organ category was excluded); the authors chose to report that there were two impressive values out of the 36 reported, for non-human primates (NHPs), in the nervous system and gastrointestinal categories. We must question how this can “support the current regulatory paradigm of animal testing”. Animal tests aren’t just purported to exist to “support safe entry to clinical trials” by predicting which drugs might not be toxic to humans—they are also purported to serve as an efficient means of detecting which drugs might be harmful.
When one examines the LRs in Monticello et al.’s analysis instead of the PVs (see our argument above), a clearer picture emerges. The reported inverse Negative LRs (iNLRs) are very low indeed—sometimes less than 1.0, and often barely greater than unity—which suggests that the animal tests are providing no evidential weight to the probability that a drug will show no toxicity in humans. This is precisely the salient finding we reported in our papers [9,10,11], and which underpins our argument that the animal tests are not fit for purpose. They report a mean iNLR of just 1.5–1.6, and a mean Positive LR (PLR) of 2.9. These are low LR values, which indicate that very little evidential weight is being provided by the animal tests to the probability of human toxicity/absence of toxicity. They also report similarly poor iNLRs for rodents, dogs and monkeys, as we found. In short, in many ways, they actually repeat and reinforce our findings, in accordance with their statement in section 2.7 of their Methods, that, “As a general rule, a test is considered ‘diagnostic’ in predicting a positive outcome when the LR+ is >10 or for predicting a negative outcome when the iLR- is > 10.” Of their 36 possible results, only two PLRs/LR+ met the authors’ acknowledged ‘diagnostic’ definition of a value of > = 10, and none of the iNLRs/iLR- did so. In fact, 30 of the iLR- values were < =2, with most of these in or around unity; i.e. they provided no evidential weight at all. In other words, by the definition and criteria that they cite, the animal tests, based on their data and their analysis, cannot be considered to be diagnostic/predictive.
We appreciate that the authors acknowledge some important points about this area of science generally, as well as some limitations of their study. As we did in our own work, they report “limited” efforts to analyse the value of animal tests in the past, and accept they are based on “historical precedence” and an assumption of value. With regard to their analysis, they accept that their data involved just 182 drugs (compared to our > 3200, for example); they looked only at animal test/Phase I concordance, and didn’t include later phase clinical trials, in which more drugs will fail. Their study also used few, broad categories for adverse drug reactions (ADRs), which favours their hypothesis compared to more, and more stringent, classifications; and they combined mice and rats as ‘one effective species’, even though mice and rats often show significant differences in toxicity [11]. Finally, they reported no conflicts of interest, but thanked almost 20 biopharmaceutical companies in their acknowledgements, and have affiliations to nine companies. While we do not suggest any impropriety, some might argue that they could have an interest in justifying their industry’s and companies’ historic and current use of animals in drug testing.
Clark and Steger-Hartmann
This was an analysis of more than 3000 drugs, based on data in Elsevier’s comprehensive PharmaPendium database [25, 26]. The authors took a similar approach to our own, by using LRs to determine the diagnostic power of tests in animals to inform human toxicity, as well as concluding that their study confirmed our own salient finding: “…the lack of these [adverse] events in nonclinical [animal] studies was found to not be a good predictor of safety in humans, thus partly confirming the findings of Bailey et al. (2014). [citing one of our series of three papers]”.
Confirmation of our salient finding is of the utmost importance for two reasons. First, though we sought no validation of our own approach and publications, but have always had the utmost confidence in them, some stakeholders with opposing opinions on the value of animal-based drug testing were intent on denigrating our work. Secondly, no matter how well any animal test might predict human toxicity (hypothetically), it is the absence of toxicity in animals that is the critical factor for the progression of a new drug into clinical (human) trials. As we continue to argue, if animal tests fail in this crucial respect—as they appear to do—this not only means those tests are not fit for their overall purpose (identifying safe and effective human drugs), but this must have repercussions for the pharmaceutical industry and its regulators, and how they approach drug testing generally.
This paper also confirmed our other main finding, which suggested that adverse reactions in animal tests are, in fact, also likely to occur in humans (though, importantly, often not in a similar manner). Crucially, however, we have interpreted the consequences of this aspect differently. Both the authors of this paper, and ourselves, found this aspect to be very variable, with no clear pattern in terms of types of toxic effects or types of drugs. We therefore concluded that this cannot be considered particularly relevant or reliable. Clark and Steger-Hartmann, however, provided some examples of where animals did predict human toxicity, but did not show, or weigh these against, areas where this predictive aspect was lower, non-existent, or negative. Indeed, some of the examples they provided were only just over the statistical threshold they had themselves had set. Consequently, we believe that while both their data and our own data support their conclusion that, “The animal-human translation of many key observations is confirmed as being predictive”, they do not support their conclusion that their study “…confirmed the general predictivity of animal safety observations for humans”. This is compounded by very poorly predictive observations that can only be considered as serious, such as death, convulsions, movement disorders and liver disorders.