David was perhaps best known in the forensic science community for his work on statistical models for the calculation of likelihood ratios, including his work with Prof Colin Aitken on multivariate kernel density models, and for his 2005 book Introduction to Statistics for Forensic Scientists. He had a number of hobbies outside of academia, including restoring houses while he lived in them  he had a very high tolerance for living in a building site, and guest were always welcome to share in the experience. He was welcoming and gracious. When I last visited him in June 2016 he was unwell and a couple of days after I left he was diagnosed with a brain tumour. After surgery and chemotherapy his health improved and he was optimistic, but ultimately the disease defeated him. My condolences to his family and friends.

 I presented an Introduction to the likelihood ratio framework for the evaluation of forensic evidence at the Keeping up with Forensic Science Conference, organized by Cook County Public Defender Office, Forensic Science Division, and hosted by Loyola University School of Law, Chicago, IL, USA.

L&I’s main argument against the use of an LR is a “straw man argument”.

“Transparent implementation of the likelihood ratio framework is actually the solution to the problem.”

• March 2018

PhD opportunity: If you would like to work with me, please contact me asap about preparing an application for an internally funded PhD Scholarship. Anyone can apply, but the funding only covers the full cost of tuition for UK/EU students. This amount would be deducted from the tuition fees for non UK/EU students. The next call closes 30 April 2018.

• February 2018

Postdoc opportunity: If you would like to work with me (and are not working in the UK and are not a UK citizen), please contact me asap about preparing an application for a Newton International Fellowship. The next call closes 27 March 2018.

• January 2018

Postdoc opportunity: If you would like to work with me (and have spent less than 12 months in the UK during the 3 years before the submission deadline), please contact me asap about preparing an application for a Marie Skłodowska-Curie Fellowship. The next call opens 12 April 2018 and closes 12 September 2018. Aston University has hosted 24 recipients of this fellowship since 2014.

 A press release from the National Institute of Standards and Technology (NIST) could potentially impede progress toward improving the analysis of forensic evidence and the presentation of forensic analysis results in courts in the United States and around the world.

• October 2017

Statistical expert evidence 1.75 time more likely to survive an admissibility challenge than non-statistical evidence

Very sadly, Dr Bryan Found died of a heart attack on Sunday 23 October 2016.

I considered him a friend and he was one of my favourite people on the planet. We mostly saw each other at conference in Australia and around the world, and I once had the honour of running a workshop at his lab. I last saw Bryan at the end of July in Phoenix AZ, and we had too little time to spend together then. Bryan was Chief Scientist at Victoria State Police and was dedicated to improving forensic science. He was particularly well known for his work on empirical validation of forensic analysis of handwriting and signatures. I have fond memories of him throwing knitted ring-tailed possum finger puppets to bemused audience members at a conference just under a year ago in Brazil. He is going to be greatly missed by myself and by many others. My condolences to his family and friends.

 INTERPOL survey of the use of speaker identification by law enforcement agencies published

• 20152016

Image below retreived 18 Febrary 2015. Click on image to enlarge.

 Throughout 2015 and 2016, Morrison (2011) “Measuring the validity and reliability of forensic likelihood-ratio systems” was ranked as the most cited paper published in Science & Justice within the previous 5 years. As of 24 January 2017, Scopus ranks it as the 6th most cited paper ever puiblished in Science & Justice.

A New Paradigm for the Evaluation of Forensic Evidence
and its implementation in forensic voice comparison
Workshop on Quantifying the Weight of Forensic Evidence
National Institute of Standards and Technology (NIST)
May 2016

Aitken (2018) states that “Score-based approaches have been used for ... speech recognition” and that scores are “based on the similarity of pairwise scores rather than the similarity and rarity of features.” In fact, in the field of forensic speaker recognition the scores used are not similarity-only scores, but scores that take account of both similarity and typicality.

Marquis et al (2017) [What is the error margin of your signature analysis? Forensic Science International, 281, e1e8] ostensibly presents a model of how to respond to a request from a court to state an “error margin” for a conclusion from a forensic analysis. We interpret the court’s request as an explicit request for meaningful empirical validation to be conducted and the results reported. Marquis et al (2017), however, recommends a method based entirely on subjective judgement and does not subject it to any empirical validation. We believe that much resistance to the adoption of the likelihood ratio framework is not to the idea of assessing the relative probabilities (or likelihoods) of the evidence under prosecution and defence hypotheses per se, but to what is perceived to be unwarranted subjective assignment of those probabilities. In order to maximize transparency, replicability, and resistance to cognitive bias, we recommend the use of methods based on relevant data, quantitative measurements, and statistical models. If the method is based on subjective judgement, the output should be empirically calibrated. Irrespective of the basis of the method, its implementation should be empirically validated under conditions reflecting those of the case at hand.

In 2015 the Criminal Practice Directions (CPD) on admissibility of expert evidence in England & Wales were revised. They emphasised the principle that “the court must be satisfied that there is a sufficiently reliable scientific basis for the evidence to be admitted”. The present paper aims to assist courts in understanding from a scientific perspective what would be necessary to demonstrate the validity of testimony based on forensic voice comparison. We describe different technical approaches to forensic voice comparison that have been used in the United Kingdom, and critically review the case law on their admissibility. We conclude that courts have been inconsistent in their reasoning. In line with the CPD, we recommend that courts enquire as to whether forensic practitioners have made use of data and analytical methods that are appropriate and adequate for the case under consideration, and that courts require forensic practitioners to empirically demonstrate the level of performance of their forensic voice comparison system under conditions reflecting those of the case under consideration.

A revised, updated, and expanded edition of Morrison (2010) “Forensic voice comparison”. It introduces forensic speech science in a relatively non-technical way, assuming a reader who has no prior knowledge of the subject. As with the previous edition, the revised edition provides an introduction to forensic voice comparison and to speaker recognition by laypeople (e.g., earwitnesses). Compared to the previous edition, the revised edition has a heavier focus on automatic approaches to forensic voice comparison.The revised edition also includes coverage of other areas of forensic speech science, particularly disputed utterance analysis.

In a 2017 New South Wales case, a forensic practitioner conducted a forensic voice comparison using a Gaussian mixture model - universal background model (GMM-UBM). The practitioner did not report the results of empirical tests of the performance of this system under conditions reflecting those of the case under investigation. The practitioner trained the model for the numerator of the likelihood ratio using the known-speaker recording, but trained the model for the denominator of the likelihood ratio (the UBM) using high-quality audio recordings, not recordings which reflected the conditions of the known-speaker recording. There was therefore a difference in the mismatch between the numerator model and the questioned-speaker recording versus the mismatch between the denominator model and the questioned-speaker recording. In addition, the practitioner did not calibrate the output of the system. The present paper empirically tests the performance of a replication of the practitioner’s system. It also tests a system in which the UBM was trained on known-speaker-condition data and which was empirically calibrated. The performance of the former system was very poor, and the performance of the latter was substantially better.

When strength of forensic evidence is quantified using sample data and statistical models, a concern may be raised as to whether the output of a model overestimates the strength of evidence. This is particularly the case when the amount of sample data is small, and hence sampling variability is high. This concern is related to concern about precision. This paper describes, explores, and tests three procedures which shrink the value of the likelihood ratio or Bayes factor toward the neutral value of one. The procedures are: (1) a Bayesian procedure with uninformative priors, (2) use of empirical lower and upper bounds (ELUB), and (3) a novel form of regularized logistic regression. As a benchmark, they are compared with linear discriminant analysis, and in some instances with non-regularized logistic regression. The behaviours of the procedures are explored using Monte Carlo simulated data, and tested on real data from comparisons of voice recordings, face images, and glass fragments.

Score based procedures for the calculation of forensic likelihood ratios are popular across different branches of forensic science. They have two stages, first a function or model which takes measured features from known-source and questioned-source pairs as input and calculates scores as output, then a subsequent model which converts scores to likelihood ratios. We demonstrate that scores which are purely measures of similarity are not appropriate for calculating forensically interpretable likelihood ratios. In addition to taking account of similarity between the questioned-origin specimen and the known-origin sample, scores must also take account of the typicality of the questioned-origin specimen with respect to a sample of the relevant population specified by the defence hypothesis. We use Monte Carlo simulations to compare the output of three score based procedures with reference likelihood ratio values calculated directly from the fully specified Monte Carlo distributions. The three types of scores compared are: 1. non-anchored similarity-only scores; 2. non-anchored similarity and typicality scores; and 3. known-source anchored same-origin scores and questioned-source anchored different-origin scores. We also make a comparison with the performance of a procedure using a dichotomous “match”/“non-match” similarity score, and compare the performance of 1 and 2 on real data.

A press release from the National Institute of Standards and Technology (NIST) could potentially impede progress toward improving the analysis of forensic evidence and the presentation of forensic analysis results in courts in the United States and around the world. “NIST experts urge caution in use of courtroom evidence presentation method” was released on October 12, 2017, and was picked up by the phys.org news service. It argues that, except in exceptional cases, the results of forensic analyses should not be reported as “likelihood ratios”. The press release, and the journal article by NIST researchers Steven P. Lund & Harri Iyer on which it is based, identifies some legitimate points of concern, but makes a strawman argument and reaches an unjustified conclusion that throws the baby out with the bathwater.

In a 2012 case in New South Wales, Australia, the identity of a speaker on several audio recordings was in question. Forensic voice comparison testimony was presented based on an auditory-acoustic-phonetic-spectrographic analysis. No empirical demonstration of the validity and reliability of the analytical methodology was presented. Unlike the admissibility standards in some other jurisdictions (e.g., US Federal Rule of Evidence 702 and the Daubert criteria, or England & Wales Criminal Practice Directions 19A), Australia’s Unified Evidence Acts do not require demonstration of the validity and reliability of analytical methods and their implementation before testimony based upon them is presented in court. The present paper reports on empirical tests of the performance of an acoustic-phonetic-statistical forensic voice comparison system which exploited the same features as were the focus of the auditory-acoustic-phonetic-spectrographic analysis in the case, i.e., second-formant (F2) trajectories in /o/ tokens and mean fundamental frequency (f0). The tests were conducted under conditions similar to those in the case. The performance of the acoustic-phonetic-statistical system was very poor compared to that of an automatic system.

In the debate as to whether forensic practitioners should assess and report the precision of the strength of evidence statements that they report to the courts, I remain unconvinced by proponents of the position that only a subjectivist concept of probability is legitimate. I consider this position counterproductive for the goal of having forensic practitioners implement, and courts not only accept but demand, logically correct and scientifically valid evaluation of forensic evidence. In considering what would be the best approach for evaluating strength of evidence, I suggest that the desiderata be (1) to maximise empirically demonstrable performance; (2) to maximise objectivity in the sense of maximising transparency and replicability, and minimising the potential for cognitive bias; and (3) to constrain and make overt the forensic practitioner’s subjective-judgement based decisions so that the appropriateness of those decisions can be debated before the judge in an admissibility hearing and/or before the trier of fact at trial. All approaches require the forensic practitioner to use subjective judgement, but constraining subjective judgement to decisions relating to selection of hypotheses, properties to measure, training and test data to use, and statistical modelling procedures to use  decisions which are remote from the output stage of the analysis  will substantially reduce the potential for cognitive bias. Adopting procedures based on relevant data, quantitative measurements, and statistical models, and directly reporting the output of the statistical models will also maximise transparency and replicability. A procedure which calculates a Bayes factor on the basis of relevant sample data and reference priors is no less objective than a frequentist calculation of a likelihood ratio on the same data. In general, a Bayes factor calculated using uninformative or reference priors will be closer to a value of 1 than a frequentist best estimate likelihood ratio. The bound closest to 1 based on a frequentist best estimate likelihood ratio and an assessment of its precision will also, by definition, be closer to a value of 1 than the frequentist best estimate likelihood ratio. From a practical perspective, both procedures shrink the strength of evidence value towards the neutral value of 1. A single-value Bayes factor or likelihood ratio may be easier for the courts to handle than a distribution. I therefore propose as a potential practical solution, the use of procedures which account for imprecision by shrinking the calculated Bayes factor or likelihood ratio towards 1, the choice of the particular procedure being based on empirical demonstration of performance.

This article provides a primer on forensic voice comparison (aka forensic speaker recognition), a branch of forensic science in which the forensic practitioner analyzes a voice recording in order to provide an expert opinion that will help the trier-of-fact determine the identity of the speaker. The article begins with an explanation of ways in which human speech varies within and between speakers. It then discusses different technical approaches that forensic practitioners have used to compare voice recordings, and frameworks of reasoning that practitioners have used for evaluating the evidence and reporting its strength. It then discusses procedures for empirical validation of the performance of forensic voice comparison systems. It also discusses the potential influence of contextual bias and ways to reduce this. Building on this scientific foundation, the article then offers analysis, commentary, and recommendations on how courts evaluate the admissibility of forensic voice comparison testimony under the Daubert and Frye standards. It reviews past rulings such as U.S. v. Angleton, 269 F.Supp 2nd 892 (S.D. Tex. 2003) that found expert testimony based on the spectrographic approach inadmissible under Daubert. The article also offers a detailed analysis of the evidence presented in the recent Daubert hearing in U.S. v. Ahmed, et al. 2015 EDNY 12-CR-661, which included testimony based on the newer automatic approach. The scientific testimony proffered in Ahmed is used to illustrate the issues courts are likely to face when considering the admissibility of forensic voice comparison testimony in the future. The article concludes with a discussion of how proponents of forensic voice comparison testimony might meet a reasonably rigorous application of the Daubert standard and thereby ensure that such testimony is sufficiently trustworthy to be used in court.

This letter comments on the report “Forensic science in criminal courts: Ensuring scientific validity of feature-comparison methods” recently released by the President’s Council of Advisors on Science and Technology (PCAST). The report advocates a procedure for evaluation of forensic evidence that is a two-stage procedure in which the first stage is “match”/“non-match” and the second stage is empirical assessment of sensitivity (correct acceptance) and false alarm (false acceptance) rates. Almost always, quantitative data from feature-comparison methods are continuously-valued and have within-source variability. We explain why a two-stage procedure is not appropriate for this type of data, and recommend use of statistical procedures which are appropriate.

Currently, the standard approach to forensic voice comparison in China is the aural-spectrographic approach. Internationally, this approach has been the subject of much criticism. The present paper describes what we believe is the first forensic voice comparison analysis presented to a court in China in which a numeric likelihood ratio was calculated using relevant data, quantitative measurements, and statistical models, and in which the validity and reliability of the analytical procedures were empirically tested under conditions reflecting those of the case under investigation. The hypotheses addressed were whether the female speaker on a recording of a mobile telephone conversation was a particular individual, or whether it was that individual’s younger sister. Known speaker recordings of both these individuals were recorded using the same mobile telephone as had been used to record the questioned-speaker recording, and customised software was written to perform the acoustic and statistical analyses.

This article should be open access. If for any reason you can’t access it at the SPECOM site

There is increasing pressure on forensic laboratories to validate the performance of forensic analysis systems before they are used to assess strength of evidence for presentation in court. Different forensic voice comparison systems may use different approaches, and even among systems using the same general approach there can be substantial differences in operational details. From case to case, the relevant population, speaking styles, and recording conditions can be highly variable, but it is common to have relatively poor recording conditions and mismatches in speaking style and recording conditions between the known- and questioned-speaker recordings. In order to validate a system intended for use in casework, a forensic laboratory needs to evaluate the degree of validity and reliability of the system under forensically realistic conditions. The present paper is an introduction to a Virtual Special Issue consisting of papers reporting on the results of testing forensic voice comparison systems under conditions reflecting those of an actual forensic voice comparison case. A set of training and test data representative of the relevant population and reflecting the conditions of this particular case has been released, and operational and research laboratories are invited to use these data to train and test their systems. The present paper includes the rules for the evaluation and a description of the evaluation metrics and graphics to be used. The name of the evaluation is: forensic_eval_01

The present letter to the editor is one in a series of publications discussing the formulation of hypotheses (propositions) for the evaluation of strength of forensic evidence. In particular, the discussion focusses on the issue of what information may be used to define the relevant population specified as part of the different-speaker hypothesis in forensic voice comparison. The previous publications in the series are: Hicks et al. (2015); Morrison et al. (2016); Hicks et al. (2017). The latter letter to the editor mostly resolves the apparent disagreement between the two groups of authors. We briefly discuss one outstanding point of apparent disagreement, and attempt to correct a misinterpretation of our earlier remarks. We believe that at this point there is no actual disagreement, and that both groups of authors are calling for greater collaboration in order to reduce the likelihood of future misunderstandings.

Hicks et al. (2015) propose that forensic speech scientists not use the accent of the speaker of questioned identity to refine the relevant population. This proposal is based on a lack of understanding of the realities of forensic voice comparison. If it were implemented, it would make data-based forensic voice comparison analysis within the likelihood ratio framework virtually impossible. We argue that it would also lead forensic speech scientists to present invalid unreliable strength of evidence statements, and not allow them to conduct the tests that would make them aware of this problem.

The present paper introduces the Science & Justice virtual special issue on measuring and reporting the precision of forensic likelihood ratios  whether this should be done, and if so how. The focus is on precision (aka reliability) as opposed to accuracy (aka validity). The topic is controversial and different authors are expected to express a range of nuanced opinions. The present paper frames the debate, explaining the underlying problem and referencing classes of solutions proposed in the existing literature. The special issue will consist of a number of position papers, responses to those position papers, and replies to the responses.

We argue that forensic practitioners should empirically assess and report the precision of their likelihood ratios. Once the practitioner has specified the prosecution and defence hypotheses they have adopted, including the relevant population they have adopted, and has specified the type of measurements they will make, their task is to empirically calculate an estimate of a likelihood ratio which has a true but unknown value. We explicitly reject the competing philosophical position that the forensic practitioner’s likelihood ratio should be based on subjective personal probabilities. Estimates of true but unknown values are based on samples and are subject to sampling uncertainty, and it is standard practice to report the degree of precision of such estimates. We discuss the dangers of not reporting precision to the courts, and the problems with an alternative approach which instead reports a verbal expression corresponding to a pre-specified range of likelihood ratio values. Reporting precision as an interval requires an arbitrary choice of coverage, e.g., a 95% or a 99% credible interval. We outline a normative framework which a trier of fact could use to make non-arbitrary use of the results of forensic practitioners’ empirical calculations of likelihood ratios and their precision.

A survey was conducted of the use of speaker identification by law enforcement agencies around the world. A questionnaire was circulated to law enforcement agencies in the 190 member countries of INTERPOL. 91 responses were received from 69 countries. 44 respondents reported that they had speaker identification capabilities in house or via external laboratories. Half of these came from Europe. 28 respondents reported that they had databases of audio recordings of speakers. The clearest pattern in the responses was that of diversity. A variety of different approaches to speaker identification were used: The human-supervised-automatic approach was the most popular in North America, the auditory-acousticphonetic approach was the most popular in Europe, and the spectrographic/auditory-spectrographic approach was the most popular in Africa, Asia, the Middle East, and South and Central America. Globally, and in Europe, the most popular framework for reporting conclusions was identification/exclusion/ inconclusive. In Europe, the second most popular framework was the use of verbal likelihood ratio scales.

The new paradigm for the evaluation of the strength of forensic evidence includes: The use of the likelihood-ratio framework. The use of relevant data, quantitative measurements, and statistical models. Empirical testing of validity and reliability under conditions reflecting those of the case under investigation. Transparency as to decisions made and procedures employed. The present paper illustrates the use of the new paradigm to evaluate strength of evidence under conditions reflecting those of a real forensic-voice-comparison case. The offender recording was from a landline telephone system, had background office noise, and was saved in a compressed format. The suspect recording included substantial reverberation and ventilation system noise, and was saved in a different compressed format. The present paper includes descriptions of the selection of the relevant hypotheses, sampling of data from the relevant population, simulation of suspect and offender recording conditions, and acoustic measurement and statisticalmodelling procedures. The present paper also explores the use of different techniques to compensate for the mismatch in recording conditions. It also examines how system performance would have differed had the suspect recording been of better quality.

In a forensic-voice-comparison case, one speaker (A) was standing a short distance away from another speaker (B) who was talking on a mobile telephone. Later, speaker A moved closer to the telephone. Shortly thereafter, there was a section of speech where the identity of the speaker was in question  the prosecution claiming that it was speaker A and the defense claiming it was speaker B. All material for training a forensic-voice-comparison system could be extracted from this single recording, but there was a near-far mismatch: Training data for speaker A were mostly far, training data for speaker B were near, and the disputed speech was near. Based on the conditions of this case we demonstrate a methodology for handling forensic casework using relevant data, quantitative measurements, and statistical models to calculate likelihood ratios. A procedure is described for addressing the degree of validity and reliability of a forensic-voicecomparison system under such conditions. Using a set of development speakers we investigate the effect of mismatched distances to the microphone and demonstrate and assess three methods for compensation.

A group of approaches for calculating forensic likelihood ratios first calculates scores which quantify the degree of difference or the degree of similarity between pairs of samples, then converts those scores to likelihood ratios. In order for a score-based approach to produce a forensically interpretable likelihood ratio, however, in addition to accounting for the similarity of the questioned sample with respect to the known sample, it must also account for the typicality of the questioned sample with respect to the relevant population. The present paper explores a number of score-based approaches using different types of scores and different procedures for converting scores to likelihood ratios. Monte Carlo simulations are used to compare the output of these approaches to true likelihood-ratio values calculated on the basis of the distribution specified for a simulated population. The inadequacy of approaches based on similarity-only or difference-only scores is illustrated, and the relative performance of different approaches which take account of both similarity and typicality is assessed.

We present a disputed-utterance analysis using relevant data, quantitative measurements, and statistical models to calculate likelihood ratios. The acoustic data were taken from an actual forensic case in which the amount of data available to train the statistical models was small and the data point from the disputed word was far out on the tail of one of the modelled distributions. A procedure based on single multivariate Gaussian models for each hypothesis led to an unrealistically high likelihood ratio value with extremely poor reliability, but a procedure based on Hotelling’s T2 statistic and a procedure based on calculating a posterior predictive density produced more acceptable results. The Hotelling’s T2 procedure attempts to take account of the sampling uncertainty of the mean vectors and covariance matrices due to the small number of tokens used to train the models, and the posterior-predictivedensity analysis integrates out the values of the mean vectors and covariance matrices as nuisance parameters. Data scarcity is common in forensic speech science and we argue that it is important not to accept extremely large calculated likelihood ratios at face value, but to consider whether such values can be supported given the size of the available data and modelling constraints.

Lennard (2013) [Fingerprint identification: how far have we come? Aus J Forensic Sci. doi:10.1080/00450618.2012.752037] proposes that the numeric output of statistical models should not be presented in court (except ‘if necessary’/‘if required’). Instead, he argues in favour of an ‘expert opinion’ which may be informed by a statistical model but which is not itself the output of a statistical model. We argue that his proposed procedure lacks the transparency, the ease of testing of validity and reliability, and the relative robustness to cognitive bias that are the strengths of a likelihood-ratio approach based on relevant data, quantitative measurements, and statistical models, and that the latter is therefore preferable.

In this paper it is argued that one should not attempt to directly assess whether a forensic analysis technique is scientifically acceptable. Rather one should first specify what one considers to be appropriate principles governing acceptable practice, then consider any particular approach in light of those principles. This paper focuses on one principle: the validity and reliability of an approach should be empirically tested under conditions reflecting those of the case under investigation using test data drawn from the relevant population. Versions of this principle have been key elements in several reports on forensic science, including forensic voice comparison, published over the last four-and-a-half decades. The auralspectrographic approach to forensic voice comparison (also known as “voiceprint” or “voicegram” examination) and the currently widely practiced auditoryacousticphonetic approach are considered in light of this principle (these two approaches do not appear to be mutually exclusive). Approaches based on data, quantitative measurements, and statistical models are also considered in light of this principle.

In forensic-voice-comparison casework a common scenario is that the suspect’s voice is recorded directly using a microphone in an interview room but the offender’s voice is recorded via a telephone system. Acoustic-phonetic approaches to forensic voice comparison often include analysis of vowel formants, and the second formant is often assumed to be relatively robust to telephone-transmission effects. This study assesses the effects of telephone transmission on the performance of formant-trajectory-based forensic-voice-comparison systems. The effectiveness of both human-supervised and fully-automatic formant tracking is investigated. Human-supervised formant tracking is generally considered to be more accurate and reliable but requires a substantial investment of human labor. Measurements were made of the formant trajectories of /iau/ tokens in a database of recordings of 60 female speakers of Chinese using one human-supervised and five fully-automatic formant trackers. Measurements were made under high-quality, landline-to-landline, mobile-to-mobile, and mobile-to-landline conditions. High-quality recordings were treated as suspect samples and telephone-transmitted recordings as offender samples. Discrete cosine transforms (DCT) were fitted to the formant trajectories and likelihood ratios were calculated on the basis of the DCT coefficients. For each telephone-transmission condition the formant-trajectory system was fused with a baseline mel-frequency cepstral-coefficient (MFCC) system, and performance was assessed relative to the baseline system. The systems based on human-supervised formant measurement always outperformed the systems based on fully-automatic formant measurement; however, in conditions involving mobile telephones neither the former nor the latter type of system provided meaningful improvement over the baseline system, and even in the other conditions the high cost in skilled labor for human-supervised formant-trajectory measurement is probably not warranted given the relatively good performance that can be obtained using other less-costly procedures.

Acoustic-phonetic approaches to forensic voice comparison often include human-supervised measurement of vowel formants, but the reliability of such measurements is a matter of concern. This study assesses the within- and between-supervisor variability of three sets of formanttrajectory measurements made by each of four human supervisors. It also assesses the validity and reliability of forensic-voice-comparison systems based on these measurements. Each supervisor’s formant-trajectory system was fused with a baseline mel-frequency cepstral-coefficient system, and performance was assessed relative to the baseline system. Substantial improvements in validity were found for all supervisors’ systems, but some supervisors’ systems were more reliable than others.

Logistic-regression calibration and fusion are potential steps in the calculation of forensic likelihood ratios. The present paper provides a tutorial on logistic-regression calibration and fusion at a practical conceptual level with minimal mathematical complexity. A score is log-likelihoodratio like in that it indicates the degree of similarity of a pair of samples while taking into consideration their typicality with respect to a model of the relevant population. A higher-valued score provides more support for the same-origin hypothesis over the different-origin hypothesis than does a lower-valued score; however, the absolute values of scores are not interpretable as log likelihood ratios. Logistic-regression calibration is a procedure for converting scores to log likelihood ratios, and logistic-regression fusion is a procedure for converting parallel sets of scores from multiple forensic-comparison systems to log likelihood ratios. Logistic-regression calibration and fusion were developed for automatic speaker recognition and are popular in forensic voice comparison. They can also be applied in other branches of forensic science, a fingerprint/fingermark example is provided.

The onset + offset model of vowel inherent spectral change has been found to be effective for vowel-phoneme identification, and not to be outperformed by more sophisticated parametric-curve models. This suggests that if only simple cues such as initial and final formant values are necessary for signaling phoneme identity, then speakers may have considerable freedom in the exact path taken between the initial and final formant values. If the constraints on formant trajectories are relatively lax with respect to vowel-phoneme identity, then with respect to speaker identity there may be considerable information contained in the details of formant trajectories. Differences in physiology and idiosyncrasies in the use of motor commands may mean that different individuals produce different formant trajectories between the beginning and end of the same vowel phoneme. If withinspeaker variability is substantially smaller than between-speaker variability then formant trajectories may be effective features for forensic voice comparison. This chapter reviews a number of forensic-voice-comparison studies which have used different procedures to extract information from formant trajectories. It concludes that information extracted from formant trajectories can lead to a high degree of validity in forensic voice comparison (at least under controlled conditions), and that a whole trajectory approach based on parametric curves outperforms an onset + offset model.

In this paper we report on a study which demonstrates the im- portance of using non-contemporaneous test data in evaluating the validity and reliability in forensic-voice-comparison sys- tems. We test four different systems: one MFCC GMMUBM, one vowel formant-trajectory based, one nasal spectra based, and the fusion of the three systems. Each system is tested on the same set of test recordings, including same-speaker and different-speaker pairs. In one condition, the same-speaker pairs are from contemporaneous (within-session) recordings and in the other they are from non-contemporaneous (between-session) recordings. Within-session testing always overesti- mated the performance of the systems compared to between-session testing.

Acoustic-phonetic approaches to forensic voice comparison often include analysis of vowel formants. Such methods typically depend on human-supervised formant measurement, which is often assumed to be relatively reliable and relatively robust to telephonetransmission- channel effects, but which requires substantial investment of human labor. Fully-automatic formant trackers require minimal human labor but are usually not considered reliable. This study assesses the effect of variability within three sets of formant-trajectory measurements made by four human supervisors on the validity and reliability of forensic-voice-comparison systems in a high-quality v high-quality recording condition. Measurements were made of the formant trajectories of /iau/ tokens in a database of recordings of 60 female speakers of Chinese. The study also assesses the validity of forensic-voice-comparison systems including a human-supervised and five fully-automatic formant trackers under landline-to-landline, mobile-to-mobile, and mobile-to-landline conditions, each of these matched with the same condition and mismatched with the high-quality condition. In each case the formant-trajectory systems were fused with a baseline mel-frequency cepstral-coefficient (MFCC) system, and performance was assessed relative to the baseline system. The human-supervised systems always outperformed the fullyautomatic formant-tracker systems, but in some conditions the improvement was marginal and the cost of human-supervised formant-trajectory measurement probably not warranted.

Defining the relevant population to sample is an important issue in data-based implementation of the likelihood-ratio framework for forensic voice comparison. We present a logical argument that because an investigator or prosecutor only submits suspect and offender recordings for forensic analysis if they sound sufficiently similar to each other, the appropriate defense hypothesis for the forensic scientist to adopt will usually be that the suspect is not the speaker on the offender recording but is a member of a population of speakers who sound sufficiently similar that an investigator or prosecutor would submit recordings of these speakers for forensic analysis. We propose a procedure for selecting background, development, and test databases using a panel of human listeners, and empirically test an automatic procedure inspired by the above. Although the automatic procedure is not entirely consistent with the logical arguments and human-listener procedure, it serves as a proof of concept for the importance of database selection. A forensic-voice-comparison system using the automatic database-selection procedure outperformed systems with random database selection.

GLOTTEX is a software package which extracts informa- tion about voice source properties, including estimates of properties related to physical structures of the vocal folds. It has been proposed that the output of GLOTTEX can be used as part of a forensic-voice-comparison system. We test this using manually labeled segments from a database of voice recordings of 60 female Chinese speakers. Performance was assessed relative to a baseline MFCC GMM-UBM system. GMM-UBM systems based on features extracted by GLOTTEX were combined with the baseline system using logistic-regression fusion. System performance was assessed in three channel conditions: high-quality v high-quality, mobile-to-landline v mobile-to-landline, and mobile-to-landline v high-quality. Substantial improvements over the baseline system were not observed.

This paper presents a preliminary analysis of the disputed utterance in Bain v R [2009] NZSC 16. A likelihood ratio is calculated as a strength-of-evidence statement with respect to the question: What is the probability of getting the acoustic properties of the disputed utterance if Bain had said “I shot the prick” versus if he had said “I can’t breathe”. In particular, an acoustic and statistical analysis is conducted on the first segment of the second word to estimate the probability of getting the acoustics of this segment if it were a postalveolar fricative versus if it were a palatal fricative. The validity of the system is tested and ways to improve the analysis are discussed.

A protocol for the collection of databases of audio recordings for forensic-voice-comparison research and practice is described. The protocol fulfills the following requirements: (1) The database contains at least two non-contemporaneous recordings of each speaker. (2) The database contains recordings of each speaker using different speaking styles which are typical of speaking styles found in casework, and which are elicited as natural speech. (3) The database is usable for research and casework involving recording- and transmission-channel mismatch. The protocol includes three speaking tasks, (1) an informal telephone conversation, (2) an information exchange task over the telephone, and (3) a pseudo-police-style interview. Technical issues are also discussed.

In R v T the Court concluded that the likelihood-ratio framework should not be used for the evaluation of evidence except ‘where there is a firm statistical base’. The present paper argues that the Court’s opinion is based on misunderstandings of statistics and of the likelihood-ratio framework for the evaluation of evidence. The likelihood-ratio framework is a logical framework and not itself dependent on the use of objective measurements, databases, and statistical models. The ruling is analysed from the perspective of the new paradigm for forensic-comparison science: the use of the likelihood-ratio framework for the evaluation of evidence; a strong preference for the use of objective measurements, databases representative of the relevant population, and statistical models; and empirical testing of the validity and reliability of the forensic-comparison system under conditions reflecting those of the case at trial.

An acoustic-phonetic forensic-voice-comparison system extracted information from the formant trajectories of tokens of Standard Chinese /iau/. When this information was added to a generic automatic forensic-voice-comparison system, which did not itself exploit acoustic-phonetic information, there was a substantial improvement in system validity but a decline in system reliability.

A procedure for comparing the performance of humans and machines on speaker recognition and on forensic voice comparison is proposed and demonstrated. The procedure is consistent with the new paradigm for forensic-comparison science (use of the likelihood-ratio framework and testing of the validity and reliability of the results). The use of the procedure is demonstrated using a small database of Swedish voice recordings.

Throughout 2015 and 2016 this was ranked as the most cited paper published in Science & Justice within the previous 5 years.

There has been a great deal of concern recently about validity and reliability in forensic science. This paper reviews for a broad target audience metrics of validity and reliability (accuracy and precision) which have been applied in forensic voice comparison and which are potentially applicable in other branches of forensic science. The metric of validity is the log likelihood-ratio cost (Cllr), and the metric of reliability is an empirical estimate of credible intervals. A revised procedure for the calculation of credible intervals is introduced.

Two procedures for the calculation of forensic likelihood ratios were tested on the same set of acousticphonetic data. One procedure was a multivariate kernel density procedure (MVKD) which is common in acousticphonetic forensic voice comparison, and the other was a Gaussian mixture modeluniversal background model (GMMUBM) which is common in automatic forensic voice comparison. The data were coefficient values from discrete cosine transforms fitted to second-formant trajectories of /aI/, /eI/, /ou/, /au/, and /OI/ tokens produced by 27 male speakers of Australian English. Scores were calculated separately for each phoneme and then fused using logistic regression. The performance of the fused GMMUBM system was much better than that of the fused MVKD system, both in terms of accuracy (as measured using the log-likelihood-ratio cost, Cllr) and precision (as measured using an empirical estimate of the 95% credible interval for the likelihood ratios from the different-speaker comparisons).

Logistic regression is a popular procedure for calibration and fusion of likelihood ratios in forensic voice comparison and automatic speaker recognition. The availability of multiple recordings of each speaker in the database used for calculation of calibration/fusion weights allows for different procedures for calculating those weights. Two procedures are compared, one using pooled data and the other using mean values from each speaker-comparison pair. The procedures are tested using an acoustic-phonetic and an automatic forensic-voicecomparison system. The mean procedure has a tendency to result in better accuracy, but the pooled procedure always results in better precision of the likelihood-ratio output.

As part of the Expert Evidence series the 100-page Forensic Voice Comparison chapter is aimed first at lawyers, judges, police officers, and potential jury members; however, it is hoped that this chapter will also be of interest to forensic scientists, phoneticians / speech scientists, speech-processing engineers, and students of all these disciplines. It introduces forensic voice comparison in a relatively non-technical way, assuming a reader who has no prior knowledge of the subject. The focus is on the understanding of concepts and the provision of basic knowledge.

“Morrison has a very nice writing style and I think he has phrased some of the fundamental matters in a way that is more clearly put than I have ever seen. I think he has done a masterly job.”

The issues of validity and reliability are important in forensic science. Within the likelihood-ratio framework for the evaluation of forensic evidence, the log-likelihood-ratio cost (Cllr) has been applied as an appropriate metric for evaluating the accuracy of the output of a forensic-voice-comparison system, but there has been little research on developing a quantitative metric of precision. The present paper describes two procedures for estimating the precision of the output of a forensic-comparison system, a non-parametric estimate and a parametric estimate of its 95% credible interval. The procedures are applied to estimate the precision of a basic automatic forensic-voice-comparison system presented with different amounts of questioned-speaker data. The importance of considering precision is discussed.

An acousticphonetic forensic-voice-comparison system was constructed using the time-averaged formant values of tokens of 61 male Chinese speakers’ /i/, /e/, and /a/ monophthongs as input. Likelihood ratios were calculated using amultivariate kernel density formula. A separate set of likelihood ratios was calculated for each vowel phoneme, and these were then fused and calibrated using linear logistic regression. The system was tested via cross-validation. The validity and reliability of the results were assessed using the log-likelihood-ratio-cost function (Cllr, a measure of accuracy) and an empirical estimate of the credible interval for the likelihood ratios from different-speaker comparisons (ameasure of precision). The credible interval was calculated on the basis of two independent pairs of samples for each different-speaker comparison pair.

We are in the midst of a paradigm shift in the forensic comparison sciences. The new paradigm can be characterised as quantitative data-based implementation of the likelihood-ratio framework with quantitative evaluation of the reliability of results. The new paradigm was widely adopted for DNA profile comparison in the 1990s, and is gradually spreading to other branches of forensic science, including forensic voice comparison. The present paper first describes the new paradigm, then describes the history of its adoption for forensic voice comparison over approximately the last decade. The paradigm shift is incomplete and those working in the new paradigm still represent a minority within the forensicvoice-comparison community.

In their recent introduction to forensic linguistics, Coulthard & Johnson (2007) include a portrayal of the likelihood-ratio framework for the evaluation of forensic comparison evidence (pp. 203207). This portrayal includes a number of inaccuracies. The present letter attempts to correct these inaccuracies.

Non-contemporaneous speech samples from 27 male speakers of Australian English were compared in a forensic likelihood-ratio framework. Parametric curves (polynomials and discrete cosine transforms) were fitted to the formant trajectories of the diphthongs /aI/, /eI/, /oU/, /aU/, and /OI/. The estimated coefficient values from the parametric curves were used as input to a generative multivariate-kernel-density formula for calculating likelihood ratios expressing the probability of obtaining the observed difference between two speech samples under the hypothesis that the samples were produced by the same speaker versus under the hypothesis that they were produced by different speakers. Cross-validated likelihood-ratio results from systems based on different parametric curves were calibrated and evaluated using the log-likelihood-ratio cost function (Cllr). The cross-validated likelihood ratios from the best-performing system for each vowel phoneme were fused using logistic regression. The resulting fused system had a very low error rate, thus meeting one of the requirements for admissibility in court.

A traditional-style phonetic-acoustic forensic-speakerrecognition analysis was conducted on Australian English /o/ recordings. Different parametric curves were fitted to the formant trajectories of the vowel tokens, and cross-validated likelihood ratios were calculated using a single-stage generative multivariate kernel density formula. The outputs of different systems were compared using Cllr, a metric developed for automatic speaker recognition, and the crossvalidated likelihood ratios were calibrated using a procedure developed for automatic speaker recognition. Calibration ameliorated some likelihood-ratio results which had offered strong support for a contrary-to-fact hypothesis.

A likelihood-ratio-based forensic speaker discrimination was conducted using the mean formant frequencies of Standard Chinese /i/ and /y/ tokens produced by 64 male speakers. The speech data were relatively forensically realistic in that they were relatively extemporaneous, were recorded over the telephone, and were from three non-contemporaneous recording sessions. A multivariate-kernel-density formula was used to calculate cross-validated likelihood ratios comparing all possible same-speaker and different-speaker combinations across sessions. Results were comparable with those previously obtained with laboratory speech in other languages. In general, greater strength of evidence was obtained for recording sessions separated by one week than for recording sessions separated by one month.

Incorrect versions of Figures 3 and 4 were printed in the paper version. These have been corrected in the online vesion.

Earlier studies have indicated that information regarding speaker identity can be extracted from the dynamic spectral properties of diphthongs. Some studies have conducted likelihood-ratio analyses based on simple models of the dynamic formant properties of diphthongs (e.g., dual-target model), and others have used more sophisticated polynomial curve fitting models but have not conducted likelihood-ratio analyses. The present study examines the strength of evidence which can be produced by a likelihood-ratio analysis based on the coefficients of polynomial curves fitted to the formant trajectories of Australian English /aI/ tokens. A cubic polynomial model offers a substantial improvement over the dual-target model.

Matlab function also available at
American Institute of Physics Electronic Physics
Auxiliary Publication Service (EPAPS): E-JASMAN-123-001801

The following article may also be of
interest:Gorshi, S., Vaseghi, S., Yan, Q. (2008) Cross-entropic comparison of formants of British, Australian and American English accents. Speech
Communication, 50, 564–579. http://dx.doi.org/10.1016/j.specom.2008.03.013

A multinomial logistic regression function
is now avaialble in the Matlab Statistics Toolbox. I
have provided versions of some of the sample software
making use this function. T. M. Nearey’s software allows
for more control over the specification of the logistic
regression model, in particular it allows one to specify
diphone-biassed models. Matlab is required to run the
software. Zipped files which include Nearey's software
are password protected. Contact me to
get the password. See also Logistic regression Software
above.

Logistic regression software to run analyses of Bion, Escudero, Morrison (2008) data [not part of the tutorial paper, but provides examples of different (simpler?) drivers for the logistic regression function]: download
Nearey version