Biometric authentication: Multicue data fusion

Various research studies have suggested that no single modality can provide an adequate solution for high-security applications. These studies agree that it is vital to use multiple modalities such as visual, infrared, acoustic, chemical sensors, and so on. The problem…

Various research studies have suggested that no single modality can provide an adequate solution for high-security applications. These studies agree that it is vital to use multiple modalities such as visual, infrared, acoustic, chemical sensors, and so on. The problem of combining the power of several classifiers is of great importance to various applications. In many remote sensing, pattern recognition, and multimedia applications, it is not uncommon for different channels or sensors to facilitate the recognition of an object. In addition, for many applications with very high-dimensional feature data, fusion of multiple modalities provides some computational relief by dividing feature vectors into several lower-dimensional vectors before integrating them for final decision (i.e., the divide-and-conquer principle). Consequently, it is very important to develop intelligent and sophisticated techniques for combining information from different sensors [1]. To cope with the limitations of individual biometrics, researchers have proposed using multiple biometric traits concurrently for verification. Such systems are commonly known as multimodal verification systems [183]. By using multiple biometric traits, systems gain more immunity to intruder attack. For example, it is more difficult for an impostor to impersonate another person using both audio and visual information simultaneously. Multicue biometrics also helps improve system reliability. For instance, while background noise has a detrimental effect on the performance of voice biometrics, it does not have any influence on face biometrics. Conversely, although the performance of face recognition systems greatly depends on lighting conditions, lighting does not have any effect on voice quality (see Problem 1 for a numerical example). As a result, audio and visual (AV) biometrics has attracted a great deal of attention in recent years. However, multiple biometrics should be used with caution because catastrophic fusion may occur if the biometrics is not properly combined; such fusion occurs when the performance of an ensemble of combined classifiers is worse than any of the individual classifiers. Biometric pattern recognition systems must be computationally efficient for real-time processing, and VLSI DSP architecture has made it economically feasible to support such intelligent processing in real-time. Neural networks (NNs) are particularly suitable for such real-time sensor fusion and recognition because they can easily adapt in response to incoming data and take special characteristics of individual sensors into consideration. This chapter proposes and evaluates novel neural network architecture for effective, efficient fusion of signals from multiple modalities. Taking the perspective of treating the information pertaining to each sensor as a local expert, hierarchical NNs offer a very attractive architectural solution for multi-sensor information fusion. A hierarchical NN comprised of many local classification experts and an embedded fusion agent (i.e., gating network) will be developed for the architecture and algorithm design. In this context, the notion of mixture-of-experts (MOE) offers an instrumental tool for combining information from multiple local experts (see Figure 6.3). For effective, efficient local experts, the decision-based neural network (DBNN) is adopted as the local expert classification module. The proposed hierarchical NN can effectively incorporate the powerful expectation-maximization (EM) algorithm for adaptive training of (1) the discriminant function in the classification modules and (2) the gating parameters in the fusion network.  This chapter also shows why such a hierarchical NN is not only cost-effective in terms of computation but also functionally superior in terms of recognition performance. In addition to the fusion of data collected from multiple sensors, it is possible to fuse the scores of multiple samples from a single sensor. This chapter details a novel approach to computing the optimal weights for fusing scores, based on the score distribution of independent samples and prior knowledge of the score statistics. Evaluations of this multi-sample fusion technique on speaker verification, face verification, and audio and visual (voice plus face) biometric authentication are reported.   10.2 Sensor Fusion for Biometrics Sensor fusion is an information processing technique (see [66, 125]) through which information produced by several sources can be optimally combined. The human brain is a good example of a complex, multi-sensor fusion system; it receives five different signals: sight, hearing, taste, smell, and touch from five different sensors: eyes, ears, tongue, nose, and skin. Typically, it fuses signals from these sensors for decision-making and motor control. The human brain also fuses signals at different levels for different purposes. For example, humans recognize objects by both seeing and touching them; humans also communicate by watching the speaker’s face and listening to his or her voice at the same time. All of these phenomena suggest that the human brain is a flexible and complicated fusion system. Research in sensor fusion can be traced back to the early 1980s [17, 348]. Sensor fusion can be applied in many ways, such as detection of the presence of an object, recognition of an object, tracking an object, and so on. This chapter focuses on sensor fusion for verification purposes. Information can be fused at two different levels: feature and decision. Decision-level fusion can be further divided into abstract fusion and score fusion. These fusion techniques are discussed in the following two subsections.

10.2.1 Feature-Level Fusion

In feature-level fusion, data from different modalities are combined at the feature level before being presented to a pattern classifier [60]. One possible approach is to concatenate the feature vectors derived from different modalities [60], as illustrated in Figure 10.1. The dimensionality of the concatenated vectors, however, is sometimes too large for a reliable estimation of a classifier’s parameters, a problem known as the curse of dimensionality. Although dimensionality reduction techniques, such as PCA or LDA, can help alleviate the problem [60, 260], these techniques rely on the condition that data from each class contain only a single cluster. Classification performance might be degraded when the data from individual classes contain multiple clusters. Moreover, systems based on feature-level fusion are not very flexible because the system needs to be retrained whenever a new sensor is added. It is also important to synchronize different sources of information in feature-level fusion, which may introduce implementation difficulty in AV fusion systems.

 

10.2.2 Decision-Level Fusion

Unlike feature-level fusion, decision-level fusion attempts to combine the decisions made by multiple modality-dependent classifiers (see Figure 10.2). This fusion approach solves the curse of dimensionality problem by training modality-dependent classifiers separately. Combining the outputs of the classifiers, however, is an important issue. The architecture of the classifiers can be identical but the input features are different (e.g., one uses audio data as input and the other uses video data). Alternatively, different classifiers can work on the same features and their decisions are combined. There are also systems that use a combination of these two types.

The two types of decision fusions are: abstract and score. In the former, the binary decisions made by individual classifiers are combined, as shown in Figure 10.2(a); in the latter, the scores (confidence) of the classifiers are combined, as in Figure 10.2(b).

In abstract fusion, the binary decisions can be combined by majority voting or using AND and OR operators. In majority voting, the final decision is based on the number of votes made by the individual classifiers [113, 182]. However, this voting method may have difficulty making a decision when there is an even number of sensors, and the decisions made by half of the classifiers do not agree with the other half.

Varshney [359] proposed using logical AND and OR operators for fusion. In the AND fusion, the final decision is not reached until all the decisions made by the classifiers agree. This type of fusion is very strict and therefore suitable only for systems that require low false acceptance. However, it has difficulty when the decisions made by different sensors are not consistent, which is a serious problem in multiclass applications. Unlike the AND fusion, the final decision in the OR fusion is made as soon as one of the classifiers makes a decision. This type of fusion is suitable only for systems that can tolerate a loose security policy (i.e., allowing high false acceptance error). The OR fusion suffers the same problem as the voting method when the decisions of individual classifiers do not agree with one other.

In score fusion, the scores of modality-specific classifiers are combined and the final score is used to make a decision (see Figure 10.2(b)). Typically, the output of modality-specific classifiers is linearly combined through a set of fusion weights [182]. The ¯final score is obtained from

 

where K is the number of modalities or experts, {wi} are a set of fusion weights, and {si} are the scores obtained from the K modalities. This kind of fusion is also referred to as the sum rule  [6, 182].

Scores can be interpreted as posteriori probabilities in the Bayesian framework. Assuming that scores from different modalities are statistically independent, the final score can be combined by using the product rule  [6, 182]:

 

To account for the discriminative power and reliability of each modality, a set of weights can be introduced as follows:

It has been stated that the independence assumption is unrealistic in many situations. However, this challenge does hold for some applications. For example, in AV verification systems, facial and speech features are mainly independent. Therefore, fusion of audio and visual data at the score level is a possible solution to reducing verification error.

The fusion weights wi can be non-adaptive and adaptive. Non-adaptive weights are learned from training data and kept fixed during recognition. For example, in Potamianos and Neti [284] and Sanderson and Paliwal [330], the fusion weights were estimated by minimizing the misclassification error on a held-out set; in Pigeon et al. [277], the parameters of a logistic regression model are estimated from the dispersion between the means of speakers’ and impostors’ scores. The non-adaptive weights, however, may not be optimal in mismatch conditions. Adaptive weights, on the other hand, are estimated from observed data during recognition for example, according to the signal-to-noise ratio [239], degree of voicing [260], degree of error present in each modality [371].

Another important approach to adapting fusion weights is based on the train- able properties of neural networks. For example, in Brunelli and Falavigna [37], a person identification system based on acoustic and visual features was proposed. In particular, two classifiers based on acoustic features and three based on visual ones provide data for an integration module whose performance was evaluated. A novel technique for the integration of multiple classifiers at a hybrid rank/measurement level was introduced using HyperBF networks. This research showed that the performance of the integrated system was superior to that of the acoustic and visual subsystems.

The linear combiners described before assume that the combined scores obtained from different classes are linearly separable. In case this assumption cannot be met, the scores obtained from d  experts can be considered as some d-dimensional vectors and a binary classifier (e.g., support vector machine, multi-layer perceptron, decision-tree architecture, Fisher’s linear discriminant, and Bayesian classifier) can be trained from a held-out set to classify the vectors [22, 51, 96]. The experimental results showed that SVMs and Bayesian classifiers achieve about the same performance and outperform the rest of the candidate classifiers.
10.2.1 Feature-Level Fusion
In feature-level fusion, data from different modalities are combined at the feature level before being presented to a pattern classifier [60]. One possible approach is to concatenate the feature vectors derived from different modalities [60], as illustrated in Figure 10.1. The dimensionality of the concatenated vectors, however, is sometimes too large for a reliable estimation of a classifier’s parameters, a problem known as the curse of dimensionality. Although dimensionality reduction techniques, such as PCA or LDA, can help alleviate the problem [60, 260], these techniques rely on the condition that data from each class contain only a single cluster. Classification performance might be degraded when the data from individual classes contain multiple clusters. Moreover, systems based on feature-level fusion are not very flexible because the system needs to be retrained whenever a new sensor is added. It is also important to synchronize different sources of information in feature-level fusion, which may introduce implementation difficulty in AV fusion systems.

10.2.2 Decision-Level Fusion
Unlike feature-level fusion, decision-level fusion attempts to combine the decisions made by multiple modality-dependent classifiers (see Figure 10.2). This fusion approach solves the curse of dimensionality problem by training modality-dependent classifiers separately. Combining the outputs of the classifiers, however, is an important issue. The architecture of the classifiers can be identical but the input features are different (e.g., one uses audio data as input and the other uses video data). Alternatively, different classifiers can work on the same features and their decisions are combined. There are also systems that use a combination of these two types.

The two types of decision fusions are: abstract and score. In the former, the binary decisions made by individual classifiers are combined, as shown in Figure 10.2(a); in the latter, the scores (confidence) of the classifiers are combined, as in Figure 10.2(b).
In abstract fusion, the binary decisions can be combined by majority voting or using AND and OR operators. In majority voting, the final decision is based on the number of votes made by the individual classifiers [113, 182]. However, this voting method may have difficulty making a decision when there is an even number of sensors, and the decisions made by half of the classifiers do not agree with the other half.
Varshney [359] proposed using logical AND and OR operators for fusion. In the AND fusion, the final decision is not reached until all the decisions made by the classifiers agree. This type of fusion is very strict and therefore suitable only for systems that require low false acceptance. However, it has difficulty when the decisions made by different sensors are not consistent, which is a serious problem in multiclass applications. Unlike the AND fusion, the final decision in the OR fusion is made as soon as one of the classifiers makes a decision. This type of fusion is suitable only for systems that can tolerate a loose security policy (i.e., allowing high false acceptance error). The OR fusion suffers the same problem as the voting method when the decisions of individual classifiers do not agree with one other.
In score fusion, the scores of modality-specific classifiers are combined and the final score is used to make a decision (see Figure 10.2(b)). Typically, the output of modality-specific classifiers is linearly combined through a set of fusion weights [182]. The ¯final score is obtained from

where K is the number of modalities or experts, {wi} are a set of fusion weights, and {si} are the scores obtained from the K modalities. This kind of fusion is also referred to as the sum rule  [6, 182].
Scores can be interpreted as posteriori probabilities in the Bayesian framework. Assuming that scores from different modalities are statistically independent, the final score can be combined by using the product rule  [6, 182]:

To account for the discriminative power and reliability of each modality, a set of weights can be introduced as follows:

It has been stated that the independence assumption is unrealistic in many situations. However, this challenge does hold for some applications. For example, in AV verification systems, facial and speech features are mainly independent. Therefore, fusion of audio and visual data at the score level is a possible solution to reducing verification error.
The fusion weights wi can be non-adaptive and adaptive. Non-adaptive weights are learned from training data and kept fixed during recognition. For example, in Potamianos and Neti [284] and Sanderson and Paliwal [330], the fusion weights were estimated by minimizing the misclassification error on a held-out set; in Pigeon et al. [277], the parameters of a logistic regression model are estimated from the dispersion between the means of speakers’ and impostors’ scores. The non-adaptive weights, however, may not be optimal in mismatch conditions. Adaptive weights, on the other hand, are estimated from observed data during recognition for example, according to the signal-to-noise ratio [239], degree of voicing [260], degree of error present in each modality [371].
Another important approach to adapting fusion weights is based on the train- able properties of neural networks. For example, in Brunelli and Falavigna [37], a person identification system based on acoustic and visual features was proposed. In particular, two classifiers based on acoustic features and three based on visual ones provide data for an integration module whose performance was evaluated. A novel technique for the integration of multiple classifiers at a hybrid rank/measurement level was introduced using HyperBF networks. This research showed that the performance of the integrated system was superior to that of the acoustic and visual subsystems.
The linear combiners described before assume that the combined scores obtained from different classes are linearly separable. In case this assumption cannot be met, the scores obtained from d  experts can be considered as some d-dimensional vectors and a binary classifier (e.g., support vector machine, multi-layer perceptron, decision-tree architecture, Fisher’s linear discriminant, and Bayesian classifier) can be trained from a held-out set to classify the vectors [22, 51, 96]. The experimental results showed that SVMs and Bayesian classifiers achieve about the same performance and outperform the rest of the candidate classifiers.