Facial-recognition technology is already being used for applications ranging from unlocking phones to identifying potential criminals. Despite advances, it has still come under fire for racial bias: many algorithms that successfully identify white faces still fail to properly do so for people of color. Last week the National Institute of Standards and Technology (NIST) published a report showing how 189 face-recognition algorithms, submitted by 99 developers across the globe, fared at identifying people from different demographics. Along with other findings, NIST’s tests revealed that many of these algorithms were 10 to 100 times more likely to inaccurately identify a photograph of a black or East Asian face, compared with a white one. In searching a database to find a given face, most of them picked incorrect images among black women at significantly higher rates than they did among other demographics. This report is the third part of the latest assessment to come out of a NIST program called the Face Recognition Vendor Test (FRVT), which assesses the capabilities of different face-recognition algorithms. “We intend for this to be able to inform meaningful discussions and to provide empirical data to decision makers, policy makers and end users to know the accuracy, usefulness, capabilities [and] limitations of the technology,” says Craig Watson, an Image Group manager at NIST. “We want the end users and policy decision makers to see those results and decide for themselves.” Scientific American spoke with Watson about how his team conducted these evaluations. [An edited transcript of the interview follows.] What is the Face Recognition Vendor Test program? It’s a core algorithm test of face-recognition capabilities. Part one looked at one-to-one verification accuracy: How well can algorithms take two images and tell you if they are the same person or not? An application would be like your phone: When you go to open your phone, if you’re using face recognition, you present your face to the phone. It says, “Are you the person that can access this phone or not?” Then part two looked at one-to-many identification. That’s searching against a gallery of unknown images. And if there is a match in the gallery, can the algorithm return that accurately? One-to-many searches, you can do to access control to a facility: Ideally, someone would walk in, present their biometric. It would be compared to those that are allowed access, and then they would automatically be granted access. It’s also used by law enforcement—searching, potentially, a criminal database to find out if someone’s in that database or not. What I would point out with that one is that everything that comes back to that, from the algorithm, typically goes to a human review. And then this part three is looking at demographic differentials for both one-to-one and one-to-many applications [to see if] the algorithms perform differently across different demographics in the data set. What were the results in part three? What we report are two types of errors: false positives and false negatives. A false positive is when an algorithm says that two photos are the same person when, in fact, they’re not. A false negative is when an algorithm says two photos are not the same person, when, in fact, they are. If you’re trying to access your phone, and you present your face, and it doesn’t give you access, that’s a false negative. In that case, it’s maybe an inconvenience—you can present a second time, and then you get access to your phone. And a false positive, if you’re doing access control to a facility, it’s a concern to the system owner because a false positive would allow people into the facility who shouldn’t be allowed. And then if you get into the law-enforcement perspective, it’s putting candidates on a list who possibly shouldn’t be on it. One of the things we found is that the majority of the algorithms submitted exhibit some level of demographic differential. We found that false positives are generally higher than the false negatives. They exist on some level across most algorithms but actually not all of them. In one-to-one, there’s a really broad performance. Some algorithms have significant—up to 100 times—more errors in certain demographics versus others. And that’s kind of the worst case. But there’s also the lower end [of errors] where algorithms perform better. So the real point here is there’s really a broad variance in performance. We strongly encourage everyone to know your algorithm, know your data and know your application when making decisions. Algorithms developed in Asian countries seemed to fare better with nonwhite faces. What did the report say about that? What, specifically, it was talking about there was that algorithms developed in Asian countries didn’t have the demographic differentials for Asian faces. And what that indicates is there’s some promise that data that the algorithms train with could improve these performances. We don’t know, specifically, what the algorithm was trained on. We’re just making some level of assumption that the ones in Asian countries are trained with more Asian faces than most of the other algorithms. So why haven’t developers in the U.S. trained their algorithms on more diverse faces? When you get into these deep-learning and convolutional neural networks, you need large amounts of data and access to those data. That may not be trivial. Where did NIST obtain photographs and data for these tests? We have other agency sponsors that provide large volumes of anonymous operational data. In this particular test, we have four data sets. We have domestic mug-shot images that the FBI provided, application photos for immigration benefits, visa-application photos the Department of State provided and border-crossing photos for travelers entering the U.S. from [the Department of Homeland Security]. And, I would point out, those data go through human-subject review and legal review and privacy review before they’re shared with NIST. These are very large volumes of data. In this case, it was about a little more than 18 million images of a little more than eight million subjects that allowed us to do this testing. And those data come with various metadata—such as, for the FBI mug shots, their race category was black or white. And we can use those metadata, then, to do these demographic-differential analyses. For the Department of Homeland Security data, we had country of birth, and we use that as a proxy for race, where we could divide the data into basically seven different categories of regions around the world. Then we also get age and sex with most of the data, which allows us to do this analysis. Those data are sequestered here in the U.S., where we do not share them. What we do is we develop an [application program interface (API)] to drive the test. So we own all the hardware here at NIST. We compile the driver on this end, and it links to their software, and then we run it on our hardware. That API is just about controlling how the load gets distributed across our hardware—how we access the images. So it’s about control of the test on this—control of the data, too.

Along with other findings, NIST’s tests revealed that many of these algorithms were 10 to 100 times more likely to inaccurately identify a photograph of a black or East Asian face, compared with a white one. In searching a database to find a given face, most of them picked incorrect images among black women at significantly higher rates than they did among other demographics.

This report is the third part of the latest assessment to come out of a NIST program called the Face Recognition Vendor Test (FRVT), which assesses the capabilities of different face-recognition algorithms. “We intend for this to be able to inform meaningful discussions and to provide empirical data to decision makers, policy makers and end users to know the accuracy, usefulness, capabilities [and] limitations of the technology,” says Craig Watson, an Image Group manager at NIST. “We want the end users and policy decision makers to see those results and decide for themselves.” Scientific American spoke with Watson about how his team conducted these evaluations.

[An edited transcript of the interview follows.]

What is the Face Recognition Vendor Test program?

It’s a core algorithm test of face-recognition capabilities. Part one looked at one-to-one verification accuracy: How well can algorithms take two images and tell you if they are the same person or not? An application would be like your phone: When you go to open your phone, if you’re using face recognition, you present your face to the phone. It says, “Are you the person that can access this phone or not?”

Then part two looked at one-to-many identification. That’s searching against a gallery of unknown images. And if there is a match in the gallery, can the algorithm return that accurately? One-to-many searches, you can do to access control to a facility: Ideally, someone would walk in, present their biometric. It would be compared to those that are allowed access, and then they would automatically be granted access. It’s also used by law enforcement—searching, potentially, a criminal database to find out if someone’s in that database or not. What I would point out with that one is that everything that comes back to that, from the algorithm, typically goes to a human review.

And then this part three is looking at demographic differentials for both one-to-one and one-to-many applications [to see if] the algorithms perform differently across different demographics in the data set.

What were the results in part three?

What we report are two types of errors: false positives and false negatives. A false positive is when an algorithm says that two photos are the same person when, in fact, they’re not. A false negative is when an algorithm says two photos are not the same person, when, in fact, they are. If you’re trying to access your phone, and you present your face, and it doesn’t give you access, that’s a false negative. In that case, it’s maybe an inconvenience—you can present a second time, and then you get access to your phone. And a false positive, if you’re doing access control to a facility, it’s a concern to the system owner because a false positive would allow people into the facility who shouldn’t be allowed. And then if you get into the law-enforcement perspective, it’s putting candidates on a list who possibly shouldn’t be on it.

One of the things we found is that the majority of the algorithms submitted exhibit some level of demographic differential. We found that false positives are generally higher than the false negatives. They exist on some level across most algorithms but actually not all of them. In one-to-one, there’s a really broad performance. Some algorithms have significant—up to 100 times—more errors in certain demographics versus others. And that’s kind of the worst case. But there’s also the lower end [of errors] where algorithms perform better. So the real point here is there’s really a broad variance in performance. We strongly encourage everyone to know your algorithm, know your data and know your application when making decisions.

Algorithms developed in Asian countries seemed to fare better with nonwhite faces. What did the report say about that?

What, specifically, it was talking about there was that algorithms developed in Asian countries didn’t have the demographic differentials for Asian faces. And what that indicates is there’s some promise that data that the algorithms train with could improve these performances. We don’t know, specifically, what the algorithm was trained on. We’re just making some level of assumption that the ones in Asian countries are trained with more Asian faces than most of the other algorithms.

So why haven’t developers in the U.S. trained their algorithms on more diverse faces?

When you get into these deep-learning and convolutional neural networks, you need large amounts of data and access to those data. That may not be trivial.

Where did NIST obtain photographs and data for these tests?

We have other agency sponsors that provide large volumes of anonymous operational data. In this particular test, we have four data sets. We have domestic mug-shot images that the FBI provided, application photos for immigration benefits, visa-application photos the Department of State provided and border-crossing photos for travelers entering the U.S. from [the Department of Homeland Security]. And, I would point out, those data go through human-subject review and legal review and privacy review before they’re shared with NIST.

These are very large volumes of data. In this case, it was about a little more than 18 million images of a little more than eight million subjects that allowed us to do this testing. And those data come with various metadata—such as, for the FBI mug shots, their race category was black or white. And we can use those metadata, then, to do these demographic-differential analyses. For the Department of Homeland Security data, we had country of birth, and we use that as a proxy for race, where we could divide the data into basically seven different categories of regions around the world. Then we also get age and sex with most of the data, which allows us to do this analysis.

Those data are sequestered here in the U.S., where we do not share them. What we do is we develop an [application program interface (API)] to drive the test. So we own all the hardware here at NIST. We compile the driver on this end, and it links to their software, and then we run it on our hardware. That API is just about controlling how the load gets distributed across our hardware—how we access the images. So it’s about control of the test on this—control of the data, too.