Published: 04/04/2024
Tackling AI data bias with synthetic identities
Artificial intelligence and machine learning are among the most disruptive technologies ever seen, promising to unlock unprecedented levels of growth and innovation. However, flaws in the data that underpins AI models can lead to insufficient, and potentially harmful, results – especially when the data is related to people. Synthetic identities can mitigate these challenges.
Though still in its infancy, AI is evolving rapidly, with new use cases emerging daily – and businesses are reacting fast. According to a 2023 survey by PwC, 73% of US companies have already invested in AI solutions, while 58% of companies said they will prioritize investments in AI in the next 12 months.1
The ability of AI models to analyze large data sets, identify trends, and automate complex tasks is redefining how these companies and organizations operate and make decisions, leading to greater operational efficiency, cost savings, and profitability – for those who get it right. However, therein lies a critical challenge: AI models are only as good as the data they are trained on.
Statistical distortions – also known as data bias – in the data set can lead to skewed results and inaccuracies. They can be caused by several factors, such as unbalanced data sets, historical biases in the data collection process, or flawed data-sampling methods, all of which are then learned and generalized by the AI model.
When it comes to human data, these distortions can be a real problem because they can cause the AI model to favor certain groups or outcomes over others, resulting in biased and potentially unfair decisions. Especially in AI applications that use human images or video data, such distortions can be disastrous, perpetuating and even amplifying existing social biases. This calls into question the fairness and effectiveness of AI-driven decisions, making this a priority issue for the immediate future.
The challenge of addressing data bias
The EU AI Act, a proposed legal framework designed to regulate the use of AI in the European Union, seeks to address this issue by classifying AI applications into different risk categories and establishing appropriate requirements to ensure their safe, transparent, and compliant use. Although the act is still being drafted, mitigating the risk of any harm being caused by statistical distortions will be a key component.
It is not an easy problem to solve, however. One of the biggest challenges in addressing data bias lies in the inherent limitations of human data sets. Particularly in image-processing AI models, such as those using convolutional neural networks (CNNs), bias can manifest itself in subtle and complex ways. This is because the models are designed to identify patterns, which can inadvertently construct indirect representations of characteristics such as age, gender, or ethnicity, even when these features are not explicitly tagged in the training data. Not only is this form of indirect bias hard to detect, it is extremely difficult to correct, requiring extensive, diverse data sets for thorough analysis. The time and cost required to produce the thousands of new images needed to test and achieve a more fairly representative data set are simply not feasible.
Synthetic identities can help solve this problem.
What are synthetic identities?
Synthetic identities are artificially generated personas that simulate a wide range of human characteristics to represent a broad spectrum of human diversity without corresponding to real people. These personas can be used not only to detect bias in a data set, but also to test the actual AI model for bias and remove this bias entirely.
This has several advantages over real-life data sets:
-
Time and cost savings: Using synthetic identities to create personas eliminates the need for extensive data collection efforts – such as recruiting human subjects and data tagging – saving significant time and money.
-
Greater privacy: Synthetic identities sidestep the privacy concerns associated with the use of personal data, while also addressing any moral complications that may arise from perpetuating biases in the data set.
-
Enhanced security: Because the data used by synthetic identities doesn’t correspond to real people, the risk of exposing sensitive personal information is greatly reduced.
-
Improved accuracy and fairness: AI models trained on synthetic data can achieve higher levels of accuracy and fairness than is possible with human data, thus more effectively reflecting the diversity of real-world populations.
-
Compliance: The use of synthetic identities aligns with existing and emerging regulations, such as the EU AI Act, as they inherently avoid personal data and privacy issues.
Now, secunet has developed “a unique method of testing AI models for bias” using a large number of photorealistic synthetic identities to reflect the vast diversity of the human population. By replacing original identities in the test data with synthetic counterparts, the AI model’s recognition capabilities are rigorously tested against a wide range of profiles. This process makes it possible to test every conceivable combination of human characteristics, a feat that is practically impossible using human data, thus preventing the emergence of new biases. And if any bias is detected, the AI model is retrained with new synthetic identities until it shows no detectable bias.
However, secunet’s solution goes one step further by also addressing the security and robustness of the AI model’s recognition performance. The technology allows for testing under different environmental conditions, such as different weather or lighting scenarios, to evaluate an AI application’s ability to recognize relevant characteristics in these settings, thus identifying and mitigating potential security risks.
In short, this solution not only ensures the fairness and non-discrimination of the AI models, but also enhances their security and robustness. The impact of this on building trust and confidence in AI applications cannot be underestimated.
Synthetic identities in action
The potential use cases for this technology are vast and varied, going beyond the obvious application in biometric identification. A prime example is its potential to enhance the safety of workers on construction sites by using AI-powered cameras that can detect whether personnel are adhering to safety measures, such as wearing safety helmets. Because the model has been trained with synthetic identities, it can accurately identify individuals in different scenarios, whether it’s a sunny day or a rainy evening.
However, the true power of this solution becomes apparent when the unexpected occurs. Consider a scenario where a group of children stumble upon a construction site and start playing with unmanned equipment. An AI model trained exclusively on human data might fail to promptly detect and respond to such an unusual occurrence due to a lack of similar imagery in the model’s training data. Naturally, it’s neither feasible nor moral to collect images of children on a construction site for training purposes. This is where synthetic identities can prove invaluable in making AI systems more secure and reliable, ultimately preventing potential disasters.
Enhancing security to build trust in AI models
As AI becomes increasingly embedded in every aspect of our lives – from the workplace to our homes, and everywhere in between – ensuring public trust in these systems remains a crucial yet challenging goal. The EU AI Act will address some of the public concerns and provide a framework that ensures AI is used safely and responsibly. However, the responsibility doesn’t rest solely with regulatory bodies; organizations, businesses, and developers also play a pivotal role in shaping the perception and effectiveness of AI.
By transitioning from human data to synthetic identities, they can make a tangible and moral commitment toward enhancing the safety and responsibility of AI practices. This shift not only aligns with regulatory standards, but also demonstrates a proactive approach to addressing some of the most pressing concerns surrounding AI, including privacy, bias, and security. As this practice becomes more widespread, it will contribute to building a future where AI is not only trusted, but also recognized as a force for good.
Key takeaways
- AI is advancing quickly, but data bias in training models threatens its effectiveness and fairness.
- Synthetic identities offer a practical solution to data bias, enhancing AI accuracy, fairness, and privacy.
- Utilizing synthetic identities in AI models can improve reliability and build public trust across various applications.
-
2024 AI Business Predictions, PwC, 2023
Share this article
Don’t miss out on the latest articles in G+D SPOTLIGHT: by subscribing to our newsletter, you’ll be kept up to date on latest trends, ideas, and technical innovations – straight to your inbox every month.