Adversarial attacks in AI

Posted by Venkatesh Subramanian on October 20, 2024 · 4 mins read

In an earlier post we discussed multiple facets of Responsible AI and Data, and the risks thereof. In this post we look at attacks on images by infusing noise and altering predictions, with some real life examples, and a simple example and mitigation approach.

Adversarial attacks in AI typically introduce malicious input data- called adversarial examples- that tricks an AI model to give incorrect predictions, despite appearing normal to the human eye. Models can be fooled by such attacks to make biased decisions in certain scenarios such as law-enforcement, finance, healthcare- thus affecting the bias and fairness pillar of Responsible AI (RAI).

Let’s say there is an AI system that matches faces against a criminal database and flag for police investigation. Attacker can subtly alter photos of target individual via digital edits to match them with the profile of a known criminal in database (False Positives).
Likewise attacker can also apply subtle modification to image of actual criminal so that the AI system does not flag them at all (False Negatives).
Such attacks can also bias law enforcement to treat different groups of people differently. For example, many reports have suggested that colored people are treated more unfairly by AI systems as they amplify existing societal biases in data.

Another example was reported by MIT, where hackers used innocuous stickers on the road that sent the autonomous Tesla cars careening into oncoming traffic lane, a huge safety violation!

Transparency and explainability pillar of RAI that demands model interpretability/explainability can be exploited by adversarial attackers to understand the model’s weaknesses.

As an example, let’s take a clean image of a cat.
Before adversarial attack the AI system predicts the correct label as ‘cat’.
Next we apply a technique called “Fast Gradient Sign Method (FGSM)”, which generate adversarial examples by adding a small amount of noise (or perturbation) to the original image. The noise is generated based on the gradients of the model’s loss function with respect to the input image. Attack uses a factor called “epsilon” factor that controls the intensity of this perturbation. Attackers usually start with small epsilon and keep on increasing it till it is good enough to fool the model, and yet not perceptible to human eye. Post the epsilon noise introduction in the cat image, the AI model will incorrectly classify the same as some other entity.

FGSM Attack function in PyTorch looks as below. Input parameters: image to be perturbed, intensity of perturbation, gradient of loss/sensitivity in prediction with respect to small changes in original image.

def fgsm_attack(image, epsilon, data_grad):
    sign_data_grad = data_grad.sign()  # Get the sign of the gradients
    perturbed_image = image + epsilon * sign_data_grad  # Apply perturbation
    perturbed_image = torch.clamp(perturbed_image, 0, 1)  # Keep values in [0,1]
    return perturbed_image

Details of FGSM attacks here for the technically inclined.

Other than FGSM, multiple other techniques exist to generate such examples:

  1. PGD (Projected Gradient Descent) : A more iterative and stronger version of FGSM.
  2. Deepfool: Minimizes perturbations while still making the model misclassify the input.
  3. C&W attack (Carlini & Wagner): Optimization attack that minimizes the difference between adversarial and original samples.
  4. Boundary attack: Starts from an adversarial point and gradually reduces the perturbation.

Note that these attacks are typically more compute intensive compared to the simpler FGSM.

As part of adversarial testing you can use these methods to proactively generate adversarial sample images, and use these images to retrain the model, so it becomes more robust against this class of attacks.


Subscribe

* indicates required

Intuit Mailchimp