Adversarial Attacks on DNNs - Literature Review, Part 1

Background

Adversarial attack is a technique to fool the machine learning model by adding small perturbations to the input data. The alterations are often imperceptible to human eyes but can cause the model to make incorrect predictions.

It is such a common phenomenon that developing a robust model against malicious attacks has become a critical task in the field of deep learning. In this report, we will discuss the basic concept of adversarial attack and introduce some common methods to generate adversarial examples.

White-box and Black-box Attack

There are two types of adversarial attacks: white-box and black-box attacks. In a white-box attack, the attacker has full access to the model, including its architecture and parameters. In contrast, in a black-box attack, the attacker has no knowledge of the model and can only interact with it by querying it. White-box attacks are more powerful than black-box attacks should the attacker have full access to the model. However, in practice, black-box attacks are more realistic.

Fast Gradient Sign Method (FGSM)

FGSM is a simple yet effective white-box attack method, first introduced by Goodfellow et al. in 2015. They observed that different ML models trained on various datasets are all vulnerable to the same adversarial examples, which suggests that the adversarial examples expose fundamental blind spots in DL models.

It was concluded that linear behavior in high-dimensional spaces can be exploited to generate adversarial examples. Denote the weight vector as $w$ and the adversarial example as $x'$ . The model tries to classify the input $x'$ by computing the dot product $w^Tx' = w^Tx + w^T\eta$ , where $\eta$ is the perturbation. Simply let $\eta = \text{sign}(w)$ will increase the activation by $\epsilon m n$ , and the more dimensions $n$ the input has, the more likely the model will misclassify the input.

However, nonlinear model families also suffer from this vulnerability. In their original paper, a picture of giant panda $(x)$ was correctly classified by a ImageNet-trained Inception model. However, by adding a small perturbation to the original image, the model misclassified the corrupted image $(x')$ as a gibbon with high confidence. The perturbation $\eta$ can be calculated as follows:

$\eta = \epsilon \text{sign}\left(\nabla_x J(\theta, x, y)\right)$

where $\epsilon$ is the magnitude of the perturbation and the loss function $J(\theta, x, y)$ is the cross-entropy loss between the true label $y$ and the predicted label. The adversarial example is then given by $x' = x + \eta$ .

The following code snippet demonstrates how to generate adversarial examples using FGSM in PyTorch:

def fgsm_attack(model, loss, images, labels, eps) :
    images = images.to(device)
    labels = labels.to(device)
    images.requires_grad = True
            
    outputs = model(images)
    
    model.zero_grad()
    cost = loss(outputs, labels).to(device)
    cost.backward()
    
    attack_images = images + eps*images.grad.sign() # FGSM
    attack_images = torch.clamp(attack_images, 0, 1)
    
    return attack_images

The fgsm_attack function takes the model, loss function, images, labels, and the magnitude of the perturbation as input and returns the adversarial examples. The images.grad attribute stores the gradient of the loss function with respect to the input images. The images.grad.sign() function returns the sign of the gradient, which is then multiplied by the magnitude of the perturbation to generate the adversarial examples.

Similarly, the linear behavior can be extended to softmax models and thus basically all deep learning models. For example, the MNIST recognition model can lose the ability to distinguish between 3 and 7 by adding a human-imperceptible perturbation.

Finally, the paper also proposed a defense mechanism called adversarial training, which augments the training data with adversarial examples. Training with an adversarial objective function based on FGSM is an effective way to regularize the model and significantly improve its robustness against adversarial attacks. Adversarial training can also prevent overfitting and improve the generalization of the maxout network.

Back to one of the phenomena mentioned earlier, adversarial examples are transferable across different models. Softmax and RBF networks are both vulnerable to adversarial attacks, and adversarial examples generated for one model can also fool the other model. The paper suggests that this is only true when $\epsilon$ is so large that no meaningful information can be extracted from the input. Further, two problems remain unsolved: why adversarial training works and why averaging multiple models can sometimes eliminate adversarial examples.

DeepXplore

DeepXplore performs a white-box adversarial attack on autonomous driving model. Pei et al. (2019) proposed a systematic approach to generate adversarial examples that can expose the vulnerabilities of image-based models, which can also be used to refine the DL models.

Since the emergence of autonomous driving, the accuracy of the algorithm has been a major concern. There has repeatedly been news of accidents caused by autonomous driving, which has raised public concerns about the safety of autonomous driving. Nvidia DAVE-2, a widely used DNN-based model, is found to be easily fooled by providing a darker version of the original image. The adversarial examples generated by DeepXplore can cause the model to make incorrect predictions, which can be dangerous in real-world scenarios.

DeepXplore introduces a new metric called neuron coverage to measure the effectiveness of adversarial examples. Neuron coverage is defined as the percentage of neurons activated by at least one test case. The higher the neuron coverage, the more likely that untested neurons will be activated by adversarial examples. The goal of DeepXplore is to maximize both neuron coverage and the differential behaviors to expose the vulnerabilities of the model, which are often overlooked by traditional DNN testing methods.

Gradient ascent is used to generate test cases. As its name suggests, gradient ascent is the opposite of gradient descent. Gradient descent takes input as constant and weights as variables, and repeatedly updates the weights to minimize the loss function. In contrast, gradient ascent takes weights - exactly the weights of the model to be attacked - as constant and input as variables, and repeatedly updates the input to maximize the loss function. Practically, DeepXplore tests two DNN models with the same input which induce different outputs - and its job is to find the input that creates the maximum difference and neuron coverage.

Joint optimization is used to maximize both neuron coverage and the difference. The two optimization objectives are combined into a single loss function, which is then minimized using gradient descent.

Mathematically, the two objectives are:

$obj = \text{loss}_1 - \lambda \cdot \text{loss}_2$

where $\text{loss}_1, \text{loss}_2$ are the loss functions of the two models, and $\lambda$ is a hyperparameter that converts the two loss functions into the same scale.

And $f = \sum_{i=1}^{n} \text{dnn}_i(x)$

where $\text{dnn}_i(x)$ is the output of the $i$ -th neuron in the DNN given input $x$ , and $f$ is the neuron coverage.

The following code snippet demonstrates how DeepXplore works with autonomous driving models:

for _ in xrange(args.seeds):
    gen_img = preprocess_image(random.choice(img_paths))
    orig_img = gen_img.copy()
    # first check if input already induces differences
    angle1, angle2, angle3 = model1.predict(gen_img)[0], model2.predict(gen_img)[0], model3.predict(gen_img)[0]
    if angle_diverged(angle1, angle2, angle3):

        update_coverage(gen_img, model1, model_layer_dict1, args.threshold)
        update_coverage(gen_img, model2, model_layer_dict2, args.threshold)
        update_coverage(gen_img, model3, model_layer_dict3, args.threshold)

        averaged_nc = (neuron_covered(model_layer_dict1)[0] + neuron_covered(model_layer_dict2)[0] +
                       neuron_covered(model_layer_dict3)[0]) / float(
            neuron_covered(model_layer_dict1)[1] + neuron_covered(model_layer_dict2)[1] +
            neuron_covered(model_layer_dict3)[
                1])

        gen_img_deprocessed = draw_arrow(deprocess_image(gen_img), angle1, angle2, angle3)
        continue
    
    # if all turning angles roughly the same
    orig_angle1, orig_angle2, orig_angle3 = angle1, angle2, angle3
    layer_name1, index1 = neuron_to_cover(model_layer_dict1)
    layer_name2, index2 = neuron_to_cover(model_layer_dict2)
    layer_name3, index3 = neuron_to_cover(model_layer_dict3)

    # construct joint loss function
    if args.target_model == 0:
        loss1 = -args.weight_diff * K.mean(model1.get_layer('before_prediction').output[..., 0])
        loss2 = K.mean(model2.get_layer('before_prediction').output[..., 0])
        loss3 = K.mean(model3.get_layer('before_prediction').output[..., 0])
    elif args.target_model == 1:
        # omitted for brevity
    elif args.target_model == 2:
        # omitted for brevity
    
    loss1_neuron = K.mean(model1.get_layer(layer_name1).output[..., index1])
    loss2_neuron = K.mean(model2.get_layer(layer_name2).output[..., index2])
    loss3_neuron = K.mean(model3.get_layer(layer_name3).output[..., index3])
    layer_output = (loss1 + loss2 + loss3) + args.weight_nc * (loss1_neuron + loss2_neuron + loss3_neuron)

    # for adversarial image generation
    final_loss = K.mean(layer_output)

    # we compute the gradient of the input picture wrt this loss
    grads = normalize(K.gradients(final_loss, input_tensor)[0])

    # this function returns the loss and grads given the input picture
    iterate = K.function([input_tensor], [loss1, loss2, loss3, loss1_neuron, loss2_neuron, loss3_neuron, grads])

    # we run gradient ascent for 20 steps
    for iters in xrange(args.grad_iterations):
        loss_value1, loss_value2, loss_value3, loss_neuron1, loss_neuron2, loss_neuron3, grads_value = iterate(
            [gen_img])
        if args.transformation == 'light':
            grads_value = constraint_light(grads_value)  # constraint the gradients value
        elif args.transformation == 'occl':
            grads_value = constraint_occl(grads_value, args.start_point,
                                          args.occlusion_size)  # constraint the gradients value
        elif args.transformation == 'blackout':
            grads_value = constraint_black(grads_value)  # constraint the gradients value

        gen_img += grads_value * args.step
        angle1, angle2, angle3 = model1.predict(gen_img)[0], model2.predict(gen_img)[0], model3.predict(gen_img)[0]

        if angle_diverged(angle1, angle2, angle3):
            update_coverage(gen_img, model1, model_layer_dict1, args.threshold)
            update_coverage(gen_img, model2, model_layer_dict2, args.threshold)
            update_coverage(gen_img, model3, model_layer_dict3, args.threshold)

            averaged_nc = (neuron_covered(model_layer_dict1)[0] + neuron_covered(model_layer_dict2)[0] +
                           neuron_covered(model_layer_dict3)[0]) / float(
                neuron_covered(model_layer_dict1)[1] + neuron_covered(model_layer_dict2)[1] +
                neuron_covered(model_layer_dict3)[
                    1])

            gen_img_deprocessed = draw_arrow(deprocess_image(gen_img), angle1, angle2, angle3)
            orig_img_deprocessed = draw_arrow(deprocess_image(orig_img), orig_angle1, orig_angle2, orig_angle3)
            break

For each of the original images, DeepXplore first checks if the input already induces different outputs. If so, it updates the neuron coverage. Otherwise, it constructs a joint loss function according to the target model, compute the loss and gradients, and repeat the gradient ascent for 20 iterations. If the turning angles diverge, it updates the neuron coverage and saves the adversarial examples to disk. The new image is marked with the turning angles of the three models for comparison.

There are some hyperparameters worth mentioning. The weight_diff and weight_nc are used to balance the two objectives. The step is the step size for gradient ascent, and the grad_iterations is the number of iterations for gradient ascent. The transformation can be set to light, occl, or blackout to apply different transformations to the input image.

The researchers found thousands of adversarial examples that can be found in real world and cause the model to make incorrect predictions. This includes altered lighting conditions, a single rectangle masking, and many tiny black squares. For PDF and Android malicious detection models, DeepXplore also integrates real-life constraints. The results show that DeepXplore can effectively expose the vulnerabilities of the models and improve their robustness against adversarial attacks.

They further discussed the benefits of utilizing neuron coverage as a metric for DNN testing. Firstly, it is more practical than traditional code coverage, which measures the percentage of code executed by a test suite. Code coverage can easily reach 100% while neuron coverage is much lower, indicating that the model has not been fully tested. Note to compute the neuron coverage, the output of each neuron must be normalized to the range [0, 1] according to this formula:

$X_{\text{norm}} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}},$

where $X_{\text{min}}$ and $X_{\text{max}}$ are the minimum and maximum output values of that layer, respectively.

Secondly, neuron coverage can identify the activation patterns of different classes of inputs. For example, testing MNIST with a single digit input achieves higher neuron coverage and overlaps with a mixed digit input. This indicates that the model has learned the features of the input data and can distinguish between different classes of inputs.

To measure the performance of DeepXplore, two metrics are used: the neuron coverage and the execution time to find adversarial examples. The selection of hyperparameters is crucial to the performance of DeepXplore, and it varies depending on the DNN model. The number of training samples, neurons and training epochs can also affect the performance of DeepXplore. The higher these model parameters, the faster the DeepXplore can find adversarial examples, and thus a more knowledgeable model are more likely to be attacked.

As for the DeepXplore training process, with the help of DeepXplore, the models can improve their accuracy by 1-3% on average; and it can also be used to detect the data contamination, or incorrect labels in the training data.

Improving Transferability of Adversarial Examples with Input Diversity

In this section, we will introduce a black-box attack method called Input Diversity Strategy. As its name implies, the objective of this method is to make adversarial examples also applicable to other models, by enriching the diversity of the input data.

Variants of FGSM

Zou et al. (2020) concluded the motivation of this method: while iterative white-box attacks are effective, they are not transferable to other models, and thus has low black-box attack success rate due to overfitting. On the other hand, single-step white-box attacks learn less from their target model, and thus have low white-box attack success rate, but has increased black-box attack success rate. The authors proposed a new method that yields high transferability and success rate in both white-box and black-box attacks.

The method is based on FGSM, specifically the Iterative FGSM (I-FGSM). The I-FGSM is a multi-step variant of FGSM, where the perturbation is updated multiple times to maximize the loss function. The perturbation is calculated as follows:

$\begin{aligned} \eta_0 &= 0 \\ \eta_{t+1} &= \text{Clip}_{\epsilon}\left(\eta_t + \alpha \cdot \text{sign}\left(\nabla_x J(\theta, x + \eta_t, y)\right)\right) \end{aligned}$

where $\alpha$ is the step size, $\epsilon$ is the magnitude of the perturbation, and $\text{Clip}_{\epsilon}$ is the function that clips the perturbation to ensure it is within the $\epsilon$ -ball around the original input. The adversarial example is then given by $x' = x + \eta_T$ , where $T$ is the number of iterations.

Momentum Iterative FGSM (MI-FGSM) is another variant of I-FGSM that incorporates momentum into the perturbation update to discourage overfitting. A hyperparameter $\mu$ is introduced to control the momentum, and the perturbation is updated as follows:

$\begin{aligned} g_{t+1} &= \mu \cdot g_t + \frac{\nabla_x J(\theta, x + \eta_t, y)}{\|\nabla_x J(\theta, x + \eta_t, y)\|_1} \\ \eta_{t+1} &= \text{Clip}_{\epsilon}\left(\eta_t + \alpha \cdot \text{sign}(g_{t+1})\right) \end{aligned}$

where $g_t$ is the gradient of the loss function at iteration $t$ . The momentum term $g_t$ is updated by the gradient of the loss function at each iteration, which helps to stabilize the perturbation update and improve the transferability of adversarial examples.

The larger the $\mu$ , the more momentum is added to the perturbation update, which can help to escape from local minima and improve the transferability of adversarial examples. However, a large $\mu$ can also lead to overfitting. (Local minima are points where the gradient is zero, but the loss function is not at its minimum.)

To enhance the input diversity, multiple random transformations are applied to the input data. The most simple variant is to add random noise and resize randomly. Diverse Input Iterative FGSM (DI2-FGSM) is similar to I-FGSM, but with the addition of random transformations. The perturbation is calculated as follows:

$\eta_{t+1} = \text{Clip}_{\epsilon}\left(\eta_t + \alpha \cdot \text{sign}\left(\nabla_x J(\theta, \text{Transform}(x + \eta_t; p), y)\right)\right)$

where $\text{Transform}$ is the function that applies random transformations to the input data with probability $p$ . Mathematically,

$\text{Transform}(x; p) = \begin{cases} \text{Transform}(x) & \text{with probability } p \\ x & \text{with probability } 1 - p \end{cases}$

In case $p = 0$ , DI2-FGSM is equivalent to I-FGSM; and in case $p = 1$ , DI2-FGSM can only perform random transformations, which makes white-box attacks less effective. It is thus important to find a balance between the two extremes.

Momentum DI2-FGSM (M-DI2-FGSM) is the combination of DI2-FGSM and MI-FGSM, which further enhances the transferability of adversarial examples.

$g_{t+1} = \mu \cdot g_t + \frac{\nabla_x J(\theta, \text{Transform}(x + \eta_t; p), y)}{\|\nabla_x J(\theta, \text{Transform}(x + \eta_t; p), y)\|_1}$

Applications

It is suggested that, when an adversial example remains adversarial for multiple networks, it is more likely to be transferable to other networks. Therefore, attacking an ensemble of networks is a good strategy to improve the transferability of adversarial examples.

The experiments show that M-DI2-FGSM is more effective than other methods in black-box attacks and also has a high success rate in white-box attacks, as in single model attacks.

As for the ensemble attacks, it reaches a higher overall success rate than single model attacks, as well as compared to other methods. On white-box attacks, DI2-FGSM and M-DI2-FGSM have a slightly lower success rate than their non-diverse counterparts, mainly due to the fact that ensemble attacks are more difficult than single attacks. However, this can be improved by reducing random transformation probability $p$ , increase the number of iterations $N$ , or use a smaller step size $\alpha$ . This makes sense as these changes can make the adversarial examples to learn more from the target model.

Further experiments show that as $p$ increases, black-box attacks success rate increases, while white-box attacks success rate decreases. As $N$ increases, the success rate of both white-box and black-box attacks increases. As $\alpha$ decreases, the success rate of both white-box and black-box attacks increases. Additionally, black-box attacks are more effective than white-box attacks.

Hyperparameters are crucial to the success rate of adversarial examples, and the optimal hyperparameters vary depending on the target model. For example, if the target model is very different from the source model, a larger $p$ , $N$ , and smaller $\alpha$ are recommended.

Conclusion

In this report, we have discussed the basic concept of adversarial attack and introduced some common methods to generate adversarial examples. Fast Gradient Sign Method (FGSM) is a basic white-box attack method that can generate adversarial examples by adding small perturbations to the input data. DeepXplore is a systematic approach to generate adversarial examples that can expose the vulnerabilities of image-based models. To transfer adversarial examples to other models, Momentum and Diverse Input Strategy are proposed to balance the trade-off between white-box and black-box attacks.

Next literature review will focus on more advanced and specific adversarial attack methods, such as Deepbillboard, Sinvad and Input Validation Enhancement.

References

[1] Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and Harnessing Adversarial Examples. ICLR 2015. https://ai.google/research/pubs/pub43405

[2] Pei, K., Cao, Y., Yang, J., & Jana, S. (2019). DeepXplore. GetMobile, 22(3), 36–38. https://doi.org/10.1145/3308755.3308767

[3] Zou, J., Pan, Z., Qiu, J., Liu, X., Rui, T., & Li, W. (2020). Improving the Transferability of Adversarial Examples with Resized-Diverse-Inputs, Diversity-Ensemble and Region Fitting. In Lecture notes in computer science (pp. 563–579). https://doi.org/10.1007/978-3-030-58542-6_34

本文采用署名-相同方式共享 4.0 国际许可协议，转载请注明出处。