Warp Bug Gradcheck_tape Fails For Handle_contact_pairs

by ADMIN 55 views

Hey everyone,

We've encountered a couple of issues with the gradcheck_tape function when using handle_contact_pairs in Warp, and we wanted to share the details and see if anyone else has run into this or has insights.

Bug Description

After determining contact points for overlapping static spheres using wp.sim.collide, the gradcheck_tape function is failing in two scenarios:

  1. Mismatch between AD (Automatic Differentiation) and FD (Finite Difference) gradients when number_of_instances is less than 3.
  2. Warp CUDA error 700 (illegal memory access) when number_of_instances is 3 or greater.

Let's dive into each of these issues with more detail.

1. AD and FD Gradient Mismatch with number_of_instances < 3

When dealing with fewer than three instances, there's a discrepancy between the gradients calculated using Automatic Differentiation (AD) and Finite Difference (FD) methods. This mismatch causes the gradcheck_tape function to fail. Specifically, the error manifests as significant differences in gradient values for contact-related outputs like contact_point1, contact_offset0, contact_offset1, and contact_normal.

To really understand this, let's break it down. The gradcheck_tape function is crucial for verifying the correctness of our gradient calculations. It compares the gradients computed via two different methods: AD, which Warp uses under the hood, and FD, a numerical approximation. When these two methods produce vastly different results, it indicates a potential issue in the AD implementation or the way we're using it. In this case, the FD gradients are coming up as zero, while the AD gradients have non-zero values, leading to the failure. It's like having two different maps that should lead to the same treasure, but one map is completely blank. This makes debugging and ensuring the accuracy of our simulations a real challenge.

Here’s a snippet from the error log that highlights the issue:

Input      | Output            | Max Abs Error          | AD at MAE  | FD at MAE | Max Rel Error | Pass |
---------------------------------------------------------------------------------------------------------
body_q     | contact_point1    | 8.000e-01 at (2, 0, 6) | 8.000e-01  | 0.000e+00 | 8.000e+07     | FAIL |
body_q     | contact_offset0   | 1.000e+00 at (0, 0, 0) | 1.000e+00  | 0.000e+00 | 1.000e+08     | FAIL |
body_q     | contact_offset1   | 1.000e+00 at (0, 0, 0) | -1.000e+00 | 0.000e+00 | 1.000e+08     | FAIL |
body_q     | contact_normal    | 1.000e+01 at (0, 0, 0) | -1.000e+01 | 0.000e+00 | 1.000e+09     | FAIL |

As you can see, the AD and FD gradients have huge discrepancies, especially for contact_offset and contact_normal, which results in the FAIL status. This discrepancy suggests that there's something fundamentally different in how these gradients are being calculated, and it’s preventing gradcheck_tape from doing its job. Imagine trying to bake a cake with two recipes, one calling for sugar and the other for salt – you'll end up with a very different result, and that’s what we’re seeing here with these gradients. This means we can't trust the gradient calculations for these parameters, which can throw a wrench in any optimization or learning process that relies on them. Ultimately, it's crucial to resolve this mismatch to ensure the stability and accuracy of our simulations.

2. Warp CUDA Error 700 with number_of_instances >= 3

Moving on to the second issue, when the number_of_instances reaches three or more, we encounter a Warp CUDA error 700, which indicates an illegal memory access. This error typically arises when a kernel attempts to read from or write to a memory location that it doesn't have permission to access. This is a critical error that halts the execution and needs immediate attention.

This type of error is like hitting a brick wall while driving – everything comes to a screeching halt. In our context, it means that the handle_contact_pairs kernel is trying to access memory it shouldn't, leading to the crash. This usually points to a bug in the kernel code itself, where the memory indexing or bounds checking might be off. The error message "an illegal memory access was encountered" is our clue that something's amiss in how the memory is being handled, possibly due to an out-of-bounds write or read operation. This kind of error is a big red flag, as it can lead to unpredictable behavior and incorrect simulation results.

Here’s the relevant snippet from the error log:

Warp CUDA error 700: an illegal memory access was encountered (in function cuda_context_check, /builds/omniverse/warp/warp/native/warp.cu:2093)
Traceback (most recent call last):
  File "/home/user/Coding/test_warp/reproduce_memory_allocation_error_minimal.py", line 21, in <module>
    gradcheck_tape(tape=tape)
  File "/home/user/Coding/test_warp/venv/lib/python3.10/site-packages/warp/autograd.py", line 322, in gradcheck_tape
    success = gradcheck(
  ...
RuntimeError: CUDA error detected: 700
Error launching kernel: handle_contact_pairs on device cuda:0
Warp CUDA error 700: an illegal memory access was encountered (in function free_device_async, /builds/omniverse/warp/warp/native/warp.cu:653)

The traceback indicates that the error originates from within Warp's CUDA context, specifically during the execution of the handle_contact_pairs kernel. When we see an error like this, it usually means there’s a deeper issue in the memory management or kernel execution flow. This CUDA error 700 is a pretty serious problem because it can compromise the entire simulation. We need to dive into the kernel code and figure out where this illegal memory access is happening and why. It might be due to a miscalculation in memory offsets or an issue with how the kernel is launched and configured.

Minimal Example to Reproduce the Errors

To help reproduce these issues, I've created a minimal example using Python and Warp. This example sets up a simple simulation with overlapping static spheres and then runs gradcheck_tape to verify the gradient calculations.

import warp as wp
import warp.sim
from warp.autograd import gradcheck_tape

wp.config.verify_cuda = True

number_of_instances = 3
model_builder = wp.sim.ModelBuilder()
for _i in range(number_of_instances):
    model_builder.add_shape_sphere(body=model_builder.add_body(), radius=0.1, pos=[0.0, 0.0, 0.1], collision_group=_i)
    model_builder.add_shape_sphere(body=model_builder.add_body(), radius=0.1, pos=[0.0, 0.0, 0.0], collision_group=_i)

model = model_builder.finalize(requires_grad=True)
model.ground = False
state = model.state(requires_grad=True)

tape = wp.Tape()
with tape:
    wp.sim.collide(model, state, requires_grad=True)

gradcheck_tape(tape=tape)

This code snippet creates a basic Warp simulation with a specified number of sphere instances. Each instance consists of two spheres, ensuring there's potential for contact and collision. The model_builder is used to create these shapes and bodies, and the model and state objects are initialized with gradient tracking enabled (requires_grad=True). This is crucial because we want to check the gradients computed during the collision handling.

Inside the with tape: block, we record the operations performed by wp.sim.collide. This is where the magic of automatic differentiation happens. Warp's tape system keeps track of all the operations so that it can later compute gradients. The wp.sim.collide function is the heart of our simulation, as it detects collisions and computes contact information between the spheres. By setting requires_grad=True, we ensure that the gradients are tracked through this function.

Finally, gradcheck_tape(tape=tape) is called to verify the correctness of the gradients. This function compares the gradients computed using automatic differentiation with those computed using finite differences. If there's a significant discrepancy, it indicates a potential issue in the gradient computation, which is exactly what we're seeing with the errors we've described.

By running this minimal example, you should be able to reproduce the issues we're facing: the AD/FD gradient mismatch when number_of_instances is less than 3, and the CUDA error 700 when it's 3 or greater. This makes it easier to debug and identify the root cause of the problem.

More Extensive Example: Separate Rigid Body Spheres

To provide more context, this issue was discovered while working on a simple optimization problem where the goal is to minimize the squared distance between two rigid body spheres. While the gradients appear to guide the optimizer correctly, the FD gradients are all zero, and gradcheck_tape fails. This raises a key question: Is wp.sim.collide intended to be used in this manner, or should it only be used in combination with an integrator that resolves collisions?

This setup involves creating a small simulation with two spheres that can move independently. The objective is to bring these spheres as close as possible to each other. The positions of the spheres are adjusted iteratively using an optimizer, guided by the gradients computed by Warp. The core of the problem lies in accurately calculating the distance between the spheres and the corresponding gradients.

Image

The fact that the gradients seem to work, meaning the optimizer is indeed moving the spheres closer together, is quite intriguing. However, the failure of gradcheck_tape indicates that there's a disconnect between what the AD gradients are telling us and what the FD gradients suggest. This could mean that while the optimization process is converging, the underlying gradient calculations might not be entirely correct.

Here's where the question of intended usage comes into play. If wp.sim.collide is primarily designed to be used within a larger simulation loop that includes collision resolution and integration, then using it in isolation like this might expose edge cases or limitations that aren't typically encountered. On the other hand, if it's meant to provide accurate contact information even in static scenarios, then the failure of gradcheck_tape points to a genuine issue that needs to be addressed.

Here’s the code for this more extensive example, which should give you a clearer picture of how we're using wp.sim.collide:

import numpy as np
import warp as wp
import warp.sim
from warp.autograd import gradcheck_tape
from warp.optim import SGD

wp.config.mode = "debug"
wp.config.verify_cuda = True


@wp.kernel
def compute_signed_distance(
        shape_contact_tids: wp.array(dtype=wp.int32),
        rigid_contact_broad_shape0: wp.array(dtype=wp.int32),
        rigid_contact_broad_shape1: wp.array(dtype=wp.int32),
        shape_body: wp.array(dtype=int),
        shapes_per_instance: int,
        body_q: wp.array(dtype=wp.transform),
        contact_point0: wp.array(dtype=wp.vec3),
        contact_point1: wp.array(dtype=wp.vec3),
        contact_normal: wp.array(dtype=wp.vec3),
        contact_thickness: wp.array(dtype=wp.float32),
        # outputs
        distance: wp.array(dtype=wp.float32),
) -> None:
    tid = wp.tid()
    shape_tid = shape_contact_tids[tid]
    if shape_tid == -1:
        return

    shape0_idx = rigid_contact_broad_shape0[tid]
    shape1_idx = rigid_contact_broad_shape1[tid]
    shape_0_instance_idx = shape0_idx // shapes_per_instance
    shape_1_instance_idx = shape1_idx // shapes_per_instance
    assert shape_0_instance_idx == shape_1_instance_idx  # noqa: S101

    rigid_a = shape_body[shape0_idx]
    x_wb_a = wp.transform_identity()
    if rigid_a >= 0:
        x_wb_a = body_q[rigid_a]
    p_a_world = wp.transform_point(x_wb_a, contact_point0[shape_tid])

    rigid_b = shape_body[shape1_idx]
    x_wb_b = wp.transform_identity()
    if rigid_b >= 0:
        x_wb_b = body_q[rigid_b]
    p_b_world = wp.transform_point(x_wb_b, contact_point1[shape_tid])

    diff = p_a_world - p_b_world
    d = wp.dot(diff, contact_normal[shape_tid])
    signed_distance = d - contact_thickness[shape_tid]
    wp.atomic_add(arr=distance, i=shape_0_instance_idx, value=wp.pow(signed_distance, 2.0))


@wp.kernel
def update_body_q(positions: wp.array(dtype=wp.float32), body_q: wp.array(dtype=wp.transform)) -> None:
    tid = wp.tid()
    body_q[tid] = wp.transform(wp.vec3(0., 0., positions[tid]), wp.quat_identity())


number_of_instances = 2
model_builder = wp.sim.ModelBuilder(gravity=0.0)
for _i in range(number_of_instances):
    model_builder.add_shape_sphere(body=model_builder.add_body(), radius=0.1, pos=[0.0, 0.0, 0.01], collision_group=_i)
    model_builder.add_shape_sphere(body=model_builder.add_body(), radius=0.1, pos=[0.0, 0.0, 0.0], collision_group=_i)

model = model_builder.finalize(requires_grad=True)
model.ground = False
state = model.state(requires_grad=True)

rng = np.random.default_rng(42)
pos = rng.uniform(low=[0.01, 0.0], high=[0.1, 0.0], size=(number_of_instances, 2)).flatten()
object_z = wp.array(pos, dtype=wp.float32, requires_grad=True)
distance = wp.zeros((number_of_instances,), dtype=wp.float32, requires_grad=True)

optimizer = SGD(lr=0.01, params=[object_z])
for _ in range(20):
    distance.zero_()
    tape = wp.Tape()
    with tape:
        wp.launch(
            kernel=update_body_q,
            dim=(number_of_instances * 2,),
            inputs=[object_z],
            outputs=[state.body_q]
        )
        wp.sim.collide(model, state, requires_grad=True)
        wp.launch(
            kernel=compute_signed_distance,
            dim=model.rigid_contact_tids.shape,
            inputs=[
                model.rigid_contact_tids,
                model.rigid_contact_broad_shape0,
                model.rigid_contact_broad_shape1,
                model.shape_body,
                2,
                state.body_q,
                model.rigid_contact_point0,
                model.rigid_contact_point1,
                model.rigid_contact_normal,
                model.rigid_contact_thickness,
            ],
            outputs=[distance],
        )

    distance.grad.fill_(1.0)
    tape.backward()
    optimizer.step([object_z.grad])

    gradcheck_tape(tape=tape, raise_exception=False)
    tape.zero()

In this example, we're setting up a more complex scenario where we iteratively adjust the positions of the spheres to minimize their signed distance. The compute_signed_distance kernel calculates this distance based on the contact information provided by wp.sim.collide. The optimizer then uses the gradients to update the sphere positions. The key takeaway here is that we're using wp.sim.collide as a component in a larger optimization loop, which might be revealing certain limitations or bugs.

System Information

  • Ubuntu 22.04
  • Warp 1.8.0
  • Python 3.10.12

We hope this detailed explanation and the provided examples will help in understanding and resolving these issues. Any insights or suggestions would be greatly appreciated!