We first describe the computations needed for determining how a simulated object moves and gets deformed before explaining how we structure the data to efficiently perform these computations on the GPU while allowing for topology changes. While the discussion focuses on an elasticity problem, our solution can be applied to other simulations that make use of a graph-like structure.
Deformation
We wish to solve the continuum equations of elasticity for a dynamic system:
$$\begin{aligned} \rho \ddot{\varvec{u}} = \nabla \cdot \varvec{\sigma }+ \varvec{f}^{ext}\text {,} \end{aligned}$$
(1)
where \(\rho \) represents mass density, \(\varvec{u}\) the displacement field, \(\varvec{\sigma }\) the stress and \(\varvec{f}^{ext}\) any external forces. Note that since we use a linear elasticity model, \(\varvec{\sigma }\) only depends on \(\varvec{u}\). After discretization into a set of nodes and linearization, we obtain a system of equations
$$\begin{aligned} \varvec{M}\ddot{\varvec{u}} = \varvec{K}\varvec{u}+ \varvec{f}\text {,} \end{aligned}$$
(2)
with \(\varvec{M}\) as the mass matrix, \(\varvec{K}\) as the stiffness matrix, \(\varvec{u}\) the vector of nodal displacements and \(\varvec{f}\) the vector of external nodal forces.
For a surgery simulation with haptic interactions, stability is essential. We thus choose to use an implicit dynamic solver, which remains stable for large time steps. Additionally, unlike explicit solvers, the time step is not constrained by the smallest element size, which is difficult to control when arbitrary cutting is allowed. Following the method of [25], we obtain
$$\begin{aligned} \left( \varvec{M}- \Delta t^2 \varvec{K}\right) \Delta {\dot{\varvec{u}}} = \Delta t \left( \varvec{f}^{elastic}_0 + \varvec{f}^{ext}_0 \right) \text {,} \end{aligned}$$
(3)
where \(\Delta t\) is the length of the time step, \(\Delta {\dot{\varvec{u}}}\) the change in velocity during that time step – the quantity we are trying to determine – and \(\varvec{f}_0 = \varvec{f}^{elastic}_0 + \varvec{f}^{ext}_0\) the total forces on the object nodes at the beginning of the time step. From the solution to eq. 3, we can compute the new nodal displacements as
$$\begin{aligned} \varvec{u}= \varvec{u}_0 + \Delta t ({\dot{\varvec{u}}}_0 + \Delta {\dot{\varvec{u}}}) \text {,} \end{aligned}$$
(4)
where \(\varvec{u}_0\) and \(\dot{\varvec{u}_0}\) are respectively the nodal displacement and velocities at the beginning of the time step.
The entries in \(\varvec{K}\) are determined by the relationships between nodes, whether they are connected through elements in mesh-based methods or through their influence in meshless methods. Since we are constantly changing these connections by cutting, \(\varvec{K}\) also changes constantly, making the use of a precomputed system matrix impossible. Additionally, because the computation of \(\varvec{K}\) is expensive, we prefer to use a matrix-free method, such as conjugate gradient (CG), for solving the linear system. In that case, the main computation becomes the multiplication of \(\varvec{K}\) with an arbitrary vector of nodal displacements (or corrections) at each iteration of the solver (see Algorithm 1).
As for the mass matrix term \(\varvec{M}\), we lump the object’s mass on the nodes, resulting in a diagonal matrix. Its product with a vector can thus be reduced to an element-wise multiplication that can be added to the left-hand side of Eq. 3.
The computation of the matrix-vector product in Algorithm 1, in particular the \(\varvec{K}\varvec{p}\) product, is driven by the structure that connects the nodes together. For example, with finite elements, the domain may be subdivided into tetrahedra, where each tetrahedron connects four nodes. With the EFG method, which we use in the present work, each cubic integration element combines a set of about eight nodes. However, the computation method is the same regardless of how the elements are formed. Algorithm 2 describes how we compute \(\varvec{K}\varvec{p}\) by looping over the elements, compute their force density, and distribute it to their nodes.
The rest of the conjugate gradient consists of relatively simple vector operations, such as additions and scalar products, which can easily be carried out on the GPU. As pointed out and implemented by [18], for each iteration, only two scalars need to be transmitted to the CPU to determine whether to continue iterating.
The operations presented in this section represent the most intensive part of the simulation and are thus the ones that need to be targeted for a GPU implementation. However, they depend on data structures that are modified every time a topology change occurs in the simulated model and must be implemented in a way that is not overly penalized by these changes.
Changing connectivity data
We now describe how the element connectivity data is encoded so that it can be efficiently modified on the GPU. As a general rule for GPU computing (also for optimal cache access on the CPU), we want pieces of data that will be accessed together to be located close to each other in memory.
We perform topology changes using the method presented in [26] and map the surface onto the physical model as in [27]. There are four sets of data, illustrated in Fig. 1, that are very large and need to be updated every frame:
-
1.
The connectivity graph that assigns a set of nodes to each element; for example, element I would be \(\left\{ i \big |i \in {a, b, d, e}\right\} \)
-
2.
The shape function value for each element-node pair; I = \( \left\{ \phi _i^I \big |i \in {a, b, d, e} \right\} \)
-
3.
The shape function derivative for each element-node pair; I = \( \left\{ \nabla \phi _i^I \big |i \in {a, b, d, e} \right\} \)
-
4.
The weights of the mapping between surface vertices and nodes; for example, vertex 3 would be \(\left\{ w_i^3 \big |i \in {d, e} \right\} \).
They are represented in the diagram of Fig. 1. The connectivity between nodes and surface vertices actually uses the same graph as the connectivity between nodes and elements, and thus requires little additional data to update.
Figure 2 illustrates which operations of the simulation are performed on the CPU and which are performed on the GPU. It also shows what data need to be transferred for each of these operations. Names in bold are the 2D arrays discussed in this section.
Two considerations guided our choice of data structure for storing the connectivity and shape functions and mapping: avoid constantly reallocating memory on the GPU and group many small memory transfers into a larger batch—to avoid the relatively large latency associated with each transfer.
These four data sets may be viewed as two-dimensional arrays. For example, the element-node connectivity has a row for each element, containing the identifier of the nodes that are connected to that element. Unlike mesh-based methods, the number of nodes per element may vary, so the rows have uneven lengths.
A memory-efficient way to store such a 2D array, as done for example by [16], would be to have a linear array containing all rows contiguously, with a second array indicating where each row starts in the larger one, and a optionally a third one to store the length of each row. However, with such a structure, if a row increases in length from one frame to the next, all subsequent values in the larger array would need to be shifted. That would be almost equivalent to updating the entire data structure, in addition to having to reallocate memory for it. The cost of a new allocation as well as en entire copy would be noticeable in an application where fast update rates are required.
To avoid having to reallocate memory, even after a change in size for some rows, our solution is to set a maximum row size and allocate a single block that is overall slightly larger than what is strictly needed. To keep track of the actual number of items in each row, we also add a vector that lists the current size of each row. This two-array structure is illustrated in Fig. 3. Since no elements are added during the simulation, the row size and data block arrays will never need to be resized or reallocated.
This structure is somewhat less flexible than the compact array, because the number of nodes attached to an element can never exceed the allocated maximum row size. However, the criterion that determines whether we need to add nodes in the neighborhood of an element is whether these nodes are coplanar (see [26] for details). The cases where this condition cannot be satisfied when choosing at least 8 neighbors are theoretically rare and will be discussed further in “Limitations” section.
During a single time step, only a small fraction of all elements will have a new set of neighbor nodes. Similarly for surface changes, only a relatively small number of vertices will need a new mapping onto the volume. To avoid copying the entire data structure and the overhead of many small memory transfers, we divide the data arrays into batches, which can be copied one at a time to GPU memory. The batches are all part of the same memory allocation, only memory transfers are affected by this division. This reduces the total amount of data to be copied every time step while keeping the number of memory transfers low.