
Citation: | Jingyu Li, Genyang Tang, Tianyue Hu (2010). Optimization of a precise integration method for seismic modeling based on graphic processing unit. Earthq Sci 23(4): 387-393. DOI: 10.1007/s11589-010-0736-4 |
As the exploration of petroleum resources is developing continuously, depth of subsurface target and requirement for precision of exploration method are both growing, which causes rapid increase of data volume. Meanwhile, continuous cost increases in the field of geophysics data processing. For the sake of better recognizing the propagation and imaging of seismic wave in complex subsurface structures, it is essential to develop new modeling method with higher precision in order to carry out the forward modeling and inverse for seismic wavefields. Higher demand on the capacity of calculation devices has attracted more attention.
For the past 20 years, the performance of CPU is continuously growing with Moore's law, while the parallel calculation algorithm for multi-CPU system has been widely applied. CPU manufacturers also developed multi-core CPUs. Intel achieved six-core CPU in 2009. However, the problem of CPUs, which have relatively fewer processing units, has never had a fundamental change in their hardware architecture. Although super computers manage thousands of CPUs, and PC-Clusters are also set up for parallel computing, the contradiction is not still solved between CPUs' mainly serialized feature in processing, and huge volume geophysics data with relatively simple and repetitious calculating steps.
On the other hand, great demand for emulation technique of complex graphic processing forces faster development of graphic processing unit (GPU). Although the frequency of a GPU's processing unit is less than that of a CPU's, a GPU can handle much more processing units, which makes GPU's performance much better than CPU's of similar costs. Cutting-edge GPU's capacity of floating point arithmetic has reached 1012 FLOPS, advancing a six-core GPU's by two orders of magnitude (Zhang et al, 2009).
In this paper, we provide a precise integration method (PIM) to explore the application of GPU-based parallel calculation for difference modeling method. We enhanced the PIM calculation program by GPU-friendly optimization to fully exploit its calculation efficiency, and to illuminate the significance of GPU parallel calculation in practical production of petroleum exploration.
Precise integration method (Tang, 2007; Yang, 2009) is a wave equation solving method designed for complex subsurface structures. This method can handle numerical dispersion and meet strict stability criteria of finite difference method (FDM). The basic idea is to employ a difference approximation in spatial domain, converting the wave equation into a set of differential equations with respect of time and then solve this equation system using an integration of a matrix of exponential functions. Theoretically this solution is accurate in the time domain, which solves the problem of numerical dispersion from the FDM, hence increasing the total accuracy. The whole spatial domain can be divided into several sub-domains. We use a single-point sub-domain to reduce the data bandwidth, with which we compromise between accuracy and efficiency. This method, with less data and higher stability and capable in heterogeneous medium and complex boundary conditions, has better results for complex subsurface structures and structures with relief surface.
For a 2D isotropic medium in x-z domain, the acoustic equation can be written as follows
|
(1) |
where u is pressure perturbation propagating in the medium, and v is the acoustic velocity. For simplicity, the source function is omitted. All sources in this study is explosion, therefore the problem is completely 2-D.
Discretizing the spatial coordinates from equation (1), we get two-order difference equation
|
(2) |
where
|
|
This is a five-point sub-domain format, and the grid configuration is shown in Figure 1.
In time domain, considering the equations
|
(3) |
where M is the mass matrix, C is damping matrix, K is stiffness matrix, and F is equivalent load, U0 and
|
(4) |
where
|
(5) |
Equation (4) has its analytic solution in time domain as
|
(6) |
Considering constant time steps which is tn=nΔt, introducing T(Δt)=eHΔt, and matrix f becomes fn for time interval [tn, tn+1], we can then get the explicit scheme from equation (6)
|
(7) |
For acoustic wave equation, we have M=I and C=0, equation (5) can be rewritten as
|
After obtaining V, we can get U from the first half of V. Usually, the elements of U are considered as the final solution at the location (i, j), which is U={ui, j}. The two-order difference equation (1) becomes
|
(8) |
Introducing weight coefficient
|
(9) |
where
|
Taking ui, j(k)n as value ui, j(k), at time n, we get the iterative solution
|
where
|
The initial conditions are expressed as
|
The value at current time is expressed with those of the previous two time steps in this iteration solution. To reduce dispersion, under single-point sub-domain, the stability condition for two- and four-order difference equation is
|
Comparing with the stability condition of FDM, vΔt/Δx≤1, PIM has advantage in stability. We use two-order difference equations for following numerical tests.
For source implementation, we just add source function in the iteration solutions, with the same calculation format with non-source areas. And for absorbing boundaries, we acquire a perfectly matched layer absorbing boundary condition for the second-order seismic wave equation (Komatitsch and Tromp, 2003).
Under the compute unified device architecture (CUDA), GPU-based parallel calculating program can combine well with high level programming languages such as C and FORTRAN. There are a number of GPU -based high performing parallel algorithm in the field of scientific calculating.
Unlike the ideas of dividing model into several partitions for PC-Cluster, CUDA is designed with a two-stage thread managing system. The whole model is put into a grid, which contains numerous blocks, with no more than 512 threads in each block. During the procedure of processing, the program send all threads of a block into a stream multiprocessor (SM). The SM then assign each thread to a stream processor (SP). Every SM can have up to eight active blocks processing at the same time. A GPU contains dozens of SMs, and a SM has eight SPs. All the assignment work of blocks and threads are automatically completed by GPU (NVIDIA, 2008). The advantage of this managing system is that program designers do not have to care about the actual number of SMs and SPs or make specialized programs for each device, just like what they do for PC-Clusters. The same codes can run on every CUDA-supported GPU. In the other words, the program has better portability.
The best suitable strategy for GPU-based program designing is to divide the model into very small pieces, matching along with a large number of threads. The calculation steps for each thread should be kept simple. From numerical test we can see that the strategy is very efficient especially for models with larger scale.
Figure 2a shows the kernel calculating flow of CPU-based PIM program. Based on it, we applied several steps to transit the method on GPU. All data for calculating is stored in global memory of GPU, which can be accessed by all threads from both CPU and GPU, for reason that there's no data interchange between different threads. We assign every meshed units of the model to a thread. So the total number of threads equals to the number of meshed units, including absorbing boundaries. The number of threads in each block is limited to 512, while the number of threads in each dimension of grid is only limited to 65 535, so we set the block size to 16×16, and the number of blocks in the grid is decided by the actual scale of the model. Most often used constants such as time-unrelated item of source function, factors of absorbing boundaries and weighting coefficient, are stored in the constant memory of GPU, which has cache for faster access.
Figure 2b show the kernel calculating flow of GPU-based PIM parallel program. All the GPU tasks are located in the loop of time step (with gray background in Figure 2b). There's no data interchange between devices during that repetitive procedure, which increases the efficiency. For the aspect of instruction optimization, the program adopts asynchronous technique. While GPU is calculating, CPU write collected data from last shot into the resulting SEG-Y file. The synchronous technique also increases the efficiency.
We test the relation between efficiency of the program and volume of model. All following numerical test results only indicate kernel calculation time, that is, they do not include the time of initializing. Environment Test includes Intel Core i7 CPU (clock frequency: 3.2GHz), 4 GB RAM and a NVIDIA GTX 260 GPU (24 SMs, 192 SPs, clock frequency of SP: 1242 MHz, memory on GPU: 896 MB).
Testing model is a constant velocity model with one shot in the center, which is the minimum phase Ricker wavelet, the number of time steps is 1 000, and the time interval of the calculation is 1 ms. Table 1 shows the scale of the model (equivalent to total threads) and calculation time of both CPU and GPU.
![]() |
Test No.1 uses only one block of GPU. As 'collecting shot gather' is to transfer data between devices, GPU program spends longer time than CPU. Test No. 2 has the same number of blocks as the number of SMs in the test device. There's only one active blocks in each SM. The scale of test No.3 is eight times as large as that of test No.2, which means all blocks are active in the test device. Only models with larger scale than test No.3 can be performed with full load. GPU-based program has apparent advantage in tests Nos.4-7. The acceleration ratio (the ratio of CPU time to GPU time) is steady at approximately 40. The acceleration ratio changes with the volume of the models are shown in Figure 3. In tests Nos.6 and 7, the program reduce the size of temporary data because of relatively small memory on GPU, which causes the time of data transfer increase and slows down the GPU performance.
We carry out two case studies of forward modeling to study the application of PIM to both CPU and GPU platforms. Case 1 images small-scale low-velocity caverns in a high-velocity layer. Case 2 contains large scale of complex structure which is of practical value.
Case 1 Caverns
The model shown in Figure 4 is a three-layer model with four small caverns in the high-velocity layer (Li and Zhao, 2006). The size of the largest cavern in the left is 40 m×40 m; the size of the two caverns in the center are both 30 m×30 m; and the smallest one in the right of the model has the size of 20 m×20 m. We calculate 60 shots altogether to acquire the seismic data with a shot interval of 50 m. Receivers are set to the right side of each shot in an off-end spread. The number of receivers is 100, with a receiver interval of 10 m. Offset ranges from 100 m to 1 090 m. With the size of calculation mesh unit set to be 5 m×5 m, we adopt a 15 Hz minimum phase Ricker wavelet as the source signature. The same procedure is calculated by both PIM programs on CPU and GPU.
The stacked sections from the results of both programs are shown in Figure 5. Even if they are calculated on different platforms, the results show no difference between the data from GPU and CPU platforms, which proves that GPU platform safely increases the time efficiency. Figure 6 is the post-stack Kirchhoff time migration section (from velocities of the original model). We can see that PIM clearly indicates three-layer subsurface structure and accurately locates the four caverns.
Case 2 Kuche geological model
Figure 7 shows Kuche velocity geological model, with a width of 16 km and a depth of 6.8 km. We calculate 75 shots with a shot interval of 160 m, with the same number of receivers on either side of a split spread. The maximum offset on either side are 2 km. A 15 Hz minimum phase Ricker wavelet is employed as the source, and the first shot goes at 2 km to the right of the top-left corner of the model. (CDP 400 in Figure 7).
We test different configuration of grids for different devices. For CPU program, we have 200 receivers with a receiver interval of 20 m. The model is meshed into 20m× 10 m units. For GPU program, we have 400 receivers with a receiver interval of 10 m. The model is meshed into 5 m×5 m, which is of higher precision than that for CPU program.
Due to the difference of the configuration of grids, the computational cost of GPU program is eight times as large as CPU program, but GPU has an acceleration ratio of approximately 40, which makes GPU still faster than CPU. Figure 8 shows the stack sections from both CPU and GPU programs with different configuration of grids. The post-stack time migration sections and post-stack depth migration sections are shown in Figures 9 and 10.
As the calculation grids are set smaller in size for GPU, there is some slight dispersion which is not apparent in the result. GPU program provides clearer and more detailed information. Special subsurface structures in the deeper zone can be positioned more accurately. Therefore we can rely on the efficiency of GPU program to increase the precision by setting smaller grid size for the model.
This paper introduces the GPU parallel calculation optimization of a precise integration method for seismic modeling. According to the numerical tests of simple model and geological model, GPU parallel calculation maintains the theoretical advantage of PIM, which has small data bandwidth, higher precision and better stability. Meanwhile, with proper program design and optimization, GPU parallel calculation will have a great performance on time efficiency, of which the acceleration ratio can as high as around 40. That will greatly reduce the cost of time and device in practical production.
This study just shows the advantage of GPU parallel calculation with a method of seismic modeling. A future improvement of GPU includes higher performance of SP clock frequency, higher precise processing, larger memory support, multi-GPU parallel calculation, etc. More GPU-oriented parallel calculating algorithm will be developed under the revolutionary multi-staged thread managing system. Both development of device and algorithm will be a strong force to the development of petroleum exploration calculating technology.
This study was supported by the National Natural Science Foundation of China (Nos. 40974066 and 40821062) and by National Basic Research Program of China (No. 2007CB209602).
Komatitsch D and Tromp J (2003). A perfectly matched layer absorbing boundary condition for the second-order seismic wave equation. Geophys J Inter 154(1): 146-153. doi: 10.1046/j.1365-246X.2003.01950.x
|
Li J and Zhao Q (2006). Collection of Pictures of Physical Models for Petroleum Exploration Seismology. Petroleum industry press, Beijing, 99.
|
NVIDIA (2008). CUDA Technical Training Volume I: Introduction to CUDA Programming. http://developer.nvidia.com/object/gpucomputing.html, 2010-05-04.
|
Tang G, Hu T and Yang J (2007). Applications of a precise integration method in forward seismic modeling. The 77th SEG Annual International Meeting, Expanded Abstract, 2 130-2 134.
|
Yang J, Liu T, Tang G and Hu T (2009). Modeling seismic wave propagation within complex structures. Appl Geophys 6(1): 30-41. doi: 10.1007/s11770-009-0005-2
|
Zhang S, Chu Y, Zhao K and Zhang Y (2009). High Performance GPU Calculating: CUDA. China Water Power Press, Beijing, 6 (in Chinese)
|