To evaluate the MPI_XSTAR computing performance, we calculated a grid of XSTAR models for the two-dimensional - parameter space, sampling the column density with 9 logarithmic intervals and an interval size of from to cm, and the ionization parameter with 6 logarithmic intervals and an interval size of from to ergcms, assuming a gas density of cm and a turbulent velocity of kms. We assumed a spherical geometry with a covering fraction of . The chemical composition is assumed to be solar elemental abundances ( ; Grevesse et al., 1996). The initial gas temperature of K used here is typical for AGNs (Bianchi et al., 2005; Nicastro et al., 1999). The parameters used for MPI_XSTAR benchmarks are listed in Table 1. We also employed a spectral energy distribution (SED) described in Danehkar et al. (2017) as the central source of ionizing radiation with a typical luminosity of ergs (between 1 and 1000 Ryd). We run MPI_XSTAR on the Harvard ODYSSEY cluster, consisting of 60,000 cores and 4 GB of memory per core on average, running the CentOS v6.5 implementation of the Linux operating system and scheduling the jobs using the SLURM v16.05 resource manager. To calculate the speedup and efficiency, we have submitted different MPI_XSTAR jobs with a single CPU and multiple CPUs (2 to 54).
Parameter | Value | Interval Size |
(cm)&dotfill#dotfill; | ||
(ergcms)&dotfill#dotfill; | ||
(cm)&dotfill#dotfill; | - | |
(kms)&dotfill#dotfill; | - | |
&dotfill#dotfill; | - | |
&dotfill#dotfill; | - | |
( K)&dotfill#dotfill; | - | |
( ergs)&dotfill#dotfill; | - |
The speedup of a parallel computation with processors is defined as follows:
The efficiency in using the computing resources for a parallel computation with processors is defined as follows:
Table 2 lists the running time, the speedup, the efficiency of MPI_XSTAR with 1 to 54 CPUs. It can been seen that the running time of the parallel executions is significantly shorter than the serial execution. It took around 18 hours to make XSTAR grid models with 32 and 54 CPUs, whereas about 10 days using a single CPU (). Although the speedup increases with the number of processors (), it does not demonstrate an ideal speedup ( ). We also notice that the efficiency decreases with increasing the number of processors ().
|
1 | 254:07:25 | 1.00 | 1.00 |
2 | 127:23:58 | 1.99 | 0.99 |
4 | 87:09:47 | 2.92 | 0.73 |
8 | 41:35:41 | 6.11 | 0.76 |
16 | 35:18:21 | 7.20 | 0.45 |
32 | 17:42:15 | 14.35 | 0.45 |
54 | 18:13:30 | 13.93 | 0.26 |
The performance results for MPI_XSTAR versus processors are shown in Fig. 1, including the running time , the speedup and the efficiency . As seen in the figure, the speedup and efficiency are not linearly correlated with the number of processors. This is due to the fact that the running time of each XSTAR process greatly varies according to the physical conditions ( and ), so they are not identical to each other. We notice that the running time of a parallel execution is limited by the maximum running time of the XSTAR program for given physical parameters. For our benchmark example, it took between 25 seconds and 17.5 hours for each XSTAR run, depending on the column density and the ionization parameter used as input parameters. Parallel running times of multiple XSTAR runs do not exceed the maximum running time of a single XSTAR. There should not be much difference between the parallel executions with and . However, as seen in Table 2, the parallel computing with is roughly a half hour longer than that with . This is due to the fact that each cluster node (of the Harvard ODYSSEY) used in our benchmark consists of 32 cores, so we had to use 2 nodes for the parallel computing with . The inter-node communication time slightly makes two-node parallel-computing (more than 32 CPUs) slower than single-node parallel-computing (only with 32 CPUs).
As the execution time of each single XSTAR restricts the parallel running time of MPI_XSTAR, it prevents us from achieving a prefect speedup ( ). If the internal Fortran 77 routines of the program XSTAR were extended according to one of convention protocols of parallel computing (MPI or OpenMP), an ideal speedup might be achievable. Nevertheless, despite the low computing efficiency of MPI_XSTAR, it provides a major improvement for constructing photoionization grid models for spectroscopic fitting tools such as XSPEC and ISIS. For example, the photoionization table model with the settings listed in Table 1 can be now produced in 18 hours using a parallel execution with 32 CPUs rather than 10 days using a serial execution.
Ashkbiz Danehkar