3 Computing Performance

To evaluate the MPI_XSTAR computing performance, we calculated a grid of $ 9 \times 6$ XSTAR models for the two-dimensional $ N_{\rm H}$-$ \xi$ parameter space, sampling the column density with 9 logarithmic intervals and an interval size of $ 0.5$ from $ \log N_{\rm H}=20$ to $ 24$cm$ ^{-2}$, and the ionization parameter with 6 logarithmic intervals and an interval size of $ 1$ from $ \log\xi=0$ to $ 5$ ergcms$ ^{-1}$, assuming a gas density of $ \log n=12$cm$ ^{-3}$ and a turbulent velocity of $ v_{\rm turb}=100$kms$ ^{-1}$. We assumed a spherical geometry with a covering fraction of $ C_{f} =\Omega / 4 \pi= 0.4$. The chemical composition is assumed to be solar elemental abundances ( $ A_{\rm Fe} = 1$; Grevesse et al., 1996). The initial gas temperature of $ T_{\rm init}=10^6$K used here is typical for AGNs (Bianchi et al., 2005; Nicastro et al., 1999). The parameters used for MPI_XSTAR benchmarks are listed in Table 1. We also employed a spectral energy distribution (SED) described in Danehkar et al. (2017) as the central source of ionizing radiation with a typical luminosity of $ L_{\rm ion}=10^{44}$ ergs$ ^{-1}$ (between 1 and 1000 Ryd). We run MPI_XSTAR on the Harvard ODYSSEY cluster, consisting of 60,000 cores and 4 GB of memory per core on average, running the CentOS v6.5 implementation of the Linux operating system and scheduling the jobs using the SLURM v16.05 resource manager. To calculate the speedup and efficiency, we have submitted different MPI_XSTAR jobs with a single CPU and multiple CPUs (2 to 54).

Table 1: Parameters used for MPI_XSTAR benchmark model.

Value Interval Size

$ \log N_{\rm H}$ (cm$ ^{-3}$)&dotfill#dotfill;
$ 20 \cdots 24$ $ 0.5$

$ \log \xi$ (ergcms$ ^{-1}$)&dotfill#dotfill;
$ 0 \cdots 5$ $ 1.0$

$ \log n$ (cm$ ^{-3}$)&dotfill#dotfill;
$ 12$ -

$ v_{\rm turb}$ (kms$ ^{-1}$)&dotfill#dotfill;
$ 100$ -

$ C_{f}=\Omega / 4 \pi$&dotfill#dotfill;
$ 0.4$ -

$ A_{\rm Fe}$ &dotfill#dotfill;
$ 1.0$ -

$ T_{\rm init}$ ($ 10^{4}$ K)&dotfill#dotfill;
$ 100$ -

$ L_{\rm ion}$ ($ 10^{38}$ ergs$ ^{-1}$)&dotfill#dotfill;
$ 1.0\times 10^{6}$ -

\item[1]\textbf{Notes.} Logarithmic interval si...
...and the ionization parameter ($\xi=L_{\rm ion}/n_{\rm H} r^2$).

The speedup $ \mathcal {S}(N)$ of a parallel computation with $ N$ processors is defined as follows:

$\displaystyle \mathcal{S}(N) \equiv \frac{\mathcal{T}(1)}{\mathcal{T}(N)},$ (1)

where $ \mathcal{T}(i)$ is the running time for a parallel execution with $ i$ processors, so $ \mathcal{T}(1)$ corresponds to a serial execution. The speedup for a single processor ($ N = 1$) is defined to be $ \mathcal{S}(1)=1$. Ideally, an excellent speedup is achieved when $ \mathcal{S}(N)\approx N$.

The efficiency $ \mathcal {E}(N)$ in using the computing resources for a parallel computation with $ N$ processors is defined as follows:

$\displaystyle \mathcal{E}(N) \equiv \frac{\mathcal{S}(N)}{N} = \frac{\mathcal{T}(1)}{N \times \mathcal{T}(N)}.$ (2)

The efficiency is typically between zero and one. An efficiency of more than one describes the so-called superlinear speedup. As the speedup $ \mathcal{S}(1)=1$, the efficiency is $ \mathcal{E}(1)=1$ for a single processor ($ N = 1$).

Table 2 lists the running time, the speedup, the efficiency of MPI_XSTAR with 1 to 54 CPUs. It can been seen that the running time $ \mathcal {T}(N)$ of the parallel executions is significantly shorter than the serial execution. It took around 18 hours to make XSTAR grid models with 32 and 54 CPUs, whereas about 10 days using a single CPU ($ N = 1$). Although the speedup $ \mathcal {S}(N)$ increases with the number of processors ($ N$), it does not demonstrate an ideal speedup ( $ \mathcal{S}(N)\approx N$). We also notice that the efficiency $ \mathcal {E}(N)$ decreases with increasing the number of processors ($ N$).

Figure 1: From top to bottom, the running time $ \mathcal {T}(N)$, the speedup $ \mathcal {S}(N)$, the efficiency $ \mathcal {E}(N)$ and as a function of the number of processes ($ N$) for a benchmark photoionization model with the parameters listed in Table 1. The running time $ \mathcal {T}(N)$ is in seconds.
\includegraphics[width=3.in, trim = 45 30 0 0, clip, angle=0]{figures/fig1_pixstar_benchmark.ps}

Table 2: Running time, speedup and efficiency of MPI_XSTAR

 $ N$ 
  $ \mathcal {T}(N)$    $ \mathcal {S}(N)$    $ \mathcal {E}(N)$ 

254:07:25 1.00 1.00

127:23:58 1.99 0.99

87:09:47 2.92 0.73

41:35:41 6.11 0.76

35:18:21 7.20 0.45

17:42:15 14.35 0.45

18:13:30 13.93 0.26

\item[1]\textbf{Notes.} The running time $\mathcal{T}(N)$\ is in hours, minutes, and seconds (hh:mm:ss).

The performance results for MPI_XSTAR versus $ N$ processors are shown in Fig. 1, including the running time $ \mathcal {T}(N)$, the speedup $ \mathcal {S}(N)$ and the efficiency $ \mathcal {E}(N)$. As seen in the figure, the speedup and efficiency are not linearly correlated with the number of processors. This is due to the fact that the running time of each XSTAR process greatly varies according to the physical conditions ( $ \log N_{\rm H}$ and $ \log \xi$), so they are not identical to each other. We notice that the running time of a parallel execution is limited by the maximum running time of the XSTAR program for given physical parameters. For our benchmark example, it took between 25 seconds and 17.5 hours for each XSTAR run, depending on the column density $ \log N_{\rm H}$ and the ionization parameter $ \log \xi$ used as input parameters. Parallel running times of multiple XSTAR runs do not exceed the maximum running time of a single XSTAR. There should not be much difference between the parallel executions with $ N=32$ and $ 54$. However, as seen in Table 2, the parallel computing with $ N=54$ is roughly a half hour longer than that with $ N=32$. This is due to the fact that each cluster node (of the Harvard ODYSSEY) used in our benchmark consists of 32 cores, so we had to use 2 nodes for the parallel computing with $ N=54$. The inter-node communication time slightly makes two-node parallel-computing (more than 32 CPUs) slower than single-node parallel-computing (only with 32 CPUs).

As the execution time of each single XSTAR restricts the parallel running time of MPI_XSTAR, it prevents us from achieving a prefect speedup ( $ \mathcal{S}(N)\approx N$). If the internal Fortran 77 routines of the program XSTAR were extended according to one of convention protocols of parallel computing (MPI or OpenMP), an ideal speedup might be achievable. Nevertheless, despite the low computing efficiency of MPI_XSTAR, it provides a major improvement for constructing photoionization grid models for spectroscopic fitting tools such as XSPEC and ISIS. For example, the photoionization table model with the settings listed in Table 1 can be now produced in 18 hours using a parallel execution with 32 CPUs rather than 10 days using a serial execution.

Ashkbiz Danehkar