|13ada28||2022-06-28 12:44:21||Mitsuaki Kawamura||master change version number|
|5a7ca7c||2022-06-28 11:33:35||Mitsuaki Kawamura||Update reference for SCTK-example|
|60476f1||2022-05-20 20:07:23||Mitsuaki Kawamura||Merge branch 'develop' of github.com:mitsuaki1987/DFPT-te...|
|edcb8d7||2022-05-20 20:01:01||Mitsuaki Kawamura||Bugfix|
|d43fa39||2022-05-09 01:51:24||Mitsuaki Kawamura||Bugfix: sign|
|bd0a173||2022-05-08 23:52:53||Mitsuaki Kawamura||SCTK/src/sctk_spinfluc.f90: Waste calculation was done. t...|
|2d2fba6||2022-04-24 23:38:12||Mitsuaki Kawamura||[BugFix] sctk_tetra.f90 : indices of sort is not used for...|
|49ed595||2022-04-24 17:17:09||Mitsuaki Kawamura||sctk_invert : Fix typo in comment sctk_coulomb : avoid to...|
|691dd7c||2022-04-24 11:16:52||Mitsuaki Kawamura||OpenMP for matrix hermite-conjugate and xc kernel|
|913ad74||2022-04-23 18:37:12||Mitsuaki Kawamura||Add OpenMP parallel into kel|
|esm-rism_ver.1.0||c95c5e4||2018-04-18 10:03:06||S.Nishihara of AdvanceSoft|
|dfpttetra6.0||f2f3986||2017-03-22 15:20:05||Mitsuaki Kawamura|
|master||13ada28||2022-06-28 12:44:21||Mitsuaki Kawamura||change version number|
This repository also contains the GPU-accelerated version of Quantum ESPRESSO.
This version is tested against PGI (now nvfortran) compilers v. >= 17.4.
The configure script checks for the presence of a PGI compiler and of a few
cuda libraries.For this reason path pointing to cudatoolkit must be present
A template for the configure command is:
./configure --with-cuda=XX --with-cuda-runtime=YY --with-cuda-cc=ZZ --enable-openmp [ --with-scalapack=no ]
XX is the location of the CUDA Toolkit (in HPC environments is
YY is the version of the cuda toolkit and
is the compute capability of the card.
If you have no idea what these numbers are you may give a try to the
get_device_props.py. An example using Slurm is:
$ module load cuda $ cd dev-tools $ salloc -n1 -t1 [...] salloc: Granted job allocation xxxx $ srun python get_device_props.py [...] Compute capabilities for dev 0: 6.0 Compute capabilities for dev 1: 6.0 Compute capabilities for dev 2: 6.0 Compute capabilities for dev 3: 6.0 If all compute capabilities match, configure QE with: ./configure --with-cuda=$CUDA_HOME --with-cuda-cc=60 --with-cuda-runtime=9.2
It is generally a good idea to disable Scalapack when running small test cases since the serial GPU eigensolver can outperform the parallel CPU eigensolver in many circumstances.
From time to time PGI links to the wrong CUDA libraries anf fails reporting
a problem in
GOmp (GNU Openmp). The solution to this
problem is removing cudatoolkit from the
LD_LIBRARY_PATH before compiling.
Serial compilation is also supported.
By default, GPU support is active. The following message will appear at the beginning of the output
GPU acceleration is ACTIVE.
GPU acceleration can be switched off by setting the following environment variable:
$ export USEGPU=no
The current GPU version passes all 186 tests with both parallel and serial
compilation. The testing suite should only be used to check the correctness of
make run-tests-pw-parallel and
should be used.
Variables allocated on the device must end with
Subroutines and functions replicating an algorithm on the GPU must end with
Modules must end with
Files with duplicated source code must end with
PW functionalities are ported to GPU by duplicating the subroutines and the functions that operate on CPU variables. The number of arguments should not change but input and output data may be referring to device variables when applicable.
Bifurcations in code flow happen at runtime with commands similar to
use control_flags, only : use_gpu [...] if (use_gpu) then call subroutine_gpu(arg_d) else call subroutine(arg) end if
At each bifurcation point it should be possible to remove the call to the accelerated routine without breaking the code. Note however that calling both the CPU and the GPU version of a subroutine in the same place may break the code execution.
[ DISCLAIMER STARTS ] What described below is not the method that will be integrated in the final release. Nonetheless it happens to be a good approach for:
1) simplify the alignment of this fork with the main repository, 2) debugging, 3) tracing evolution of memory paths as the CPU version evolves, 4) (in the future) report on a the set of global variables that should be kept to guarantee a certain speedup.
For example, this simplified the integration of the changes that took place to modernize the I/O. [ DISCLAIMER ENDS ]
Global GPU data are tightly linked to global CPU data. One cannot allocate global variables on the GPU manually. The global GPU variables follow the allocation and deallocation of the CPU ones. This is an automatic mechanism enforced by the managed memory system. In what follows, I will refer to duplicated GPU variables as "duplicated variable" and to the equivalent CPU variable as "parent variable".
Global variables in modules are synchronized through calls to subroutines
xxx being the name of the variable
in the module globally accessed by multiple subroutines.
This function accepts one argument that replicates the role of the
Acceptable values are:
0: variable will only be read (equal to intent in) 1: variable will be read and written (equal to intent inout) 2: variable will be only (entirely) updated (equal to intent out).
Function and subroutine calls having global variables in their argument
should be guarded by calls to
using_xxx with the appropriate argument.
Obviously calls with argument 0 and 1 must always be prepended.
The actual allocation of a duplicated variable happens when
is called and the parent variable is allocated.
Deallocation happens when
using_xxx_d(2) is called and the CPU variable
is not allocated.
Data synchronization (done with synchronous copies, i.e. overloaded cudamemcpy)
happens when either the CPU or the GPU memory is found to be flagged
"out of date" by a previous call to
using_xxx_d should only happen in GPU function/subroutines.
This rule can be avoided if the call is protected by ifdefs.
This is useful if you are lazy and a global variable is updated only a few times.
An example of this being g vectors that are set in a few places (at
initialization, after a scaling of the Hamiltonian etc) and are used
everywhere in the code.
Finally, there are global variables that are only updated with subroutines residing inside the same module. The allocation and the update of the duplicated counterpart becomes trivial and is simply done at the same time as the CPU variable. At the time of writing this constitute an exception to the general rule but it is actually the result of the efforts done in the last year to modularize the code and is probably the correct method to deal with duplicated data in the code.