Dendro
5.01
Dendro in Greek language means tree. The Dendro library is a large scale (262K cores on ORNL's Titan) distributed memory adaptive octree framework. The main goal of Dendro is to perform large scale multiphysics simulations efficeiently in mordern supercomputers. Dendro consists of efficient parallel data structures and algorithms to perform variational ( finite element) methods and finite difference mthods on 2:1 balanced arbitary adaptive octrees which enables the users to perform simulations raning from black holes (binary black hole mergers) to blood flow in human body, where applications ranging from relativity, astrophysics to biomedical engineering.
|
Collection of Generic Parallel Functions: Sorting, Partitioning, Searching,... More...
Classes | |
class | Mpi_datatype |
An abstract class used for communicating messages using user-defined datatypes. The user must implement the static member function "value()" that returns the MPI_Datatype corresponding to this user-defined datatype. More... | |
class | Mpi_datatype< _T< double > > |
class | Mpi_datatype< _T< float > > |
class | Mpi_datatype< bool > |
A template specialization of the abstract class Mpi_datatype. This can be used for communicating messages of type "bool". More... | |
class | Mpi_datatype< Node_Type > |
A template specialization of the abstract class "Mpi_datatype" for communicating messages of type "ot::TreeNode". More... | |
class | Mpi_datatype< ot::Node > |
A template specialization of the abstract class "Mpi_datatype" for communicating messages of type "ot::Node". More... | |
class | Mpi_datatype< ot::TreeNode > |
A template specialization of the abstract class "Mpi_datatype" for communicating messages of type "ot::TreeNode". More... | |
class | Mpi_datatype< Point > |
class | Mpi_datatype< std::complex< T > > |
Functions | |
template<typename T > | |
int | Mpi_Isend (T *buf, int count, int dest, int tag, MPI_Comm comm, MPI_Request *request) |
template<typename T > | |
int | Mpi_Issend (T *buf, int count, int dest, int tag, MPI_Comm comm, MPI_Request *request) |
template<typename T > | |
int | Mpi_Recv (T *buf, int count, int source, int tag, MPI_Comm comm, MPI_Status *status) |
template<typename T > | |
int | Mpi_Irecv (T *buf, int count, int source, int tag, MPI_Comm comm, MPI_Request *request) |
template<typename T > | |
int | Mpi_Gather (T *sendBuffer, T *recvBuffer, int count, int root, MPI_Comm comm) |
template<typename T , typename S > | |
int | Mpi_Sendrecv (T *sendBuf, int sendCount, int dest, int sendTag, S *recvBuf, int recvCount, int source, int recvTag, MPI_Comm comm, MPI_Status *status) |
template<typename T > | |
int | Mpi_Bcast (T *buffer, int count, int root, MPI_Comm comm) |
template<typename T > | |
int | Mpi_Scan (T *sendbuf, T *recvbuf, int count, MPI_Op op, MPI_Comm comm) |
template<typename T > | |
int | Mpi_Reduce (T *sendbuf, T *recvbuf, int count, MPI_Op op, int root, MPI_Comm comm) |
template<typename T > | |
int | Mpi_Allreduce (T *sendbuf, T *recvbuf, int count, MPI_Op op, MPI_Comm comm) |
template<typename T > | |
int | Mpi_Alltoall (T *sendbuf, T *recvbuf, int count, MPI_Comm comm) |
template<typename T > | |
int | Mpi_Allgatherv (T *sendbuf, int sendcount, T *recvbuf, int *recvcounts, int *displs, MPI_Comm comm) |
template<typename T > | |
int | Mpi_Allgather (T *sendbuf, T *recvbuf, int count, MPI_Comm comm) |
template<typename T > | |
int | Mpi_Alltoallv_sparse (T *sendbuf, int *sendcnts, int *sdispls, T *recvbuf, int *recvcnts, int *rdispls, MPI_Comm comm) |
template<typename T > | |
int | Mpi_Alltoallv_dense (T *sendbuf, int *sendcnts, int *sdispls, T *recvbuf, int *recvcnts, int *rdispls, MPI_Comm comm) |
template<typename T > | |
int | Mpi_Alltoallv_Kway (T *sbuff_, int *s_cnt_, int *sdisp_, T *rbuff_, int *r_cnt_, int *rdisp_, MPI_Comm c) |
template<typename T > | |
int | scatterValues (std::vector< T > &in, std::vector< T > &out, DendroIntL outSz, MPI_Comm comm) |
Re-distributes a STL vector, preserving the relative ordering of the elements. More... | |
template<typename T > | |
int | maxLowerBound (const std::vector< T > &keys, const std::vector< T > &searchList, std::vector< T > &results, MPI_Comm comm) |
A parallel search function. More... | |
template<typename T > | |
unsigned int | defaultWeight (const T *a) |
template<typename T > | |
int | partitionW (std::vector< T > &vec, unsigned int(*getWeight)(const T *), MPI_Comm comm) |
A parallel weighted partitioning function. In our implementation, we do not pose any restriction on the input or the number of processors. This function can be used with an odd number of processors as well. Some processors can pass an empty vector as input. The relative ordering of the elements is preserved. More... | |
template<typename T > | |
int | concatenate (std::vector< T > &listA, std::vector< T > &listB, MPI_Comm comm) |
A parallel concatenation function. listB is appended (globally) to listA and the result is stored in listA. An useful application of this function is when listA and listB are individually sorted (globally) and the smallest element in listB is greater than the largest element in listA and we want to create a merged list that is sorted. More... | |
template<typename T > | |
int | sampleSort (std::vector< T > &in, std::vector< T > &out, std::vector< double > &stats, MPI_Comm comm) |
A parallel sample sort implementation. In our implementation, we do not pose any restriction on the input or the number of processors. This function can be used with an odd number of processors as well. Some processors can pass an empty vector as input. If the total number of elements in the vector (globally) is fewer than 10*p^2, where p is the number of processors, then we will use bitonic sort instead of sample sort to sort the vector. We use a paralle bitonic sort to sort the samples in the sample sort algorithm. Hence, the complexity of the algorithm is O(n/p log n/p) + O(p log p). Here, n is the global length of the vector and p is the number of processors. More... | |
int | splitComm2way (bool iAmEmpty, MPI_Comm *new_comm, MPI_Comm orig_comm) |
Removes duplicates in parallel. If the input is not sorted, sample sort will be called within the function to sort the vector and then duplicates will be removed. More... | |
int | splitComm2way (const bool *isEmptyList, MPI_Comm *new_comm, MPI_Comm orig_comm) |
Splits a communication group into two depending on the values in isEmptyList. Both the groups are sorted in the ascending order of their ranks in the old comm. All processors must call this function with the same 'isEmptyList' array. More... | |
int | splitCommUsingSplittingRank (int splittingRank, MPI_Comm *new_comm, MPI_Comm orig_comm) |
unsigned int | splitCommBinary (MPI_Comm orig_comm, MPI_Comm *new_comm) |
Splits a communication group into two, the first having a power of 2 number of processors and the other having the remainder. The first group is sorted in the ascending order of their ranks in the old comm and the second group is sorted in the descending order of their ranks in the old comm. More... | |
unsigned int | splitCommBinaryNoFlip (MPI_Comm orig_comm, MPI_Comm *new_comm) |
Splits a communication group into two, the first having a power of 2 number of processors and the other having the remainder. Both the groups are sorted in the ascending order of their ranks in the old comm. More... | |
template<typename T > | |
void | MergeLists (std::vector< T > &listA, std::vector< T > &listB, int KEEP_WHAT) |
Merges lists A, and B, retaining either the low or the High in list A. More... | |
template<typename T > | |
void | MergeSplit (std::vector< T > &local_list, int which_keys, int partner, MPI_Comm comm) |
The main operation in the parallel bitonic sort algorithm. This implements the compare-split operation. More... | |
template<typename T > | |
void | Par_bitonic_sort_incr (std::vector< T > &local_list, int proc_set_size, MPI_Comm comm) |
template<typename T > | |
void | Par_bitonic_sort_decr (std::vector< T > &local_list, int proc_set_size, MPI_Comm comm) |
template<typename T > | |
void | Par_bitonic_merge_incr (std::vector< T > &local_list, int proc_set_size, MPI_Comm comm) |
template<typename T > | |
void | bitonicSort_binary (std::vector< T > &in, MPI_Comm comm) |
An implementation of parallel bitonic sort that expects the number of processors to be a power of 2. However, unlike most implementations, we do not expect the length of the vector (neither locally nor globally) to be a power of 2 or even. Moreover, each processor can call this with a different number of elements. However, we do expect that 'in' atleast has 1 element on each processor. More... | |
template<typename T > | |
void | bitonicSort (std::vector< T > &in, MPI_Comm comm) |
An implementation of parallel bitonic sort that does not expect the number of processors to be a power of 2. In fact, the number of processors can even be odd. Moreover, we do not even expect the length of the vector (neither locally nor globally) to be a power of 2 or even. Moreover, each processor can call this with a different number of elements. However, we do expect that 'in' atleast has 1 element on each processor. This recursively calls the function bitonicSort_binary, followed by a special parallel merge. More... | |
template<typename T > | |
void | parallel_rank (const T *in, unsigned int sz, DendroIntL *out, MPI_Comm comm) |
Collection of Generic Parallel Functions: Sorting, Partitioning, Searching,...
void par::bitonicSort | ( | std::vector< T > & | in, |
MPI_Comm | comm | ||
) |
An implementation of parallel bitonic sort that does not expect the number of processors to be a power of 2. In fact, the number of processors can even be odd. Moreover, we do not even expect the length of the vector (neither locally nor globally) to be a power of 2 or even. Moreover, each processor can call this with a different number of elements. However, we do expect that 'in' atleast has 1 element on each processor. This recursively calls the function bitonicSort_binary, followed by a special parallel merge.
in | the vector to be sorted |
void par::bitonicSort_binary | ( | std::vector< T > & | in, |
MPI_Comm | comm | ||
) |
An implementation of parallel bitonic sort that expects the number of processors to be a power of 2. However, unlike most implementations, we do not expect the length of the vector (neither locally nor globally) to be a power of 2 or even. Moreover, each processor can call this with a different number of elements. However, we do expect that 'in' atleast has 1 element on each processor.
in | the vector to be sorted |
int par::concatenate | ( | std::vector< T > & | listA, |
std::vector< T > & | listB, | ||
MPI_Comm | comm | ||
) |
A parallel concatenation function. listB is appended (globally) to listA and the result is stored in listA. An useful application of this function is when listA and listB are individually sorted (globally) and the smallest element in listB is greater than the largest element in listA and we want to create a merged list that is sorted.
listA | a distributed vector, the result is stored in listA |
listB | another distributed vector that is appended to listA listA must not be empty on any of the calling processors. listB can be empty on some of the calling processors. listB will be cleared within the function. |
comm | the communicator |
int par::maxLowerBound | ( | const std::vector< T > & | keys, |
const std::vector< T > & | searchList, | ||
std::vector< T > & | results, | ||
MPI_Comm | comm | ||
) |
A parallel search function.
keys | locally sorted unique list of keys |
searchList | globally sorted unique list. No processor must call this function with an empty list. |
results | maximum lower bound in searchList for the corresponding key |
comm | MPI communicator |
void par::MergeLists | ( | std::vector< T > & | listA, |
std::vector< T > & | listB, | ||
int | KEEP_WHAT | ||
) |
Merges lists A, and B, retaining either the low or the High in list A.
listA | Input list, and where the output is stored. |
listB | Second input list. |
KEEP_WHAT | determines whether to retain the High or the low values from A and B. One of KEEP_HIGH or KEEP_LOW. |
Merging the two lists when their sizes are not the same is a bit involved. The major condition that needs to be used is that all elements that are less than max(min(A), min(B)) are retained by the KEEP_LOW processor, and similarly all elements that are larger larger than min(max(A), max(B)) are retained by the KEEP_HIGH processor.
The reason for this is that, on the Keep_Low side,
max(min(A), min(B)) > min(A) > max(A-)
and similarly on the Keep_high side,
min(max(A), max(B)) < max(A) < min(A+)
which guarantees that the merged lists remain bitonic.
void par::MergeSplit | ( | std::vector< T > & | local_list, |
int | which_keys, | ||
int | partner, | ||
MPI_Comm | comm | ||
) |
The main operation in the parallel bitonic sort algorithm. This implements the compare-split operation.
which_keys | is one of KEEP_HIGH or KEEP_LOW |
partner | is the processor with which to Merge and Split. |
local_list | the input vector |
comm | the communicator |
int par::Mpi_Allgather | ( | T * | sendbuf, |
T * | recvbuf, | ||
int | count, | ||
MPI_Comm | comm | ||
) |
int par::Mpi_Allgatherv | ( | T * | sendbuf, |
int | sendcount, | ||
T * | recvbuf, | ||
int * | recvcounts, | ||
int * | displs, | ||
MPI_Comm | comm | ||
) |
int par::Mpi_Allreduce | ( | T * | sendbuf, |
T * | recvbuf, | ||
int | count, | ||
MPI_Op | op, | ||
MPI_Comm | comm | ||
) |
int par::Mpi_Alltoall | ( | T * | sendbuf, |
T * | recvbuf, | ||
int | count, | ||
MPI_Comm | comm | ||
) |
int par::Mpi_Alltoallv_dense | ( | T * | sendbuf, |
int * | sendcnts, | ||
int * | sdispls, | ||
T * | recvbuf, | ||
int * | recvcnts, | ||
int * | rdispls, | ||
MPI_Comm | comm | ||
) |
int par::Mpi_Alltoallv_sparse | ( | T * | sendbuf, |
int * | sendcnts, | ||
int * | sdispls, | ||
T * | recvbuf, | ||
int * | recvcnts, | ||
int * | rdispls, | ||
MPI_Comm | comm | ||
) |
int par::Mpi_Bcast | ( | T * | buffer, |
int | count, | ||
int | root, | ||
MPI_Comm | comm | ||
) |
int par::Mpi_Gather | ( | T * | sendBuffer, |
T * | recvBuffer, | ||
int | count, | ||
int | root, | ||
MPI_Comm | comm | ||
) |
int par::Mpi_Reduce | ( | T * | sendbuf, |
T * | recvbuf, | ||
int | count, | ||
MPI_Op | op, | ||
int | root, | ||
MPI_Comm | comm | ||
) |
int par::Mpi_Scan | ( | T * | sendbuf, |
T * | recvbuf, | ||
int | count, | ||
MPI_Op | op, | ||
MPI_Comm | comm | ||
) |
int par::Mpi_Sendrecv | ( | T * | sendBuf, |
int | sendCount, | ||
int | dest, | ||
int | sendTag, | ||
S * | recvBuf, | ||
int | recvCount, | ||
int | source, | ||
int | recvTag, | ||
MPI_Comm | comm, | ||
MPI_Status * | status | ||
) |
void par::Par_bitonic_merge_incr | ( | std::vector< T > & | local_list, |
int | proc_set_size, | ||
MPI_Comm | comm | ||
) |
void par::Par_bitonic_sort_decr | ( | std::vector< T > & | local_list, |
int | proc_set_size, | ||
MPI_Comm | comm | ||
) |
void par::Par_bitonic_sort_incr | ( | std::vector< T > & | local_list, |
int | proc_set_size, | ||
MPI_Comm | comm | ||
) |
int par::partitionW | ( | std::vector< T > & | vec, |
unsigned int(*)(const T *) | getWeight, | ||
MPI_Comm | comm | ||
) |
A parallel weighted partitioning function. In our implementation, we do not pose any restriction on the input or the number of processors. This function can be used with an odd number of processors as well. Some processors can pass an empty vector as input. The relative ordering of the elements is preserved.
vec | the input vector |
getWeight | function pointer to compute the weight of each element. If you pass NULL, then every element will get a weight equal to 1. |
comm | the communicator |
int par::sampleSort | ( | std::vector< T > & | in, |
std::vector< T > & | out, | ||
std::vector< double > & | stats, | ||
MPI_Comm | comm | ||
) |
A parallel sample sort implementation. In our implementation, we do not pose any restriction on the input or the number of processors. This function can be used with an odd number of processors as well. Some processors can pass an empty vector as input. If the total number of elements in the vector (globally) is fewer than 10*p^2, where p is the number of processors, then we will use bitonic sort instead of sample sort to sort the vector. We use a paralle bitonic sort to sort the samples in the sample sort algorithm. Hence, the complexity of the algorithm is O(n/p log n/p) + O(p log p). Here, n is the global length of the vector and p is the number of processors.
in | the input vector |
out | the output vector |
comm | the communicator |
int par::scatterValues | ( | std::vector< T > & | in, |
std::vector< T > & | out, | ||
DendroIntL | outSz, | ||
MPI_Comm | comm | ||
) |
Re-distributes a STL vector, preserving the relative ordering of the elements.
in | The input vector |
out | The output vector. Memory for this vector will be allocated within the function. |
outSz | The local size of the output vector on the calling processor. |
comm | The communicator |
int par::splitComm2way | ( | bool | iAmEmpty, |
MPI_Comm * | new_comm, | ||
MPI_Comm | orig_comm | ||
) |
Removes duplicates in parallel. If the input is not sorted, sample sort will be called within the function to sort the vector and then duplicates will be removed.
nodes | the input vector. |
isSorted | pass 'true' if the vector is globally sorted. |
comm | The communicator |
iAmEmpty | Some flag to determine which group the calling processor will be combined into. |
orig_comm | The comm group that needs to be split. |
new_comm | The new comm group. |
int par::splitComm2way | ( | const bool * | isEmptyList, |
MPI_Comm * | new_comm, | ||
MPI_Comm | orig_comm | ||
) |
Splits a communication group into two depending on the values in isEmptyList. Both the groups are sorted in the ascending order of their ranks in the old comm. All processors must call this function with the same 'isEmptyList' array.
isEmptyList | flags (of length equal to the number of processors) to determine whether each processor is active or not. |
orig_comm | The comm group that needs to be split. |
new_comm | The new comm group. |
unsigned int par::splitCommBinary | ( | MPI_Comm | orig_comm, |
MPI_Comm * | new_comm | ||
) |
Splits a communication group into two, the first having a power of 2 number of processors and the other having the remainder. The first group is sorted in the ascending order of their ranks in the old comm and the second group is sorted in the descending order of their ranks in the old comm.
orig_comm | The comm group that needs to be split. |
new_comm | The new comm group. |
unsigned int par::splitCommBinaryNoFlip | ( | MPI_Comm | orig_comm, |
MPI_Comm * | new_comm | ||
) |
Splits a communication group into two, the first having a power of 2 number of processors and the other having the remainder. Both the groups are sorted in the ascending order of their ranks in the old comm.
orig_comm | The comm group that needs to be split. |
new_comm | The new comm group. |