|  | CUTLASS
    CUDA Templates for Linear Algebra Subroutines and Solvers | 
| aligned_buffer.h | AlignedBuffer is a container for trivially copyable elements suitable for use in unions and shared memory | 
| arch.h | Defines tags for architecture-specific configurations | 
| array.h | Statically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union | 
| array_subbyte.h | Statically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union | 
| batched_reduction.h | Implements a software-pipelined efficient batched reduction. D = alpha * Reduction(A) + beta * C | 
| batched_reduction_traits.h | Defines structural properties of complete batched reduction. D = alpha * Reduction(A) + beta * C | 
| command_line.h | |
| complex.h | |
| conversion_op.h | Functor performing conversion operations used by epilogues | 
| coord.h | A Coord is a coordinate of arbitrary rank into a tensor or matrix | 
| core_io.h | Helpers for printing cutlass/core objects | 
| cutlass.h | Basic include for CUTLASS | 
| include/cutlass/util/debug.h | Debugging and logging functionality | 
| tools/util/include/cutlass/util/debug.h | Contains code for debugging cutlass code | 
| default_epilogue_complex_tensor_op.h | Epilogue for threadblock scoped complex GEMMs using Tensor Ops | 
| default_epilogue_simt.h | Epilogue for threadblock scoped GEMMs using SIMT | 
| default_epilogue_tensor_op.h | Epilogue for threadblock scoped GEMMs using Tensor Ops | 
| default_epilogue_volta_tensor_op.h | Epilogue for threadblock scoped GEMMs using Tensor Ops on Volta | 
| default_epilogue_wmma_tensor_op.h | Epilogue for threadblock scoped GEMMs using Tensor Ops | 
| default_gemm.h | Default kernel-level GEMM definitions combine threadblock-scoped matrix multiply-add with the appropriate threadblock-scoped epilogue | 
| default_gemm_configuration.h | Definitions for GEMM structures | 
| default_gemm_splitk_parallel.h | Default kernel-level GEMM definitions combine threadblock-scoped matrix multiply-add with the appropriate threadblock-scoped epilogue | 
| default_gemv.h | |
| default_gemv_core.h | Defines basic properties needed by CTA-level batched GEMV assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes | 
| default_mma.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K | 
| default_mma_core.h | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes | 
| default_mma_core_simt.h | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes | 
| default_mma_core_sm50.h | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes | 
| default_mma_core_sm70.h | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes | 
| default_mma_core_sm75.h | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes | 
| default_mma_core_wmma.h | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes | 
| default_mma_tensor_op.h | Default warp-level GEMM operators selected by data type, size, and layouts of operands | 
| default_mma_wmma_tensor_op.h | Default warp-level GEMM operators selected by data type, size, and layouts of operands | 
| default_thread_map_simt.h | |
| default_thread_map_tensor_op.h | |
| default_thread_map_volta_tensor_op.h | |
| default_thread_map_wmma_tensor_op.h | |
| device_dump.h | C++ interface to dump fragments and shared memory contents for debugging | 
| device_kernel.h | Template for generic CUTLASS kernel | 
| device_memory.h | C++ interface to CUDA device memory management functions | 
| direct_epilogue_tensor_op.h | Epilogue for tensor operations | 
| distribution.h | This header contains a class to parametrize a statistical distribution function | 
| epilogue.h | Epilogue for threadblock scoped GEMMs using Tensor Ops | 
| epilogue_base.h | Epilogue for threadblock scoped GEMMs using Tensor Ops | 
| epilogue_workspace.h | Epilogue for threadblock scoped GEMMs | 
| exceptions.h | C++ exception semantics for CUDA error codes | 
| fast_math.h | Math utilities | 
| fragment_iterator_complex_tensor_op.h | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation | 
| fragment_iterator_simt.h | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation | 
| fragment_iterator_tensor_op.h | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation | 
| fragment_iterator_volta_tensor_op.h | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation | 
| fragment_iterator_wmma_tensor_op.h | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation | 
| functional.h | Define basic numeric operators with specializations for Array<T, N>. SIMD-ize where possible | 
| include/cutlass/gemm/device/gemm.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K | 
| include/cutlass/gemm/gemm.h | Defines common types used for all GEMM-like operators | 
| include/cutlass/gemm/kernel/gemm.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K | 
| tools/util/include/cutlass/util/reference/device/gemm.h | Reference implementation for GEMM in device-side code | 
| tools/util/include/cutlass/util/reference/device/kernel/gemm.h | Reference implementation for GEMM in host-side code | 
| tools/util/include/cutlass/util/reference/device/thread/gemm.h | Reference implementation for GEMM in host-side code | 
| tools/util/include/cutlass/util/reference/host/gemm.h | Reference implementation for GEMM in host-side code | 
| device/gemm_batched.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K | 
| kernel/gemm_batched.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K | 
| include/cutlass/gemm/device/gemm_complex.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K | 
| tools/util/include/cutlass/util/reference/host/gemm_complex.h | Reference implementation for complex-valued GEMM in host-side code | 
| gemm_pipelined.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K | 
| device/gemm_splitk_parallel.h | Template for GEMM performing a reduction over K partitions in parallel | 
| kernel/gemm_splitk_parallel.h | Template for GEMM performing a reduction over K partitions in parallel | 
| gemv.h | Template for a threadblock-scoped GEMV kernel | 
| gemv_batched_strided.h | |
| half.h | Defines a class for using IEEE half-precision floating-point types in host or device code | 
| host_reorder.h | Reorder data from the host side | 
| host_tensor.h | HostTensor contributes management for both host and device memory | 
| inner_product.h | Reference implementation for GEMM in host-side code | 
| integer_subbyte.h | Defines a class for using integer types smaller than one byte in host or device code | 
| interleaved_epilogue.h | Epilogue for threadblock scoped GEMMs using Tensor Ops | 
| kernel_launch.h | Defines structures and helpers to launch CUDA kernels within CUTLASS | 
| layout.h | Defines layout functions used by TensorRef and derived classes | 
| library.h | CUTLASS Library is an object-oriented approach to managing operations implemented by CUTLASS | 
| linear_combination.h | Functor performing linear combination operations used by epilogues | 
| linear_combination_clamp.h | Functor performing linear scaling operations used by epilogues. Values are clamped before converting to the output element type | 
| linear_combination_relu.h | Functor performing linear combination operations used by epilogues. Values are clamped before converting to the output element type | 
| manifest.h | Manifest of CUTLASS Library | 
| layout/matrix.h | Defines layout functions used by TensorRef and derived classes | 
| thread/matrix.h | Defines a matrix object intended for storing data in registers and operations within a CUDA thread | 
| matrix_coord.h | Defines a canonical coordinate for rank=2 matrices offering named indices | 
| matrix_shape.h | Defines a Shape template for matrix tiles | 
| matrix_traits.h | Defines properties of matrices used to denote layout and operands to GEMM kernels | 
| memory.h | Architecture-specific operators on memory | 
| memory_sm75.h | Architecture-specific operators on memory added for SM75 | 
| arch/mma.h | Templates exposing architecture support for multiply-add operations | 
| gemm/thread/mma.h | Templates exposing architecture support for warp-level multiply-add operations | 
| gemm/warp/mma.h | Templates exposing architecture support for warp-level multiply-add operations | 
| mma_base.h | Template for a double-buffered threadblock-scoped GEMM kernel | 
| mma_complex_tensor_op.h | Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores | 
| mma_pipelined.h | Template for a double-buffered threadblock-scoped GEMM kernel | 
| mma_simt.h | Templates implementing warp-level matrix multiply-accumulate operations | 
| mma_simt_policy.h | Describes the lane policy used by warp-level matrix multiply operators targeting SIMT instructions | 
| mma_simt_tile_iterator.h | Describes the lane policy used by warp-level matrix multiply operators targeting SIMT instructions | 
| mma_singlestage.h | Template for a double-buffered threadblock-scoped GEMM kernel | 
| arch/mma_sm50.h | Matrix multiply | 
| gemm/thread/mma_sm50.h | Templates exposing architecture support for multiply-add operations | 
| arch/mma_sm60.h | Matrix multiply | 
| gemm/thread/mma_sm60.h | Templates exposing architecture support for multiply-add operations | 
| arch/mma_sm61.h | Matrix multiply | 
| gemm/thread/mma_sm61.h | Templates exposing architecture support for multiply-add operations | 
| mma_sm70.h | Matrix multiply | 
| mma_sm75.h | Matrix multiply for SM75 | 
| mma_tensor_op.h | Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores | 
| mma_tensor_op_policy.h | Policy describing implementation details of warp-level GEMM targeting Tensor Cores | 
| mma_tensor_op_sm70.h | Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores | 
| mma_tensor_op_tile_iterator.h | Defines iterators used by warp-level matrix multiply operations targeting Tensor Cores | 
| mma_tensor_op_tile_iterator_sm70.h | Defines iterators used by warp-level matrix multiply operations targeting Tensor Cores | 
| mma_tensor_op_tile_iterator_wmma.h | Defines iterators used by warp-level matrix multiply operations targeting Tensor Cores | 
| mma_tensor_op_wmma.h | Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores | 
| numeric_conversion.h | Boost-like numeric conversion operator for CUTLASS numeric types | 
| numeric_types.h | Top-level include for all CUTLASS numeric types | 
| output_tile_thread_map.h | Metaprogram for determining the mapping of output elements to threads for epilogue tiles | 
| pitch_linear.h | Defines layout functions used by TensorRef and derived classes for pitch-linear memory | 
| pitch_linear_thread_map.h | Templates implementing how threads are mapped to a given tile | 
| platform.h | C++ features that may be otherwise unimplemented for CUDA device functions | 
| predicate_vector.h | Defines container classes and iterators for managing a statically sized vector of boolean predicates | 
| predicated_tile_access_iterator.h | Templates calculating the address and predicates to the load of tiles from pitch-linear rank=2 tensors | 
| predicated_tile_access_iterator_2dthreadtile.h | Templates calculating the address and predicates to the load of tiles from pitch-linear rank=2 tensors | 
| epilogue/threadblock/predicated_tile_iterator.h | Epilogue for threadblock scoped GEMMs using Tensor Ops | 
| transform/threadblock/predicated_tile_iterator.h | Templates implementing loading of tiles from pitch-linear rank=2 tensors | 
| predicated_tile_iterator_2dthreadtile.h | Templates implementing loading of tiles from pitch-linear rank=2 tensors | 
| real.h | |
| reduce.h | Defines basic thread level reduction with specializations for Array<T, N> | 
| reduce_split_k.h | Kernel performing a reduction over densely packed tensors in global memory | 
| reduction_op.h | Functor performing reduction operations used by epilogues | 
| reduction_operators.h | Kernel performing a reduction over densely packed tensors in global memory | 
| regular_tile_access_iterator.h | Templates implementing the address computation of storing of tiles from pitch-linear rank=2 tensors | 
| regular_tile_access_iterator_pitch_linear.h | Templates implementing computing the addresses of storing of tiles from pitch-linear rank=2 tensors | 
| regular_tile_access_iterator_tensor_op.h | Templates implementing computing the addresses of storing of tiles from pitch-linear rank=2 tensors | 
| regular_tile_iterator.h | Templates implementing storing of tiles from pitch-linear rank=2 tensors | 
| regular_tile_iterator_pitch_linear.h | Templates implementing loading of tiles from pitch-linear rank=2 tensors | 
| regular_tile_iterator_pitch_linear_2dthreadtile.h | Templates implementing loading of tiles from pitch-linear rank=2 tensors | 
| regular_tile_iterator_tensor_op.h | Templates implementing storing of tiles from pitch-linear rank=2 tensors | 
| regular_tile_iterator_tensor_op_sm70.h | Templates implementing loading of tiles from pitch-linear rank=2 tensors | 
| relatively_equal.h | |
| semaphore.h | Implementation of a CTA-wide semaphore for inter-CTA synchronization | 
| shared_load_iterator.h | Epilogue for threadblock scoped GEMMs using Tensor Ops | 
| simd.h | Templates exposing SIMD operators | 
| simd_sm60.h | Templates exposing SIMD operators for SM60 | 
| simd_sm61.h | Templates exposing SIMD operators for SM60 | 
| simt_policy.h | Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of SimtOp instructions, of which a row-oriented slice is visible per iteration | 
| subbyte_reference.h | Provides a mechanism for packing and unpacking elements smaller than one byte | 
| tensor.h | Defines layout functions used by TensorRef and derived classes for common 4-D and 5-D tensor formats | 
| device/tensor_compare.h | |
| host/tensor_compare.h | |
| tensor_coord.h | Defines a canonical coordinate for rank=4 tensors offering named indices | 
| tensor_copy.h | |
| device/kernel/tensor_elementwise.h | |
| host/tensor_elementwise.h | |
| device/tensor_fill.h | |
| host/tensor_fill.h | |
| device/kernel/tensor_foreach.h | |
| device/tensor_foreach.h | |
| host/tensor_foreach.h | |
| tensor_norm.h | |
| tensor_op_multiplicand_sm70.h | |
| tensor_op_multiplicand_sm75.h | |
| tensor_op_policy.h | Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of TensorOp instructions, of which a row-oriented slice is visible per iteration | 
| tensor_ref.h | Defines a structure containing strides, bounds, and a pointer to tensor data | 
| tensor_view.h | Defines a structure containing strides and a pointer to tensor data | 
| tensor_view_io.h | |
| gemm/threadblock/threadblock_swizzle.h | Implements several possible threadblock-swizzling functions mapping blockIdx to GEMM problems | 
| reduction/threadblock_swizzle.h | Defies functors for mapping blockIdx to partitions of the batched reduction computation | 
| tile_iterator_simt.h | |
| tile_iterator_tensor_op.h | |
| tile_iterator_volta_tensor_op.h | |
| tile_iterator_wmma_tensor_op.h | |
| transpose.h | Basic copy routines for tensor views | 
| type_traits.h | Type traits for common CUDA types | 
| vector.h | Defines layout functions used for rank=1 vectors | 
| volta_tensor_op_policy.h | Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of TensorOp instructions, of which a row-oriented slice is visible per iteration | 
| wmma.h | Templates exposing architecture support for warp matrix multiply-add (WMMA) operations | 
| wmma_array.h | Statically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union | 
| wmma_ptx.h | Templates exposing warp matrix multiply-add (WMMA) operations | 
| wmma_sm70.h | Matrix multiply | 
| wmma_sm72.h | Matrix multiply | 
| wmma_sm75.h | Matrix multiply | 
| wmma_tensor_op_policy.h | Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of TensorOp instructions, of which a row-oriented slice is visible per iteration | 
 1.8.11
 1.8.11