Update on "Support different NSE in batches of CSR and CSC tensors"

This PR enables batched CSR/CSC tensors that batches may have different NSE counts. For instance, with the current master we have ```python >>> a = torch.tensor([[[1, 2], [3, 4]], [[0, 12], [21, 0]]]) >>> a.to_sparse_csr() Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: Expect the same number of specified elements per batch. ``` because the NSE of the first and second batches are different, 4 and 2, respectively. This PR implements a strided-to-sparse-CSR/CSC conversion algorithm that supports CSR/CSC batches with different NSE counts. For instance: ```python >>> a = torch.tensor([[[1, 2], [3, 4]], [[0, 12], [21, 0]]]) >>> b = a.to_sparse_csr() >>> b tensor(crow_indices=tensor([[0, 2, 4], [0, 1, 2]]), col_indices=tensor([[0, 1, 0, 1], [1, 0, 0, 0]]), values=tensor([[ 1, 2, 3, 4], [12, 21, 0, 0]]), size=(2, 2, 2), nnz=4, layout=torch.sparse_csr) >>> b[0] tensor(crow_indices=tensor([0, 2, 4]), col_indices=tensor([0, 1, 0, 1]), values=tensor([1, 2, 3, 4]), size=(2, 2), nnz=4, layout=torch.sparse_csr) >>> b[1] tensor(crow_indices=tensor([0, 1, 2]), col_indices=tensor([1, 0]), values=tensor([12, 21]), size=(2, 2), nnz=2, layout=torch.sparse_csr) ``` that is, if the NSE of a batch is smaller than the maximum NSE over all batches, the corresponding rows in `col_indices`/`values` are padded with zeros as placeholders. Algorithms on batched CSR/CSC tensors must not access the padded parts of these tensors, that is, the algorithms should use the last element of the corresponding `crow_indices` row as the NSE value rather than the value of `.values().shape[0]` that holds the maximum NSE over all batches. Performance-wise, the strided-to-sparse-CSR/CSC conversion algorithms in master and in this PR, are roughly equivalent: ```python # master branch: n [2]: a = torch.rand(10, 10, 1000, 1000) In [3]: a = torch.where(a==0, 0.1, a) # required for master, optional for the PR In [4]: %timeit a.to_sparse_csr() 2.25 s ± 9.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [5]: a_cuda = a.cuda() In [6]: %timeit a_cuda.to_sparse_csr() 55.2 ms ± 6.95 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` ```python # this PR In [2]: a = torch.rand(10, 10, 1000, 1000) In [3]: a = torch.where(a==0, 0.1, a) # required for master, optional for the PR In [4]: %timeit a.to_sparse_csr() 2.12 s ± 2.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [5]: a_cuda = a.cuda() In [6]: %timeit a_cuda.to_sparse_csr(); torch.cuda.synchronize() 47.2 ms ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` The performance of `to_sparse_csr()` on CUDA tensors increased by 15% with this PR. A strided-to-sparse-BSR/BSC conversion with variable NSE support will be implemented as a follow-up. [ghstack-poisoned]
pytorch · Sep 19, 2022 · 18f4bfc · 18f4bfc
2 parents 90fe9e9 + 60375a1
commit 18f4bfc
Show file tree

Hide file tree

Showing 562 changed files with 16,463 additions and 14,294 deletions.
diff --git a/.circleci/docker/build.sh b/.circleci/docker/build.sh
@@ -379,7 +379,7 @@ docker build \
        --build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \
        --build-arg "KATEX=${KATEX:-}" \
        --build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \
-       --build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx900;gfx906}" \
+       --build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx906}" \
        --build-arg "IMAGE_NAME=${IMAGE_NAME}" \
        --build-arg "UCX_COMMIT=${UCX_COMMIT}" \
        --build-arg "UCC_COMMIT=${UCC_COMMIT}" \

diff --git a/.circleci/docker/common/install_cudnn.sh b/.circleci/docker/common/install_cudnn.sh
@@ -4,7 +4,13 @@ if [[ ${CUDNN_VERSION} == 8 ]]; then
     # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
     mkdir tmp_cudnn && cd tmp_cudnn
     CUDNN_NAME="cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive"
-    curl -OLs  https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/${CUDNN_NAME}.tar.xz
+    if [[ ${CUDA_VERSION:0:4} == "11.7" ]]; then
+        CUDNN_NAME="cudnn-linux-x86_64-8.5.0.96_cuda11-archive"
+        curl -OLs https://ossci-linux.s3.amazonaws.com/${CUDNN_NAME}.tar.xz
+    else
+        curl -OLs  https://developer.download.nvidia.com/compute/redist/cudnn/v8.3.2/local_installers/11.5/${CUDNN_NAME}.tar.xz
+    fi
+
     tar xf ${CUDNN_NAME}.tar.xz
     cp -a ${CUDNN_NAME}/include/* /usr/include/
     cp -a ${CUDNN_NAME}/include/* /usr/local/cuda/include/

diff --git a/.circleci/docker/ubuntu-cuda/Dockerfile b/.circleci/docker/ubuntu-cuda/Dockerfile
@@ -118,6 +118,7 @@ COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
 
 # Install CUDNN
 ARG CUDNN_VERSION
+ARG CUDA_VERSION
 COPY ./common/install_cudnn.sh install_cudnn.sh
 RUN if [ "${CUDNN_VERSION}" -eq 8 ]; then bash install_cudnn.sh; fi
 RUN rm install_cudnn.sh

diff --git a/.circleci/scripts/windows_cudnn_install.sh b/.circleci/scripts/windows_cudnn_install.sh
@@ -18,7 +18,7 @@ case ${CUDA_VERSION} in
         ;;
     11.7)
         # Use cudnn8.3 with hard-coded cuda11.5 version
-        cudnn_file_name="cudnn-windows-x86_64-8.3.2.44_cuda11.5-archive"
+        cudnn_file_name="cudnn-windows-x86_64-8.5.0.96_cuda11-archive"
         ;;
     *)
         echo "CUDA_VERSION: ${CUDA_VERSION} not supported yet"

diff --git a/.github/ci_commit_pins/torchdynamo.txt b/.github/ci_commit_pins/torchdynamo.txt
@@ -1 +1 @@
-ffd056dc1510bdfecafb689ed87601055694f3e6
+41c44bc1d080d6cf063419a4166732b983b84eef
diff --git a/.github/ci_commit_pins/vision.txt b/.github/ci_commit_pins/vision.txt
@@ -1 +1 @@
-a67cc87a33a3f713aebf5299bdeb2672c98e0bc5
+a4f53308b2d0f1aa9191686e326f45c26053f686
diff --git a/.github/ci_commit_pins/xla.txt b/.github/ci_commit_pins/xla.txt
@@ -1 +1 @@
-e0dcc3171c8024ab288551d105fba24fbfae7332
+307af4313d2b0b0236618ef837959a41068cc272
diff --git a/.github/merge_rules.yaml b/.github/merge_rules.yaml
@@ -3,6 +3,7 @@
   - .jenkins/caffe2/*
   - aten/src/ATen/core/interned_strings.h
   - docs/source/onnx.rst
+  - docs/source/onnx*
   - docs/source/scripts/onnx/**
   - scripts/onnx/**
   - test/jit/test_export_modes.py
@@ -15,6 +16,8 @@
   - torch/csrc/jit/serialization/onnx.*
   - torch/csrc/onnx/**
   - torch/onnx/**
+  - third_party/onnx
+  - caffe2/python/onnx/**
   approved_by:
   - BowenBao
   - abock
@@ -323,6 +326,7 @@
   - '*'
   approved_by:
   - pytorch/metamates
+  - mruberry
   mandatory_checks_name:
   - Facebook CLA Check
   - Lint

diff --git a/.github/scripts/generate_binary_build_matrix.py b/.github/scripts/generate_binary_build_matrix.py
@@ -13,7 +13,7 @@
 from typing import Dict, List, Tuple, Optional
 
 
-CUDA_ARCHES = ["10.2", "11.3", "11.6", "11.7"]
+CUDA_ARCHES = ["10.2", "11.6", "11.7"]
 
 
 ROCM_ARCHES = ["5.1.1", "5.2"]

diff --git a/.github/scripts/generate_ci_workflows.py b/.github/scripts/generate_ci_workflows.py
@@ -207,15 +207,6 @@ class OperatingSystem:
     ),
 ]
 WINDOWS_BINARY_SMOKE_WORKFLOWS = [
-    BinaryBuildWorkflow(
-        os=OperatingSystem.WINDOWS,
-        package_type="wheel",
-        build_configs=generate_binary_build_matrix.generate_wheels_matrix(
-            OperatingSystem.WINDOWS,
-            arches=["11.3"],
-            python_versions=["3.7"]),
-        branches="master",
-    ),
     BinaryBuildWorkflow(
         os=OperatingSystem.WINDOWS,
         package_type="libtorch",

diff --git a/.github/workflows/_linux-test.yml b/.github/workflows/_linux-test.yml
@@ -117,6 +117,7 @@ jobs:
           NUM_TEST_SHARDS: ${{ matrix.num_shards }}
           PR_BODY: ${{ github.event.pull_request.body }}
           SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
+          SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
           SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}
           DOCKER_IMAGE: ${{ inputs.docker-image }}
           XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
@@ -171,6 +172,7 @@ jobs:
             -e PR_LABELS \
             -e MAX_JOBS="$(nproc --ignore=2)" \
             -e SCCACHE_BUCKET \
+            -e SCCACHE_S3_KEY_PREFIX \
             -e XLA_CUDA \
             -e XLA_CLANG_CACHE_S3_BUCKET_NAME \
             --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \

diff --git a/.github/workflows/_mac-build.yml b/.github/workflows/_mac-build.yml
@@ -33,6 +33,21 @@ on:
         default: "3.8"
         description: |
           The python version to be used. Will be 3.8 by default
+      test-matrix:
+        required: false
+        type: string
+        description: |
+          An option JSON description of what test configs to run later on. This
+          is moved here from the Linux test workflow so that we can apply filter
+          logic using test-config labels earlier and skip unnecessary builds
+
+    outputs:
+      test-matrix:
+        value: ${{ inputs.test-matrix }}
+        description: An optional JSON description of what test configs to run later on.
+      build-outcome:
+        value: ${{ jobs.build.outputs.build-outcome }}
+        description: The outcome of the build step. This is used to influence test filtering logic later on.
 
     secrets:
       MACOS_SCCACHE_S3_ACCESS_KEY_ID:
@@ -52,6 +67,8 @@ jobs:
       AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
       AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
       BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
+    outputs:
+      build-outcome: ${{ steps.build.outcome }}
     steps:
       # [see note: pytorch repo ref]
       - name: Checkout PyTorch
@@ -90,21 +107,31 @@ jobs:
         with:
           github-token: ${{ secrets.GITHUB_TOKEN }}
 
+      # Apply the filter logic to the build step too if the test-config label is already there
+      - name: Select all requested test configurations (if the test matrix is available)
+        id: filter
+        uses: ./.github/actions/filter-test-configs
+        with:
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+          test-matrix: ${{ inputs.test-matrix }}
+
       - name: Build
+        if: steps.filter.outputs.is-test-matrix-empty == 'False' || inputs.test-matrix == ''
+        id: build
         env:
           OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
         run: |
           echo "CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname "$(which conda)")/../"}" >> "${GITHUB_ENV}"
           ${CONDA_RUN} .jenkins/pytorch/macos-build.sh
 
       - name: Archive artifacts into zip
-        if: inputs.build-generates-artifacts
+        if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped'
         run: |
           zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .pytorch-test-times.json
 
       - name: Store PyTorch Build Artifacts on GHA
         uses: actions/upload-artifact@v2
-        if: inputs.build-generates-artifacts
+        if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped'
         with:
           name: ${{ env.BUILD_ENVIRONMENT }}
           retention-days: 14
@@ -114,7 +141,7 @@ jobs:
       - name: Upload sccache stats to GHA
         uses: actions/upload-artifact@v2
         # Only if sccache is installed, see above
-        if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
+        if: ${{ (github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository) && steps.build.outcome != 'skipped' }}
         with:
           name: sccache-stats-${{ inputs.build-environment }}-runattempt${{ github.run_attempt }}-${{ steps.get-job-id.outputs.job-id }}
           retention-days: 14

diff --git a/.github/workflows/_mac-test.yml b/.github/workflows/_mac-test.yml
@@ -33,16 +33,38 @@ on:
         description: secret acess key for test stats upload
 
 jobs:
+  # This needs to be run right before the test starts so that it can gather the
+  # latest labels from the PR
+  filter:
+    runs-on: [self-hosted, linux.large]
+    outputs:
+      test-matrix: ${{ steps.filter.outputs.test-matrix }}
+      is-test-matrix-empty: ${{ steps.filter.outputs.is-test-matrix-empty }}
+    steps:
+      - name: Checkout PyTorch
+        uses: pytorch/pytorch/.github/actions/checkout-pytorch@master
+        with:
+          fetch-depth: 1
+          submodules: false
+
+      - name: Select all requested test configurations
+        id: filter
+        uses: ./.github/actions/filter-test-configs
+        with:
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+          test-matrix: ${{ inputs.test-matrix }}
+
   test:
-    # Don't run on forked repos.
-    if: github.repository_owner == 'pytorch'
+    needs: filter
+    # Don't run on forked repos or empty test matrix
+    if: github.repository_owner == 'pytorch' && needs.filter.outputs.is-test-matrix-empty == 'False'
     # For setup-miniconda, see https://github.com/conda-incubator/setup-miniconda/issues/179
     # Also ensure that we always run with the right architecture
     defaults:
       run:
         shell: arch -arch ${{ inputs.arch }} bash -e -l {0}
     strategy:
-      matrix: ${{ fromJSON(inputs.test-matrix) }}
+      matrix: ${{ fromJSON(needs.filter.outputs.test-matrix) }}
       fail-fast: false
     runs-on: ${{ matrix.runner }}
     timeout-minutes: 240