Densely Connected Convolutional Networks

Apart from the academic report, this readme mainly discusses technical issues encountered and handled.

Building a latest Tensorflow against CUDA 12.6 and CUDNN 9

We have experienced significant caveats on tensorflow 2.4: intolerable silent model compilation, poor default gradient optimization, easy NaN pitchfall, confusing programming interfaces,...

Therefore, we moved to at least tensorflow 2.10 which brings a much better experience out-of-the-box. While tf 2.10 is capable of doing the job, we want to explore how tensorflow has been since it abandoned Windows CUDA support (just after 2.10). To harness the full power of GPU, a newest (compatible) driver is preferred. Tesla T4 has a compute capability of 7.5 which corresponds to a good support of recent driver version: CUDA 12.6 and CUDNN 9.

Following the build guide, we have clang17 (with llvm and lld) and gcc14 (with libstdc++) installed with c++17 feature enabled, and bazel6.5.0 installed. The version of bazel cannot be easily changed because in 6.5.0 Bazel modified some of the key APIs (label, refer to the 6th bullet point) that tensorflow uses in a wide scope, resulting that changing the versions to either older and newer will invalidate the build files provided by tensorflow.

git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow

Run the configuration file.

chmod +x ./configuration
./configuration

Through trial and errors, we knew that the most recent TensorFlow build supports CUDA 12.6.1 and CUDNN 9.5.0, both one minor version behind newest release (until 11/15/24).

Then call bazel to build. We didn't have much knowledge on bazel, but it behaved like cmake and we assumed that it was a build tool against given configurations.

bazel build //tensorflow/tools/pip_package:wheel --repo_env=WHEEL_NAME=tensorflow --config=cuda --config=cuda_wheel

During build, a problem might emerge:

__float128 is not supported on this target

This is fired through building process of CUDA-related files. Normally this should not happen as __STRICT_ANSI__ guard which is ensued by specifying c++17 should have invalidated the related float128 codes, but as this thread says, GCC is the main cause of this error. On the other hand, regarding the normal usage of CUDA and Tensorflow, float128 is barely needed, therefore we can invalidate the related code pieces in the standard library (temporarily) manually:

// /usr/include/c++/14.2.1/limits
...
#if defined(_GLIBCXX_USE_FLOAT128)
// We either need Q literal suffixes, or IEEE double.
#if ! defined(__STRICT_ANSI__) || defined(_GLIBCXX_DOUBLE_IS_IEEE_BINARY64)
...

To:

#if ! defined(_GLIBCXX_USE_FLOAT128)
// We either need Q literal suffixes, or IEEE double.
#if ! defined(__STRICT_ANSI__) || defined(_GLIBCXX_DOUBLE_IS_IEEE_BINARY64)

Simply inverse the macro will prevent related float128 macros to expand. The build was successful then. Install the built package with pip should finish the process.

pip install bazel-bin/tensorflow/tools/pip_package/wheel_house/tensorflow-version-tags.whl

import tensorflow as tf
tf.__version__

should return 2.19.0.

Tensorflow's Changes Regarding to XLA

XLA

In recent versions of TensorFlow, it is encouraged (and mostly, by default) to use XLA (Acceclerated Linear Algebra), which is a technology to simplify and merge computational graphs to lower the ops performed on computational intense hardware, to statically build and train the model.

Basically, the generation of computational graph involves compilation, which is a paramount field of study in computer science. XLA leverages LLVM as the compilation backend (lexer, parser, IR, etc.) and developed a language set called StableHLO:

func.func @main(
  %image: tensor<28x28xf32>,
  %weights: tensor<784x10xf32>,
  %bias: tensor<1x10xf32>
) -> tensor<1x10xf32> {
  %0 = "stablehlo.reshape"(%image) : (tensor<28x28xf32>) -> tensor<1x784xf32>
  %1 = "stablehlo.dot"(%0, %weights) : (tensor<1x784xf32>, tensor<784x10xf32>) -> tensor<1x10xf32>
  %2 = "stablehlo.add"(%1, %bias) : (tensor<1x10xf32>, tensor<1x10xf32>) -> tensor<1x10xf32>
  %3 = "stablehlo.constant"() {value = dense<0.0> : tensor<1x10xf32>} : () -> tensor<1x10xf32>
  %4 = "stablehlo.maximum"(%2, %3) : (tensor<1x10xf32>, tensor<1x10xf32>) -> tensor<1x10xf32>
  "func.return"(%4): (tensor<1x10xf32>) -> ()
}

It's just like a mix of assembly (regarding register and function notations) and ONNX (direct representation of inputs, operations and outputs).

Tensorflow

On tensorflow's side, there are some noticeable changes. Behaviourally, on the first training and validation (corresponding to different computational graphs of training and testing) epoch of the model, the program will compile the model with XLA, introducing a noticeable lag. Programatically, XLA is insensitive to the random seeds (this page does not seem to have an English counterpart, so translate if you want to read) so that stateless randomness has to be introduced. Using stateless RNGs (Random Number Generators) is also highly suggested in a wider scope by tensorflow instead of tf.random.

Consider a stateless dropout:

from tensorflow._api.v2.nn.experimental import stateless_dropout


class Dropout(Layer):
    def __init__(self, rate: float, seed: list[float]):
        super().__init__()
        self.rate = rate
        self.seed = seed

    def call(self, x, training=False):
        if training:
            return stateless_dropout(x, rate=self.rate, seed=self.seed)
        return x

With given global seed such as [1, 0], the result can keep consistent in the maximum level. However, it might not be guaranteed across hardwares.

Datasets

The original paper uses CIFAR-10, CIFAR-100 and SVHN for small picture trainings and benchmarks. The CIFARs are directly available from tensorflow:

import keras
if cifar == 10:
    (x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
elif cifar == 100:
    (x_train, y_train), (x_test, y_test) = keras.datasets.cifar100.load_data()

While SVHN is available on hugging face, uploaded by UFLDL lab from Stanford (official):

from datasets import load_dataset

# https://huggingface.co./datasets/ufldl-stanford/svhn
ds_train, ds_test = load_dataset(
    "ufldl-stanford/svhn", "cropped_digits", split=["train", "test"]
)

Gradient Checkpointing

Densenets are hungry for computational memories as it concats each previous feature to the next feature, resulting a discrete feature distribution on a physical device. By rearranging memory layout to make these feature maps consecutive in memory, we can read them as a whole with a significanly lower cost. This process is achieved by a technique called gradient checkpointing, by rerunning the forward pass of a critical section (that is memory-hungry) during backprop.

In a nutshell, gradient checkpointing trades some time for many more space, especially in this densenet case.

from functools import partial
from tensorflow import recompute_grad


class SingleDenseBlock(Layer):
    ...
    def _call(self, x, training):
        y = self.bottleneck_layer(x, training=training)
        y = self.dense_layer(y, training=training)
        return y

    def _me_call(self, x, training):
        return recompute_grad(partial(self._call, training=training))(x)

    def call(self, x, training=False):
        if self.memory_efficient:
            return self._me_call(x, training)
        return self._call(x, training)

eicc27
/

DenseNet