cryptoint 20250414

D. J. Bernstein

## Introduction

cryptoint is an almost-header-only library providing functions for
comparisons, bit extractions, etc. on `{int,uint}{8,16,32,64}`, while
trying to protect against compilers introducing timing variations.

Advantages over previous library functions for basic constant-time
operations:

* cryptoint provides more functions.
* cryptoint provides better protection against current compiler
  "optimizations". (This does not mean the protection is a guarantee.
  Make sure to apply further tests such as TIMECOP to the compiled code.)
* All cryptoint functions, after compilation with some common compilers
  for some common architectures, have been verified using
  [saferewrite](https://pqsrc.cr.yp.to/downloads.html)
  to match reference implementations for all inputs.

Often applications have their own ad-hoc constant-time code. Advantages
of rewriting those to use the centralized cryptoint functions:

* More likely to generate code that's
  [actually constant-time](https://blog.cr.yp.to/20240803-clang.html).
* Less code to fix in response to whatever further damage is caused by
  compiler "optimizations".
* Often less CPU time, although it's rare for this to matter.
* Better testing.

cryptoint is used inside
[SUPERCOP](https://bench.cr.yp.to/supercop.html),
[lib25519](https://lib25519.cr.yp.to),
[libmceliece](https://lib.mceliece.org),
[libntruprime](https://libntruprime.cr.yp.to),
and
[OpenSSH](https://github.com/openssh/openssh-portable/blob/master/sntrup761.c).

## Usage

To use (e.g.) `crypto_int64` in your own package, simply copy
`crypto_int64.h` and `int64_optblocker.c` into that package. Compilation
recommendations:

* Use `gcc` or `clang`. (Porting to other compilers should be a simple
  matter of compiling with `-D__attribute__(x)=`; however, tests have
  been carried out only with `gcc` and `clang`.)

* Compile all code with `-fwrapv`. (This disables some compiler
  "optimizations" that often trigger bugs in integer arithmetic. These
  "optimizations" have very little effect on performance.)

* Compile `*optblocker.c` separately: don't manually merge
  `*optblocker.c` into other files; don't use the `-flto` option in
  compiling `*optblocker.c`.

`crypto_{int,uint}64.h` define types `crypto_{int,uint}64` and the
following API functions:

| usage                                           | meaning |
| -----                                           | ----- |
| `z = crypto_{int,uint}64_load(ptr)`             | little-endian load |
| `crypto_{int,uint}64_store(ptr,z)`              | little-endian store |
| `z = crypto_{int,uint}64_load_bigendian(ptr)`   | big-endian load |
| `crypto_{int,uint}64_store_bigendian(ptr,z)`    | big-endian store |
| `z = crypto_int64_positive_mask(x)`             | `z = -(x > 0) |
| `z = crypto_int64_positive_01(x)`               | `z =  (x > 0) |
| `z = crypto_int64_negative_mask(x)`             | `z = -(x < 0) |
| `z = crypto_int64_negative_01(x)`               | `z =  (x < 0) |
| `z = crypto_int64_topbit_mask(x)`               | `z = -(x < 0) |
| `z = crypto_int64_topbit_01(x)`                 | `z =  (x < 0) |
| `z = crypto_uint64_topbit_mask(x)`              | `z = -(x >> 63) |
| `z = crypto_uint64_topbit_01(x)`                | `z =  (x >> 63) |
| `z = crypto_{int,uint}64_nonzero_mask(x)`       | `z = -(x != 0) |
| `z = crypto_{int,uint}64_nonzero_01(x)`         | `z =  (x != 0) |
| `z = crypto_{int,uint}64_zero_mask(x)`          | `z = -(x == 0) |
| `z = crypto_{int,uint}64_zero_01(x)`            | `z =  (x == 0) |
| `z = crypto_{int,uint}64_unequal_mask(x,y)`     | `z = -(x != y) |
| `z = crypto_{int,uint}64_unequal_01(x,y)`       | `z =  (x != y) |
| `z = crypto_{int,uint}64_equal_mask(x,y)`       | `z = -(x == y) |
| `z = crypto_{int,uint}64_equal_01(x,y)`         | `z =  (x == y) |
| `z = crypto_{int,uint}64_smaller_mask(x,y)`     | `z = -(x < y) |
| `z = crypto_{int,uint}64_smaller_01(x,y)`       | `z =  (x < y) |
| `z = crypto_{int,uint}64_leq_mask(x,y)`         | `z = -(x <= y) |
| `z = crypto_{int,uint}64_leq_01(x,y)`           | `z =  (x <= y) |
| `z = crypto_{int,uint}64_min(x,y)`              | `z = (x < y) ? x : y` |
| `z = crypto_{int,uint}64_max(x,y)`              | `z = (x > y) ? x : y` |
| `crypto_{int,uint}64_minmax(&x,&y)`             | in-place `(x,y) = (min,max)` |
| `z = crypto_{int,uint}64_bottombit_mask(x)`     | `z = -(x & 1) |
| `z = crypto_{int,uint}64_bottombit_01(x)`       | `z =  (x & 1) |
| `z = crypto_{int,uint}64_shlmod(x,j)`           | `z = x << (j&63)` |
| `z = crypto_{int,uint}64_shrmod(x,j)`           | `z = x >> (j&63)` |
| `z = crypto_{int,uint}64_bitmod_mask(x,j)`      | `z = -((x >> (j&63)) & 1)` |
| `z = crypto_{int,uint}64_bitmod_01(x,j)`        | `z =  ((x >> (j&63)) & 1)` |
| `z = crypto_{int,uint}64_ones_num(x)`           | `z =` number of bits set in `x` (0 through 64) |
| `z = crypto_{int,uint}64_bottomzeros_num(x)`    | `z =` number of low-order 0 bits in `x` (0 through 64) |

There are also `bitinrangepublicpos` functions tested by `test.c` and
used internally. These are not part of the API; use `bitmod` instead.

Notes on the split between `mask` functions and `01` functions:

* The `01` functions are aligned with C's convention of representing
  true as `1` and false as `0`. For example, `x < y` in C, like
  `x < y ? 1 : 0`, means `1` if `x` is smaller than `y`, else `0`. You
  can replace this with `crypto_int64_smaller_01(x,y)` if `x` and `y`
  are 64-bit signed integers.
* The `mask` functions are aligned with a convention of representing
  true as `-1` and false as `0`, which works well with logic
  instructions. For example, you can rewrite the variable-time code
  `x < y ? u : v` (meaning `u` if `x < y`, else `v`) as
  `v ^ ((u ^ v) & crypto_int64_smaller_mask(x,y))`. For comparison,
  `v + (u - v) * crypto_int64_smaller_01(x,y)` would rely on
  multiplication taking constant time, but on some platforms
  multiplication takes variable time.

Beware that the `mask` convention, like any other use of negative
integers, isn't compatible with unsigned integer extension. For example,
conversion from `uint8` to `uint64` will convert `-1` to `255` rather
than to `-1`. This is an argument for using `int` rather than `uint`. On
the other hand, C allows compilers to damage the correctness of `int`
code in various ways that aren't allowed for `uint`. Compiling with
`-fwrapv`, as recommended above, disables some of that damage.

## Internals

cryptoint has two main defenses against timing variations being
introduced by compiler "optimizations".

The first defense is `optblocker`, which is a global `volatile` variable
containing 0. The usage of `optblocker` in cryptoint is designed to
systematically hide 1-bit data paths from compilers.

The second defense is assembly. This would be safest as separate `.s`
files, but the usability constraint of having only two files per size
(one `.h` file, one `optblocker.c` file) forces cryptoint to use inline
assembly instead. Currently cryptoint has assembly implementations of
various functions for

* `amd64` (64-bit AMD and Intel, aka `x86_64`),
* `arm64` (64-bit ARM, aka `aarch64`),
* `arm32` (32-bit ARM, not in Thumb mode), and
* `sparc32` (32-bit SPARC, still used in space applications).

These are selected automatically for `gcc` and `clang` using tests for
`__GNUC__`, `__x86_64__`, etc. Other platforms fall back automatically
to portable code using `optblocker`.

From an auditing perspective, reviewing cryptoint means checking code
for many different functions, and the usage of assembly makes this work
more difficult:

* Assembly implementations are separate for each size, and separate
  for each targeted platform: e.g., `crypto_int*_negative_mask` has
  not just portable code but also 16 assembly implementations (4
  sizes for each of `amd64`, `arm64`, `arm32`, `sparc32`).
* Assembly is generally less readable than C, making bugs more
  likely to escape the notice of authors and reviewers. Assembly also
  depends on quirks of the targeted instruction sets.
* Inline assembly has its own quirks, with a more complicated
  interface than the function ABI.

cryptoint takes the following steps to reduce the risk of bugs:

* cryptoint includes a new `readasm` tool that generates inline assembly
  from an easier-to-read format. (See `functions` for the source code in
  this format; `crypto*.h` are automatically generated.) This improves
  auditability. Also, the converter generates register annotations,
  avoiding some common classes of inline-assembly bugs.
* All cryptoint functions are subjected to a battery of conventional
  unit tests via `cryptoint/test.c` in SUPERCOP. Various functions are
  also indirectly tested via tests of implementations that use cryptoint.
* All cryptoint functions are also integrated into saferewrite, which,
  after compilation, uses symbolic execution and Z3 to check equivalence
  to reference implementations. This has been run with various compilers
  for `amd64`, `arm64`, `arm32`, `mips64`, `sparc32`, and `x86`; see
  `saferewrite-results`.

## Implementation notes

The portable version of `SIGNED_negative_mask` could instead use
`X >>= (N-1) ^ SIGNED_optblocker`.

The use of `optblocker` inside `bitinrangepublicpos` is meant to protect
against the compiler causing trouble if `S` is a compile-time constant
`63` (or in general 1 below the width), although this is unnecessary for
the current applications of `bitinrangepublicpos`.

The assembly implementations of `shlmod` and `shrmod` assume that shift
instructions take constant time. The portable implementation does not
make this assumption. This is important: for example, compilers for
32-bit platforms will typically produce `int64` shift code that takes
different time for different shift distances.

Various assembly implementations assume that conditional instructions,
such as conditional moves, take constant time.

The `amd64` implementation of `TYPE_nonzero_01` could use `set` instead
of `cmov`.

There are more ways to implement `TYPE_ones_num`. The ending could use
`*0x...10101`, but this would rely on multiplication instructions taking
constant time, which, as noted above, isn't true on some platforms. For
`__SSE4_2__` one could use `popcnt`. For `arm64` one could use `cnt` for
`cssc`, or NEON `cnt`.

For `TYPE_bottomzeros_num`, one could use `tzcnt` for `amd64` with
`bmi1`, or `ctz` for `arm64` with `cssc`.