Article by Ayman Alheraki on June 4 2026 12:30 PM
The advent of large-scale quantum computers poses an existential threat to current public-key cryptography (RSA, ECC). Shor's algorithm could factor large numbers and compute discrete logarithms in polynomial time, breaking the security of our digital infrastructure.
In response, NIST (National Institute of Standards and Technology) concluded a multi-year selection process. In 2024, they standardized several Post-Quantum Cryptography (PQC) algorithms. Among them, Module-Lattice-Based Cryptography stands out as the most balanced and versatile foundation.
Standardized algorithms (2024):
FIPS 203 (ML-KEM) – Module-Lattice-Based Key-Encapsulation Mechanism (formerly CRYSTALS-Kyber). Used for secure key exchange.
FIPS 204 (ML-DSA) – Module-Lattice-Based Digital Signature Algorithm (formerly CRYSTALS-Dilithium).
These algorithms are considered the successors to ECDH and ECDSA/RSA.
Lattice-based cryptography relies on the hardness of problems in high-dimensional lattices. The core assumption is the Module Learning With Errors (MLWE) problem.
Given a secret vector s and an error term e (small random noise), an attacker receives:
b = A·s + e (mod q)
Where:
A is a publicly known matrix of polynomials (module structure).
q is a large modulus (e.g., 3329 for ML-KEM).
s is the secret key.
e is the error.
Why is this hard? Without the error e, solving for s is simple linear algebra. With the error, the problem becomes NP-hard even for quantum computers. Meanwhile, the legitimate user who knows s can easily eliminate the noise.
Older lattice designs were either:
Ring-LWE (very fast but with more algebraic structure, potentially weaker).
General Lattice LWE (most secure but huge key sizes).
Module-LWE strikes the optimal balance:
Security similar to general lattices.
Performance similar to Ring-LWE.
Flexible parameters (rank k of the module).
For ML-KEM (Kyber), the module rank is k = 2 or k = 3 or k = 4, providing different security levels.
Let's break down how Module-Lattice works in practice for key encapsulation.
Sample a random matrix A from the module.
Sample small secret vector s and error e.
Compute t = A·s + e.
Public Key: (A, t)
Secret Key: s (and sometimes A)
Sample small random r and errors e1, e2.
Compute u = Aᵀ·r + e1 (ciphertext part 1)
Compute v = tᵀ·r + e2 + (message) (ciphertext part 2)
The shared secret is derived by hashing v.
Using secret s, the receiver computes v - sᵀ·u which (due to algebra) cancels out the lattice components, leaving the message plus small noise. The noise is then removed, and the shared secret is recovered.
Modern C++ (C++17/20/23) provides powerful features for writing both high-performance and safe cryptographic code.
Polynomial arithmetic over rings (e.g., Z_q[x]/(x^n + 1)).
Number Theoretic Transform (NTT) – For fast polynomial multiplication (O(n log n) instead of O(n²)).
Sampling – Discrete Gaussian or centered binomial distributions.
Hashing – SHA-3/SHAKE (standardized for PQC).
template<std::unsigned_integral T>class Poly {private: std::vector<T> coeffs; size_t n; // degree (e.g., 256) T q; // modulus (e.g., 3329) public: Poly(size_t degree, T modulus) : n(degree), q(modulus), coeffs(degree, 0) {} // NTT-friendly multiplication using convolution Poly multiply_ntt(const Poly& other) const { // In real implementation: convert to NTT domain, pointwise multiply, inverse NTT Poly result(n, q); // Simplified fallback (schoolbook for demonstration) for (size_t i = 0; i < n; ++i) { for (size_t j = 0; j < n - i; ++j) { result.coeffs[i + j] = (result.coeffs[i + j] + coeffs[i] * other.coeffs[j]) % q; } } return result; } // Add error vector (for MLWE) void add_noise(std::span<const T> error) { for (size_t i = 0; i < n; ++i) { coeffs[i] = (coeffs[i] + error[i]) % q; } } const T* data() const { return coeffs.data(); } size_t size() const { return n; }};
std::span – Safe, non-owning views for key material.
std::array – Compile-time fixed-size buffers (good for NTT twiddle factors).
constexpr – Precompute constants (e.g., roots of unity) at compile time.
concepts – Enforce type safety for integer moduli.
std::bit_cast – Type-punning for side-channel resistance.
struct ML_KEM_512_Params { static constexpr size_t k = 2; // module rank static constexpr size_t n = 256; // polynomial degree static constexpr size_t q = 3329; // modulus static constexpr size_t key_bytes = 32; // AES-256 key length};
template<typename Params>class ML_KEM { using PolyVec = std::array<Poly<uint16_t>, Params::k>; public: struct PublicKey { PolyVec A; PolyVec t; }; struct SecretKey { PolyVec s; // secret vector }; SecretKey keygen(PublicKey& pk) { // 1. Generate matrix A deterministically from seed // 2. Sample secret s and error e using binomial distribution // 3. Compute t = A·s + e // 4. Return secret key return SecretKey{}; }};
Cryptography is ultimately about constant-time execution – no branches or memory access patterns that depend on secret data. CPU registers are critical here.
Accessing RAM leaks timing information via cache misses. Modern CPUs have:
L1 cache: ~32KB, 4 cycles latency.
Registers: ~hundreds of bytes, 0 cycles latency, no cache side channel.
1. NTT butterflies (in registers):
// ARM64 NEON or x86 AVX2 – using 256-bit registers (YMM)// Each YMM register holds 16 x 16-bit coefficients (q=3329 fits in 16 bits)
__m256i butterfly(__m256i a, __m256i b, __m256i omega) { // All operations stay in registers – no memory access __m256i b_omega = _mm256_mullo_epi16(b, omega); // b * omega __m256i u = _mm256_add_epi16(a, b_omega); __m256i v = _mm256_sub_epi16(a, b_omega); return _mm256_permute2f128_si256(u, v, 0x20);}
2. Polynomial pointwise multiplication (SIMD registers):
Load 16 coefficients into YMM0.
Load 16 coefficients into YMM1.
Multiply (staying in registers).
Store back to memory only at the end.
3. Sampling randomness (CPU RNG instructions):
Modern x86 CPUs have RDRAND and RDSEED:
// Direct register-level randomnessuint64_t random_value;if (_rdrand64_step(&random_value)) { // random_value is now in a register (rax) // Use it for binomial noise generation}
The golden rule: If a secret touches memory, you may have lost. Good practice:
// BAD: secret-dependent memory accessuint8_t secret_lookup(const uint8_t* table, size_t secret_index) { return table[secret_index]; // Cache attack!}
// GOOD: secret stays in registers, use CMOV/maskinguint8_t constant_time_select(uint8_t a, uint8_t b, bool secret_cond) { uint8_t mask = -static_cast<uint8_t>(secret_cond); // all 1s if true return (a & ~mask) | (b & mask); // Lives in register, no branch}
x86-64: 16 general-purpose registers (RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP, R8-R15) + 16 AVX-512 vector registers (ZMM0-ZMM15).
ARM64: 31 general-purpose registers (X0-X30) + 32 NEON vector registers (V0-V31).
For ML-KEM-512, the entire polynomial multiplication can be performed without spilling to L1 cache – fitting in 16 vector registers.
// Portable (auto-vectorized by GCC/Clang -O2)void poly_add_portable(uint16_t* a, const uint16_t* b, size_t n) { for (size_t i = 0; i < n; ++i) { a[i] += b[i]; // Compiler may use SIMD registers automatically }}
// Explicit register control with intrinsics (AVX2)void poly_add_avx2(uint16_t* a, const uint16_t* b, size_t n) { for (size_t i = 0; i < n; i += 16) { __m256i va = _mm256_loadu_si256((__m256i*)(a + i)); __m256i vb = _mm256_loadu_si256((__m256i*)(b + i)); __m256i vsum = _mm256_add_epi16(va, vb); _mm256_storeu_si256((__m256i*)(a + i), vsum); }}
Modulo by q=3329 is expensive. Good implementations use Montgomery reduction or Barrett reduction – both implemented with shifts and multiplies, staying in registers:
// Barrett reduction for q=3329 (constant-time, no division)inline uint16_t barrett_reduce(uint32_t a) { const uint32_t q = 3329; const uint32_t v = ((1ULL << 26) + q/2) / q; // precomputed uint32_t t = ((uint64_t)a * v) >> 26; t = a - t * q; return t; // Now 0 <= t < 2q, ready for next operation}
| Concern | Mitigation |
|---|---|
| Timing attacks | All loops fixed iteration count; no secret-dependent branches |
| Stack clearing | secure_zero_memory using volatile or memset_s |
| Compiler optimizations | Use volatile or asm barriers to prevent dead-code removal |
| Register spilling | Mark critical functions __attribute__((target("avx2"))) |
| Power analysis | Randomize execution order (harder in C++; consider assembly) |
void secure_clear(std::span<uint8_t> secret) { volatile uint8_t* p = secret.data(); for (size_t i = 0; i < secret.size(); ++i) { p[i] = 0; // volatile prevents optimization } // Also zero registers if possible (x86: vxorps ymm0, ymm0, ymm0)}
// AVX2
constexpr uint16_t Q = 3329;constexpr size_t N = 256;
class KyberPoly { alignas(32) std::array<uint16_t, N> coeffs; public: void ntt() { /* In-place NTT using registers */ } void multiply(const KyberPoly& other) { // Pointwise multiplication in NTT domain for (size_t i = 0; i < N; i += 16) { __m256i va = _mm256_load_si256((__m256i*)(coeffs.data() + i)); __m256i vb = _mm256_load_si256((__m256i*)(other.coeffs.data() + i)); __m256i prod = _mm256_mullo_epi16(va, vb); _mm256_store_si256((__m256i*)(coeffs.data() + i), prod); } }};
int main() { KyberPoly a, s, e; // ... sampling, NTT conversion ... KyberPoly t = a.multiply(s); // t = a·s + e (simplified) // t is now the public key component return 0;}
Module-Lattice-Based Cryptography (ML-KEM, ML-DSA) is the future of secure communications. Its implementation in Modern C++ leverages:
Zero-cost abstractions (templates, spans) for compile-time polymorphism.
SIMD intrinsics and CPU registers for constant-time, high-speed polynomial arithmetic.
Memory-safe patterns (RAII, spans) to prevent leaks.
The tight integration with CPU register files – from general-purpose registers for loop counters to vector registers for NTT butterflies – is what makes lattice crypto possible on constrained devices and high-performance servers alike.
As quantum computers advance, every developer should understand these primitives. The transition is not optional; it is a matter of when, not if. Start experimenting with liboqs (Open Quantum Safe) or KEM-CSIKE today.
"The lattice is not just a mathematical object – it is a register-friendly, parallelizable, post-quantum fortress."