The proposed herein is a scalable high-radix (i.e.,
$2^m$2m) Montgomery Modular (MM) Multiplication circuit replacing the integer multiplications in each iteration of the Montgomery MM algorithm (related to the product of
$m$m bits of the multiplier and the multiplicand) with carry-save compressions and completely eliminating costly multiplications. Furthermore, the proposed Montgomery MM decomposes the multiplicand itself using a radix of
$2^w$2w with
$w\geq 2m$w≥2m, thereby achieving a scalable design, which can deliver an issue latency of one cycle and a cycle (count) latency of
$O(N^2/(wmp))$O(N2/(wmp)) where
$p$p denotes the number of available processing elements, each of which is designed to complete the above iteration by computing in part the product of
$w$w bits of the multiplicand and
$m$m bits of the multiplier. The area complexity of the proposed Montgomery MM is
$O(wmp)$O(wmp), and thus, the Area-Latency-Product complexity is
$O(N^{2})$O(N2).