使用DeepCover安全微控制器提高模块化求幂的速度

本应用笔记描述了当使用具有模块化算术加速器（maa）的maxq微控制器时，如何将模运算速度提高50%以上。
介绍模幂，a^和^模 m 是许多加密函数中的常见操作。maxq微处理器中的模块化算术加速器（maa）可以执行高达2048位的模数。很容易加载带有a，e和m的内存区域，然后开始操作。
当模量是两个或多个素数的乘积时，我们可以使用中国余数定理（crt）的结果，通过执行两个较小的模幂而不是一个大的模幂来减少执行时间。具体来说，我们使用 garner 的算法进行此操作。
描述在典型的 rsa 解密操作中，我们通过执行 pt = ct 从密文（ct）中恢复纯文本（pt）^d^mod n，其中 d 和 n 构成私钥。值 d 是我们的解密指数，n 是素数 p 和 q 的乘积。通常，p 和 q 的长度相同，n、pt 和 ct 将是该位数的两倍。例如，如果 p 和 q 的长度为 1024 位，则 n 在大约 2048% 的时间内将是 60 位的数字。
crt 将我们的幂减少到以下等式：
let c1 = ct^d^ mod p, c2 = ct^d^ mod q, and let m1 = p(p^-1^ mod q) and m2 = q(q^-1^ mod p).
ct = (c1 + m 1 (c2 - c 1 )) mod n
or
ct = (c2 + m 2 (c1 - c 2 )) mod n.
请注意，现在我们在 c 中的模幂1和 c2项的位数将是 ct 的一半^d^mod n 操作。
术语 m1和米2两者都可以预先计算。该 p^-1^mod q 是某个值，比如 y，使得 p × y mod q = 1。例如，如果 p = 11 且 q = 17，则 11^-1^mod 17 = 14，因为 11 乘以 14 mod 17 = 1。这些反值可以使用扩展的欧几里得算法找到，或者由于 p 和 q 是素数，因此执行函数 p^q-2^模组问。这种求逆的模幂是基于费马小定理的。
该 c1和 c2项很有趣，因为 ct 和 d 都是它们的模量值 p 和 q 的两倍。maa 不能做的一件事是对大于模量大小的值进行操作;我们需要先减小这两个值，然后才能使用 maa 执行模幂。指数 d 相对于 p 和 q 的减少可以预先计算。这些新指数只是（d - 1） mod p 和（d - 1） mod q。指数的约简也是基于费马小定理。
降低两个 c 的 ct1和 c2在执行时通过模块化乘法完成。例如，如果 ct 长 64 位，p 和 q 都是 32 位长，那么我们可以执行以下乘法：ct × 2^32^模组（p × 2 ^32^ ).这将是一个 64 位模块化乘法。这实际上比等式看起来更简单。我们将 64 位、4 字 ct 放入 maa 寄存器 a 中，并在 maa 寄存器 b 中清除所有内容。然后，我们设置寄存器b的第32位，使寄存器等于2 ^32^ .在模中，我们将底部的两个单词写为零，然后将 p 的值复制到接下来的两个单词中。然后我们将 maws 设置为 64 并执行模块化乘法。我们正在寻找的减少值是结果的第 3 和第 4 字。
为了使其余的计算更容易，我们发现c之间的哪个项更大1和 c 2 ，然后选择我们从大数中减去较小以避免获得负数的等式。现在做一个模乘法，然后将其添加到 c 1 （如果我们乘以 m 1 ）或 c 2 （如果我们乘以 m 2 ).
代码说明清单 1 显示了指向 maa 中每个寄存器的无符号长字指针的初始化。给出了采用安全risc架构的deepcover安全微控制器（maxq1103）中maa的硬编码地址。^®^
清单 1.指向 maa 寄存器的指针typedef unsigned long int ulong;// long word pointers to maa memories in the maxq1103ulong *maa_aw = (ulong *) 0x8000;ulong *maa_bw = (ulong *) 0x8100;ulong *maa_resw = (ulong *) 0x8200;ulong *maa_expw = (ulong *) 0x8400;ulong *maa_modw = (ulong *) 0x8500;清单 2 包含四个预先计算的常量：piqtp（读作 p 逆 q 乘以 p）、qiptq（读作 q 逆 p 乘以 q）、dphip（读作 p 的 d phi）和 dphiq（读作 q 的 d phi）。常量 diptp 和 piqtq 是 m1和米2上面的术语。常量 dphip 和 dphiq 是 c 的简化解密指数1和 c2上面的术语。两个载体 ptp 和 ptq 用于临时存储。其他变量 p、q、n、phi、e、d、pt 和 ct 描述了 rsa 所需的所有值。术语 phi 等于（p - 1） × （q - 1）。
术语 nwords 是 n 中的单词数，即键的模数。在此实现中，假设 p 和 q 将具有正好 16 位× n字，并且 n 将恰好具有 32 位× n字。
这些向量中的单词从最低有效单词保存到最有效单词。它们从低字到高字加载到 maa 寄存器中。需要明确的是，下面是一个 p × q = n 的示例，使用清单 2 中的常量和粗体的交替长字。
0xf22f213fe34b717b × 0xc9446776b381bfb9 = 0xbe67b781405a57697217c6cfbb2ac6e3清单 2.rsa 常量int nwords = 0x4;ulong ptp[0x2];ulong ptq[0x2];// complete set of rsa constantsulong p[0x2] = { 0xe34b717b, 0xf22f213f };ulong q[0x2] = { 0xb381bfb9, 0xc9446776 };ulong n[0x4] = { 0xbb2ac6e3, 0x7217c6cf, 0x405a5769, 0xbe67b781 };ulong phi[0x4] = { 0x245d95b0, 0xb6a43e19, 0x405a5767, 0xbe67b781 };// keysulong e[0x4] = { 0x00000005, 0x00000000, 0x00000000, 0x00000000 };ulong d[0x4] = { 0xb6b1448d, 0xc55031ad, 0x337b791f, 0x9852f934 };// sample plain text and corresponding cipher textulong pt[0x4] = { 0x90abcdef, 0x12345678, 0x90abcdef, 0x12345678 };ulong ct[0x4] = { 0xda3c591a, 0xc131ad9d, 0x40a51b30, 0x361958df };// the four pre-computed values used in crt computation.ulong piqtp[0x4] = { 0x50995949, 0x4d355f7a, 0x907f8cc5, 0x1f0f60bf };ulong qiptq[0x4] = { 0x6a916d9b, 0x24e26755, 0xafdacaa4, 0x9f5856c1 };ulong dphip[0x2] = { 0x5aeafa31, 0x60dfa6e6 };ulong dphiq[0x2] = { 0x6bb43fd5, 0x78c2a47a };清单 3 具有 do_crt 函数，该函数使用 p 或 q 中的字数进行调用。例程从创建术语 c 开始1和 c2从上面并分别将这些值保存在 ptp 和 ptq 中。然后我们确定 ptp 和 ptq 哪个更大，然后调用将执行模块化乘法和加法的例程。这会将 pt 留在maa_resw内存中。
清单 3.do_crt例程void do_crt(int nwords){ // nwords is the number of 32 bit words in p or in q. int i; mod_reduction(ct, p, dphip, nwords, ptp); mod_reduction(ct, q, dphiq, nwords, ptq); for (i = nwords - 1; i >= 0; --i) if (ptp[i] > ptq[i]) { sum_mul_sub(2*nwords, ptp, ptq, qiptq, n); break; } else { sum_mul_sub(2*nwords, ptq, ptp, piqtp, n); break; }}清单 4 包含初始化 maa 的详细信息。在这里，我们清除maa中的所有384个单词，初始化存储器选择寄存器mams，然后告诉maa哪个是模数中最重要的位，maws。
清单 4.init_maa例程void init_maa(int mod_size){ int i; for (i = 0; i < 384; ++i) maa_aw[i] = 0; // clear the entire maa mams = 0x6420; // memory select register maws = mod_size; // position of the most significant bit of modulus.}清单 5 包含对 ptp = ct 形式的表达式进行模块化约简的详细信息^德菲普^莫德·此例程要做的第一件事是通过使用移位模数 p 进行模乘法，并让乘法成为移位值，将 ct 减小到其大小的一半。这些例程使用长单词的移动，而不是一次移动几个单词。（一个长字的距离移动就像移动32次。完成模乘法后，我们的简化答案在maa_resw寄存器中向左移动。
随着ct的减少，我们接下来进行模幂运算，以获得这个子程序应该得到的结果，ptp。
清单 5.mod_reduction例程void mod_reduction(ulong *ct, ulong *p, ulong *dphip, int nwords, ulong *ptp){ int i; // nwords as passed is the length of p. (rather than n) // we are going to do a modmul, with a shifted p as the modulus // init_maa is initializing maws with the correct modulus size. init_maa(nwords*64); // reducing ct mod p by doing the modmul ct * 2^(nwords*32) mod (p * (2^(nword*32)) // load a with ct for (i = 0; i < 2*nwords; ++i) maa_aw[i] = ct[i]; // load b with 2^(nwords*32) which is simply a bit set maa_bw[nwords] = 1; // load modulus with p*2^(nwords*32) which is simply a load shifted by nwords. for (i = 0; i < nwords; ++i) maa_modw[i + nwords] = p[i]; // this multiply gives us the reduction in ct and // the answer in maa_resw shifted by nwords. mact = 0x05; // mod multiply and start while (mact & 1) // wait for the multiply to finish ; // load registers to do ct^dphip mod p // notice that we are coping the shifted result of maa_resw to maa_aw. for (i = 0; i < nwords; ++i) { maa_aw[i] = maa_resw[nwords + i]; maa_bw[i] = 0; maa_expw[i] = dphip[i]; maa_modw[i] = p[i]; } maa_b[0] = 1; // the b reg is always 1 for modexp maws = 32*nwords; // the most important step is setting maws to the correct size mact = 0x1; // mod exp and start while (mact & 1) ; // copy our result to the ptp argument. for (i = 0; i < nwords; ++i) ptp[i] = maa_resw[i];}清单 6 描述了将所有内容组合在一起的函数。我们将计算方程 ct = （c 1 + 米 1 （c 2 ， b 1 ）） mod n if c2大于 c1或 ct = （c 2 + 米 2 （c 1 ， b 2 ））否则。它从减法开始。请注意，我们使用 n 作为模数，我们将处于乘法和加法的完整键长度。减法后，我们将结果从 maa_resw 移动到 maa_aw，并复制我们的乘数 m1或米 2 ，我们的参数 c 成maa_bw，并开始模乘法。在最后一步中，我们将乘法的结果从 maa_resw 复制到 maa_aw，然后复制 c1或 c 2 ，我们的参数 b，进入maa_bw并进行模块化加法。完成后，我们的纯文本是maa_resw的。
清单 6.sum_mul_sub例程void sum_mul_sub(int nwords, ulong *a, ulong *b, ulong *c, ulong *n){ int i; // prepare to subtract b from a for (i = 0; i < nwords/2; ++i) { maa_aw[i] = a[i]; maa_bw[i] = b[i]; maa_modw[i] = n[i]; } // clear the upper words of maa_a and maa_b and copy the rest of n for (i = nwords/2; i < nwords; ++i) { maa_aw[i] = 0; maa_bw[i] = 0; maa_modw[i] = n[i]; } // this is a full size operation. // start the subtraction maws = 32*nwords; mact = 0xb; // subtract and start while (mact & 1) ; // copy the result over to maa_aw and // put or multiplicand into maa_bw for (i = 0; i < nwords; ++i) { maa_aw[i] = maa_resw[i]; maa_bw[i] = c[i]; } mact = 5; // multiply and start while (mact & 1) ; for (i = 0; i < nwords/2; ++i) { maa_aw[i] = maa_resw[i]; maa_bw[i] = b[i]; } for (i = nwords/2; i < nwords; ++i) { maa_aw[i] = maa_resw[i]; maa_bw[i] = 0; } mact = 0x9; // add and start while (mact & 1) ;}业绩和结论表 1 和表 2 给出了使用上述算法可以实现的速度改进的指示。随着模量尺寸的增加，我们得到的时间减少更大。
这里介绍的 c 实现旨在证明使用此算法可以提高速度。该代码还显示了如何操作 maa。可以做很多事情来提高算法的速度，包括循环展开、循环优化、使用编译器优化的数据移动例程以及使用汇编语言。
最快和最简单的速度改进是操作maa的mams寄存器，这将消除一些数据移动。mams 寄存器允许我们重命名内存段。为简单起见，此应用程序中未完成此操作。
从理论上讲，通过使用这种算法，应该可以在提高速度的情况下接近 4 比 1 的时序比。
始终使用加密环作为您的加密时钟源。这是通过清除电源管理寄存器（pmr）中的主加密源选择（mcss、pmr.7）来实现的。解密时，建议将模块化算术加速器控制寄存器（mact）的优化计算控制（ocalc，mact.4）清除为零，禁用该功能。
表 1.maxq1103 在25mhz，maa运行在加密环路| 大小 | modexp （ms） | 显像管（毫秒） | 率 |
| ------------------------------------------- | --------------- | ---------------- | ---- |
| 2048 | 549 | 166 | 3.3 |
| 1024 | 82.0 | 29.6 | 2.8 |
| 512 | 14.2 | 7.25 | 2.0 |
| 256 | 3.37 | 3.08 | 1.1 |
table 2. deepcover secure microcontroller (maxq1050) at 24mhz with the maa running with crypto ring| size | modexp (ms) | crt (ms) | ratio |
| -------------------------------------- | ------------- | ---------- | ------- |
| 2048 | 1760 | 492 | 3.6 |
| 1024 | 244 | 75.7 | 3.2 |
| 512 | 37.1 | 14.3 | 2.6 |
| 256 | 6.80 | 4.49 | 1.5 |
numerical example of rsa using the crt to recover the plain textthis example goes through the steps of constructing the public and private keys for rsa, then taking a sample message, encrypting it, and decrypting it. then we show how the same encrypted message can be decrypted using the crt.
我们首先找到几个质数。设 p = 0xe747 和 q = 0xc7a5。两者都是 16 位质数。
这给了我们 n = p × q = 0xb45d41c3 和 phi = （p - 1）（q - 1） = 0xb45b92d8。两者都是 32 位。
我们可以选择 e = 0x10001，因为 gcd（e， phi） = 1。这给了我们公钥。我们的私钥被选中，因此 e × d mod phi = 1。使用扩展的欧几里得算法，我们计算d = 0x9b111cc9。
我们任意选择我们的纯文本，pt = 0xabcdef12。我们的密文，ct = pt^和^mod n = 0x87ccfe27。要恢复纯文本，请执行 ct^d^莫德·这是一个 32 位模幂。这原则上是 rsa 加密/解密过程。
现在我们将使用中文余数定理恢复纯文本。
离线时，我们预先计算四个常量。第一个是 piqtp = 0x9e1d261c，即（p^-1^mod q） × p.第二个是 qiptq = 0x16401ba8，即（q^-1^mod p） times q.它们都是 32 位长。您可以使用扩展的欧几里得算法或执行 p^q-2^模组 q 和 q^p-2^mod p 找到逆数。我们还需要 dphip = d mod phi（p） = 0x9b111cc9 mod （0xe747 - 1） = 0x4aab 和 dphiq = d mod phi（q） = 0x9b111cc9 mod （0xc7a5 - 1） = 0x9a0d。这些数字都是 16 位，大小是 n 的一半。
在线计算从通过 mod p 减少密文开始。密文长 32 位，模长 16 位。为了使用 maa 进行归约，我们将模数 p 向左移动 16 位并将密文乘以 216。这看起来像0x87ccfe27 × 0x10000 mod 0xe7470000 = 0x36b00000。现在我们将答案向右移动 16 位，或者只抓住单词的上部 16 位。我们需要这个值来执行 16 位模幂0x36b0^德菲普^mod p，看起来像0x36b0^0x4aab^模组 0xe747 = 0x6425 = ptp。
使用密文 mod q 的第二个模块化约简随着 mod 0xc87a27 = 0x10000d0 7x50000ccfe0 × 543x0000而扩展。将答案向右移动 16 位，或者只是抓住上面的单词，得到减少，0x543d。使用此结果并执行模幂0x543d^德菲克^mod q，看起来像0x543d^0x9a0d^模组0xc7a5 = 0x1671 = ptq.
如果 ptp 大于 ptq，我们计算（ptq + （ptp - ptq） × qiptq） mod n，否则我们计算（ptp + （ptq - ptp） × piqtp） mod n。请记住，piqtp 和 qiptq 是预先计算的，长度为 32 位。
我们看到 ptp 大于 ptq。差值，ptp - ptq = 0x4db4。（（ptp - ptq）和 qiptq） mod n 和（0x4db4 × 0x16401ba8） mod 0xb45d41c3 的乘积是0xabcdd8a1。添加 ptq，0x1671会给我们返回纯文本，0xabcdef12，我们就完成了。

人工智能的使用已进入一个新时代
英伟达收购Arm对芯片设计生态有何影响?
华为P10内存门、闪存门最新消息：华为P0造假死不承认，华为Mate9受牵连！未来可能都成虚假宣传
什么是OTP语音芯片？唯创知音WTN6xxx系列：低成本智能语音解决方案
DIY,亲手设计出漂亮PCB
使用DeepCover安全微控制器提高模块化求幂的速度
MCU数位控制技术助力变频马达性能大跃进
晋亿实业股份有限公司选购我司HS-DSC-101差示扫描量热仪
密度传感器的原理是什么？有什么特点？
变电站的工作原则和操作规范
固态聚合物锂电池中电解质的技术研究
标记交换,什么是标记交换
全方位赋能开发者成长！华为开发者联创日 · 深圳站圆满落幕
华为云虚拟专用网络VPN，为企业铺就数据上云的安全路
北斗系统在ARJ21-700飞机103架机上进行了测试试飞成功
什么是通信电源
工业4.0时代下MES的4个发展趋势
调频发射器的制作
北京移动已在北京开通了6000个5G基站
Apple Watch Series 9和Ultra 2将搭载新款心率传感器和U2芯片