Steps of Salsa20 algorithm and its applications.

Assumptions:

This article assumes the reader knows the following:

High-level cryptographic knowledge
basic bitwise operations
Basic computer architecture knowledge
Stream Cipher Knowledge

Overview of Salsa20

Salsa 20 is a cryptographic secure Pseudo Random Number Generator and a cipher and cannot distinguish true randomness without knowing the internal state. It is designed to replace AES as it claims to have the same level of security with a faster computation time. The research paper that derives this cipher is listed here:

The difference between a PRNG and CPRNG is that a PRNG to an experienced attacker has the potential to find a pattern without an internal state. This is not the case for a CPRNG.
There are even faster models with reduced cycles, but these need to be more secure.

The uniqueness of Salsa20 is that the cipher is secure and fast, which is usually a conflicting tradeoff in cryptography. This can be seen throughout the design of the cipher.

Operations used in Salsa20

32-bit addition
32-bit xor
constant distance 32-bit rotation
- Constant distance 32-bit rotation is when the bits are shifted by a constant amount, like a bitshift. Instead of extending the bits (making the number exponentially larger), the most significant bit will be deleted from its position and appended to the bitstream at the end, making it the least important bit.

The justification is that the operations can build the same logic as any circuit and, therefore, the same security. Another essential feature is that these operations are less prone, if at all prone, to timing attacks, which is another exciting and rich concept.

Overall, some operations, for example, multiplication, can give out what is known as time leaks. This is where the time of the program execution depends on the secret data noticeably. Furthermore, time leaks give hints at what secret data is.

S-boxes are another example of an encryption technique that can give out time leaks (This is a controversial statement as cryptographers are at odds with S-box security). At a minimum, the memory is straining on the upper-level caches in the architecture due to the S-boxes that are several kilobytes large.

How blocks are correlated in Salsa:

The salsa algorithm will expand a 256-bit key and a 64-bit nonce into a 2^70-byte stream. A high-level view of its encryption and decryption process is as follows (after the stream has been created):

It will encrypt a b-byte plaintext by XORing the plaintext with the first b-bytes of the bit stream and then discards the rest of the stream
To decrypt the ciphertext of b-bytes, it will xor the ciphertext with the first b-bytes of the bit stream and discard the rest.

Another note is that the key and nonce have been expanded before the encryption and decryption process.

Another note is that the blocks are independent hashes of the key, nonce, and block counter. Additionally, there is no chaining from one to the next. This allows both parallel computing and random access to the encrypted data.

💻 Chaining is used to encrypt and decrypt significant plaintext inputs by creating a cryptographic chain where the ciphertext is dependent on the previous blocks.

The input to this algorithm is the key and nonce as is, and there is no preprocessing to these elements.

Salsa 20 encryption step by step:

For a block to be generated, the problem space is set up in the following way. First, is that the goal of the salsa20 is to produce a 64-byte block given a key, nonce, and block counter. Furthermore, the stream is comprised of these 64-byte blocks. We will accomplish this goal using the operations listed above.

The core builds a 16-word array(block), where each word is 4 bytes. The block will contain the constant number 0x61707865, the first four words of the key, 0x3320646e, the two nonce words, and two block-counter words, 0x79622d32 constant word, and finally, the last four words of the key, with constant word 0x6b206574. The strings are always interpreted in little-endian form. The reason is that CPUs optimized for little-endian form have a big time leak for big-endian interpretation, while CPUs optimized for big-endian form have a no time leak for little-endian form.

Example Array Generation:

For example, here is the starting array for key (1, 2, 3, 4, 5, . . . , 32), nonce (3, 1, 4, 1, 5, 9, 2, 6), and block 7:

Now we have the starting array, our next step is to start modifying. Firstly, we will modify the below diagonals. These elements are circled in red:

The above diagonals are circled in blue.

We add the above diagonal and the diagonal, then store it in memory. We then rotate the result 7 bits. Finally, we xor into the below diagonal, which will modify the block value. This will give the following result:

Underlined numbers were added, and the word underneath both underlined was the only segment modified.

Next, we will modify the elements below the ones circled in red by adding the diagonal with the below diagonal (in that column) and storing it in memory. We then rotate the result by 9 bits and xor into the below-below diagonal. This gives the following result:

Finally, we go down each column and rotate the last element (excluding the diagonal) that wasn’t edited yet by 13 bits. You can call this the below-below-below diagonal. We then rotate the diagonal by rotating them 18 bits.

Finally, we transpose our block to get the final result of the first round, which in this case, is:

We then repeat this process n times for a Salsa20/n round cipher. This, in the end, is the whole cipher that currently has applications in many elite tech companies, such as Google.