Randomness proof

Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#1

Randomness proof

25 Aug 2023, 14:32

Dear All,

just sharing some weekend thoughts. I am looking for an answer to the question, which may be so simple and obvious, that it's not even discussed, but I'd still want to have some math proof of it, even if it is not being doubted.

Suppose I have a random number generator (RNG) which produces a sequence of pseudorandom numbers: r1, r2, r3, r4,....
The vendor guarantees that the values will appear to be distributed uniformly and pass some common tests (Diehard or similar) that are thrown on such generators to confirm they are well-behaving.

What I want to confirm is the following:

If r1, r2, r3, r4.... is a sequence R of uniformly distributed pseudorandom numbers between 0 and 1, then:

First:

d1(r1), d1(r2), d1(r3), d1(r4),... is also a sequence of uniformly distributed pseudorandom numbers between 0 and 9 (with d1() taking the value of the first digit after comma in the decimal notation of its argument r1);

and

Second:

d1(R) is uncorrelated with d2(R), where d2() is a similar function returning the second digit after comma in the decimal notation of its argument.

Alternatively, if one can show that it is possible to create an RNG that would be passing the tests with its generated sequence, but have a correlated pattern between d1 and d2, that would seriously damage the statement I am trying to confirm.

One can simply take Stata's built-in RNG for experimentation. Below is a fragment of code that produces a sample of d1-d2 and some quick diagnostics on it. And statistically it complies with the statement I am making, but still, I have doubts whether this is always the case.

Any thoughts on this?

Thank you, Sergiy

Code:

clear all version 17.0 set seed 12345678 set obs `=1e6' generate rnd=int(runiform()*100) generate d1=int(rnd/10) generate d2=rnd-d1*10

Here is the heatmap of the resulting d1-d2 distribution:

Correlating d1 vs d2 results in:

Code:

. corr d1 d2 (obs=1,000,000) | d1 d2 -------------+------------------ d1 | 1.0000 d2 | 0.0010 1.0000

And the histogram of d1 by d2 is:
Tags: randomness
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#2

25 Aug 2023, 15:22

I believe both statements are true.

First let's be clear that we are talking about a uniform distribution on [0, 1), or (0, 1). (If 1 is included, the first assertion will fail because the occurrence of 1.0 in the r sequence will cause a surplus of 0's in the d1(r) sequence--though it will be hardly noticeable.)

d1(r) = floor(10*r). So we sill have d1(r) = i if and only if i <= 10*r <= i+1, for i = 0, ..., 9. Equivalently, i/10 <= r <= (i+1/10), for i = 0,...,9. But because r is uniformly distributed on (0, 1) or [0, 1), the probability that i/10 <= r <= (i+1/10) is exactly 0.1. So the probability of d1(r) = i is 0.10 for all i = 0,...,9. That is, d1(r) is uniformly distributed over the integers between 0 and 9 inclusive.

The second statement can be proved quickly as a consequence. Consider the function d*(r) = the first two digits after the decimal point in r. By an argument that is similar to the argument in the preceding paragraph, replacing 10 by 100 throughout, one sees that d*(r) is uniformly distributed over the integers between 0 and 99, inclusive. So the probability that d*(r) = j is always 0.01 for all integers j from 0 through 99. But d1(r) is the first digit of d*(r), and d2(r) is the second digit of d*(r), and the first two digits of the numbers between 0 and 99 are independent, because all combinations of them occur exactly once. It follows that d1(r) is independent of d2(r).

QED
1 like
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#3

25 Aug 2023, 17:11

Dear Clyde Schechter ,

thank you very much! This answers the question that I've posted, provided that one operates on a 'fair' implementation of the RNG.

If possible, can I follow up on this with a variation?

Suppose the end-user doesn't have access to the vendor's code implementing the RNG, but is free to do black-box testing before using, which in all probability converges to running the Diehard tests. At the same time [asymmetrically] the vendor has access to the end-user's code (that uses the supplied RNG) and knows, that:

- the end user will be using d1(R) and d2(R) for a specific purpose and
- the end user will not use the d3(R), d4(R) etc. (other digits of the random numbers produced by the RNG are not going to be used in the application)
- the vendor has the interest of rigging the RNG() in a certain way (To put in prospective, suppose the vendor is supplying the RNG to a voting commission, that will utilize it to randomize the order in which the candidates appear on the screen during voting for each voter, and the vendor of the RNG has own favorite candidate that they want to move to the earlier position in the list for more than a proportional number of voters, or put that candidate higher than another candidate, which is otherwise posed to win)
- the vendor is having sufficient confidence that the end-user is pretty much the only one who will care about actually testing the RNG, and any other users (if they exist) will not do independent analysis, even if they happen to use the whole values of R, or d3(R), etc.

Hence the question: Is it possible to rig the RNG in such a way, that the RNG still passes the Diehard tests, but is skewed in favor of a particular combination? E.g. perhaps by selectively choosing the remaining digits the vendor could hide the bias originating from the rigged first digits??

I realize that,

1) the answer may depend on the exact structure of the tests, but from the first look none of them addresses specifically this flaw (though I can hardly see where the consequences of this rigging may pop up and I confess I do not completely understand the exact mechanics of all these tests from just their brief description).

2) the answer may depend on the length of the random numbers produced by the RNG in terms of the meaningful digits (but let's assume that it is considerably longer than 2 digits, let's take equivalent of a double [~15 digits] for the moment).

3) the answer may depend on the definition of what we consider "passes" the test. Let's take some naïve approach or go with a convenience assumption here.

The concern originates from recent attacks on various hash schemes, where the content of the file is manipulated to have different content, yet matching the original checksum, provided that there is some slack in the file which can contain arbitrary content. In such an attack a useful executable may be replaced with malicious code, which is shorter in terms of bytes, and the remaining bytes are manipulated to match the original signature. It takes time, but is achievable with modern means for a growing number of hash schemes. Here I am seeing the similarity that if the end-user evaluates the RNG on it's declared functionality (the quality of the generated sequence R), but uses the generator differently (uses specifically d1(R) and d2(R)) can that be exploited by a malicious vendor??

Disclaimer: nothing of what is written here should imply any malicious behavior or intent of any specific organization or individual whatsoever, but is purely a discussion of a statistical process.

Thank you and have a good weekend, Sergiy
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2405
#4

25 Aug 2023, 17:16

I'm curious to follow this thread, as the math is beyond me. I was under the impression that for black-box testing purposes, one can never prove randomness. Rather you can only be convinced that PRNGs only appear to be random if statistically similar to how random number streams should operate, often by passing several tests (e.g., the Diehard suite of tests).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#5

25 Aug 2023, 17:55

Well, I don't know the answers to #3. My best guess is that, yes, it's possible. That said, I think that doing it by tampering with the high order digits of the random number sequence would be very challenging. Some of the diehard tests are clearly quite sensitive to the high order digits and I think the ruse would be easily exposed were it done that way. However, one could probably have a greater chance of successfully concealing the tampering if it is done in the lower order digits.

As for being able to alter a program while preserving its hash, I have heard it is possible, but I don't know how difficult it is to do: I would imagine it requires a lot of sophistication and would be computationally intensive. Moreover, I would imagine that strategy for tampering could be defeated by using multiple independent hashes. I suspect that simultaneously preserving all of them would prove, even if theoretically possible, beyond anybody's computational powers with existing or foreseeable technologies.

That said, while I'm comfortable with my knowledge of probability, my knowledge of computer security matters is very superficial. So I think you have to take my response here as being "to the best of my knowledge," which may not be good enough for your concerns. I think there are others in the Forum who have a deeper understanding of these issues, and I hope they will chime in.

Leonardo Guizzetti It's not that you can't prove a random number generator produces a truly random sequence. It's that computer-generated random numbers are provably not random. This is due to 1) the use of a deterministic algorithm, and 2) the limits of finite precision arithmetic: eventually the sequence must cycle. The hope is that the cycle length is so long that only a tiny fraction of its beginning will ever be used, so that the sequence is sufficiently random for practical purposes. If one truly needs random numbers, one has to start with a physical process that exhibits random behavior and capture a sequence of random numbers from that. random.org generates such sequences and most of their services are available without charge.
2 likes
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#6

28 Aug 2023, 10:21

Dear Clyde!

Thank you very much for this analysis!

I understand it is not a quick question and may require a lot of in-depth analysis, specifics of the setup, and some assumptions on the significance/likelihood of detection. Furthermore, d1() and d2() are taken as examples here. In practice they could be some other functions of the elements of the random sequence R, but with the same flaw that they don't take the value as a whole, but rather peek into some bits and pieces of it, leaving some opportunity for the content to be manipulated.

I think we can park the issue here as your current response already gives me food for thought on how to continue, but I would certainly be interested in learning more in case anyone wishes to add to this thread later or much later.

As this issue is not Stata-specific by itself, I think I should also ask in the cryptographic forums on this.

And yes, I agree that using multiple hashing schemes for executables adds considerable complexity to trying to tamper with them in this way.

Thank you, Sergiy
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 822
#7

28 Aug 2023, 15:27

As for being able to alter a program while preserving its hash, I have heard it is possible, but I don't know how difficult it is to do: I would imagine it requires a lot of sophistication and would be computationally intensive. Moreover, I would imagine that strategy for tampering could be defeated by using multiple independent hashes. I suspect that simultaneously preserving all of them would prove, even if theoretically possible, beyond anybody's computational powers with existing or foreseeable technologies.

Altering a program while preserving its hash is possible (this is called a collision) but very computationally difficult with an appropriate cryptographic hashing algorithm, and I would be absolutely shocked if there were a way to do this that gives you a program with useful alterations. It is, however, much more straightforward with a checksum. My understanding is that it's fairly easy to get a different executable with the same checksum, so checksums should be used to verify that the data was correctly transmitted, but never to authenticate the data.

The problem with relying on cryptographic hashing alone is that the hash itself is typically transmitted along with the program you want to verify, and is therefore susceptible to a man-in-the-middle attack. We need to verify that the sender of the program and the hash are who we think they are. Programs are typically cryptographically signed, meaning that the hash is encrypted with a private key before transmission, and decrypted with a known public key after receipt. If you then hash the program yourself and it matches the decrypted message, then you know whoever encrypted the message has the private key. Private and public keys are disseminated by trusted certificate authorities, and private keys aren't even transmitted over the open internet; They are sent on a thumb drive via priority mail. Cryptographic signatures don't depend exclusively on the content of the program to validate the sender, so no manipulation of bits in your malicious program will give you the correct result without the correct private key. Breaking a modern cryptographic signature is on the usual cosmic order of cryptographic difficulty. The best strategy for a hacker is the obvious strategy: Find a way to steal the private key.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment