Combining MC simulations with loops

Jeffrey Michler

Join Date: Jul 2019

Posts: 13
#1

Combining MC simulations with loops

16 Jul 2024, 17:28

I am trying to use Monte Carlo simulations to conduct a sensitivity analysis to potential measurement error in the independent variable in a regression. What I would like to do is add different amounts (x) of noise to the value of the variable and then for every amount x conduct the MC simulation. In the end, I would have a sense of how much results change when x amount of noise was added. For x I was thinking a fraction of the actual value of the independent variable. To provide an implementable working example, imagine I want to use the auto data in Stata and `reg price mpg`. Then, I want to see how sensitive results are to mismeasurement of `mpg` by adding noise to `mpg` based on the distribution of `mpg`. So, do a MC simulation when I `replace mpg = mpg + rnormal(0.1*mean(mpg),0.1*sd(mpg))` at .1 intervals all the way up to 1 (apologies for the abuse of Stata syntax). Then I could see how much noise, as a fraction of the reported mpg, would need to be added before results change. I could obviously write this out as a set of 9 (.1 to 1) different programs but I would like to use a loop to cycle through values from .1 to 1. I am not sure if the best way is to write a loop that changes the value of `x = .1(.1)1` and within the loop have the MC program or if it is better to put the loop into the program. Right now I have the following:

Code:

sysuse auto.dta, clear reg price mpg program myreg, rclass forvalues i = .1(.1)1 { qui: sum mpg local mpg_mean = r(mean) local mpg_std = r(sd) return scalar mpgmn = `mpg_mean'*`i' return scalar mpgsd = `mpg_std'*`i' replace mpg = mpg + rnormal(`mpgmn',`mpgsd') reg price mpg } end * run MC simulations simulate _b _se, reps(100) seed(5762): myreg gen t_mpg =_b_mpg/_se_mpg gen p_mpg = 2*ttail(dfr,abs(t_mpg)) sum _b_mpg _se_mpg p_mpg

As I said, I'm not sure putting the loop inside the program is the right approach. I've tried to follow the advice in this post but to no avail.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30064
#2

16 Jul 2024, 17:42

Put the program inside the loop. But do not use -forvalues i = .1(.1)1-. You will get slightly inaccurate results that way because of precision issues. Instead use:

Code:

forvalues j = 1/10 { local i = `j'/10 // INSERT PROGRAM HERE USING `i' AS THE NOISE PARAMETERT }

The reason for this roundabout way of doing it is that 0.1 has no exact finite representation in binary (just like 1/3 has no exact finite representation in decimal.) So with 0.1(0.1)1 controlling the loop, Stata uses a finite binary number that is close to, but not exactly equal to 0.1 for both the starting point and the increment. The error then builds up with each successive iteration. Doing it as I show, with -forvalues j = 1/10- and then calculating -local i = `j'/10-, you still get inexact results, but the error is sometimes a little high and sometimes a little low but does not accumulate with each iteration since the integer arithmetic involved in incrementing j from 1 to 10 is exact.

The actual difference is very slight here, but with -.1(.1)1- your final iteration does not actually have `i' equal to 1, it has a minutely smaller value. If the difference is minute, roughly 2^-20, why am I making a big deal out of this? Because you should learn to do it the right way so that when a situation arises later in life where the difference ends up large enough to matter, you won't get bitten by this. By the way, this precaution applies not just to Stata but to programming any language on a binary computer: floating point arithmetic is inexact, and sometimes you have to do things in a roundabout way to avoid having errors accumulate to large magnitudes.
Comment

Jeffrey Michler

Join Date: Jul 2019
Posts: 13

17 Jul 2024, 09:44

Thanks Clyde, especially about the inexactitude of floating point arithmetic on binary computers. Something I had no idea about but makes sense.

In terms of placing the program in the loop, the code does not work for me. It breaks and returns `r(1)` saying I broke it somewhere. This is the case regardless of if I put the `simulate` inside or outside the loop. Any ideas where my error is coming from? This is what I am running:

Code:

sysuse auto.dta, clear

forvalues j = 1/10 {
    local i = `j'/10
      
program myreg, rclass
   
    qui: sum         mpg
    local            mpg_mean = r(mean)
    local            mpg_std = r(sd)
    
    return scalar    mpgmn = `mpg_mean'*`i'
    return scalar    mpgsd = `mpg_std'*`i'

    replace          mpg = mpg + rnormal(`mpgmn',`mpgsd')
        
    reg              price mpg
end
}
* run MC simulations
    simulate        _b _se, reps(100) seed(5762): myreg
    
    gen             t_mpg =_b_mpg/_se_mpg
    gen             p_mpg = 2*ttail(dfr,abs(t_mpg))
    
    sum             _b_mpg _se_mpg p_mpg

Comment

George Ford

Join Date: Aug 2014
Posts: 3136

17 Jul 2024, 10:35

you are defining mpgmn and mpqsd as scalars, then treating them as locals.

This ran.

Code:

  program myreg, rclass
  args i
   
    qui: sum         mpg
    local            mpg_mean = r(mean)
    local            mpg_std = r(sd)
    
    local    mpgmn = `mpg_mean'*0.1
    local    mpgsd = `mpg_std'*0.1

    replace          mpg = mpg + rnormal(`mpgmn',`mpgsd')
        
    reg              price mpg
end


* run MC simulations
    simulate        _b _se, reps(100) seed(5762): myreg

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30064
#5

17 Jul 2024, 11:34

But what is shown in #4 operates only for the noise parameter set to 0.1. And if you try to replace that by `i', it will fail, because `i' is defined outside the scope of program myreg. You need to explicitly pass `i' to program myreg as a parameter in order to generalize this and place it in a loop. In addition, when you place this program in a loop, you will stumble over some aspects of the workings of -simulate- that your code does not handle correctly.

First, you should not set the random number seed inside the loop: that causes the same random numbers to be re-used with each value of j. The seed should be set outside the loop.
Next, -simulate- overwrites the data in memory with the simulation results. So when you then try to access the original variables of auto.dta on the second iteration, they are not there! You need to instead save the output of each simulate and then bring auto.dta back into memory at each iteration.
Third, your -simulate- code does not create the variable dfr that you try to use later on. You have to explicitly create it.

So:

Code:

capture program drop myreg program myreg, rclass args noise_parameter qui: sum mpg local mpg_mean = r(mean) local mpg_std = r(sd) local mpgmn = `mpg_mean'*`noise_parameter' local mpgsd = `mpg_std'*`noise_parameter' replace mpg = mpg + rnormal(`mpgmn',`mpgsd') reg price mpg end set seed 5762 forvalues j = 1/10 { local i = `j'/10 tempfile results`j' sysuse auto, clear simulate _b _se dfr=(e(df_r)), saving(`results`j'') reps(100): myreg `i' } clear tempfile building save `building', emptyok forvalues j = 1/10 { use `results`j'', clear gen noise_factor = `j'/10 append using `building' save `"`building'"', replace } rename _eq2_dfr dfr gen t_mpg =_b_mpg/_se_mpg gen p_mpg = 2*ttail(dfr,abs(t_mpg)) sum _b_mpg _se_mpg p_mpg

Finally, looking at the last line of the code, calculating the summary statistics for p-values (variable p_mpg) doesn't really make sense. The mean p-value doesn't have any meaning, nor do the others. More likely you will want to do something like count the number of p-values at each noise factor level that deviate from the "noise-free" p-value by more than a certain amount. Or perhaps count the number that are statistically significant at your desired level and compare that to the noise-free number.
1 like
Comment

Jeffrey Michler

Join Date: Jul 2019
Posts: 13

17 Jul 2024, 11:49

Thanks. I've been going back and forth between using locals and scalers that I didn't even see the mismatch that you pointed out despite staring at my code forever. I now get the program to run but am still getting an error when I put the program into a loop like Clyde suggested.

Code:

sysuse auto.dta, clear
forvalues j = 1/10 {
    local i = `j'/10
      
program myreg, rclass
   
    qui: sum         mpg
    local            mpg_mean = r(mean)
    local            mpg_std = r(sd)
    
    local            mpgmn = `mpg_mean'*`i'
    local            mpgsd = `mpg_std'*`i'

    replace          mpg = mpg + rnormal(`mpgmn',`mpgsd')
        
    reg              price mpg
end
}
* run MC simulations
    simulate        _b _se, reps(100) seed(5762): myreg

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30064
#7

17 Jul 2024, 12:38

Please look carefully at the code offered in #5. In particular:
The definition of the program should be outside the loop.

The -program- needs to take the value of `i' as an argument, and that argument must be passed to it by the -simulate ...: myreg- command.

The program must -use- the auto.dta file at each iteration of the loop because -simulate- overwrites the data in memory.

The -simulate- command should be inside the loop.

You need to add the -saving()- option to your -simulate- command so that all of the simulation results are saved.

Setting the random number seed should be outside the loop.

All of these things are shown in the code in #5, and in bold face, and explained in the text of #5. The business about scalars vs locals that George Ford pointed out in #4 is also important, as the program cannot work properly without that--but as I pointed out in #5, just correcting that is not enough to make the code work inside a loop with -simulate-.
Comment
Jeffrey Michler

Join Date: Jul 2019

Posts: 13
#8

17 Jul 2024, 14:34

Clyde, sorry for my post #6. I must have been writing it at the same time you were writing post #5 and I made my post without seeing your updates/comments contained in post #5. Should have refreshed the page before clicking post since your post #5 answered all the questions I was asking.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30064
#9

17 Jul 2024, 14:38

No problem. This sort of thing happens often on Statalist. Usually when I finish posting something, I check to make sure that my post has turned up as the number in the thread that I was expecting it to be. If I find that other posts have intervened since what I was responding to, I edit my post to indicate that it crossed with whatever other(s) came between, and, if appropriate, I add or modify content to reflect the information in those intervening posts.
Comment
Jeffrey Michler

Join Date: Jul 2019

Posts: 13
#10

13 Aug 2024, 10:43

Thanks to Clyde and George for getting this working for me. I am now trying to work on a variant of this MC simulation where I randomly re-assign "treatment" and then see how results change. Using the auto data, we can think of "treatment" as whether or not the car is foreign. I want to randomly reassign which observations are foreign and domestic ("control") at intervals of 10% of the data set (74 total obs. in the data). For my random assignment in the program, I generate a value for each observation from a uniform distribution. I then assign each a rank from smallest random number value to largest. I then create a variable `num` to hold the number of observations I am going to reassign, which varies as I vary the `noise_parameter`. If `noise_parameter` is set to 0, then I will reassign no observations, if I set it at 1, then I will reassign 7.4 observations which is 10% of the data. I then compare the value of the observation's rank in the randomly drawn distribution to the value of `num`. If the rank is less than the number of observations to be re-assigned I tag it for re-assignment. I finally create a new indicator for foreign/domestic equal to `foreign` if the observations was not tagged for re-assignment and equal to it's opposite value if it was tagged for re-assignment (foreign becomes domestic, domestic becomes foreign). Finally, I regress price on the new "treatment" variable. Once I have the program, I run a loop where I vary the value of the noise parameter from 0 to 10. At 0, no observation is re-assigned. At 10, all observations are re-assigned.

I have followed the logic of what Clyde and George provided but when I run the code I get the error message "Error occurred when simulate executed myreg," which isn't particularly informative. I think my issue is how I am looping through the values of the noise parameter, variously defined as `j`, `i', and `noise_parameter`. I may have followed Clyde's example too slavishly.

Code:

capture program drop myreg program myreg, rclass args noise_parameter gen r_num = runiform() sort r_num gen rank = _n gen num = 7.4*`noise_parameter' gen tag = 1 if rank < num replace tag = 0 if tag == . gen foreign_new = foreign if tag == 0 replace foreign_new = 1 if tag == 1 & foreign == 0 replace foreign_new = 0 if tag == 1 & foreign == 1 reg price foreign_new end set seed 5762 forvalues j = 0/10 { local i = `j' tempfile results`j' sysuse auto, clear simulate _b _se dfr=(e(df_r)), saving(`results`j'') reps(100): myreg `i' } clear tempfile building save `building', emptyok forvalues j = 0/10 { use `results`j'', clear gen noise_factor = `j' append using `building' save `"`building'"', replace } rename _eq2_dfr dfr gen t_foreign =_b_foreign/_se_foreign gen p_foreign = 2*ttail(dfr,abs(t_foreign)) sum _b_foreign _se_foreign p_foreign
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30064
#11

13 Aug 2024, 11:43

The problem is that your -program myreg- creates new variables r_num, rank, tag, num, and foreign_new. The first time -simulate- calls the program, everything is fine. But when -simulate- tries to repeat the call to -myreg-, those variables are still hanging around from the previous call, so the commands to -gen- them throw error messages. You can't -gen- an existing variable.

Now, if you are in a rush to get this working and willing to accumulate technical debt to do so, you can quickly fix this by just adding -drop r_num foreign_new tag rank num- just before the -end- of -program myreg-.

But the right way to fix it is to use tempvars instead of real variables for r_num, rank, tag, num, and foreign_new. In fact, when writing -program-s, one should almost never create real variables in them unless the purpose of the program is specifically to create those new variables. One should instead use tempvars instead, so that Stata can be depended upon to automatically clean them up when the program exits.

By the way, the code -local i = `j'- and subsequently using `i' instead of `j' in the loop is harmless, but kind of pointless. Just refer directly to `j': that makes it obvious that the argument of -myreg- is the loop iterator, and you don't have to scan back through the code to understand that i is just a copy of j.

Last edited by Clyde Schechter; 13 Aug 2024, 11:50.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2411
#12

13 Aug 2024, 12:08

There's a "noisily" option for -simulate- (and similar commands like -bootstrap-) that will cause them to display what's happening at each of their iterations. That option is great for finding errors in the programs called by such commands, and in fact, rather than looking carefully at your code, the first thing I did was to use -noisily-, which produced output that quickly pointed me to the same problem that Clyde found by more elegant means, i.e., brainpower. I'd encourage you to try out -noisily- to see how it might help you in the future:

Code:

simulate _b _se dfr=(e(df_r)), noisily saving(`results`j'') reps(100): myreg `i'
1 like
Comment
Jeffrey Michler

Join Date: Jul 2019

Posts: 13
#13

13 Aug 2024, 13:11

Thanks everyone! The tempvars worked like a charm. And I will start using noisily to debug more before I come to the forums.
Comment

Announcement