Insert values from a cycle into a dataset

Daniela Rodrigues

Join Date: Jan 2021

Posts: 30
#1

Insert values from a cycle into a dataset

21 Oct 2022, 13:08

Hello,

I am trying to run the following cycle in Stata:

levels clinic_code, local(clinic_code_levels)
foreach p of local clinic_code_levels {
levels month, local(month_levels)
foreach m of local month_levels {
conindex nr_events if clinic==`"p'"' & month==`m', rankvar(person_age) true zero generalized
noi disp "`p' `m': " ("r(CI)-1.96*r(CIse) ", " r(CI)+1.96*r(CIse) ")"
}
}

Right now I am able to see all the output in the main window, but I will end up with too many lines with the different clinics and months, almost close to 30k.

Does anyone know if it is possible to automatically insert those values into a new dataset like this:
Clinic `p' Month `m' concentration index
r(CI) 95% confidence interval (lower)
r(CI)-1.96*r(CIse) 95% confidence interval (upper)
r(CI)+1.96*r(CIse)

Many thanks for your help.
Tags: None

Ken Chui

Join Date: Aug 2014
Posts: 1063

21 Oct 2022, 19:00

Try statsby, here is an example:

Code:

sysuse nlsw88, clear

* Start with the simplest command without if:
conindex wage, truezero
* Check what are returned
return list

* Set up the statsby command.
* Married and Collgrad can be replaced by clinic and month
statsby conint = r(CI) conse = r(CIse), by(married collgrad) clear: conindex wage, truezero

* Check data
list

Result

Code:

     +--------------------------------------------------+
     | married           collgrad     conint      conse |
     |--------------------------------------------------|
  1. |  Single   Not college grad   .3403089   .0132737 |
  2. |  Single       College grad   .3000356   .0132422 |
  3. | Married   Not college grad   .3052576   .0087177 |
  4. | Married       College grad   .2890811   .0097662 |
     +--------------------------------------------------+

Then, using generate command, compute the lower and upper bounds after the 30k-line data is finished.

Comment

Daniela Rodrigues

Join Date: Jan 2021

Posts: 30
#3

22 Oct 2022, 12:37

It works perfectly, thanks very much Ken Chui

Last edited by Daniela Rodrigues; 22 Oct 2022, 12:40.
Comment
Daniela Rodrigues

Join Date: Jan 2021

Posts: 30
#4

26 Oct 2022, 10:05

Ken Chui,

Can I ask one last question about this please? I managed to run the code for 6 groups of clinics and months in seconds, but now when I run the code for the whole dataset, it is taking so long. Do you know what it means the +1+2+3+4+5 ..................... 50 .... that appear in the Stata window while running the statsby command? Would this mean that only 50 groups out of 30k were processed in half day?

Many thanks.
Comment
Ken Chui

Join Date: Aug 2014

Posts: 1063
#5

26 Oct 2022, 11:08

Originally posted by Daniela Rodrigues View Post

Ken Chui,

Can I ask one last question about this please? I managed to run the code for 6 groups of clinics and months in seconds, but now when I run the code for the whole dataset, it is taking so long. Do you know what it means the +1+2+3+4+5 ..................... 50 .... that appear in the Stata window while running the statsby command? Would this mean that only 50 groups out of 30k were processed in half day?

Many thanks.

I believe so. I am not very knowledgeable with CPU time management. It may be worthy to make a new post about this and see if anyone can help you with that.

Given what is presented here, I'd perhaps try creating some subsets data files so that you can process them in small batches and save the results batch by batch.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#6

26 Oct 2022, 11:11

The use of loops over levels of variables containing commands that then use -if- conditions restricting to those levels can be very slow in large data sets. And, internally, -statsby- uses that same approach. By contrast, -runby- speeds up the process considerably and can be used for most of these situations. In your case:

Code:

capture program drop one_clinic_month program define one_clinic_month conindex nr_events, rankvar(person_age) true zero generalized gen con_index = r(CI) gen ll95 = r(CI) = 1.96*r(CIse) gen ul95 = r(CI) + 1.96*r(CIse) exit end runby one_clinic_month, by(clinic month) verbose

should do the trick. I suggest you try it first with a subset of your data containing only a few clinics and a few months to be sure that program one_clinic_month runs without errors and produces sensible results. (I am not familiar with the -conindex- program, which is not an official Stata command, so I can't be sure that my code is completely compatible with the way it works.) If you are satisfied that it is working properly, then eliminate the -verbose- option (so you won't get thousands of pages of output with the full data set) and add the -status- option (which will give periodic progress reports on how much of the data has been processed and an estimate of the time remaining to completion.)

-runby- is written by Robert Picard and me, and is available from SSC.

Added: Crossed with #5.

Last edited by Clyde Schechter; 26 Oct 2022, 11:16.
1 like
Comment
Daniela Rodrigues

Join Date: Jan 2021

Posts: 30
#7

31 Oct 2022, 07:01

Many thanks both for your input on this. -runby- is now installed in my database and I just checked for a couple of clinics and months and it gives the same results. I will now run this program for the whole dataset.

Once again - thank you.

Last edited by Daniela Rodrigues; 31 Oct 2022, 07:26.
Comment
Daniela Rodrigues

Join Date: Jan 2021

Posts: 30
#8

31 Oct 2022, 09:14

Clyde Schechter,

I just run your code against my whole dataset and it has already finished. Fantastic, thanks very much again.

I just got some non-zero "by-group errors" from a particular month onwards, is there any way to inspect what these errors might be / what might be causing them?

Many thanks.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#9

31 Oct 2022, 10:27

Those will be combinations of clinic and month that appear in the original data but do not appear in the results. So if you

Code:

use original_data, clear keep clinic month duplicates drop merge 1:1 clinic month using results_from_runby, keep(master) keepusing(clinic month) nogenerate list, noobs clean

Stata will show them to you. (Replace the italicized parts of the code with the actual names of the original data set and the data set containing the results from -runby-.)

I don't know what -conindex- does or how it works. But what you should find when you delve more deeply into the findings here is that for those combinations of clinic and month, the data were in some way unsuitable for -conindex- to run. This kind of thing comes up commonly when -runby- is used with a program that does a regression: there are often -runby-groups that don't have enough observations to carry out the regression. Perhaps it will be something like that. But you'll have to look to see.

To see what the errors actually are, you can use the original data set, and keep only the clinic-month combinations that produced errors, and then re-do -runby-, adding the -verbose- option. That way you will see the error messages that program one_clinic_month threw. So, following the code above, it would be like this:

Code:

merge 1:m clinic month using original_data, keep(match) nogenerate runby one_clinic_month, by(clinic month) verbose

Last edited by Clyde Schechter; 31 Oct 2022, 10:31.
Comment
Daniela Rodrigues

Join Date: Jan 2021

Posts: 30
#10

07 Nov 2022, 07:09

This is very helpful, thank you Clyde Schechter.
Comment

Clinic `p'	Month `m'	concentration index r(CI)	95% confidence interval (lower) r(CI)-1.96*r(CIse)	95% confidence interval (upper) r(CI)+1.96*r(CIse)

Announcement

Insert values from a cycle into a dataset

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment