Variance over time by industry (variable)

Hugo Rocha

Join Date: Feb 2021
Posts: 288

Variance over time by industry (variable)

07 Jun 2022, 09:51

Hi,

I am trying to create two variables that computes the variance of log of output per worker (ly) over time by industry (isic) and type of industry (tech_intensity) but clearly the command is not just sd^2 by isic or tech_intensity. In addition, I am also trying to compute the variance of industries (isic) within the same group (tech_intensities). That is, how ly is dispersed among industries within the same group. Any hint? Thank you very much!

Code:

 * Example generated by -dataex-. To install: ssc install dataex
clear
input int country str2 isic float tech_intensity int year float ly
392 "20" 1 1963   7.26083
710 "16" 1 1963  8.802372
356 "17" 1 1963  6.480821
356 "18" 1 1963  6.298085
566 "15" 1 1963  8.434374
288 "17" 1 1963  7.216023
368 "18" 1 1963  7.340207
356 "21" 1 1963  7.079779
792 "17" 1 1963  7.442109
100 "18" 1 1963         .
364 "19" 1 1963         .
616 "21" 1 1963  7.429392
600 "36" 1 1963         .
 40 "16" 1 1963  10.65649
266 "17" 1 1963         .
278 "22" 1 1963         .
410 "22" 1 1963    7.1299
620 "20" 1 1963  7.622666
760 "18" 1 1963   6.31662
152 "16" 1 1963  10.35778
266 "36" 1 1963         .
 56 "15" 1 1963  8.705056
200 "19" 1 1963         .
620 "21" 1 1963  8.078789
250 "18" 1 1963  7.561381
214 "15" 1 1963   7.09509
600 "18" 1 1963         .
222 "17" 1 1963 8.1795435
578 "36" 1 1963  8.216323
710 "20" 1 1963  7.067297
608 "22" 1 1963  7.425481
100 "20" 1 1963         .
840 "19" 1 1963         .
222 "19" 1 1963         .
508 "18" 1 1963         .
800 "22" 1 1963  6.906117
710 "18" 1 1963  7.492924
894 "17" 1 1963  7.218875
862 "15" 1 1963   8.94442
364 "17" 1 1963  7.422935
792 "19" 1 1963         .
826 "18" 1 1963   7.72704
 40 "22" 1 1963  7.940096
364 "21" 1 1963  7.233703
716 "21" 1 1963         .
100 "16" 1 1963         .
124 "21" 1 1963 9.2917595
528 "20" 1 1963  8.188397
788 "20" 1 1963    7.1653
620 "36" 1 1963  6.033399
528 "22" 1 1963   8.06982
724 "19" 1 1963         .
368 "36" 1 1963  6.826024
368 "22" 1 1963  7.411281
196 "36" 1 1963  7.786526
702 "36" 1 1963  7.383864
528 "16" 1 1963  7.836859
246 "21" 1 1963  8.564982
372 "16" 1 1963  8.608574
152 "20" 1 1963  7.815402
222 "16" 1 1963  9.658947
528 "15" 1 1963  8.183544
392 "18" 1 1963  7.212449
372 "17" 1 1963  7.730071
188 "19" 1 1963         .
788 "15" 1 1963  7.907479
716 "18" 1 1963         .
 32 "18" 1 1963         .
356 "36" 1 1963  6.337507
642 "20" 1 1963         .
388 "20" 1 1963  8.442659
214 "21" 1 1963  8.358867
356 "20" 1 1963  6.141352
642 "19" 1 1963         .
620 "15" 1 1963  7.801026
208 "18" 1 1963  8.031172
752 "18" 1 1963  8.175014
826 "15" 1 1963  8.442614
598 "19" 1 1963         .
400 "16" 1 1963   7.82051
710 "22" 1 1963  8.219606
214 "19" 1 1963         .
410 "18" 1 1963  6.633695
400 "18" 1 1963  6.989285
630 "15" 1 1963         .
376 "15" 1 1963  8.251745
586 "15" 1 1963  7.492325
196 "18" 1 1963  7.739839
170 "21" 1 1963  8.523235
716 "15" 1 1963         .
124 "22" 1 1963  8.976151
340 "18" 1 1963  6.823594
280 "36" 1 1963  8.478176
840 "17" 1 1963  8.879159
 40 "17" 1 1963  7.733911
826 "17" 1 1963  7.907815
792 "21" 1 1963  8.089255
508 "36" 1 1963         .
792 "36" 1 1963  7.732408
840 "15" 1 1963  9.494059
end

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#2

07 Jun 2022, 10:05

I am trying to create two variables that computes the variance of log of output per worker (ly) over time by industry (isic) and type of industry (tech_intensity)

Code:

by isic tech_intensity, sort: egen wanted = sd(ly) replace wanted = wanted^2

In addition, I am also trying to compute the variance of industries (isic) within the same group (tech_intensities). That is, how ly is dispersed among industries within the same group.

I don't understand what this means.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17702
#3

07 Jun 2022, 10:21

Hugo:
I do share Clyde's concern about the clarity of your aecon query.
Therefore, please consider what follows as a temptative answer:

Code:

. bysort isic (year): egen wanted=sd(ly) . replace wanted=wanted^2 .

Kind regards,
Carlo
(Stata 19.0)
Comment
Hugo Rocha

Join Date: Feb 2021

Posts: 288
#4

07 Jun 2022, 10:26

Originally posted by Clyde Schechter View Post

Code:

by isic tech_intensity, sort: egen wanted = sd(ly) replace wanted = wanted^2

I don't understand what this means.

Thank you, for the first line I tried something similar and I also tried your sintax. But I am getting negative values for the standard deviation for one group.

Code:

tabstat sdly, by(tech_intensity) Summary for variables: sdly by categories of: tech_intensity tech_intensity | mean ---------------+---------- 1 | -.1312786 2 | .210542 3 | .0106954 ---------------+---------- Total | -8.00e-11 --------------------------

Can I try to include year as well to see how the standard deviation changed over time among these industries?

For the second part, what I mean (my apologies if I was not clear) is: given the same group of industries (tech_intensities), how disperse is ly across industries (isic) in this same group?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#5

07 Jun 2022, 10:48

There is no such thing as a negative standard deviation, as you know. So something went seriously wrong in the calculation of sdly. Please post back with a data example that exhibits this problem and show the exact code you used to create the sdly variable.

For the second part, I think I understand now. Try this to get both parts

Code:

// PART 1 by tech_intensity isic, sort: egen variance = sd(ly) replace variance = variance^2 // PART 2 egen one_isic_obs = tag(isic tech_intensity) by tech_intensity (isic), sort: egen wanted = sd(cond(one_isic_obs, variance, .))

Note: here wanted is a standard deviation. If you prefer to have that result in the variance metric, just -replace wanted = wanted^2- at the end.
2 likes
Comment

Hugo Rocha

Join Date: Feb 2021
Posts: 288

07 Jun 2022, 11:07

Originally posted by Clyde Schechter View Post

There is no such thing as a negative standard deviation, as you know. So something went seriously wrong in the calculation of sdly. Please post back with a data example that exhibits this problem and show the exact code you used to create the sdly variable.

For the second part, I think I understand now. Try this to get both parts

Code:

// PART 1
by tech_intensity isic, sort: egen variance = sd(ly)
replace variance = variance^2

// PART 2
egen one_isic_obs = tag(isic tech_intensity)
by tech_intensity (isic), sort: egen wanted = sd(cond(one_isic_obs, variance, .))

Note: here wanted is a standard deviation. If you prefer to have that result in the variance metric, just -replace wanted = wanted^2- at the end.

Exactly. Here is the exact code I ran with my standard deviation.

Code:

 by isic tech_intensity, sort: egen sd_tech_intensity = sd(ly)

Code:

 bysort tech_intensity: sum sdly

-------------------------------------------------------------------------------------------------------------------------------
-> tech_intensity = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        sdly |     25,078   -.1312786    .9608647  -6.361836   4.074447

-------------------------------------------------------------------------------------------------------------------------------
-> tech_intensity = 2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        sdly |     14,931     .210542    1.045799  -7.439419   8.159857

-------------------------------------------------------------------------------------------------------------------------------
-> tech_intensity = 3

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        sdly |     13,894    .0106954    .9793396  -7.317303   4.497964

Code:

 tabstat sdly if year<2000, by(tech_intensity)

Last edited by Hugo Rocha; 07 Jun 2022, 11:20.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#7

07 Jun 2022, 11:15

Well, your command -by isic tech_intensity, sort: egen sd_tech_intensity = sd(ly)- generates a variable named sd_tech_intensity. But the -sum- and -tabstat- commands you show are operating on some other variable named sdly. So how did you get the variable sdly? And please be sure to include in your response a data example that reproduces this problem.
Comment

Hugo Rocha

Join Date: Feb 2021
Posts: 288

07 Jun 2022, 11:24

Originally posted by Clyde Schechter View Post

Well, your command -by isic tech_intensity, sort: egen sd_tech_intensity = sd(ly)- generates a variable named sd_tech_intensity. But the -sum- and -tabstat- commands you show are operating on some other variable named sdly. So how did you get the variable sdly? And please be sure to include in your response a data example that reproduces this problem.

I honestly do not know how it went. But I deleted all variables constructed and redid the analysis as follows (based on your sintax) and the problem seem to have disappeared...:

Code:

  by isic tech_intensity, sort: egen wanted = sd(ly)

Code:

 bysort tech_intensity: sum wanted





	Code:
	 -------------------------------------------------------------------------------------------------------------------------------
-> tech_intensity = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      wanted |     49,223    1.237885    .1196244   1.141948   1.692813

-------------------------------------------------------------------------------------------------------------------------------
-> tech_intensity = 2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      wanted |     27,740    1.279511    .1536367   1.115616   1.569246

-------------------------------------------------------------------------------------------------------------------------------
-> tech_intensity = 3

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      wanted |     38,876    1.360191    .1893504   1.151006   1.859497
abstat wanted if year<2000, by(tech_intensity)

Summary for variables: wanted
     by categories of: tech_intensity 

tech_intensity |      mean
---------------+----------
             1 |  1.239249
             2 |  1.281243
             3 |  1.365035
---------------+----------
         Total |  1.292305
--------------------------

. tabstat wanted if year>2000, by(tech_intensity)

Summary for variables: wanted
     by categories of: tech_intensity 

tech_intensity |      mean
---------------+----------
             1 |  1.235673
             2 |  1.276771
             3 |  1.350849
---------------+----------
         Total |  1.282882

--------------------------

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#9

07 Jun 2022, 11:40

Glad to hear it!

But I'm a little distressed about "I honestly do not know how it went." Unless you are doing these analyses just for fun, you should be doing all of this work using do-files, not just typing commands into the Command window. You should also be logging all of your runs. And you should be saving the de-bugged do-files and the logs they produce. You need a complete audit trail of everything you are doing from beginning to end. So the creation of the sdly variable should be somewhere in one of those do-files or log (smcl-) files.

By the way, the audit trail I speak of isn't just for others to review. You need it for yourself too. Imagine that 6 months from now you have to go back to this project, perhaps to add some additional analyses, or change a definition and re-run things. Things like this happen a lot! If you haven't saved all your code and outputs, you will be saddled with starting over from scratch, and it is likely that at that point you won't remember exactly how you did everything, so you will struggle to reproduce the original work, let alone modify it.

Last edited by Clyde Schechter; 07 Jun 2022, 11:47.
2 likes
Comment
Hugo Rocha

Join Date: Feb 2021

Posts: 288
#10

07 Jun 2022, 12:17

Originally posted by Clyde Schechter View Post

Glad to hear it!

But I'm a little distressed about "I honestly do not know how it went." Unless you are doing these analyses just for fun, you should be doing all of this work using do-files, not just typing commands into the Command window. You should also be logging all of your runs. And you should be saving the de-bugged do-files and the logs they produce. You need a complete audit trail of everything you are doing from beginning to end. So the creation of the sdly variable should be somewhere in one of those do-files or log (smcl-) files.

By the way, the audit trail I speak of isn't just for others to review. You need it for yourself too. Imagine that 6 months from now you have to go back to this project, perhaps to add some additional analyses, or change a definition and re-run things. Things like this happen a lot! If you haven't saved all your code and outputs, you will be saddled with starting over from scratch, and it is likely that at that point you won't remember exactly how you did everything, so you will struggle to reproduce the original work, let alone modify it.

This is an analysis I am still not sure where it will lead me in terms of research. Conditional on these results, I may dive further into this specific section. Once I get conclusive results, I start making the do-files. You are absolutely right. This is something I am still struggling research-wise though (keeping a record of everything). Last year, I submitted a paper for revision and lost an insane amount of time reproducing results again... it was a torture. I think I am learning the hard way...

What are log (smcl) files?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#11

07 Jun 2022, 13:24

What are log (smcl) files?

Log files are files created by Stata that basically mirror what you see in the Results window. They contain the commands, and also hold the output generated by those commands. To create a log file, pick a name for the file and run

Code:

log using filename_I_picked, replace

From that point on, everything you run, both the commands and the output you see in Results, including any messages, will be saved in the file you designated. If you did not specify a filename extension, the file type will be .smcl (Stata markup control language). I generally begin all my do-files with:

Code:

capture log close log using filename, replace

and at the end of the do-file I put

Code:

log close

To look at the contents of a log file you have already saved, the command is:

Code:

view filename.smcl

This will cause a Viewer window to pop up containing the file, and you can review everything that was done. Note: you have to actually specify the .smcl extension in this command.

The nice thing about the .smcl files is that they contain the formatting in the output: bold face, colors, etc. The drawback is that you can't share them with people who don't have Stata because no other software, as far as I know, can open them. So if you need to share log files, you can instead save them as plain text files by specifying a .txt filename extension with the filename, and by using the -text- option in the -log- command. So -log using filename.txt, text replace-.

Read -help log- for information about other more advanced aspects of using Stata log files.

By the way, if you're wondering why I start with that -capture log close- command, here's the explanation. When I first write a do-file, it is likely to contain errors. When I try to run it, it will proceed until it breaks. At that point, I need to review what I've done, fix the error(s) and re-run it. Well, when I try to re-run it, and it hits the -log using filename, replace- statement, Stata will halt with an error message telling me that the file is already open. And, of course, it is open because the do-file didn't make it all the way to the -log close- command at the end. So then I would have to type -log close- in the Command window and restart the do-file. To avoid this inconvenience, putting -log close- before the -log open- command would work, except that on the first run, the log isn't already open, so this, too would throw an error. By using -capture log close-, Stata will close the log file if it is, in fact, open, and just move on quietly if it isn't. Exactly what is needed.
2 likes
Comment
Hugo Rocha

Join Date: Feb 2021

Posts: 288
#12

08 Jun 2022, 14:35

Originally posted by Clyde Schechter View Post

Log files are files created by Stata that basically mirror what you see in the Results window. They contain the commands, and also hold the output generated by those commands. To create a log file, pick a name for the file and run

Code:

log using filename_I_picked, replace

From that point on, everything you run, both the commands and the output you see in Results, including any messages, will be saved in the file you designated. If you did not specify a filename extension, the file type will be .smcl (Stata markup control language). I generally begin all my do-files with:

Code:

capture log close log using filename, replace

and at the end of the do-file I put

Code:

log close

To look at the contents of a log file you have already saved, the command is:

Code:

view filename.smcl

This will cause a Viewer window to pop up containing the file, and you can review everything that was done. Note: you have to actually specify the .smcl extension in this command.

The nice thing about the .smcl files is that they contain the formatting in the output: bold face, colors, etc. The drawback is that you can't share them with people who don't have Stata because no other software, as far as I know, can open them. So if you need to share log files, you can instead save them as plain text files by specifying a .txt filename extension with the filename, and by using the -text- option in the -log- command. So -log using filename.txt, text replace-.

Read -help log- for information about other more advanced aspects of using Stata log files.

By the way, if you're wondering why I start with that -capture log close- command, here's the explanation. When I first write a do-file, it is likely to contain errors. When I try to run it, it will proceed until it breaks. At that point, I need to review what I've done, fix the error(s) and re-run it. Well, when I try to re-run it, and it hits the -log using filename, replace- statement, Stata will halt with an error message telling me that the file is already open. And, of course, it is open because the do-file didn't make it all the way to the -log close- command at the end. So then I would have to type -log close- in the Command window and restart the do-file. To avoid this inconvenience, putting -log close- before the -log open- command would work, except that on the first run, the log isn't already open, so this, too would throw an error. By using -capture log close-, Stata will close the log file if it is, in fact, open, and just move on quietly if it isn't. Exactly what is needed.

Thank you very much! I am taking a look at it!
Comment
Hugo Rocha

Join Date: Feb 2021

Posts: 288
#13

10 Jun 2022, 08:21

Originally posted by Clyde Schechter View Post

Well, your command -by isic tech_intensity, sort: egen sd_tech_intensity = sd(ly)- generates a variable named sd_tech_intensity. But the -sum- and -tabstat- commands you show are operating on some other variable named sdly. So how did you get the variable sdly? And please be sure to include in your response a data example that reproduces this problem.

I do apologize for continuing on this feed. But, I forgot to ask something important. At the end, my goal is to plot these standard deviations over time. However, since the panel is unbalanced, there are many years in which many countries and industries (isic) are not present. What could be a good syntax to construct a common sample (of countries and industries) and then on this common sample I plot standard deviations?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#14

10 Jun 2022, 11:50

However, since the panel is unbalanced, there are many years in which many countries and industries (isic) are not present. What could be a good syntax to construct a common sample (of countries and industries)

I'm interpreting this as meaning that you want the sample to consist of all and only those country#industry combinations that have an observation in every year that appears in the data set.

Code:

summ year, meanonly local first = r(min) local last = r(max) by country isic (year), sort: keep if year[1] == `first' & year[_N] == `last' /// & _N == `last'-`first'+1

should do it. Now, the example data in #1 contains only one year (1963), so it results in everything being kept. But based on your statement that the full data set is unbalanced, I believe this code will retain those and only those country#isic combinations that have an observation in every year. Just be aware that because of the special nature of the example data in #1, this code is not fully tested.
1 like
Comment
Hugo Rocha

Join Date: Feb 2021

Posts: 288
#15

10 Jun 2022, 12:04

Originally posted by Clyde Schechter View Post

I'm interpreting this as meaning that you want the sample to consist of all and only those country#industry combinations that have an observation in every year that appears in the data set.

Code:

summ year, meanonly local first = r(min) local last = r(max) by country isic (year), sort: keep if year[1] == `first' & year[_N] == `last' /// & _N == `last'-`first'+1

should do it. Now, the example data in #1 contains only one year (1963), so it results in everything being kept. But based on your statement that the full data set is unbalanced, I believe this code will retain those and only those country#isic combinations that have an observation in every year. Just be aware that because of the special nature of the example data in #1, this code is not fully tested.

BTW, I started to use the log files today and it is amazing I can see all my commands now! Thank you so much for that!

My goal is to have a common and consistent sample of countries and industries for a set of years. Now that I am checking many countries and industries are not present before the 1990's. So, perhaps I am plotting the standard deviation of log value added per worker starting from 1990 (but keeping the same countries and industries, this is my main goal, having a common sample). That would change your syntax perhaps in the second line?
Comment

Announcement

Variance over time by industry (variable)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment