Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Variance over time by industry (variable)

    Hi,

    I am trying to create two variables that computes the variance of log of output per worker (ly) over time by industry (isic) and type of industry (tech_intensity) but clearly the command is not just sd^2 by isic or tech_intensity. In addition, I am also trying to compute the variance of industries (isic) within the same group (tech_intensities). That is, how ly is dispersed among industries within the same group. Any hint? Thank you very much!

    Code:
     * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int country str2 isic float tech_intensity int year float ly
    392 "20" 1 1963   7.26083
    710 "16" 1 1963  8.802372
    356 "17" 1 1963  6.480821
    356 "18" 1 1963  6.298085
    566 "15" 1 1963  8.434374
    288 "17" 1 1963  7.216023
    368 "18" 1 1963  7.340207
    356 "21" 1 1963  7.079779
    792 "17" 1 1963  7.442109
    100 "18" 1 1963         .
    364 "19" 1 1963         .
    616 "21" 1 1963  7.429392
    600 "36" 1 1963         .
     40 "16" 1 1963  10.65649
    266 "17" 1 1963         .
    278 "22" 1 1963         .
    410 "22" 1 1963    7.1299
    620 "20" 1 1963  7.622666
    760 "18" 1 1963   6.31662
    152 "16" 1 1963  10.35778
    266 "36" 1 1963         .
     56 "15" 1 1963  8.705056
    200 "19" 1 1963         .
    620 "21" 1 1963  8.078789
    250 "18" 1 1963  7.561381
    214 "15" 1 1963   7.09509
    600 "18" 1 1963         .
    222 "17" 1 1963 8.1795435
    578 "36" 1 1963  8.216323
    710 "20" 1 1963  7.067297
    608 "22" 1 1963  7.425481
    100 "20" 1 1963         .
    840 "19" 1 1963         .
    222 "19" 1 1963         .
    508 "18" 1 1963         .
    800 "22" 1 1963  6.906117
    710 "18" 1 1963  7.492924
    894 "17" 1 1963  7.218875
    862 "15" 1 1963   8.94442
    364 "17" 1 1963  7.422935
    792 "19" 1 1963         .
    826 "18" 1 1963   7.72704
     40 "22" 1 1963  7.940096
    364 "21" 1 1963  7.233703
    716 "21" 1 1963         .
    100 "16" 1 1963         .
    124 "21" 1 1963 9.2917595
    528 "20" 1 1963  8.188397
    788 "20" 1 1963    7.1653
    620 "36" 1 1963  6.033399
    528 "22" 1 1963   8.06982
    724 "19" 1 1963         .
    368 "36" 1 1963  6.826024
    368 "22" 1 1963  7.411281
    196 "36" 1 1963  7.786526
    702 "36" 1 1963  7.383864
    528 "16" 1 1963  7.836859
    246 "21" 1 1963  8.564982
    372 "16" 1 1963  8.608574
    152 "20" 1 1963  7.815402
    222 "16" 1 1963  9.658947
    528 "15" 1 1963  8.183544
    392 "18" 1 1963  7.212449
    372 "17" 1 1963  7.730071
    188 "19" 1 1963         .
    788 "15" 1 1963  7.907479
    716 "18" 1 1963         .
     32 "18" 1 1963         .
    356 "36" 1 1963  6.337507
    642 "20" 1 1963         .
    388 "20" 1 1963  8.442659
    214 "21" 1 1963  8.358867
    356 "20" 1 1963  6.141352
    642 "19" 1 1963         .
    620 "15" 1 1963  7.801026
    208 "18" 1 1963  8.031172
    752 "18" 1 1963  8.175014
    826 "15" 1 1963  8.442614
    598 "19" 1 1963         .
    400 "16" 1 1963   7.82051
    710 "22" 1 1963  8.219606
    214 "19" 1 1963         .
    410 "18" 1 1963  6.633695
    400 "18" 1 1963  6.989285
    630 "15" 1 1963         .
    376 "15" 1 1963  8.251745
    586 "15" 1 1963  7.492325
    196 "18" 1 1963  7.739839
    170 "21" 1 1963  8.523235
    716 "15" 1 1963         .
    124 "22" 1 1963  8.976151
    340 "18" 1 1963  6.823594
    280 "36" 1 1963  8.478176
    840 "17" 1 1963  8.879159
     40 "17" 1 1963  7.733911
    826 "17" 1 1963  7.907815
    792 "21" 1 1963  8.089255
    508 "36" 1 1963         .
    792 "36" 1 1963  7.732408
    840 "15" 1 1963  9.494059
    end


  • #2
    I am trying to create two variables that computes the variance of log of output per worker (ly) over time by industry (isic) and type of industry (tech_intensity)
    Code:
    by isic tech_intensity, sort: egen wanted = sd(ly)
    replace wanted = wanted^2
    In addition, I am also trying to compute the variance of industries (isic) within the same group (tech_intensities). That is, how ly is dispersed among industries within the same group.
    I don't understand what this means.

    Comment


    • #3
      Hugo:
      I do share Clyde's concern about the clarity of your aecon query.
      Therefore, please consider what follows as a temptative answer:
      Code:
      . bysort isic (year): egen wanted=sd(ly)
      
      
      . replace wanted=wanted^2
      
      
      .
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        Code:
        by isic tech_intensity, sort: egen wanted = sd(ly)
        replace wanted = wanted^2

        I don't understand what this means.
        Thank you, for the first line I tried something similar and I also tried your sintax. But I am getting negative values for the standard deviation for one group.


        Code:
          tabstat sdly, by(tech_intensity)
        
        Summary for variables: sdly
             by categories of: tech_intensity 
        
        tech_intensity |      mean
        ---------------+----------
                     1 | -.1312786
                     2 |   .210542
                     3 |  .0106954
        ---------------+----------
                 Total | -8.00e-11    
        --------------------------
        Can I try to include year as well to see how the standard deviation changed over time among these industries?

        For the second part, what I mean (my apologies if I was not clear) is: given the same group of industries (tech_intensities), how disperse is ly across industries (isic) in this same group?



        Comment


        • #5
          There is no such thing as a negative standard deviation, as you know. So something went seriously wrong in the calculation of sdly. Please post back with a data example that exhibits this problem and show the exact code you used to create the sdly variable.

          For the second part, I think I understand now. Try this to get both parts

          Code:
          // PART 1
          by tech_intensity isic, sort: egen variance = sd(ly)
          replace variance = variance^2
          
          // PART 2
          egen one_isic_obs = tag(isic tech_intensity)
          by tech_intensity (isic), sort: egen wanted = sd(cond(one_isic_obs, variance, .))
          Note: here wanted is a standard deviation. If you prefer to have that result in the variance metric, just -replace wanted = wanted^2- at the end.

          Comment


          • #6
            Originally posted by Clyde Schechter View Post
            There is no such thing as a negative standard deviation, as you know. So something went seriously wrong in the calculation of sdly. Please post back with a data example that exhibits this problem and show the exact code you used to create the sdly variable.

            For the second part, I think I understand now. Try this to get both parts

            Code:
            // PART 1
            by tech_intensity isic, sort: egen variance = sd(ly)
            replace variance = variance^2
            
            // PART 2
            egen one_isic_obs = tag(isic tech_intensity)
            by tech_intensity (isic), sort: egen wanted = sd(cond(one_isic_obs, variance, .))
            Note: here wanted is a standard deviation. If you prefer to have that result in the variance metric, just -replace wanted = wanted^2- at the end.
            Exactly. Here is the exact code I ran with my standard deviation.

            Code:
             by isic tech_intensity, sort: egen sd_tech_intensity = sd(ly)

            Code:
             bysort tech_intensity: sum sdly
            
            -------------------------------------------------------------------------------------------------------------------------------
            -> tech_intensity = 1
            
                Variable |        Obs        Mean    Std. Dev.       Min        Max
            -------------+---------------------------------------------------------
                    sdly |     25,078   -.1312786    .9608647  -6.361836   4.074447
            
            -------------------------------------------------------------------------------------------------------------------------------
            -> tech_intensity = 2
            
                Variable |        Obs        Mean    Std. Dev.       Min        Max
            -------------+---------------------------------------------------------
                    sdly |     14,931     .210542    1.045799  -7.439419   8.159857
            
            -------------------------------------------------------------------------------------------------------------------------------
            -> tech_intensity = 3
            
                Variable |        Obs        Mean    Std. Dev.       Min        Max
            -------------+---------------------------------------------------------
                    sdly |     13,894    .0106954    .9793396  -7.317303   4.497964

            Code:
             tabstat sdly if year<2000, by(tech_intensity)
            Summary for variables: sdly
            by categories of: tech_intensity

            tech_intensity | mean
            ---------------+----------
            1 | -.3831029
            2 | -.0534197
            3 | -.3028053
            ---------------+----------
            Total | -.2697763 [/CODE] Thank you very much!
            --------------------------
            Last edited by Hugo Rocha; 07 Jun 2022, 11:20.

            Comment


            • #7
              Well, your command -by isic tech_intensity, sort: egen sd_tech_intensity = sd(ly)- generates a variable named sd_tech_intensity. But the -sum- and -tabstat- commands you show are operating on some other variable named sdly. So how did you get the variable sdly? And please be sure to include in your response a data example that reproduces this problem.

              Comment


              • #8
                Originally posted by Clyde Schechter View Post
                Well, your command -by isic tech_intensity, sort: egen sd_tech_intensity = sd(ly)- generates a variable named sd_tech_intensity. But the -sum- and -tabstat- commands you show are operating on some other variable named sdly. So how did you get the variable sdly? And please be sure to include in your response a data example that reproduces this problem.
                I honestly do not know how it went. But I deleted all variables constructed and redid the analysis as follows (based on your sintax) and the problem seem to have disappeared...:

                Code:
                  by isic tech_intensity, sort: egen wanted = sd(ly)
                Code:
                 bysort tech_intensity: sum wanted
                
                
                
                
                
                Code:
                 -------------------------------------------------------------------------------------------------------------------------------
                -> tech_intensity = 1
                
                    Variable |        Obs        Mean    Std. Dev.       Min        Max
                -------------+---------------------------------------------------------
                      wanted |     49,223    1.237885    .1196244   1.141948   1.692813
                
                -------------------------------------------------------------------------------------------------------------------------------
                -> tech_intensity = 2
                
                    Variable |        Obs        Mean    Std. Dev.       Min        Max
                -------------+---------------------------------------------------------
                      wanted |     27,740    1.279511    .1536367   1.115616   1.569246
                
                -------------------------------------------------------------------------------------------------------------------------------
                -> tech_intensity = 3
                
                    Variable |        Obs        Mean    Std. Dev.       Min        Max
                -------------+---------------------------------------------------------
                      wanted |     38,876    1.360191    .1893504   1.151006   1.859497
                abstat wanted if year<2000, by(tech_intensity) Summary for variables: wanted by categories of: tech_intensity tech_intensity | mean ---------------+---------- 1 | 1.239249 2 | 1.281243 3 | 1.365035 ---------------+---------- Total | 1.292305 -------------------------- . tabstat wanted if year>2000, by(tech_intensity) Summary for variables: wanted by categories of: tech_intensity tech_intensity | mean ---------------+---------- 1 | 1.235673 2 | 1.276771 3 | 1.350849 ---------------+---------- Total | 1.282882
                --------------------------



                Comment


                • #9
                  Glad to hear it!

                  But I'm a little distressed about "I honestly do not know how it went." Unless you are doing these analyses just for fun, you should be doing all of this work using do-files, not just typing commands into the Command window. You should also be logging all of your runs. And you should be saving the de-bugged do-files and the logs they produce. You need a complete audit trail of everything you are doing from beginning to end. So the creation of the sdly variable should be somewhere in one of those do-files or log (smcl-) files.

                  By the way, the audit trail I speak of isn't just for others to review. You need it for yourself too. Imagine that 6 months from now you have to go back to this project, perhaps to add some additional analyses, or change a definition and re-run things. Things like this happen a lot! If you haven't saved all your code and outputs, you will be saddled with starting over from scratch, and it is likely that at that point you won't remember exactly how you did everything, so you will struggle to reproduce the original work, let alone modify it.
                  Last edited by Clyde Schechter; 07 Jun 2022, 11:47.

                  Comment


                  • #10
                    Originally posted by Clyde Schechter View Post
                    Glad to hear it!

                    But I'm a little distressed about "I honestly do not know how it went." Unless you are doing these analyses just for fun, you should be doing all of this work using do-files, not just typing commands into the Command window. You should also be logging all of your runs. And you should be saving the de-bugged do-files and the logs they produce. You need a complete audit trail of everything you are doing from beginning to end. So the creation of the sdly variable should be somewhere in one of those do-files or log (smcl-) files.

                    By the way, the audit trail I speak of isn't just for others to review. You need it for yourself too. Imagine that 6 months from now you have to go back to this project, perhaps to add some additional analyses, or change a definition and re-run things. Things like this happen a lot! If you haven't saved all your code and outputs, you will be saddled with starting over from scratch, and it is likely that at that point you won't remember exactly how you did everything, so you will struggle to reproduce the original work, let alone modify it.
                    This is an analysis I am still not sure where it will lead me in terms of research. Conditional on these results, I may dive further into this specific section. Once I get conclusive results, I start making the do-files. You are absolutely right. This is something I am still struggling research-wise though (keeping a record of everything). Last year, I submitted a paper for revision and lost an insane amount of time reproducing results again... it was a torture. I think I am learning the hard way...

                    What are log (smcl) files?

                    Comment


                    • #11
                      What are log (smcl) files?
                      Log files are files created by Stata that basically mirror what you see in the Results window. They contain the commands, and also hold the output generated by those commands. To create a log file, pick a name for the file and run
                      Code:
                      log using filename_I_picked, replace
                      From that point on, everything you run, both the commands and the output you see in Results, including any messages, will be saved in the file you designated. If you did not specify a filename extension, the file type will be .smcl (Stata markup control language). I generally begin all my do-files with:

                      Code:
                      capture log close
                      log using filename, replace
                      and at the end of the do-file I put
                      Code:
                      log close
                      To look at the contents of a log file you have already saved, the command is:
                      Code:
                      view filename.smcl
                      This will cause a Viewer window to pop up containing the file, and you can review everything that was done. Note: you have to actually specify the .smcl extension in this command.

                      The nice thing about the .smcl files is that they contain the formatting in the output: bold face, colors, etc. The drawback is that you can't share them with people who don't have Stata because no other software, as far as I know, can open them. So if you need to share log files, you can instead save them as plain text files by specifying a .txt filename extension with the filename, and by using the -text- option in the -log- command. So -log using filename.txt, text replace-.

                      Read -help log- for information about other more advanced aspects of using Stata log files.

                      By the way, if you're wondering why I start with that -capture log close- command, here's the explanation. When I first write a do-file, it is likely to contain errors. When I try to run it, it will proceed until it breaks. At that point, I need to review what I've done, fix the error(s) and re-run it. Well, when I try to re-run it, and it hits the -log using filename, replace- statement, Stata will halt with an error message telling me that the file is already open. And, of course, it is open because the do-file didn't make it all the way to the -log close- command at the end. So then I would have to type -log close- in the Command window and restart the do-file. To avoid this inconvenience, putting -log close- before the -log open- command would work, except that on the first run, the log isn't already open, so this, too would throw an error. By using -capture log close-, Stata will close the log file if it is, in fact, open, and just move on quietly if it isn't. Exactly what is needed.

                      Comment


                      • #12
                        Originally posted by Clyde Schechter View Post

                        Log files are files created by Stata that basically mirror what you see in the Results window. They contain the commands, and also hold the output generated by those commands. To create a log file, pick a name for the file and run
                        Code:
                        log using filename_I_picked, replace
                        From that point on, everything you run, both the commands and the output you see in Results, including any messages, will be saved in the file you designated. If you did not specify a filename extension, the file type will be .smcl (Stata markup control language). I generally begin all my do-files with:

                        Code:
                        capture log close
                        log using filename, replace
                        and at the end of the do-file I put
                        Code:
                        log close
                        To look at the contents of a log file you have already saved, the command is:
                        Code:
                        view filename.smcl
                        This will cause a Viewer window to pop up containing the file, and you can review everything that was done. Note: you have to actually specify the .smcl extension in this command.

                        The nice thing about the .smcl files is that they contain the formatting in the output: bold face, colors, etc. The drawback is that you can't share them with people who don't have Stata because no other software, as far as I know, can open them. So if you need to share log files, you can instead save them as plain text files by specifying a .txt filename extension with the filename, and by using the -text- option in the -log- command. So -log using filename.txt, text replace-.

                        Read -help log- for information about other more advanced aspects of using Stata log files.

                        By the way, if you're wondering why I start with that -capture log close- command, here's the explanation. When I first write a do-file, it is likely to contain errors. When I try to run it, it will proceed until it breaks. At that point, I need to review what I've done, fix the error(s) and re-run it. Well, when I try to re-run it, and it hits the -log using filename, replace- statement, Stata will halt with an error message telling me that the file is already open. And, of course, it is open because the do-file didn't make it all the way to the -log close- command at the end. So then I would have to type -log close- in the Command window and restart the do-file. To avoid this inconvenience, putting -log close- before the -log open- command would work, except that on the first run, the log isn't already open, so this, too would throw an error. By using -capture log close-, Stata will close the log file if it is, in fact, open, and just move on quietly if it isn't. Exactly what is needed.
                        Thank you very much! I am taking a look at it!

                        Comment


                        • #13
                          Originally posted by Clyde Schechter View Post
                          Well, your command -by isic tech_intensity, sort: egen sd_tech_intensity = sd(ly)- generates a variable named sd_tech_intensity. But the -sum- and -tabstat- commands you show are operating on some other variable named sdly. So how did you get the variable sdly? And please be sure to include in your response a data example that reproduces this problem.
                          I do apologize for continuing on this feed. But, I forgot to ask something important. At the end, my goal is to plot these standard deviations over time. However, since the panel is unbalanced, there are many years in which many countries and industries (isic) are not present. What could be a good syntax to construct a common sample (of countries and industries) and then on this common sample I plot standard deviations?

                          Comment


                          • #14
                            However, since the panel is unbalanced, there are many years in which many countries and industries (isic) are not present. What could be a good syntax to construct a common sample (of countries and industries)
                            I'm interpreting this as meaning that you want the sample to consist of all and only those country#industry combinations that have an observation in every year that appears in the data set.

                            Code:
                            summ year, meanonly
                            local first = r(min)
                            local last = r(max)
                            by country isic (year), sort: keep if year[1] == `first' & year[_N] == `last' ///
                                & _N == `last'-`first'+1
                            should do it. Now, the example data in #1 contains only one year (1963), so it results in everything being kept. But based on your statement that the full data set is unbalanced, I believe this code will retain those and only those country#isic combinations that have an observation in every year. Just be aware that because of the special nature of the example data in #1, this code is not fully tested.

                            Comment


                            • #15
                              Originally posted by Clyde Schechter View Post
                              I'm interpreting this as meaning that you want the sample to consist of all and only those country#industry combinations that have an observation in every year that appears in the data set.

                              Code:
                              summ year, meanonly
                              local first = r(min)
                              local last = r(max)
                              by country isic (year), sort: keep if year[1] == `first' & year[_N] == `last' ///
                              & _N == `last'-`first'+1
                              should do it. Now, the example data in #1 contains only one year (1963), so it results in everything being kept. But based on your statement that the full data set is unbalanced, I believe this code will retain those and only those country#isic combinations that have an observation in every year. Just be aware that because of the special nature of the example data in #1, this code is not fully tested.
                              BTW, I started to use the log files today and it is amazing I can see all my commands now! Thank you so much for that!

                              My goal is to have a common and consistent sample of countries and industries for a set of years. Now that I am checking many countries and industries are not present before the 1990's. So, perhaps I am plotting the standard deviation of log value added per worker starting from 1990 (but keeping the same countries and industries, this is my main goal, having a common sample). That would change your syntax perhaps in the second line?

                              Comment

                              Working...
                              X