Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating Mean Values for a variable based on 2 conditions in Panel Data

    Hi,

    I have panel data of150+ companies for the years 2001-19. I have divided these years between pre-crisis years (2001 to 2006) and post crisis years (2010 to 2019.) The data points contain Geographic Diversification and ROA. I would like to generate the average ROA for the following conditions:
    Average ROA by Company for pre-crisis and post-crisis years separately and where Geographic Diversification>1.4
    I tried the following command but I do not think the results are correct:
    Code:
    egen AvgROApreCrisis = mean(ROA) if inrange(Year,2001,2006) & GeographicDiversificationr>1.4, by(Company)
    Any help in correcting this code would be very greatly appreciated. Thanks a lot.

  • #2
    Code:
    by Company, sort: egen AvgROApreCrisis = mean(cond(inrange(Year, 2001, 2006) & GeographicDiversification>1.4, ROA, .))

    Comment


    • #3
      Thanks Clyde (#2.) I tried the code above but ended up with blank values. I also tried a variants without the GeographicDiversification variable but that too gave me blank values.
      Code:
      by Company, sort: egen AvgROApreCrisis = mean(cond(inrange(Year, 2001, 2006) & GeographicDiversification>1.4, ROA, .))
      (3,534 missing values generated)
      by Company, sort: egen AvgROApreCrisis = mean(cond(inrange(Year, 2001, 2006), ROA, .))
      (3,534 missing values generated)
      I am wondering if I am missing something? (It is quite impossible that all values generated are blank.

      Comment


      • #4
        Well, you don't provide any example data, so I can't test it out and troubleshoot it. But if you are getting all missing values, from that code it would seem to imply that ROA is always missing when YEAR is between 2001 and 2006. Try running
        Code:
        summ ROA if inrange(Year, 2001, 2006)
        and see what you get. If I am right, it will say there are 0 observations and all the other statistics will be missing. And you will then have to figure out why your data are like that.

        If I am wrong, then post back with example data (use the -dataex- command) that reproduces the problem and I'll try to troubleshoot it. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

        Comment


        • #5
          Thank you Clyde (#4.) Really appreciate your help. There is clearly something wrong with my data because, as you mentioned, the command summ ROA gives me no results:

          Code:
          . summ ROA_win05 if inrange(Year, 2001, 2006)
          
              Variable |        Obs        Mean    Std. Dev.       Min        Max
          -------------+---------------------------------------------------------
             ROA_win05 |          0
          I am pasting 38 observations below. Also, I would like to share the variable details i.e.
          ID = Company
          GSCluster = GeographicDiversification
          ROA_win05 = ROA ( - I have winsorized the ROA)

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input int ID byte Year double(GSCluster ROA_win05)
          2  1                  .     .
          2  2                  .     .
          2  3                  .     .
          2  4                  .     .
          2  5                  .     .
          2  6                  .     .
          2  7                  .     .
          2  8                  .     .
          2  9                  .     .
          2 10 .05600153435484741     .
          2 11 .14162706960881938  .036
          2 12  .2014492218323749  -6.2
          2 13 .20436253747814215 -7.92
          2 14                  0   -11
          2 15                  0   -11
          2 16                  0 -2.09
          2 17                  0  -.95
          2 18  .4429715044705054 -.766
          2 19  .6573397035288463 -.847
          4  1                  0     .
          4  2                  0     .
          4  3                  0     .
          4  4                  0  9.85
          4  5                  0  3.74
          4  6                  0  2.42
          4  7                  0  2.57
          4  8                  0  1.78
          4  9                  0  9.76
          4 10                  0  7.02
          4 11                  0  1.12
          4 12                  0  1.57
          4 13                  0  1.52
          4 14                  0   1.9
          4 15                  . -.488
          4 16                  0 -.701
          4 17                  0   -11
          4 18                  0   -11
          4 19                  .     .
          end
          In case you have any insight into what is wrong with the data, please do let me know. Many thanks.

          Comment


          • #6
            In the example data you show, ROA is always missing in years 1 through 3, and sporadically missing in other years as well..

            Evidently, this is not what you expected. You need to review the data management that created this data set. I would start at the beginning: find the raw data files and determine whether ROA (or the variables you calculate ROA from if ROA was not a "given" variable) is also missing in the early years there. If so, you need to communicate with the original provider of that data to find out what is going on and what to do. If the ROA values (or all of the variables you calculate it from) are present, you will have to retrace the code that created your data set from the beginning to see at what point those variables get clobbered or lost.

            Comment


            • #7
              Thank you Clyde (#6.) I agree that ROA is missing sporadically for some years. That is normal in data pertaining to financial statements. While I can try and look for other sources for this ROA data and fill it in, I find it strange that lack of some ROA data points should result in NO results for that data range. For instance, at an aggregate level, following are the winsorized ROA results:

              Code:
              . summ ROA_win05
              
                  Variable |        Obs        Mean    Std. Dev.       Min        Max
              -------------+---------------------------------------------------------
                 ROA_win05 |      2,613     2.74942    5.946121        -11       13.6
              However, if I try to run the summ command for a data range (say 2001 to 2006), I get NIL results eg.

              Code:
              . summ ROA_win05 if inrange(Year, 2001, 2006)
              
                  Variable |        Obs        Mean    Std. Dev.       Min        Max
              -------------+---------------------------------------------------------
                 ROA_win05 |          0
              and similarly, the results are NIL for the data range 2010 to 2019:

              Code:
              . summ ROA_win05 if inrange(Year, 2010, 2019)
              
                  Variable |        Obs        Mean    Std. Dev.       Min        Max
              -------------+---------------------------------------------------------
                 ROA_win05 |          0
              Does that mean that in order to be able to get the results for the code mentioned in #2, I should have ALL data points? Would appreciate any input. Thank you so much.

              Comment


              • #8
                Look, forget about statistics. Common sense says that you can't figure out anything about what happens to ROA in years 2001 to 2006 (or 2010 to 2019) if there are no values of ROA available in those years.

                Working with sporadic missing data raises some issues of its own, but those aren't the problem we're dealing with right now. We're dealing with the problem of having no data at all for the years that are important in your model. That's what you have to fix first.

                Once you've done that, if you want to worry about the sporadic missing data in other years, fine--but it's a completely different issue.

                Comment


                • #9
                  Hi Clyde,

                  Thanks. I totally agree that one cannot work if there is no data. However, that is not my question. My question is that if I use the summ command by itself, I do get some ROA data however if I use the summ command for the full data range (2001-2019), I get nothing. So, I am wondering why the data does not show up for the full time series i.e. "inrange(Year, 2001, 2019)" makes the data.


                  Code:
                  . summ ROA_win05
                  
                      Variable |        Obs        Mean    Std. Dev.       Min        Max
                  -------------+---------------------------------------------------------
                     ROA_win05 |      2,613     2.74942    5.946121        -11       13.6

                  Code:
                  . summ ROA_win05 if inrange(Year, 2001, 2019)
                  
                      Variable |        Obs        Mean    Std. Dev.       Min        Max
                  -------------+---------------------------------------------------------
                     ROA_win05 |          0
                  In case my question is not making sense, please ignore it. Apologies for these basic queries but I am trying to understand the logic.

                  Comment


                  • #10
                    Well, what are the values of the Year variable. I noted that in your example data, the values of Year were numbers like 1, 2, 3,..., not in the 2000's. I imagined that for reasons of confidentiality or secrecy you might have just recoded it that way. BUt maybe in your real data that's also true. If so, there are going to be no Years in the range 2001 to 2009. Run -tab Year, nolabel- to see what you actually have.

                    Comment


                    • #11
                      Hi Clyde - Thanks so much for your help. My data file is too large with multiple fields customised for the different variables. That is the lesson for me is to keep the file tidy and remove unnecessary fields. You are right on the Year parameter. I have corrected it and it is now working fine. Apologies for taking up your time on this forum and thanks a lot.

                      Comment

                      Working...
                      X