Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Average values across variables

    Hi,
    I have 30 variables with values on each day spanning 2004-2022. I want to average each word's value for each day t. How would I go about doing this, so that I have a new variable as the average of each day?

  • #2
    It depends on how your data is organized. Can you give us an example?
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      Code:
      date    cost    cheap    donation    asset    Competitiveadvantage    France    Gold
      1/1/2004    10    70    0    0    0    23    6
      1/2/2004    15    75    0    6    0    23    10
      1/3/2004    10    65    0    0    0    20    9
      1/4/2004    11    76    0    0    0    24    10
      1/5/2004    0    44    0    0    0    22    21
      1/6/2004    15    74    0    15    0    25    10
      1/7/2004    10    67    5    15    0    22    11
      1/8/2004    16    64    0    14    0    24    12
      1/9/2004    15    58    3    11    5    24    9
      1/10/2004    14    79    0    0    5    25    14
      1/11/2004    14    81    0    18    7    24    14
      1/12/2004    19    74    0    5    0    27    9
      1/13/2004    19    70    0    13    2    24    11
      1/14/2004    17    65    0    10    2    24    11
      1/15/2004    20    61    0    7    4    21    10
      1/16/2004    14    63    0    8    0    19    10
      The data is like this, so I have many words where I want to average their value over each day

      Comment


      • #4
        Raul:
        you may want to consider the -rowmean- function available from -egen-.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Thank you Carlo.

          With my 50 words, I regressed these on another variable and computed a t-statistic. With my code
          Code:
          egen UKIS = rmean(varlist)
          , I only want to include those words that had a negative t-statistic. Is there a shortcut for this?

          Comment


          • #6
            Raul Athwall I don't think I understand what you are doing. It is not clear to me if taking a mean across these variables (as the rowmean function for egen does) makes conceptual sense. Is that really what you want?

            On the other hand, finding the mean for every variable (separately) for each day is also not making sense since your data example suggests you have just one observation per day.

            Could you please clarify?

            Comment


            • #7
              I essentially want the mean for each variable on each day e.g looking at my data on the first day I want to add the observations for 'cost', 'cheap', 'donation' etc and get an average for this on that particular day, and do this for every day within my dataset (from 2004-2022). It isn't a mean for each variable but an average observation on each day.

              Comment


              • #8
                Could I ask what the larger purpose is? I am struggling to imagine what this "average observation" would be useful for.

                Comment


                • #9
                  Essentially my project is creating an index of google trends words to relate this to the UK stock market to see if there is a relationship. I have data on each of these words and calculated log daily differences in the words, and to create my index I want the average daily difference of all the words I have chosen on day t

                  Comment


                  • #10
                    Ah I see, thanks for the explanation. You mentioned you wanted to restrict it to the variables with negative t statistics. Could you show your code for generating those statistics?

                    Comment


                    • #11
                      Code:
                      foreach var of varlist ldiffcost_w-ldiffexpense_w {
                        2. regress `var' rmrf, robust
                        3. }
                      
                      Linear regression                               Number of obs     =      3,536
                                                                      F(1, 3534)        =       0.98
                                                                      Prob > F          =     0.3225
                                                                      R-squared         =     0.0003
                                                                      Root MSE          =     .11711
                      
                      ------------------------------------------------------------------------------
                                   |               Robust
                       ldiffcost_w | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                      -------------+----------------------------------------------------------------
                              rmrf |   .1873577   .1893697     0.99   0.323    -.1839274    .5586427
                             _cons |   -.003509   .0019697    -1.78   0.075    -.0073708    .0003528
                      ------------------------------------------------------------------------------
                      I have around 30 that have a negative t-statistic, and these are the only ones I want to include in my average

                      Comment


                      • #12
                        So from what little I have understood, here is the path I would take:
                        • first, note that the sign of a t statistic is completely determined by the sign of the coefficient itself.
                        • second, we can access the coefficients using the e(b) matrix that is stored after any regression. For a simple regression, the slope coefficient is accessed by e(b)[1,1]
                        So in the loop running the regressions, I would do something like this (assuming you want to collect the names of the dependent variables from the regressions for which the slope coefficient is negative):

                        Code:
                        local negvars
                        foreach var of varlist ldiffcost_w - ldiffexpense_w {
                            regress `var' rmrf, robust
                            if e(b)[1,1] < 0 local negvars `negvars' `var'
                        }
                        and then later, I would use the local macro we created in the egen command described before:
                        Code:
                        egen UKIS = rowmean(`negvars')

                        Comment


                        • #13
                          Just so I have understood correctly, this isolates those with negative slope coefficient (also negative t statistics), and put this into the rowmean function?

                          Comment


                          • #14
                            Yes: this collects the dependent variables (among those in the variable list in the foreach var of varlist ... statement) which have a negative slope coefficient when regressed on rmrf, and puts them into the local macro negvars, so they can be averaged using the rowmean function of the egen command.

                            Comment


                            • #15
                              Whether this is a good idea is another question. For a start, the criterion necessarily lets through many variables with a weak relationship that qualifies as being negative But if this was set for me as an assignment, I wouldn't want to lump words together any way.

                              Comment

                              Working...
                              X