Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate variable containing weighed rowmean

    Dear all,

    it might be a wood for the trees kind of problem, but I cannot wrap my head around it.
    I have two datasets, one where the observations are seconds and variables are a score for each second for a person:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double(Sec PersonA) str7 PersonB double PersonC str7 PersonD
    12 2 "4" 2 "4"
    13 4 "4" 4 "4"
    14 4 "2" 4 "2"
    15 4 "4" 4 "4"
    16 4 "3" 4 "3"
    17 3 "3" 3 "3"
    18 3 "2" 3 "2"
    19 3 "1" 3 "1"
    20 3 "2" 3 "2"
    21 5 "3" 5 "3"
    22 5 "4" 5 "4"
    23 5 "5" 5 "5"
    24 5 "5" 5 "5"
    25 4 "5" 4 "5"
    end
    (All values are numeric, I got this wrong in my example).

    What I did before was this: egen mean=rowmean( PersonA- PersonD ). But now it got trickier:


    Another data set contains information about each of these persons, on the basis of which the mean for each second of all persons is to be weighted. In my example data, my goal would be to eaven out values based on kind of car and pet. Unfortunately, every person is a case here, so I can't just match the sets and create a variable for weighting.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str7 PersonID double(car Pet)
    "PersonA" 1 2
    "PersonB" 2 4
    "PersonC" 1 4
    "PersonD" 1 4
    end
    Can anyone think of a smart way to work my way around?The dataset is one of many and contains data for thousands of seconds and many persons.
    As always, thankful for every suggestion and many thanks in advance!

    Best
    N.

  • #2
    I think you're going to have to reshape your dataset with seconds in order to make these calculations. And perhaps reshape back into long after, if there other calculations to be made.


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double Sec PersonA PersonB PersonC PersonD
    12 2 4 2 4
    13 4 4 4 4
    14 4 2 4 2
    15 4 4 4 4
    16 4 3 4 3
    17 3 3 3 3
    18 3 2 3 2
    19 3 1 3 1
    20 3 2 3 2
    21 5 3 5 3
    22 5 4 5 4
    23 5 5 5 5
    24 5 5 5 5
    25 4 5 4 5
    end
    
    *reshape
    reshape long Person, i(Sec) j(PersonID) string
    reshape wide Person, i(PersonID) j(Sec)
    ren Pers* Sec*
    ren SeconID PersonID
    replace PersonID ="Person"+PersonID
    * merge with weighting vars
    merge 1:1 PersonID using dataset2.dta
    *drop _merge
    
    *weighting
    * 12 and 25 in this loop are the first and last second in dataset
    * wieghting here as original value*value of car, as example
    forvalues i = 12(1)25 {
    replace Secon`i' = Secon`i'*car
    }

    Comment


    • #3
      Thank you, Jorrit, I will try that! I will have to rename my variables first as they are unfortunately named with random ids without a uniform stem.
      I will report how it went.
      Thanks again!

      Edit: I just did this with my example data to comprehend each step. But I struggle with reshaping the data back to the original set afterwards.
      I dropped the car- and pet-variable first since I see no way keeping them in the second-by-second-set. Then I did:

      reshape long Secon, i(PersonID) j(Sec)
      reshape wide Secon, i(Sec) j(PersonID) string

      The data seems fine, but I want to make sure I didn't make a mistake a can't trace later in the huge dataset... Would that be the correct 'way back'?

      Thank you!

      Last edited by Nadine Bast; 21 Feb 2019, 02:18.

      Comment


      • #4
        I think my data might be to much for Stata... I got this error:

        Data long -> wide
        -----------------------------------------------------------------------------
        Number of obs. 807438 -> 138
        Number of variables 3 -> 5852
        j variable (5851 values) Sekunde -> (dropped)
        xij variables:
        Person_ -> Person_1 Person_2 ... Person_5851
        characteristic contents too long
        The maximum value of the contents is 67,784.
        r(1004);


        Is there any way to work around this?

        Comment


        • #5
          Your code in post #3 is fine.

          As for the error in #4:
          Where Stata complains about the length of the statement "Person_1 Person_2 ... Person_5851" in the reshape command: when expanded for all personID's, this statement exceeds 67,784 characters.
          In some cases this limitation would mean you'd need to cut your dataset in pieces and run the code on each individual piece.
          A simple fix in this specific case is to make var names shorter. You can keep a list of 5851 variable names shorter than those 67,784 characters by doing e.g. below changes.
          Adjust to your variable names and PersonID

          Code:
          *Some fake data
          clear
          set maxvar 10000
          set obs 138
          forvalues i = 1(1)5851 {
          gen Person`i' = runiform()
          }
          gen Sec= _n
          
          *reshape
          reshape long Person, i(Sec) j(PersonID) string
          reshape wide Person, i(PersonID) j(Sec)
          ren Pers* Sec*
          ren SeconID PersonID
          replace PersonID ="Person"+PersonID
          * merge with weighting vars
          merge 1:1 PersonID using dataset2.dta
          
          *weighting
          forvalues i = 12(1)25 {
          replace Secon`i' = Secon`i'*car
          }
          
          *Reshape back
          drop car Pet _merge
          reshape long Secon, i(PersonID) j(Sec)
          ren Secon S
          replace PersonID = subinstr(PersonID, "Person", "P", .)
          reshape wide S, i(Sec) j(PersonID) string
          ren SP* Person_*

          Comment


          • #6
            Thank you, Jorrit!
            I already started to split my data in sets of 450 observations each, which worked but produced 14 sets.

            The error actually occurs already in the first part:
            reshape long Person, i(Sec) j(PersonID) string works but reshape wide Person, i(PersonID) j(Sec) produces the error So I will try to put your code in earlier. Your help is greatly appreciated!

            Comment


            • #7
              I didnt have an issue running the exact code posted in #5. Might be some slight changes to variable names in your dataset?
              Logic remains the same though: make sure that the list of variables that reshape generates does not exceed 67,784 characters. So with 5800 vars, stick to short variable names.

              Comment


              • #8
                Sorry, I have expressed myself unclearly - the code is fine. What I meant is that the error concerning excession of characters occured already with the first code (in
                my long dataset, not the example). It works fine with a smaller part of the data.
                I thought stata complained due to the enormous amount of observations in the long set (80.7000),
                but will try if it works with your suggestion and shorten the names. As each person got a random id of 12 letters each, which I can't change because I need them to match the data later,
                I'm nut sure if shortening the prefix is enough.

                Comment


                • #9
                  Ah, okay. I would agree it would be easiest/safest to retain original ID numbers. So in that case splitting the dataset would be a sensible workaround.
                  There might also be alternative routes to what you are trying to accomplish, but I cant think of them.

                  Comment


                  • #10
                    Nevermind, thanks for all your help!

                    Comment

                    Working...
                    X