Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help needed with generating specific datapoints/observations

    Hello

    I'm working on a big survey dataset and i'm looking for a way of generating specific datapoints based on the information in other variables. This post is sort of continuing on the following posts: https://www.statalist.org/forums/for...riable-problem . Which i got amazing help in regards to how to manipulate a time variable.

    Anyways,

    For example, i have a variable that is showing the time of day a respondent finished answering a survey called "hour" --> if the respondent finishes the survey between 00:00 and 00:59 = 1, 01:00 - 01:59 = 2,..., 23:00 - 23:59 = 24.

    Furthermore, the dataset contains various categorical variables that show what the respondents have answered on the different questions in the survey.

    For example, "How positive or negative are you in regards to what you have just seen?" 1 = very negative, 2 = negative, 3 = neither positive or negative, 4 = positive, 5 = very positive.

    Now, what I would like to do is to take the average score between the different time intervals such that that "score" is shown through all the observations of that time interval

    For example: (q1_average_scores is the variable i want to make)
    ID q1_scores hour q1_average_scores
    1 3 1 3.5
    2 4 1 3.5
    3 5 2 3
    4 1 2 3

    As mentioned, the dataset is quite large and have 10's of thousands more rows. Doing this manually for 24 time periods is a possibility (but, not really), and if I were to do it based on the above example i believe the code would be:

    generate q1_average_scores = 0

    sum q1_scores in 1/2 ---> Which would give me the average score of ID 1 and 2 = 3.5
    replace q1_average_scores = 3.5 in 1/2

    then

    sum q1_scores in 3/4 --> average = 3
    replace q1_average_scores = 3 in 3/4

    The problem, of course, is that with 15 more variables and 24 time intervals the job is tedious and also obviously extremely prone to human error, which i would like to avoid. My question therefore is: is there a way to get the same result using another code that would cross-reference the average value of q1_scores and the respective time interval in "hour" and then put the average score in "q1_average_scores"? I dont know if i'm even explaining myself adequately, but hopefully that is understandable.

    Thanks in advance for any help!

    Tor

  • #2
    Tor:
    I would give the following code a shot:
    Code:
    bysort hour: egen wanted=mean( q1_scores)
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      That worked perfectly, and exactly as intented. Thank you, Carlo, much appreciated!

      Tor

      Comment


      • #4
        Hi again, Carlo (or anyone that might feel compelled to offer a solution)

        I was wondering if it was possible to use the same function in order to make a percentage score instead? More specifically, by just getting the percentage score of all values = 2 OR 3 within a certain time interval

        For example
        ID q1_scores hour % score
        1 2 4 80%
        2 2 4 80%
        3 3 4 80%
        4 2 4 80%
        5 5 4 80%
        As shown above, it would potentially look something like shown in variable "% score", where 80% of the respondents that have answered in the time interval = 4 has given an answer which corresponds to the value 2 or 3.

        Im guessing i am to use a similar bysort code involving IF and OR, which also includes a way of making it a percentage score, but i've been looking at both help bysort, by, sort and egen and haven't found anything that would help.

        Comment


        • #5
          I can't envisage what your data look like. The thread linked in #1 does not help me as it seems to focus only on the handling of dates and times.

          In any case, you should (please) by now -- after 35 posts and 15 months on the list -- be familiar with our request to show data examples using dataex.

          FAQ Advice #12 applies.

          In the absence of a data example. I will answer by analogy. A percent(age) is just one step away from a proportion and a proportion can always be cast as the mean of a (0, 1) indicator.

          This is as example you can run in Stata yourself.

          Code:
          . sysuse auto, clear
          (1978 automobile data)
          
          . egen wanted = mean(100 * cond(rep78 < ., inlist(rep78, 4,5), .)), by(foreign)
          
          . tabdisp foreign, c(wanted)
          
          ----------------------
          Car       |
          origin    |     wanted
          ----------+-----------
           Domestic |   22.91667
            Foreign |   85.71429
          ----------------------
          
          . tab rep78 foreign, col
          
          +-------------------+
          | Key               |
          |-------------------|
          |     frequency     |
          | column percentage |
          +-------------------+
          
              Repair |
              record |      Car origin
                1978 |  Domestic    Foreign |     Total
          -----------+----------------------+----------
                   1 |         2          0 |         2
                     |      4.17       0.00 |      2.90
          -----------+----------------------+----------
                   2 |         8          0 |         8
                     |     16.67       0.00 |     11.59
          -----------+----------------------+----------
                   3 |        27          3 |        30
                     |     56.25      14.29 |     43.48
          -----------+----------------------+----------
                   4 |         9          9 |        18
                     |     18.75      42.86 |     26.09
          -----------+----------------------+----------
                   5 |         2          9 |        11
                     |      4.17      42.86 |     15.94
          -----------+----------------------+----------
               Total |        48         21 |        69
                     |    100.00     100.00 |    100.00

          There is quite a lot to unpack here, but here goes, The core is that a percentage is 100 times a proportion.

          Mixing code and pseudocode, the main idea is


          egen wanted = mean(100 * true-or-false-expression), by(whatever)

          where I should flag that

          egen wanted = 100 * mean(true-or-false-expression), by(whatever)

          may seem equivalent, but that is illegal as Stata code. The trick is writing down the true-or-false expression as code.

          inlist(rep78, 3, 4)

          is true (returning 1) if rep78 is 3 or 4 and false (returning 0) otherwise.

          It is elementary and also fundamental that the mean of a series of 0s and 1s is a proportion just as the mean of

          0, 0, 0, 1, 1, 1, 1, 1, 1, 1

          is 0.7 as those magnificent seven 1s are mixed with three 0s.

          The twist here is that missing values also count as "not 3 or 4", so the code insists on ignoring such values. Know that Stata always ignores missing values as far as possible when calculating means.

          cond(rep78 < ., inlist(rep78, 4,5), .)

          says

          if rep78 < . evaluate inlist(rep78, 4, 5)
          otherwise return missing



          .

          Comment


          • #6
            Hello and thanks for your reply, Nick!

            My apologies, i thought providing a table of example data was sufficient. I will start using the dataex command from now on.

            I'm not quite sure i understand what the specific code i am to use in order to get the percentage value, but I think i somewhat understand the process. Using dataex:

            input long(ID q1_awareness) float hour
            467 5 8
            468 3 9
            469 1 9
            470 4 10
            471 5 10
            472 4 10
            473 3 10
            474 4 10
            475 4 10
            476 4 10
            477 3 10
            478 4 10

            My first step was to find the average value within specific time intervals. The code Carlo provided worked perfectly for this:

            bysort hour: egen wanted=mean(q1_awareness)

            Now, what I think you're trying to tell me is to use a code that creates a variable where the values are 0 or 1. It will be 1 if a criteria is satisfied (for example, the criteria can be that within the time intervals the value in q1_awareness needs to be either 4 or 5), and it will return 0 if the values are anything else than 4 or 5. Furthermore, it takes the mean of the 0 and 1's within the different time intervals and multiplies it by 100 in order to get the actual percentage value. (btw, it does not need to be multiplied by 100 since the information of f.ex 0.31 or 31 will essentially be the same).

            Again, i am not quite sure how to type the specific code needed for this operation, but based on your explanation would something like the below code be correct?

            egen wanted = 100*mean(q1_awareness if q1_awareness == 4 OR 5), by(hour)

            Obviously, this code wont work in STATA, and my challenge is getting the condition term correct.
            Last edited by Tor Haug Anonsen; 12 Jan 2023, 01:58.

            Comment


            • #7
              Thanks for the (slightly incomplete) data example.

              In #5 I was at pains to flag that 100 * mean(something) is illegal as an egen function call, but mean(100 * something) is legal. This is surprising mathematically but from a Stata view isn't difficult to explain: the syntax of egen as shown in the help is strict that a function call must immediately follow the equals sign.

              if qualifiers aren't allowed within function calls. That is why we need true-or-false checks, say using inlist() or inrange().

              The code you want is quite close to my example in #5. As you say, a factor of 100 is utterly a matter of convention and not compulsory, so in this version we will leave it out.

              Being cautious about missing values does no harm.

              Code:
              clear 
              
              input long(ID q1_awareness) float hour
              467 5 8
              468 3 9
              469 1 9
              470 4 10
              471 5 10
              472 4 10
              473 3 10
              474 4 10
              475 4 10
              476 4 10
              477 3 10
              478 4 10
              end 
              
              egen pc45 = mean(cond(q1_awareness < ., inlist(q1_awareness, 4,5), .)), by(hour)
              
              list, sepby(hour)
              
                   +----------------------------------+
                   |  ID   q1_awa~s   hour       pc45 |
                   |----------------------------------|
                1. | 467          5      8          1 |
                   |----------------------------------|
                2. | 468          3      9          0 |
                3. | 469          1      9          0 |
                   |----------------------------------|
                4. | 470          4     10   .7777778 |
                5. | 471          5     10   .7777778 |
                6. | 472          4     10   .7777778 |
                7. | 473          3     10   .7777778 |
                8. | 474          4     10   .7777778 |
                9. | 475          4     10   .7777778 |
               10. | 476          4     10   .7777778 |
               11. | 477          3     10   .7777778 |
               12. | 478          4     10   .7777778 |
                   +----------------------------------+
              See also https://www.stata-journal.com/articl...article=dm0055 -- especially Section 9.


              To maintain my reputation as picky, I flag https://www.statalist.org/forums/help#spelling and that criterion is the singular of criteria.

              Comment


              • #8
                Ah, okey. I understand. Thank you for providing a very good example, which worked perfectly! (and also giving me a lecture in grammar, all though, in my defense, english is not my first language and i do know the plural and singular spelling of the word in Norwegian )

                Comment


                • #9
                  Picky on picky: the grammar here is really Greek, as English singular and plural are copied from the Greek originals. just as with phenomenon and phenomena.

                  But I know absolutely no Norwegian. I can tell you that many English-speaking academics and students get this wrong too.

                  Comment


                  • #10
                    Til

                    Comment

                    Working...
                    X