Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating New Variable based on another variable's percentage distribution

    Hi everyone,

    Sorry if my question isn't clear. Below is an output I have created.



    alc | Freq. Percent Cum.
    ------------+-----------------------------------
    1 | 84 3.21 3.21
    2 | 81 3.10 6.31
    3 | 89 3.40 9.72
    4 | 42 1.61 11.32
    5 | 93 3.56 14.88
    6 | 68 2.60 17.48
    7 | 76 2.91 20.39
    8 | 52 1.99 22.38
    9 | 78 2.98 25.36
    10 | 57 2.18 27.54
    11 | 69 2.64 30.18
    12 | 65 2.49 32.67
    13 | 82 3.14 35.81
    14 | 103 3.94 39.75
    15 | 58 2.22 41.97
    16 | 83 3.18 45.14
    17 | 91 3.48 48.62
    18 | 76 2.91 51.53
    19 | 86 3.29 54.82
    20 | 76 2.91 57.73
    21 | 100 3.83 61.55
    22 | 127 4.86 66.41
    23 | 101 3.86 70.28
    24 | 42 1.61 71.88
    25 | 54 2.07 73.95
    26 | 72 2.75 76.70
    27 | 113 4.32 81.03
    28 | 101 3.86 84.89
    29 | 74 2.83 87.72
    30 | 123 4.71 92.43
    31 | 90 3.44 95.87
    32 | 35 1.34 97.21
    33 | 73 2.79 100.00
    ------------+-----------------------------------
    Total | 2,614 100.00


    I was wondering whether there was an easy code so that I could replace the values of alc with its corresponding percentage distribution. For example, the first would be 1 = 3.21, 2 = 3.10, 3 = 3.40, etc. Is there a command/code/function to be able to do this, or would I have to individually replace each value with its corresponding percentage individually?

    Thank you.


  • #2
    Code:
    assert !missing(alc)
    gen total=_N
    bys alc: gen wanted= (_N/total)*100

    Comment


    • #3
      Code:
      local total_obs = _N
      by alc, sort: gen wanted = 100*_N/`total_obs'
      Added: Crossed with #2. Our two solutions are conceptually the same. They differ stylistically. However #2, importantly, first checks that alc has no missing values--which is necessary in order for the code to perform correctly. So #2 is better.
      Last edited by Clyde Schechter; 04 Apr 2023, 16:14.

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        Code:
        local total_obs = _N
        by alc, sort: gen wanted = 100*_N/`total_obs'
        Added: Crossed with #2. Our two solutions are conceptually the same. They differ stylistically. However #2, importantly, first checks that alc has no missing values--which is necessary in order for the code to perform correctly. So #2 is better.
        Hi,

        Thank you for the quick response. The total for this variable is different from the total number of observations within the entire dataset. So "alc" has 2614 observations while the total observations are 113,290. Sorry, I'm new to STATA so I don't know if the code both of you provided still runs true if this is the case when using _N.

        Does it not matter?

        Thank you.

        Comment


        • #5
          With missing values, the code in #2 will not run: it will break at the initial -assert- command. The code in #3 will run and give incorrect results.

          Code:
          summ alc, meanonly
          local denominator `r(N)'
          by alc, sort: gen wanted = 100*_N/`denominator' if !missing(alc)

          Comment


          • #6
            Originally posted by Clyde Schechter View Post
            With missing values, the code in #2 will not run: it will break at the initial -assert- command. The code in #3 will run and give incorrect results.

            Code:
            summ alc, meanonly
            local denominator `r(N)'
            by alc, sort: gen wanted = 100*_N/`denominator' if !missing(alc)

            When I run that last line of code I get an error message:


            . do "C:\Users\WESTER~1\AppData\Local\Temp\STD00000000. tmp"

            . by alc, sort: gen wanted = 100*_N/`denominator' if !missing(alc)
            unknown function if!missing()
            r(133);

            end of do-file

            r(133);


            Is it something I'm doing wrong?

            Thanks.

            Comment


            • #7
              When I run the code in #5 with a toy data set that resembles what you describe in #1, it runs with no errors and produces correct results.
              Code:
              . // CREATE TOY DATA SET
              . clear*
              
              . set obs 50
              Number of observations (_N) was 0, now 50.
              
              . set seed 1234
              
              . gen alc = runiformint(1, 33)
              
              . replace alc = . if runiform() < 0.1
              (3 real changes made, 3 to missing)
              
              .
              . // RUN THE CODE
              . summ alc, meanonly
              
              . local denominator `r(N)'
              
              . by alc, sort: gen wanted = 100*_N/`denominator' if !missing(alc)
              (3 missing values generated)
              And if you run the above code yourself, you will see that the results it produces are correct.

              So I cannot reproduce the problem you describe. I also am unable to figure out what you are showing in #6 because
              . by alc, sort: gen wanted = 100*_N/`denominator' if !missing(alc)
              unknown function if!missing()
              r(133);
              is self-contradictory. The error message complains of an unknown function if!missing(). But the command that led to it doesn't have any such function. The critical issue is this: there must be a space between if and !missing(). The error message implies that there isn't one. But the line of code above it does include a space there. So the error message cannot have actually arisen from that code unless your Stata installation is somehow corrupted. So I would go back and try again. As a general rule,* the best way to use code you see on Statalist is to use copy/paste into your do-file editor. If you type it by hand, you risk making mistakes. I suspect that, in fact, you are doing a lot of hand typing here and making repeated mistakes. I think when you first tried to run the code in #5 you actually did miss the space between if and !missing(). I think that then when you decided to post back about the problem, you miscopied your mistake as if it did have a mistake.

              In any case, all I can definitely say is that the code in #5 when properly copied does not produce any error messages and it does produce correct results.

              *However, code copy/pasted from web sites (including this Forum) and from word processing documents or PDFs can cause problems. They sometimes contain non-printing characters that are used to control formatting and text display. Those non-printing characters are not visible to our eyes, but they are visible to Stata and can cause all sorts of weird problems that are maddeningly difficult to debug. So sometimes you actually have to manually retype code that was copy/pasted to get it to work. But that is a far less common problem than the errors that get introduced by typing code from these sources.
              Last edited by Clyde Schechter; 04 Apr 2023, 22:14.

              Comment

              Working...
              X