Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to replace the value of a variable by the most frequently occured value

    Hi,

    I have a dataset looks like this:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str2 id str8 location
    "1" "A"
    "1" "A"
    "1" "A"
    "1" "A"
    "1" "B"
    "1" "B"
    "1" "C"
    "2" "E"
    "2" "E"
    "2" "C"
    end
    I want to replace the value of the location by the most frequently happened one. For instance, for id 1, "A" appears 4 times, so I want to replace all the locations for id 1 to "A", and "E" to id 2.

    I wonder if anyone knows how to realize this in stata.

    Thank you !

  • #2
    Your question is almost fully specified. The most frequent value is called a mode. What do you expect should happen if there is more than one mode, and what logic do you use to select the mode, if not returning all of them?

    Comment


    • #3
      Originally posted by Leonardo Guizzetti View Post
      Your question is almost fully specified. The most frequent value is called a mode. What do you expect should happen if there is more than one mode, and what logic do you use to select the mode, if not returning all of them?
      Hi Leonardo,

      Yeah, that's a great point. I haven't thought of this case before. I guess I'll simply use anyone of the multiple values. For instance, the first one. Do you have any thoughts on how to realize this in Stata?

      Thank you.

      Comment


      • #4
        In that case,
        Code:
        gen long obs_no = _n
        
        by id location (obs_no), sort: gen occurrences = _N
        gsort id occurrences -obs_no
        by id (occurrences): replace location = location[_N]
        
        list, noobs clean
        Had you asked instead for the largest mode or the smallest mode (in the sense of alphabetical order), this would be simpler: you could just use the -egen, mode()- function.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          In that case,
          Code:
          gen long obs_no = _n
          
          by id location (obs_no), sort: gen occurrences = _N
          gsort id occurrences -obs_no
          by id (occurrences): replace location = location[_N]
          
          list, noobs clean
          Had you asked instead for the largest mode or the smallest mode (in the sense of alphabetical order), this would be simpler: you could just use the -egen, mode()- function.
          Hi Clyde,

          Thank you very much for your reply. Actually I was trying to use the command

          Code:
           bysort id (location):egen mode = mode(location)
          For the example it works, but when I changed the data to below where I have multiple modes, it has error like this:

          Warning: at least one group contains all missing values or contains multiple modes. Generating missing values for the mode
          of these groups. Use the missing, maxmode, minmode, or nummode() options to control this behavior.
          (7 missing values generated)

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str2 id str8(location mode)
          "1" "A" "" 
          "1" "A" "" 
          "1" "A" "" 
          "1" "B" "" 
          "1" "B" "" 
          "1" "B" "" 
          "1" "C" "" 
          "2" "C" "E"
          "2" "E" "E"
          "2" "E" "E"
          end
          I wonder if you have any idea how to replace the mode for id 1 with the first mode in alphabetic order based on the current code that I have. In my case, use location="A"?

          Thank you.

          Comment


          • #6
            Like the warning says, -egen, mode()- complains when more than one mode is present because it doesn't know what to do with those unless you explicitly tell it. So it substitutes a missing value in those instances.

            If you do want the first (or last) alphabetic mode, then simply add the option -minmode- (or -maxmode-) to the -egen, mode()- command.

            Code:
            clear
            input str2 id str8 location
            "1" "A"
            "1" "A"
            "1" "A"
            "1" "A"
            "1" "B"
            "1" "B"
            "1" "C"
            "2" "E"
            "2" "E"
            "2" "C"
            "3" "B"
            "3" "A"
            end
            
            bysort id : egen first_alpha_mode = mode(location), minmode
            bysort id : egen last_alpha_mode = mode(location), maxmode
            Result

            Code:
                 +----------------------------------------------------+
                 | id   location   first_alpha_mode   last_alpha_mode |
                 |----------------------------------------------------|
              1. |  1          A                  A                 A |
              2. |  1          A                  A                 A |
              3. |  1          A                  A                 A |
              4. |  1          A                  A                 A |
              5. |  1          B                  A                 A |
              6. |  1          B                  A                 A |
              7. |  1          C                  A                 A |
                 |----------------------------------------------------|
              8. |  2          E                  E                 E |
              9. |  2          E                  E                 E |
             10. |  2          C                  E                 E |
                 |----------------------------------------------------|
             11. |  3          B                  A                 B |
             12. |  3          A                  A                 B |
                 +----------------------------------------------------+

            Comment


            • #7
              Originally posted by Leonardo Guizzetti View Post
              Like the warning says, -egen, mode()- complains when more than one mode is present because it doesn't know what to do with those unless you explicitly tell it. So it substitutes a missing value in those instances.

              If you do want the first (or last) alphabetic mode, then simply add the option -minmode- (or -maxmode-) to the -egen, mode()- command.

              Code:
              clear
              input str2 id str8 location
              "1" "A"
              "1" "A"
              "1" "A"
              "1" "A"
              "1" "B"
              "1" "B"
              "1" "C"
              "2" "E"
              "2" "E"
              "2" "C"
              "3" "B"
              "3" "A"
              end
              
              bysort id : egen first_alpha_mode = mode(location), minmode
              bysort id : egen last_alpha_mode = mode(location), maxmode
              Result

              Code:
              +----------------------------------------------------+
              | id location first_alpha_mode last_alpha_mode |
              |----------------------------------------------------|
              1. | 1 A A A |
              2. | 1 A A A |
              3. | 1 A A A |
              4. | 1 A A A |
              5. | 1 B A A |
              6. | 1 B A A |
              7. | 1 C A A |
              |----------------------------------------------------|
              8. | 2 E E E |
              9. | 2 E E E |
              10. | 2 C E E |
              |----------------------------------------------------|
              11. | 3 B A B |
              12. | 3 A A B |
              +----------------------------------------------------+
              I see. That is exactly what I want.

              Thank you so much.

              Comment

              Working...
              X