Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting observations in string variables

    Hello,

    I’m using Stata BE/17.0 and I want to sum the numbers of distinct diplomas students have. The maximum of diploma a student can have is 3 and each diploma is indicated under the 3 string variables diplo1 diplo2 diplo3. There are 3 possible diplomas (des dep cfms) and I want to create 3 new numeric variables (sum_des sum_dep sum_cfms) that would range from 0 to 3. The "sd" value is “without diploma”, so it would count as a zero.

    I tried a loop and failed miserably. I finally did it more "manually", but I presume there is a better code that would count the number of distinct diplomas each student has. Here what I did anyways:

    Code:
    foreach dip in des dep cfms {
    forvalues num= 1/3{
                gen `dip'`num' = 0
    }
    }
     
    forvalue num = 1/3{
     replace des`num'=1 if diplo`num'=="des"
    }
    gen sum_des= des1+des2+des3
    drop des1 des2 des3
     
    forvalue num = 1/3{
     replace dep`num'=1 if diplo`num'=="dep"
    }
    gen sum_dep= dep1+dep2+dep3
    drop dep1 dep2 dep3
     
    forvalue num = 1/3{
     replace cfms`num'=1 if diplo`num'=="cfms"
    }
    gen sum_cfms= cmfs1+cmfs2+cmfs3
    drop cmfs1 cmfs2 cmfs3
    Here is the sample data:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str34 diplo1 str4(diplo2 diplo3)
    "cfms" " cfms "     ""  
    "sd"   ""     ""  
    "dep"  "dep" "dep"
    "des"  ""     ""  
    "sd"   ""     ""  
    "dep"  "dep"  ""  
    "cfms" ""     ""  
    "des"  ""     ""  
    "des"  "dep"  ""  
    "des"  ""     ""  
    "sd"   ""     ""  
    end

    Thank you. Any input is appreciated.

    N







  • #2
    Code:
    gen `c(obs_t)' student_id = _n
    reshape long diplo, i(student_id)
    foreach x in cfms dep des {
        by student_id (_j), sort: egen `x'_count = total(diplo == "`x'")
    }
    egen all_diplomas = rowtotal(*_count)
    reshape wide
    Note: I suspect in your real data set you already have a student id variable. If so, skip the line that creates student_id and replace student_id throughout by the name of your actual student id variable. It is crucial, however, that each value of that student id variable occurs in only one observation in the data set.

    You may want to consider eliminating the -reshape wide- at the end. I don't know where you're going with this data from here. Most thing in Stata are easier with long data than with wide. But it depends on what you'll be doing next.

    Comment


    • #3
      Here is one way to do it.

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str34 diplo1 str4(diplo2 diplo3)
      "cfms" " cfms "     ""  
      "sd"   ""     ""  
      "dep"  "dep" "dep"
      "des"  ""     ""  
      "sd"   ""     ""  
      "dep"  "dep"  ""  
      "cfms" ""     ""  
      "des"  ""     ""  
      "des"  "dep"  ""  
      "des"  ""     ""  
      "sd"   ""     ""  
      end
      
      gen long id = _n
      save diplo, replace
      
      reshape long diplo , i(id) j(which)
      
      drop if inlist(diplo, "", "sd")
      drop which
      duplicates drop
      bysort id : gen count = _N
      by id: keep if _n == 1
      drop diplo  
      
      merge 1:1 id using diplo
      sort id  
      
      list
      
      
           +--------------------------------------------------------+
           | id   count   diplo1   diplo2   diplo3           _merge |
           |--------------------------------------------------------|
        1. |  1       2     cfms      cfm               Matched (3) |
        2. |  2       .       sd                     Using only (2) |
        3. |  3       1      dep      dep      dep      Matched (3) |
        4. |  4       1      des                        Matched (3) |
        5. |  5       .       sd                     Using only (2) |
           |--------------------------------------------------------|
        6. |  6       1      dep      dep               Matched (3) |
        7. |  7       1     cfms                        Matched (3) |
        8. |  8       1      des                        Matched (3) |
        9. |  9       2      des      dep               Matched (3) |
       10. | 10       1      des                        Matched (3) |
           |--------------------------------------------------------|
       11. | 11       .       sd                     Using only (2) |
           +--------------------------------------------------------+
      Last edited by Nick Cox; 15 Jun 2023, 17:23.

      Comment


      • #4
        Here is another method, that does not require reshaping the data:

        Code:
        local diplomas des dep cfms
        label define DIPLOMAS 0 "sd" 1 "des" 2 "dep" 3 "cfms"
        
        forval i = 1/3 {
            gen _diplo`i' = trim(diplo`i')
            encode _diplo`i', gen(diplo`i'_n) label(DIPLOMAS)
        }
        
        forval i = 1/3 {
            local dip: word `i' of `diplomas'
            egen sum_`dip' = anycount(diplo?_n) , values(`i')
        }
        
        drop diplo?_n _diplo?
        which produces:

        Code:
        . list , noobs sep(0)
        
          +---------------------------------------------------------+
          | diplo1   diplo2   diplo3   sum_des   sum_dep   sum_cfms |
          |---------------------------------------------------------|
          |   cfms     cfms                  0         0          2 |
          |     sd                           0         0          0 |
          |    dep      dep      dep         0         3          0 |
          |    des                           1         0          0 |
          |     sd                           0         0          0 |
          |    dep      dep                  0         2          0 |
          |   cfms                           0         0          1 |
          |    des                           1         0          0 |
          |    des      dep                  1         1          0 |
          |    des                           1         0          0 |
          |     sd                           0         0          0 |
          +---------------------------------------------------------+
        A couple remarks:
        • you seem to have edited the dataex extract by hand, which produced at least one oddity: the value of diplo2 in the first observation was being read as " cfm" since you had declared this to be str4. I fixed the variable declaration line of the dataex so that everything is read properly:
        Code:
        input str4 diplo1 str6(diplo2 diplo3)
        • The step where I create variables that have the trimmed versions of diplo? is necessitated by just one instance, again the same value of diplo2 in the first row, " cfms " which contains leading and trailing spaces. If this was entirely due to your manual editing, and your data does not actually have such spaces, you can modify the code accordingly and work directly with the diplo? variables instead of _diplo? . On the other hand, if this is a feature of your data and you decide to use the solution in #2 instead of mine, do trim the variable diplo there, otherwise the code will not produce the correct result:
        Code:
        by student_id (_j), sort: egen `x'_count = total(trim(diplo) == "`x'")
        Last edited by Hemanshu Kumar; 15 Jun 2023, 21:26.

        Comment


        • #5
          Here is yet another way to do it. This time I fixed the inconsistencies in the data example. thanks to Hemanshu Kumar

          Code:
          clear
          input str34 diplo1 str4(diplo2 diplo3)
          "cfms" "cfms"     ""  
          "sd"   ""     ""  
          "dep"  "dep" "dep"
          "des"  ""     ""  
          "sd"   ""     ""  
          "dep"  "dep"  ""  
          "cfms" ""     ""  
          "des"  ""     ""  
          "des"  "dep"  ""  
          "des"  ""     ""  
          "sd"   ""     ""  
          end
          
          gen count = 0 
          
          foreach d in cfms des dep {
              replace count = count + strpos(diplo1 + diplo2 + diplo3, "`d'") > 0 
          }
          
          list 
          
              +----------------------------------+
               | diplo1   diplo2   diplo3   count |
               |----------------------------------|
            1. |   cfms     cfms                1 |
            2. |     sd                         0 |
            3. |    dep      dep      dep       1 |
            4. |    des                         1 |
            5. |     sd                         0 |
               |----------------------------------|
            6. |    dep      dep                1 |
            7. |   cfms                         1 |
            8. |    des                         1 |
            9. |    des      dep                1 |
           10. |    des                         1 |
               |----------------------------------|
           11. |     sd                         0 |
               +----------------------------------+
          This in turn could be written as just one (awkward) line.

          Comment


          • #6
            See also Section 7 in https://journals.sagepub.com/doi/pdf...867X0900900107 which is

            SJ-9-1 pr0046 . . . . . . . . . . . . . . . . . . . Speaking Stata: Rowwise
            (help rowsort, rowranks if installed) . . . . . . . . . . . N. J. Cox
            Q1/09 SJ 9(1):137--157
            shows how to exploit functions, egen functions, and Mata
            for working rowwise; rowsort and rowranks are introduced

            However, the egen function rowsvals() introduced there, and now in egenmore from SSC, does not have a handle for ignoring sd.

            Comment


            • #7
              Fix to #5

              Code should be

              Code:
              replace count = count + (strpos(diplo1 + diplo2 + diplo3, "`d'") > 0)
              The tell-tale here is that in observation 9 the occurrence of both des and dep should imply a count of 2.
              Last edited by Nick Cox; 16 Jun 2023, 03:34.

              Comment


              • #8
                Thanks a lot Clyde Schechter, Hemanshu Kumar and Nick Cox for this generous input. I will study all your solutions carefully.

                Clyde Schechter the purpose was to have descriptive statistics of all the diplomas awarded, by sex (not included in the extract data). And yes, you're right to suspect I have the student id variable.
                Hemanshu Kumar I have indeed edited the data extract by hand... I always have a hard time chosing the best sample data to simplify my problem and often end up editing.

                Comment


                • #9
                  Thank you all! Hemanshu Kumar your code worked perfectly for my needs.

                  Comment

                  Working...
                  X