Counting observations in string variables

Nicolas Charette

Join Date: Aug 2022

Posts: 23
#1

Counting observations in string variables

15 Jun 2023, 15:38

Hello,

I’m using Stata BE/17.0 and I want to sum the numbers of distinct diplomas students have. The maximum of diploma a student can have is 3 and each diploma is indicated under the 3 string variables diplo1 diplo2 diplo3. There are 3 possible diplomas (des dep cfms) and I want to create 3 new numeric variables (sum_des sum_dep sum_cfms) that would range from 0 to 3. The "sd" value is “without diploma”, so it would count as a zero.

I tried a loop and failed miserably. I finally did it more "manually", but I presume there is a better code that would count the number of distinct diplomas each student has. Here what I did anyways:

Code:

foreach dip in des dep cfms { forvalues num= 1/3{ gen `dip'`num' = 0 } } forvalue num = 1/3{ replace des`num'=1 if diplo`num'=="des" } gen sum_des= des1+des2+des3 drop des1 des2 des3 forvalue num = 1/3{ replace dep`num'=1 if diplo`num'=="dep" } gen sum_dep= dep1+dep2+dep3 drop dep1 dep2 dep3 forvalue num = 1/3{ replace cfms`num'=1 if diplo`num'=="cfms" } gen sum_cfms= cmfs1+cmfs2+cmfs3 drop cmfs1 cmfs2 cmfs3

Here is the sample data:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str34 diplo1 str4(diplo2 diplo3) "cfms" " cfms " "" "sd" "" "" "dep" "dep" "dep" "des" "" "" "sd" "" "" "dep" "dep" "" "cfms" "" "" "des" "" "" "des" "dep" "" "des" "" "" "sd" "" "" end

Thank you. Any input is appreciated.

N
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30165
#2

15 Jun 2023, 17:09

Code:

gen `c(obs_t)' student_id = _n reshape long diplo, i(student_id) foreach x in cfms dep des { by student_id (_j), sort: egen `x'_count = total(diplo == "`x'") } egen all_diplomas = rowtotal(*_count) reshape wide

Note: I suspect in your real data set you already have a student id variable. If so, skip the line that creates student_id and replace student_id throughout by the name of your actual student id variable. It is crucial, however, that each value of that student id variable occurs in only one observation in the data set.

You may want to consider eliminating the -reshape wide- at the end. I don't know where you're going with this data from here. Most thing in Stata are easier with long data than with wide. But it depends on what you'll be doing next.
1 like
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35780

15 Jun 2023, 17:13

Here is one way to do it.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str34 diplo1 str4(diplo2 diplo3)
"cfms" " cfms "     ""  
"sd"   ""     ""  
"dep"  "dep" "dep"
"des"  ""     ""  
"sd"   ""     ""  
"dep"  "dep"  ""  
"cfms" ""     ""  
"des"  ""     ""  
"des"  "dep"  ""  
"des"  ""     ""  
"sd"   ""     ""  
end

gen long id = _n
save diplo, replace

reshape long diplo , i(id) j(which)

drop if inlist(diplo, "", "sd")
drop which
duplicates drop
bysort id : gen count = _N
by id: keep if _n == 1
drop diplo  

merge 1:1 id using diplo
sort id  

list


     +--------------------------------------------------------+
     | id   count   diplo1   diplo2   diplo3           _merge |
     |--------------------------------------------------------|
  1. |  1       2     cfms      cfm               Matched (3) |
  2. |  2       .       sd                     Using only (2) |
  3. |  3       1      dep      dep      dep      Matched (3) |
  4. |  4       1      des                        Matched (3) |
  5. |  5       .       sd                     Using only (2) |
     |--------------------------------------------------------|
  6. |  6       1      dep      dep               Matched (3) |
  7. |  7       1     cfms                        Matched (3) |
  8. |  8       1      des                        Matched (3) |
  9. |  9       2      des      dep               Matched (3) |
 10. | 10       1      des                        Matched (3) |
     |--------------------------------------------------------|
 11. | 11       .       sd                     Using only (2) |
     +--------------------------------------------------------+

Last edited by Nick Cox; 15 Jun 2023, 17:23.

Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1478

15 Jun 2023, 21:16

Here is another method, that does not require reshaping the data:

Code:

local diplomas des dep cfms
label define DIPLOMAS 0 "sd" 1 "des" 2 "dep" 3 "cfms"

forval i = 1/3 {
    gen _diplo`i' = trim(diplo`i')
    encode _diplo`i', gen(diplo`i'_n) label(DIPLOMAS)
}

forval i = 1/3 {
    local dip: word `i' of `diplomas'
    egen sum_`dip' = anycount(diplo?_n) , values(`i')
}

drop diplo?_n _diplo?

which produces:

Code:

. list , noobs sep(0)

  +---------------------------------------------------------+
  | diplo1   diplo2   diplo3   sum_des   sum_dep   sum_cfms |
  |---------------------------------------------------------|
  |   cfms     cfms                  0         0          2 |
  |     sd                           0         0          0 |
  |    dep      dep      dep         0         3          0 |
  |    des                           1         0          0 |
  |     sd                           0         0          0 |
  |    dep      dep                  0         2          0 |
  |   cfms                           0         0          1 |
  |    des                           1         0          0 |
  |    des      dep                  1         1          0 |
  |    des                           1         0          0 |
  |     sd                           0         0          0 |
  +---------------------------------------------------------+

A couple remarks:

you seem to have edited the dataex extract by hand, which produced at least one oddity: the value of diplo2 in the first observation was being read as " cfm" since you had declared this to be str4. I fixed the variable declaration line of the dataex so that everything is read properly:

Code:

input str4 diplo1 str6(diplo2 diplo3)

The step where I create variables that have the trimmed versions of diplo? is necessitated by just one instance, again the same value of diplo2 in the first row, " cfms " which contains leading and trailing spaces. If this was entirely due to your manual editing, and your data does not actually have such spaces, you can modify the code accordingly and work directly with the diplo? variables instead of _diplo? . On the other hand, if this is a feature of your data and you decide to use the solution in #2 instead of mine, do trim the variable diplo there, otherwise the code will not produce the correct result:

Code:

by student_id (_j), sort: egen `x'_count = total(trim(diplo) == "`x'")

Last edited by Hemanshu Kumar; 15 Jun 2023, 21:26.

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35780

15 Jun 2023, 23:28

Here is yet another way to do it. This time I fixed the inconsistencies in the data example. thanks to Hemanshu Kumar

Code:

clear
input str34 diplo1 str4(diplo2 diplo3)
"cfms" "cfms"     ""  
"sd"   ""     ""  
"dep"  "dep" "dep"
"des"  ""     ""  
"sd"   ""     ""  
"dep"  "dep"  ""  
"cfms" ""     ""  
"des"  ""     ""  
"des"  "dep"  ""  
"des"  ""     ""  
"sd"   ""     ""  
end

gen count = 0 

foreach d in cfms des dep {
    replace count = count + strpos(diplo1 + diplo2 + diplo3, "`d'") > 0 
}

list 

    +----------------------------------+
     | diplo1   diplo2   diplo3   count |
     |----------------------------------|
  1. |   cfms     cfms                1 |
  2. |     sd                         0 |
  3. |    dep      dep      dep       1 |
  4. |    des                         1 |
  5. |     sd                         0 |
     |----------------------------------|
  6. |    dep      dep                1 |
  7. |   cfms                         1 |
  8. |    des                         1 |
  9. |    des      dep                1 |
 10. |    des                         1 |
     |----------------------------------|
 11. |     sd                         0 |
     +----------------------------------+

This in turn could be written as just one (awkward) line.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35780
#6

15 Jun 2023, 23:41

See also Section 7 in https://journals.sagepub.com/doi/pdf...867X0900900107 which is

SJ-9-1 pr0046 . . . . . . . . . . . . . . . . . . . Speaking Stata: Rowwise
(help rowsort, rowranks if installed) . . . . . . . . . . . N. J. Cox
Q1/09 SJ 9(1):137--157
shows how to exploit functions, egen functions, and Mata
for working rowwise; rowsort and rowranks are introduced

However, the egen function rowsvals() introduced there, and now in egenmore from SSC, does not have a handle for ignoring sd.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35780
#7

16 Jun 2023, 03:19

Fix to #5

Code should be

Code:

replace count = count + (strpos(diplo1 + diplo2 + diplo3, "`d'") > 0)

The tell-tale here is that in observation 9 the occurrence of both des and dep should imply a count of 2.

Last edited by Nick Cox; 16 Jun 2023, 03:34.
1 like
Comment
Nicolas Charette

Join Date: Aug 2022

Posts: 23
#8

16 Jun 2023, 08:53

Thanks a lot Clyde Schechter, Hemanshu Kumar and Nick Cox for this generous input. I will study all your solutions carefully.

Clyde Schechter the purpose was to have descriptive statistics of all the diplomas awarded, by sex (not included in the extract data). And yes, you're right to suspect I have the student id variable.
Hemanshu Kumar I have indeed edited the data extract by hand... I always have a hard time chosing the best sample data to simplify my problem and often end up editing.
Comment
Nicolas Charette

Join Date: Aug 2022

Posts: 23
#9

04 Jul 2023, 08:34

Thank you all! Hemanshu Kumar your code worked perfectly for my needs.
Comment

Announcement