Generate new variable of the attached values of existing variable

Asaf Yancu

Join Date: May 2021
Posts: 38

Generate new variable of the attached values of existing variable

12 Jun 2022, 23:45

Hello, all Statalists!
My dataset consists of three variables:
1. id - student ID.
2. test_date.
3. freq - number of tests each student has taken on a certain day.

Here is a sample of this dataset:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long id int test_date float freq
    1 16560 1
    1 16579 1
    1 16589 2
    1 16589 2
    1 16602 1
    1 16615 1
    1 16618 1
  169 15833 1
  169 15840 1
  169 15847 1
  169 15858 2
  169 15858 2
  169 15864 1
  169 15866 1
  169 15871 1
  169 15873 3
  169 15873 3
  169 15873 3
  169 15875 1
  169 15882 1
  169 15885 1
  172 15833 1
  172 15858 1
  172 15864 1
  172 15878 1
  172 15882 1
  172 15885 1
44747 16566 1
44747 16575 1
44747 16582 1
44747 16587 4
44747 16587 4
44747 16587 4
44747 16587 4
44747 16589 1
44747 16594 1
44747 16596 1
44747 16602 1
end
format %td test_date

I would like to generate new variable "subgroup" that assigns each student his\her combination of values from the freq variable.
This should look like this:

Code:

list, sepby(id) noobs

  +-------------------------------------+
  |    id   test_date   freq   subgroup |
  |-------------------------------------|
  |     1   04may2005      1         12 |
  |     1   23may2005      1         12 |
  |     1   02jun2005      2         12 |
  |     1   02jun2005      2         12 |
  |     1   15jun2005      1         12 |
  |     1   28jun2005      1         12 |
  |     1   01jul2005      1         12 |
  |-------------------------------------|
  |   169   08may2003      1        123 |
  |   169   15may2003      1        123 |
  |   169   22may2003      1        123 |
  |   169   02jun2003      2        123 |
  |   169   02jun2003      2        123 |
  |   169   08jun2003      1        123 |
  |   169   10jun2003      1        123 |
  |   169   15jun2003      1        123 |
  |   169   17jun2003      3        123 |
  |   169   17jun2003      3        123 |
  |   169   17jun2003      3        123 |
  |   169   19jun2003      1        123 |
  |   169   26jun2003      1        123 |
  |   169   29jun2003      1        123 |
  |-------------------------------------|
  |   172   08may2003      1          1 |
  |   172   02jun2003      1          1 |
  |   172   08jun2003      1          1 |
  |   172   22jun2003      1          1 |
  |   172   26jun2003      1          1 |
  |   172   29jun2003      1          1 |
  |-------------------------------------|
  | 44747   10may2005      1         14 |
  | 44747   19may2005      1         14 |
  | 44747   26may2005      1         14 |
  | 44747   31may2005      4         14 |
  | 44747   31may2005      4         14 |
  | 44747   31may2005      4         14 |
  | 44747   31may2005      4         14 |
  | 44747   02jun2005      1         14 |
  | 44747   07jun2005      1         14 |
  | 44747   09jun2005      1         14 |
  | 44747   15jun2005      1         14 |
  +-------------------------------------+

Please notice that the dataset consists of more than 200,000 ID's. Thus, macro substitution results in line that is too long.

Appreciate your help!

Last edited by Asaf Yancu; 12 Jun 2022, 23:48.

Tags: data

Øyvind Snilsberg

Join Date: Oct 2021
Posts: 591

13 Jun 2022, 00:19

Code:

tab freq, gen(D)
gen subgroup = ""
foreach v of var D* {
bysort id (`v'): replace `v' = `v'[_N]
replace subgroup = subgroup + substr("`v'",2,.) if `v'
}

Comment

Nick Cox

Join Date: Mar 2014
Posts: 36057

13 Jun 2022, 00:44

Thanks for the clear question and data example.

Here's another way to do it, or strictly two versions of another way to do it. I am assuming no more then 9 tests in one day.

See also https://journals.sagepub.com/doi/pdf...36867X20909698

Code:

bysort id (freq) : gen wanted = strofreal(freq[1])  
by id : replace wanted = cond(freq != freq[_n-1], wanted[_n-1] + strofreal(freq), wanted[_n-1]) if _n > 1 
by id : replace wanted = wanted[_N]

bysort id (freq) : gen WANTED = freq[1] 
by id : replace WANTED = cond(freq != freq[_n-1], 10 * WANTED[_n-1] + freq, WANTED[_n-1]) if _n > 1 
by id : replace WANTED = WANTED[_N]

tabdisp id, c(wanted WANTED)

----------------------------------
       id |     wanted      WANTED
----------+-----------------------
        1 |         12          12
      169 |        123         123
      172 |          1           1
    44747 |         14          14
----------------------------------

Announcement

Generate new variable of the attached values of existing variable

Comment

Comment