Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate new variable of the attached values of existing variable

    Hello, all Statalists!
    My dataset consists of three variables:
    1. id - student ID.
    2. test_date.
    3. freq - number of tests each student has taken on a certain day.

    Here is a sample of this dataset:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long id int test_date float freq
        1 16560 1
        1 16579 1
        1 16589 2
        1 16589 2
        1 16602 1
        1 16615 1
        1 16618 1
      169 15833 1
      169 15840 1
      169 15847 1
      169 15858 2
      169 15858 2
      169 15864 1
      169 15866 1
      169 15871 1
      169 15873 3
      169 15873 3
      169 15873 3
      169 15875 1
      169 15882 1
      169 15885 1
      172 15833 1
      172 15858 1
      172 15864 1
      172 15878 1
      172 15882 1
      172 15885 1
    44747 16566 1
    44747 16575 1
    44747 16582 1
    44747 16587 4
    44747 16587 4
    44747 16587 4
    44747 16587 4
    44747 16589 1
    44747 16594 1
    44747 16596 1
    44747 16602 1
    end
    format %td test_date


    I would like to generate new variable "subgroup" that assigns each student his\her combination of values from the freq variable.
    This should look like this:

    Code:
    list, sepby(id) noobs
    
      +-------------------------------------+
      |    id   test_date   freq   subgroup |
      |-------------------------------------|
      |     1   04may2005      1         12 |
      |     1   23may2005      1         12 |
      |     1   02jun2005      2         12 |
      |     1   02jun2005      2         12 |
      |     1   15jun2005      1         12 |
      |     1   28jun2005      1         12 |
      |     1   01jul2005      1         12 |
      |-------------------------------------|
      |   169   08may2003      1        123 |
      |   169   15may2003      1        123 |
      |   169   22may2003      1        123 |
      |   169   02jun2003      2        123 |
      |   169   02jun2003      2        123 |
      |   169   08jun2003      1        123 |
      |   169   10jun2003      1        123 |
      |   169   15jun2003      1        123 |
      |   169   17jun2003      3        123 |
      |   169   17jun2003      3        123 |
      |   169   17jun2003      3        123 |
      |   169   19jun2003      1        123 |
      |   169   26jun2003      1        123 |
      |   169   29jun2003      1        123 |
      |-------------------------------------|
      |   172   08may2003      1          1 |
      |   172   02jun2003      1          1 |
      |   172   08jun2003      1          1 |
      |   172   22jun2003      1          1 |
      |   172   26jun2003      1          1 |
      |   172   29jun2003      1          1 |
      |-------------------------------------|
      | 44747   10may2005      1         14 |
      | 44747   19may2005      1         14 |
      | 44747   26may2005      1         14 |
      | 44747   31may2005      4         14 |
      | 44747   31may2005      4         14 |
      | 44747   31may2005      4         14 |
      | 44747   31may2005      4         14 |
      | 44747   02jun2005      1         14 |
      | 44747   07jun2005      1         14 |
      | 44747   09jun2005      1         14 |
      | 44747   15jun2005      1         14 |
      +-------------------------------------+
    Please notice that the dataset consists of more than 200,000 ID's. Thus, macro substitution results in line that is too long.

    Appreciate your help!
    Last edited by Asaf Yancu; 12 Jun 2022, 23:48.

  • #2
    Code:
    tab freq, gen(D)
    gen subgroup = ""
    foreach v of var D* {
    bysort id (`v'): replace `v' = `v'[_N]
    replace subgroup = subgroup + substr("`v'",2,.) if `v'
    }

    Comment


    • #3
      Thanks for the clear question and data example.

      Here's another way to do it, or strictly two versions of another way to do it. I am assuming no more then 9 tests in one day.

      See also https://journals.sagepub.com/doi/pdf...36867X20909698

      Code:
      bysort id (freq) : gen wanted = strofreal(freq[1])  
      by id : replace wanted = cond(freq != freq[_n-1], wanted[_n-1] + strofreal(freq), wanted[_n-1]) if _n > 1 
      by id : replace wanted = wanted[_N]
      
      bysort id (freq) : gen WANTED = freq[1] 
      by id : replace WANTED = cond(freq != freq[_n-1], 10 * WANTED[_n-1] + freq, WANTED[_n-1]) if _n > 1 
      by id : replace WANTED = WANTED[_N]
      
      tabdisp id, c(wanted WANTED)
      
      ----------------------------------
             id |     wanted      WANTED
      ----------+-----------------------
              1 |         12          12
            169 |        123         123
            172 |          1           1
          44747 |         14          14
      ----------------------------------

      Comment

      Working...
      X