Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge from a long format to a wide format

    I have two datasets. One dataset has an individual identifier , and some of the variables in that dataset points to an identifier in another dataset. I want to get the entries from the second dataset as variables in the first dataset. These are the data examples

    DATASET 1:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str15(prim_key hhid) byte stateid int ssuid byte(fs203_coreside_child1 fs203_coreside_child2 fs203_coreside_child3 fs203_coreside_child4 fs203_coreside_child5 fs203_coreside_child6)
    "101000100040101" "101000100040100" 1 1 .e .e .e  .  .  .
    "101000100040102" "101000100040100" 1 1 .e .e .e  .  .  .
    "101000100130101" "101000100130100" 1 1  3 .e .e .e  6  7
    "101000100130102" "101000100130100" 1 1  3 .e .e .e  6  7
    "101000100130109" "101000100130100" 1 1 .e  .  .  .  .  .
    "101000100250106" "101000100250100" 1 1 .e .e .e .e  1  .
    "101000100320101" "101000100320100" 1 1 .e  3  6  7  8  .
    "101000100320102" "101000100320100" 1 1 .e  3  6  7  8  .
    "101000100320109" "101000100320100" 1 1  1 .e .e .e  .  .
    "101000100370102" "101000100370100" 1 1  3  4  5  .  .  .
    "101000100370111" "101000100370100" 1 1 .e  1 .e  6  .  .
    "101000100590101" "101000100590100" 1 1 .e .e  3 .e  4  8
    "101000100590102" "101000100590100" 1 1 .e .e  3 .e  4  8
    "101000100760101" "101000100760100" 1 1  4  5  6  .  .  .
    "101000100760102" "101000100760100" 1 1 .e  3 .e  4  5  6
    "101000101040101" "101000101040100" 1 1 .e .e .e .e  3 .e
    "101000101040102" "101000101040100" 1 1 .e .e .e .e  3 .e
    "101000101330101" "101000101330100" 1 1  .  .  .  .  .  .
    "101000101720101" "101000101720100" 1 1 .e  2 .e  5  6  7
    "101000101890107" "101000101890100" 1 1 .e .e  1  6 .e  .
    "101000102490101" "101000102490100" 1 1 .e .e  3 .e  .  .
    "101000102490102" "101000102490100" 1 1 .e .e  3 .e  .  .
    "101000103350101" "101000103350100" 1 1 .e  3  4  5  6  7
    "101000103350102" "101000103350100" 1 1 .e  3  4  5  6  7
    "101000103550101" "101000103550100" 1 1  2 .e  .  .  .  .
    "101000103550102" "101000103550100" 1 1  3  5  6  7  8  .
    "101000103690101" "101000103690100" 1 1 .e  3 .e .e  .  .
    "101000103690102" "101000103690100" 1 1 .e  3 .e .e  .  .
    "101000104030101" "101000104030100" 1 1  .  .  .  .  .  .
    "101000104030102" "101000104030100" 1 1  3 .e  4  5  6  .
    "101000104240101" "101000104240100" 1 1  3 .e .e  8  .  .
    "101000104240102" "101000104240100" 1 1  3 .e .e  8  .  .
    "101000200050101" "101000200050100" 1 2 .e .e .e  3 .e .e
    "101000200050102" "101000200050100" 1 2 .e .e .e  3 .e .e
    "101000200160109" "101000200160100" 1 2  2  .  .  .  .  .
    "101000200210101" "101000200210100" 1 2 .e .e .e  3 .e  4
    "101000200210102" "101000200210100" 1 2 .e .e .e  3 .e  4
    "101000200210108" "101000200210100" 1 2 .e .e  1 .e .e .e
    "101000200270101" "101000200270100" 1 2  2  3  4  5  6  .
    "101000200290108" "101000200290100" 1 2 .e .e .e  1 .e .e
    "101000200290109" "101000200290100" 1 2  .  .  .  .  .  .
    "101000200340101" "101000200340100" 1 2  3  4  .  .  .  .
    "101000200340102" "101000200340100" 1 2  3  4  .  .  .  .
    "101000200470101" "101000200470100" 1 2 .e .e .e .e  3 .e
    "101000200470102" "101000200470100" 1 2 .e .e .e .e .e  3
    "101000200740101" "101000200740100" 1 2  2  .  .  .  .  .
    "101000200860102" "101000200860100" 1 2  3  4  5  6  7  8
    "101000201190106" "101000201190100" 1 2 .e .e .e .e .e .e
    "101000201360101" "101000201360100" 1 2 .e .e .e .e .e  3
    "101000201360102" "101000201360100" 1 2 .e .e .e .e .e  3
    "101000201360103" "101000201360100" 1 2  .  .  .  .  .  .
    "101000201560108" "101000201560100" 1 2 .e .e .e .e .e  1
    "101000201670110" "101000201670100" 1 2  .  .  .  .  .  .
    "101000202130102" "101000202130100" 1 2 .e .e  3  4  5  6
    "101000202440101" "101000202440100" 1 2 .e .e .e  3  .  .
    "101000202440102" "101000202440100" 1 2 .e .e .e  3  .  .
    "101000203010101" "101000203010100" 1 2 .e .e  3 .e  5 .e
    "101000203010102" "101000203010100" 1 2 .e .e  3 .e  5 .e
    "101000203460101" "101000203460100" 1 2 .e .e  3  4  5  .
    "101000203460102" "101000203460100" 1 2  .  .  .  .  .  .
    "101000300030201" "101000300030200" 1 3 .e .e .e  4  5  .
    "101000300030202" "101000300030200" 1 3 .e  3 .e  4  5  .
    "101000300070105" "101000300070100" 1 3 .e .e  2 .e  .  .
    "101000300140106" "101000300140100" 1 3 .e .e  1 .e .e .e
    "101000300220101" "101000300220100" 1 3  6  2 .e  3  4  5
    "101000300260101" "101000300260100" 1 3  3  4  5  6  7  8
    "101000300260102" "101000300260100" 1 3  3  4  5  6  7  8
    "101000300380205" "101000300380200" 1 3 .e .e .e .e .e  1
    "101000300550105" "101000300550100" 1 3 .e .e .e  .  .  .
    "101000300920101" "101000300920100" 1 3 .e  3  4  5  .  .
    "101000300920102" "101000300920100" 1 3 .e  3  4  5  .  .
    "101000301080101" "101000301080100" 1 3 .e  3  .  .  .  .
    "101000301080102" "101000301080100" 1 3 .e  3  .  .  .  .
    "101000301450201" "101000301450200" 1 3 .e .e .e .e  3  .
    "101000301450202" "101000301450200" 1 3 .e .e .e .e  3  .
    "101000301720106" "101000301720100" 1 3 .e .e .e  1  .  .
    "101000301840101" "101000301840100" 1 3 .e .e  3  4 .e  .
    "101000301840102" "101000301840100" 1 3 .e .e .e .e .e  .
    "101000302140101" "101000302140100" 1 3 .e  3  4  .  .  .
    "101000302140102" "101000302140100" 1 3 .e  3  4  .  .  .
    "101000302250106" "101000302250100" 1 3  .  .  .  .  .  .
    "101000302720104" "101000302720100" 1 3 .e  1 .e .e  .  .
    "101000303240201" "101000303240200" 1 3 .e .e  2  3  .  .
    "101000400040101" "101000400040100" 1 4 .e .e  3 .e  .  .
    "101000400040102" "101000400040100" 1 4 .e .e  3 .e  .  .
    "101000400070107" "101000400070100" 1 4 .e .e .e  1 .e .e
    "101000400070108" "101000400070100" 1 4 .e .e .e .e .e .e
    "101000400120101" "101000400120100" 1 4 .e  3  6  7  8  .
    "101000400120102" "101000400120100" 1 4 .e  3  6  7  8  .
    "101000400120109" "101000400120100" 1 4  1 .e  .  .  .  .
    "101000400140101" "101000400140100" 1 4  3 .e .e  5  6 .e
    "101000400140102" "101000400140100" 1 4  3 .e .e  5  6 .e
    "101000400230102" "101000400230100" 1 4  3 .e  8  9 10  .
    "101000400230111" "101000400230100" 1 4  1 .e .e .e .e .e
    "101000400430201" "101000400430200" 1 4 .e .e .e .e  3  5
    "101000400430202" "101000400430200" 1 4 .e .e .e .e  3  5
    "101000400610101" "101000400610100" 1 4  3  9 .e 11  .  .
    "101000400610102" "101000400610100" 1 4  3 .e  9 11  .  .
    "101000400930101" "101000400930100" 1 4 .e  3  4  5  6  7
    "101000400930102" "101000400930100" 1 4 .e  3  4  5  6  7
    end
    label values stateid stateid_cv
    label def stateid_cv 1 "1 Jammu and Kashmir", modify
    label values fs203_coreside_child1 _vl8884_ind
    label values fs203_coreside_child2 _vl8885_ind
    label values fs203_coreside_child3 _vl8886_ind
    label values fs203_coreside_child4 _vl8887_ind
    label values fs203_coreside_child5 _vl8888_ind
    label values fs203_coreside_child6 _vl8889_ind

    DATASET 2:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str15(prim_key hhid) byte(cv007 cv008 cv009 cv010)
    "101000100040201" "101000100040200" 3 1  8 3
    "101000100040202" "101000100040200" 3 1  9 3
    "101000100040203" "101000100040200" 4 1  0 .
    "101000100040204" "101000100040200" . .  . .
    "101000100040205" "101000100040200" . .  . .
    "101000100130103" "101000100130100" 3 1 15 7
    "101000100130104" "101000100130100" 3 1 12 5
    "101000100130105" "101000100130100" . .  . .
    "101000100130106" "101000100130100" 3 1 12 5 
    "101000100130107" "101000100130100" 3 1 12 5
    "101000100130108" "101000100130100" 3 1  9 3
    "101000100250101" "101000100250100" 3 1 10 4
    "101000100250102" "101000100250100" 3 1 15 7
    "101000100250103" "101000100250100" 4 1  0 .
    "101000100250104" "101000100250100" . .  . .
    "101000100250105" "101000100250100" . .  . .
    "101000100320103" "101000100320100" 3 1 12 5
    "101000100320104" "101000100320100" 3 1  9 3
    "101000100320105" "101000100320100" . .  . .
    "101000100320106" "101000100320100" 3 1 11 4
    "101000100320107" "101000100320100" 3 1 10 4
    "101000100320108" "101000100320100" 3 1 12 4
    "101000100370101" "101000100370100" 3 1  1 5
    "101000100370103" "101000100370100" 3 1 17 8
    "101000100370104" "101000100370100" 3 1 16 5
    "101000100370105" "101000100370100" 3 1 11 4
    "101000100370106" "101000100370100" 3 1 15 7
    "101000100370107" "101000100370100" 3 1 15 5
    "101000100370108" "101000100370100" 4 1  1 1
    "101000100370109" "101000100370100" . .  . .
    "101000100370110" "101000100370100" . .  . .
    "101000100590103" "101000100590100" 4 2  . .
    "101000100590104" "101000100590100" 3 1  8 3
    "101000100590105" "101000100590100" 3 1  6 2
    "101000100590106" "101000100590100" . .  . .
    "101000100590107" "101000100590100" . .  . .
    "101000100590108" "101000100590100" 4 2  . .
    "101000100760103" "101000100760100" 3 1  9 3
    "101000100760104" "101000100760100" 3 1  9 3
    "101000100760105" "101000100760100" 3 1  4 1
    "101000100760106" "101000100760100" 3 1  6 2
    "101000100760107" "101000100760100" 4 2  . .
    "101000101040103" "101000101040100" 3 1 10 4
    "101000101040104" "101000101040100" 3 1 10 3
    "101000101040105" "101000101040100" 3 1  8 3
    "101000101040106" "101000101040100" 3 1  7 2
    "101000101040107" "101000101040100" 3 1  4 2
    "101000101040108" "101000101040100" 3 1  2 1
    "101000101040109" "101000101040100" . .  . .
    "101000101040110" "101000101040100" . .  . .
    "101000101190201" "101000101190200" 3 1  9 3
    "101000101190202" "101000101190200" 3 1 12 5
    "101000101190203" "101000101190200" . .  . .
    "101000101190204" "101000101190200" . .  . .
    "101000101330102" "101000101330100" 4 2  . .
    "101000101330103" "101000101330100" 3 1  9 3
    "101000101330104" "101000101330100" 3 1  4 1
    "101000101330105" "101000101330100" 1 1  1 1
    "101000101330106" "101000101330100" . .  . .
    "101000101490101" "101000101490100" 3 1 10 4
    "101000101490102" "101000101490100" 3 1  8 3
    "101000101490103" "101000101490100" 3 1  6 2
    "101000101490104" "101000101490100" 3 1  4 1
    "101000101490105" "101000101490100" 3 1  4 1
    "101000101720102" "101000101720100" 3 1 10 4
    "101000101720103" "101000101720100" 3 1 10 4
    "101000101720104" "101000101720100" . .  . .
    "101000101720105" "101000101720100" 3 1  8 3
    "101000101720106" "101000101720100" 3 1  8 3
    "101000101720107" "101000101720100" 3 1  9 3
    "101000101720108" "101000101720100" 3 1  8 3
    "101000101890101" "101000101890100" 3 1 17 9
    "101000101890102" "101000101890100" 3 1 13 5
    "101000101890103" "101000101890100" 3 1  2 1
    "101000101890104" "101000101890100" 4 1  0 .
    "101000101890105" "101000101890100" . .  . .
    "101000101890106" "101000101890100" 4 2  . .
    "101000102090101" "101000102090100" 3 1 10 4
    "101000102090102" "101000102090100" 3 1 10 4
    "101000102090103" "101000102090100" 4 1  2 1
    "101000102090104" "101000102090100" 4 1  1 1
    "101000102090105" "101000102090100" . .  . .
    "101000102210101" "101000102210100" 3 1  9 3
    "101000102210102" "101000102210100" 3 1  9 3
    "101000102210103" "101000102210100" . .  . .
    "101000102210104" "101000102210100" . .  . .
    "101000102210105" "101000102210100" . .  . .
    "101000102490103" "101000102490100" 3 1 11 4
    "101000102660101" "101000102660100" 3 1 10 3
    "101000102660102" "101000102660100" 3 1 10 3
    "101000102660103" "101000102660100" 4 1  1 1
    "101000102660104" "101000102660100" . .  . .
    "101000102660105" "101000102660100" . .  . .
    "101000102870201" "101000102870200" 3 1 15 8
    "101000102870202" "101000102870200" 3 1 12 5
    "101000102870203" "101000102870200" 4 1  1 1
    "101000102870204" "101000102870200" . .  . .
    "101000102870205" "101000102870200" . .  . .
    "101000102970201" "101000102970200" 3 1  9 3
    "101000102970202" "101000102970200" 3 1 10 4
    end
    label values cv007 lasi_vl216_cv
    label def lasi_vl216_cv 1 "1 Can read only", modify
    label def lasi_vl216_cv 3 "3 Can both read and write", modify
    label def lasi_vl216_cv 4 "4 Cannot read or write", modify
    label values cv008 lasi_vl251_cv
    label def lasi_vl251_cv 1 "1 Yes", modify
    label def lasi_vl251_cv 2 "2 No", modify
    label values cv010 lasi_vl321_cv
    label def lasi_vl321_cv 1 "1 Less than Primary school (Standard 1-4)", modify
    label def lasi_vl321_cv 2 "2 Primary school completed (Standard 5-7)", modify
    label def lasi_vl321_cv 3 "3 Middle school completed (Standard 8- 9)", modify
    label def lasi_vl321_cv 4 "4 Secondary school/Matriculation completed", modify
    label def lasi_vl321_cv 5 "5 Higher secondary/intermediate/senior secondary completed", modify
    label def lasi_vl321_cv 7 "7 Graduate degree (B.A., B.Sc., B. Com.) completed", modify
    label def lasi_vl321_cv 8 "8 Post-graduate degree or (M.A., M.Sc., M. Com.) above (M.Phil, Ph.D., Post-Doc) completed", modify
    label def lasi_vl321_cv 9 "9 Professional course/degree (B.Ed, BE, B.Tech, MBBS, BHMS, BAMS, B.Pharm, BCS, BCA, BBA, LLB) (BVSc., B. Arch, M.Ed, ME, M.Tech, MD, M.Pharm, MCS, MCA, MBA,LLM, MVSc., M. Arch, MS, CA, CS, CWA) completed", modify


    so the fs203_coreside_child* variables in the first dataset gives me the last digits of the prim_key of the second dataset. I want the cv007 variable values as separate variables for the different prim_keys that the coreside variables give me. Eg 101000100130101 has a fs203_coreside_child5 value of 6 in first dataset, means that in the second dataset, 101000100130106 is my target observation, and I want the cv007 value of that observation as a variable (say cv007_child5) which takes that value . Similarly, the same 101000100130101 prim_key has fs203_coreside_child6 value of 7 in first dataset, so 101000100130107 in the second dataset is the target observation, and the cv007 value from there goes to another variable (saycv007_child6). This needs to be done for all prim_keys for the first dataset.

    I guess there is a reshape problem here, but I am clueless as to how to approach this problem. Any help would be appreciated.

  • #2
    Code:
    use dataset1, clear
    reshape long fs203_coreside_child, i(prim_key)
    by prim_key (fs203_coreside_child), sort: ///
        drop if missing(fs203_coreside_child) & _n > 1
    drop _j
    assert strlen(prim_key) == 15
    gen link = substr(prim_key, 1, 13) + string(fs203_coreside_child, "%02.0f") ///
        if !missing(fs203_coreside_child)
    
    capture frame drop second
    frame create second
    frame second: use dataset2
    frlink m:1 link, frame(second prim_key)
    frget cv007, from(second)
    Note: In the example data, many of the observations with a non-missing fs203_coreside_child nevertheless find no match in the second data set. Perhaps the full data sets will not have this problem.

    Comment


    • #3
      Thanks Clyde.
      There are two things I wanted to clarify. One is the does the reshape not require a j variable? Secondly once I run the code, it gives me the following error.

      Code:
      no; data are mi set
          Use mi reshape to perform reshape on these data.  mi reshape has the same syntax as reshape.
      
          Perhaps you did not type reshape.  In that case, the command you typed calls reshape and it is not appropriate for use with mi data.  Use mi extract or mi xeq to
          select the data on which you want to run the command, which is probably m=0.
      If I try to use -mi extract 0, clear-, then it gives me another error

      Code:
      variable _mi_miss not found
      I have no idea why this could happen, as I did not mi set the data

      Comment


      • #4
        Second question first. No, the -j()- option of the -reshape- command does not have to be specified. If you omit it, as the code I showed does, it is, by default, created as _j. You will notice that I have a -drop _j- command that gets rid of it. I did this because, as far as I can see, there is no need for the -j()- variable in your situation. All it does is tell you which of the fs203_coreside_child* variables the post-reshape observation comes from. But that is, I assume, just arbitrary and of no further importance. It is, of course, possible that I am wrong about that. Perhaps the fs203_coreside_child1 variable lists the firstborn child, and fs203_coreside_child2 the second born, etc. If there is some meaning like that, and if that meaning (e.g. birth order) is information you need for further analysis, then by all means specify a -j()- variable, choosing a name, such as birth_order, that describes what it means. And then just get rid of that -drop _j- command.

        As to the first question, it seems what you have is a corrupted data set. There are a few possibilities.
        1. The data were never -mi set- by anybody, and somebody when creating or modifying the data set before it got to you mistakenly created some variables whose names begin with _mi, which Stata is then interpreting as the result of an attempt to -mi set- the data. If you are sure that the data was never -mi set- by anybody, you can just -drop _mi*- to get rid of those variables (or rename them to something else if they contain useful data that you need) and that should enable you to proceed. But make sure that the data really were never -mi set- by anybody before you do this or you will only get yourself into deeper trouble that you will be unable to undo later.
        2. Somebody else did -mi set- the data, but somehow the variable -mi miss- got lost subsequently. Are -mi_m- and -mi_id- there? If so, what you have is a corrupted -mi set- data set. Ideally, I would reverse that with -mi extract 0-, but you already tried that and Stata won't let you. So I would distinguish two sub-cases here:
          1. There is an -_mi_m- variable in the data set, and there is a subset of the data for which _mi_m == 0, and that subset is the actual original data. In that case I would run -keep if _mi_m == 0- followed by -mi unset, asis-, and then either -drop- or -rename- the _mi* variables, depending on whether they contain useful information or not.
          2. There is no _mi_m variable, or the subset for which _mi_m == 0 is not the actual original data you need to use. In this case there is no way to distinguish the original from the imputed observations, and this data set is, I think, hopeless. You need to go back to the original source of the data and get a fresh data set.

        Comment


        • #5
          Hi Clyde, thanks for the clarification.
          I could not get around the mi set problem. There are no variables that start with _mi, and there is no _mi_m variable. So the only thing that I could do was to export it out as a csv file and import it back in, using descsave to get the variable and value labels in order.
          Now that i have done it, i am faced with another problem. The reshape command now gives me the error
          Code:
          I/O error writing .dta file
              Usually such I/O errors are caused by the disk or file system being full.
          I have 2622 variables and 73,396 observations.

          Comment


          • #6
            I think the size of the dta file was posing issues. The only fix that I can think about is running it for sections of the file and appending all the files together. In that regard, can your code be modified so that
            Code:
            frlink m:1 link, frame(second prim_key)
            keeps only the merged observations?

            Comment


            • #7
              I think the size of the dta file was posing issues.
              I'm skeptical. Even with the long string id variables, in all probability the size of this data set is only going to be order of magnitude 1 GB: that's not very large for contemporary hardware. If your hard drive (or whatever mass storage device you are using) is so full that it cannot save a file of that size, you are going to run into this same problem just a little further on in your workflow. You need to either get another mass storage device to work with, or clear out a lot of inactive files from the one you are working on. But before you do that, just check that your drive really is very close to full. The "I/O error writing .dta file" message is really non-specific, and despite Stata's attempt to explain it to you, it really can pop up whenever anything goes wrong in the I/O process.

              Next, I don't think you should do this anyway. If there are no _mi* variables, but Stata thinks the data was -mi set-, there has to be some reason Stata thinks that. Perhaps running -char dir- will show you some things like _dta[_mi_various]. The fact that at some point in its history somebody -mi set- this data means that your data may well be some mixture of real data and extra copies of observations with imputed values for missing variables. If you try to analyze this data as if it were a simple data set, you will get garbage results. If there were an _mi_m variable, that would have enabled you to identify the real observations from the imputed one. But without that, you cannot distinguish the real from the fake data. This data set is simply not usable. Full stop. Laundering it through a .csv file will cover up the symptoms of the problem and enable you to apply code to it. But the data are wrong, so the results necessarily will be as well. You have to go back to whoever supplied this file to you, explain the problem, and obtain a valid replacement file.

              I'm sorry you're having all these problems. I'd like to be more helpful, but I don't want to guide you down a path that can only lead to trouble.

              Comment


              • #8
                Hi Clyde, thanks for the response. It seems you were right. At some point during the compilation, the data handlers of this dataset used mi to generate some estimates, and then they dropped the -mi- variables without unregistering it first. I was able to get the raw version and your code works. Thanks.
                I would bother you again with something else. After applying your code, I have a dataset which essentially matches `link` to `prim_key` in the first dataset. Now this `prim_key` is a person identifier, while the `link` which we have is a child identifier. Since we could have both spouses being interviewed, we can have duplicates by `link`, which is not an issue. But what I now want is for each unique link, I want to figure out who is the father and mother, and have separate variables for the father's education and mother's education. Here's a data example.

                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input str15(prim_key hhid childid) int cv006 byte(cv007 cv008 cv009 cv010 dm003 dm006 dm007 dm008) int dm005
                "101000100130101" "101000100130100" "101000100130107" 21 3 1 12 5 1 1 12 5 54
                "101000100130101" "101000100130100" "101000100130106" 23 3 1 12 5 1 1 12 5 54
                "101000100130101" "101000100130100" "101000100130103" 29 3 1 15 7 1 1 12 5 54
                "101000100130101" "101000100130100" "101000100130108" 19 3 1  9 3 1 1 12 5 54
                "101000100130102" "101000100130100" "101000100130107" 21 3 1 12 5 2 2  . . 51
                "101000100130102" "101000100130100" "101000100130108" 19 3 1  9 3 2 2  . . 51
                "101000100130102" "101000100130100" "101000100130103" 29 3 1 15 7 2 2  . . 51
                "101000100130102" "101000100130100" "101000100130106" 23 3 1 12 5 2 2  . . 51
                "101000100250106" "101000100250100" "101000100250101" 38 3 1 10 4 2 2  . . 71
                "101000100320101" "101000100320100" "101000100320106" 24 3 1 11 4 1 1 10 4 50
                "101000100320101" "101000100320100" "101000100320108" 19 3 1 12 4 1 1 10 4 50
                "101000100320101" "101000100320100" "101000100320103" 30 3 1 12 5 1 1 10 4 50
                "101000100320101" "101000100320100" "101000100320107" 22 3 1 10 4 1 1 10 4 50
                "101000100320102" "101000100320100" "101000100320108" 19 3 1 12 4 2 1  6 2 48
                "101000100320102" "101000100320100" "101000100320103" 30 3 1 12 5 2 1  6 2 48
                "101000100320102" "101000100320100" "101000100320106" 24 3 1 11 4 2 1  6 2 48
                "101000100320102" "101000100320100" "101000100320107" 22 3 1 10 4 2 1  6 2 48
                "101000100320109" "101000100320100" "101000100320101" 50 3 1 10 4 2 2  . . 69
                "101000100370102" "101000100370100" "101000100370105" 19 3 1 11 4 2 1  8 3 45
                "101000100370102" "101000100370100" "101000100370103" 26 3 1 17 8 2 1  8 3 45
                "101000100370102" "101000100370100" "101000100370104" 24 3 1 16 5 2 1  8 3 45
                "101000100370111" "101000100370100" "101000100370106" 40 3 1 15 7 2 2  . . 69
                "101000100370111" "101000100370100" "101000100370101" 47 3 1  1 5 2 2  . . 69
                "101000100590101" "101000100590100" "101000100590108" 28 4 2  . . 1 2  . . 72
                "101000100590101" "101000100590100" "101000100590104" 31 3 1  8 3 1 2  . . 72
                "101000100590101" "101000100590100" "101000100590103" 35 4 2  . . 1 2  . . 72
                "101000100590102" "101000100590100" "101000100590103" 35 4 2  . . 2 2  . . 70
                "101000100590102" "101000100590100" "101000100590104" 31 3 1  8 3 2 2  . . 70
                "101000100590102" "101000100590100" "101000100590108" 28 4 2  . . 2 2  . . 70
                "101000100760101" "101000100760100" "101000100760106" 10 3 1  6 2 1 2  . . 49
                "101000100760101" "101000100760100" "101000100760104" 16 3 1  9 3 1 2  . . 49
                "101000100760101" "101000100760100" "101000100760105" 11 3 1  4 1 1 2  . . 49
                "101000100760102" "101000100760100" "101000100760104" 16 3 1  9 3 2 2  . . 46
                "101000100760102" "101000100760100" "101000100760103" 24 3 1  9 3 2 2  . . 46
                "101000100760102" "101000100760100" "101000100760105" 11 3 1  4 1 2 2  . . 46
                "101000100760102" "101000100760100" "101000100760106" 10 3 1  6 2 2 2  . . 46
                "101000101040101" "101000101040100" "101000101040103" 30 3 1 10 4 1 1 10 4 61
                "101000101040102" "101000101040100" "101000101040103" 30 3 1 10 4 2 2  . . 57
                "101000101720101" "101000101720100" "101000101720106" 18 3 1  8 3 1 2  . . 61
                "101000101720101" "101000101720100" "101000101720105" 20 3 1  8 3 1 2  . . 61
                "101000101720101" "101000101720100" "101000101720102" 26 3 1 10 4 1 2  . . 61
                "101000101720101" "101000101720100" "101000101720107" 16 3 1  9 3 1 2  . . 61
                "101000101720101" "101000101720100" "101000101720108" 14 3 1  8 3 1 2  . . 61
                "101000101890107" "101000101890100" "101000101890106" 33 4 2  . . 1 1  8 3 69
                "101000101890107" "101000101890100" "101000101890101" 36 3 1 17 9 1 1  8 3 69
                "101000102490101" "101000102490100" "101000102490103" 25 3 1 11 4 1 1  6 2 61
                "101000102490102" "101000102490100" "101000102490103" 25 3 1 11 4 2 2  . . 64
                "101000103350101" "101000103350100" "101000103350108" 16 3 1  9 3 1 1 10 4 54
                "101000103350101" "101000103350100" "101000103350103" 27 3 1 10 4 1 1 10 4 54
                "101000103350101" "101000103350100" "101000103350107" 18 3 1 14 5 1 1 10 4 54
                "101000103350101" "101000103350100" "101000103350106" 21 3 1 14 5 1 1 10 4 54
                "101000103350101" "101000103350100" "101000103350105" 22 3 1  9 3 1 1 10 4 54
                "101000103350101" "101000103350100" "101000103350104" 24 3 1 12 5 1 1 10 4 54
                "101000103350102" "101000103350100" "101000103350108" 16 3 1  9 3 2 2  . . 52
                "101000103350102" "101000103350100" "101000103350104" 24 3 1 12 5 2 2  . . 52
                "101000103350102" "101000103350100" "101000103350105" 22 3 1  9 3 2 2  . . 52
                "101000103350102" "101000103350100" "101000103350103" 27 3 1 10 4 2 2  . . 52
                "101000103350102" "101000103350100" "101000103350106" 21 3 1 14 5 2 2  . . 52
                "101000103350102" "101000103350100" "101000103350107" 18 3 1 14 5 2 2  . . 52
                "101000103550101" "101000103550100" "101000103550102" 54 4 1  5 2 1 1  5 2 76
                "101000103550102" "101000103550100" "101000103550103" 30 3 1 17 8 2 1  5 2 54
                "101000103550102" "101000103550100" "101000103550107" 22 3 1 16 9 2 1  5 2 54
                "101000103550102" "101000103550100" "101000103550108" 18 3 1 10 4 2 1  5 2 54
                "101000103550102" "101000103550100" "101000103550106" 24 3 1 12 5 2 1  5 2 54
                "101000103550102" "101000103550100" "101000103550105" 28 3 1  9 3 2 1  5 2 54
                "101000103690101" "101000103690100" "101000103690103" 26 3 1  9 3 1 1  9 3 60
                "101000103690102" "101000103690100" "101000103690103" 26 3 1  9 3 2 1  9 3 43
                "101000104030102" "101000104030100" "101000104030105" 25 3 1 17 8 2 1  9 3 54
                "101000104030102" "101000104030100" "101000104030104" 28 3 1 18 9 2 1  9 3 54
                "101000104030102" "101000104030100" "101000104030103" 33 3 1 19 8 2 1  9 3 54
                "101000104030102" "101000104030100" "101000104030106" 23 3 1 16 7 2 1  9 3 54
                "101000104240101" "101000104240100" "101000104240103" 38 3 1 19 8 1 1 10 4 58
                "101000104240101" "101000104240100" "101000104240108" 25 3 1 19 9 1 1 10 4 58
                "101000104240102" "101000104240100" "101000104240108" 25 3 1 19 9 2 2  . . 56
                "101000104240102" "101000104240100" "101000104240103" 38 3 1 19 8 2 2  . . 56
                "101000200050101" "101000200050100" "101000200050103" 37 4 2  . . 1 1  5 2 78
                "101000200050102" "101000200050100" "101000200050103" 37 4 2  . . 2 2  . . 65
                "101000200160109" "101000200160100" "101000200160102" 38 4 1  5 1 1 1  5 2 80
                "101000200210101" "101000200210100" "101000200210106" 12 3 1  5 2 1 1 10 4 46
                "101000200210101" "101000200210100" "101000200210105" 15 3 1  5 2 1 1 10 4 46
                "101000200210101" "101000200210100" "101000200210104" 18 3 1  9 3 1 1 10 4 46
                "101000200210101" "101000200210100" "101000200210103" 22 3 1  9 3 1 1 10 4 46
                "101000200210101" "101000200210100" "101000200210107" 10 3 1  3 1 1 1 10 4 46
                "101000200210102" "101000200210100" "101000200210104" 18 3 1  9 3 2 2  . . 46
                "101000200210102" "101000200210100" "101000200210106" 12 3 1  5 2 2 2  . . 46
                "101000200210102" "101000200210100" "101000200210105" 15 3 1  5 2 2 2  . . 46
                "101000200210102" "101000200210100" "101000200210107" 10 3 1  3 1 2 2  . . 46
                "101000200210102" "101000200210100" "101000200210103" 22 3 1  9 3 2 2  . . 46
                "101000200210108" "101000200210100" "101000200210101" 46 3 1 10 4 2 2  . . 78
                "101000200270101" "101000200270100" "101000200270105" 15 3 1  9 3 1 1 10 4 48
                "101000200270101" "101000200270100" "101000200270106" 13 1 1  5 2 1 1 10 4 48
                "101000200270101" "101000200270100" "101000200270103" 20 3 1 12 5 1 1 10 4 48
                "101000200270101" "101000200270100" "101000200270102" 27 3 1 16 7 1 1 10 4 48
                "101000200270101" "101000200270100" "101000200270104" 17 3 1  9 3 1 1 10 4 48
                "101000200290108" "101000200290100" "101000200290101" 35 3 1 12 5 2 2  . . 64
                "101000200340101" "101000200340100" "101000200340104" 13 3 1  9 3 1 1 10 4 46
                "101000200340101" "101000200340100" "101000200340103" 18 3 1 13 5 1 1 10 4 46
                "101000200340102" "101000200340100" "101000200340104" 13 3 1  9 3 2 2  . . 42
                "101000200340102" "101000200340100" "101000200340103" 18 3 1 13 5 2 2  . . 42
                "101000200470101" "101000200470100" "101000200470103" 35 3 1  9 4 1 1 10 4 72
                end
                label values cv007 lasi_vl216_cv
                label def lasi_vl216_cv 1 "1 Can read only", modify
                label def lasi_vl216_cv 3 "3 Can both read and write", modify
                label def lasi_vl216_cv 4 "4 Cannot read or write", modify
                label values cv008 lasi_vl251_cv
                label def lasi_vl251_cv 1 "1 Yes", modify
                label def lasi_vl251_cv 2 "2 No", modify
                label values cv010 lasi_vl321_cv
                label def lasi_vl321_cv 1 "1 Less than Primary school (Standard 1-4)", modify
                label def lasi_vl321_cv 2 "2 Primary school completed (Standard 5-7)", modify
                label def lasi_vl321_cv 3 "3 Middle school completed (Standard 8- 9)", modify
                label def lasi_vl321_cv 4 "4 Secondary school/Matriculation completed", modify
                label def lasi_vl321_cv 5 "5 Higher secondary/intermediate/senior secondary completed", modify
                label def lasi_vl321_cv 7 "7 Graduate degree (B.A., B.Sc., B. Com.) completed", modify
                label def lasi_vl321_cv 8 "8 Post-graduate degree or (M.A., M.Sc., M. Com.) above (M.Phil, Ph.D., Post-Doc) completed", modify
                label def lasi_vl321_cv 9 "9 Professional course/degree (B.Ed, BE, B.Tech, MBBS, BHMS, BAMS, B.Pharm, BCS, BCA, BBA, LLB) (BVSc., B. Arch, M.Ed, ME, M.Tech, MD, M.Pharm, MCS, MCA, MBA,LLM, MVSc., M. Arch, MS, CA, CS, CWA) completed", modify
                label values dm003 rsex_ind
                label def rsex_ind 1 "1 Male", modify
                label def rsex_ind 2 "2 Female", modify
                label values dm006 _vl6022_ind
                label def _vl6022_ind 1 "1 Yes", modify
                label def _vl6022_ind 2 "2 No", modify
                label values dm008 _vl6024_ind
                label def _vl6024_ind 2 "2 Primary Completed (Standard 5-7)", modify
                label def _vl6024_ind 3 "3 Middle Completed (Standard 8- 9)", modify
                label def _vl6024_ind 4 "4 Secondary School/Matriculation completed", modify
                label def _vl6024_ind 5 "5 Higher Secondary/Intermediate/Senior Secondary completed", modify
                As you can see both 101000100130101 and 101000100130102 have the same childids, which is fine since they are spouses so it makes sense. What I now want is for each distinct child id (eg 101000100130107), look for the gender of the prim_key(dm003), and get the value of dm006 as father's education if dm003 =1 & if dm003 =2 then get the value of dm006 as mother's education. In effect, I want , for each childid, their mother and father education. Who their father(or mother) is will be determined by the prim_key and the gender(dm003), but they need to have the same hhid.

                I am not sure how to proceed on this. Any help would be appreciated.

                Comment


                • #9
                  Code:
                  rename dm008 education
                  rename dm003 sex
                  
                  
                  frame put prim_key hhid education if sex == "1 Male":rsex_ind, into(fathers)
                  frame put prim_key hhid education if sex == "2 Female":rsex_ind, into(mothers)
                  
                  frame fathers: duplicates drop
                  frame mothers: duplicates drop
                  
                  frlink m:1 prim_key hhid, frame(fathers)
                  frget fathers_education = education, from(fathers)
                  drop fathers
                  
                  frlink m:1 prim_key hhid, frame(mothers)
                  frget mothers_education = education, from(mothers)
                  drop mothers
                  
                  frame drop fathers
                  frame drop mothers
                  
                  by childid (fathers_education), sort: replace fathers_education = fathers_education[1]
                  by childid (mothers_education), sort: replace mothers_education = mothers_education[1]
                  I find it impossible to work with variable names like cv006 and dm005 and the like. I know that large-scale surveys tend to use names like these, and for good reason. But I can never keep straight which variables mean what. So after I import the data, I always rename the variables I'm going to be working directly with in ways that tell me what they are. That's what the first two lines of my code here are doing. You don't have to do that, of course, if you are comfortable with the original names.

                  Do note that the way your data set is structured, many of the children appear twice: once in their father's observation and again in their mother's. This remains true in the final data set after this code. So if you are going to do analyses of children, you will need to do further data management to reduce the data set to a single observation for each child so that you don't mistakenly double-count children who have two parents in their household.

                  Comment


                  • #10
                    Thanks Clyde. If I am not mistaken, the code here will give me, for each occurrence of a childid, the father's education and the mother's education. So that means that I have, for some cases, two instances of the same child in the dataset. So to do analyses of children, I'll have to drop duplicates by childid (and/or other variables as required). Am I correct in this understanding?

                    Comment


                    • #11
                      Yes.

                      Comment


                      • #12
                        Thanks a lot Clyde. Just to clear something, I had a question. The sex variable was a int type. So when we put it to a frame, why do we treat it like a string? Shouldn't we put
                        Code:
                         frame put prim_key hhid education if sex == 1 :rsex_ind, into(fathers)
                        Does the rsex_ind have something to do with it?

                        Comment


                        • #13
                          We don't treat it like a string. The notation "Male":rsex_ind is a part of Stata syntax that allows you to refer to a value-labeled numeric variable by the value labels rather than by the numeric values. In most contexts in Stata, you can use this notation. The advantage of using it is that the code is transparent: to read and understand it, you don't have know that somebody arbitrarily decided to code Male as 1 and Female as 2 instead of the other way around, or, for that matter, instead of some other numbers.

                          And no, you wouldn't refer to it as 1:rsex_ind. If you try that you will see that Stata calls it an invalid name. You can just say -if sex == 1-, of course. But, again, that code is only readable if you know what numbers were chosen to represent which sex in the data set.

                          Comment


                          • #14
                            Hi Clyde.
                            Thanks for all the help and I apologize for the late follow up. I had stated this earlier
                            As you can see both 101000100130101 and 101000100130102 have the same childids, which is fine since they are spouses so it makes sense. What I now want is for each distinct child id (eg 101000100130107), look for the gender of the prim_key(dm003), and get the value of dm006 as father's education if dm003 =1 & if dm003 =2 then get the value of dm006 as mother's education. In effect, I want , for each childid, their mother and father education. Who their father(or mother) is will be determined by the prim_key and the gender(dm003), but they need to have the same hhid.
                            What I also forgot to mention is the fact that it is possible to have, for the same child, two different mothers(or fathers). So the same childid can be linked to two different prim_keys with the same gender. I noticed that you used a duplicates drop in both the frames, I was wondering if I could just skip that and do a m:m link. I am aware that m:m is a really messy in Stata, so I am skeptical about it.
                            Is it possible to get what I want, or do I have to do some other restructuring?

                            Comment


                            • #15
                              No, absolutely do not do an m:m merge. It will give you data salad.

                              There are ways to handle this situation, but in order to advise you I need to know how you want to handle the situation where the same child has multiple mothers or fathers. After all, you said that the goal was to create variables representing mother's education and father's education. Well, if the child has two mothers, how do you decide which mother's education will be used for that variable? After all, it would only be an occasional coincidence if both mothers have the same level of education. Of perhaps you want the average of the mothers' education levels? Or the highest, or the lowest? Or something else? Same problem for the father's education variable. If you explain how you want to handle that situation, I can try to develop code that will do it for you.

                              Added: Also, I don't see how your data set up even makes it possible to identifywho the two (or more?) mothers/fathers would be. In the data you gave, the parents' IDs are just the values of prim_key associated with the hhid from the child's record in the data set. So does this mean that a child with multiple mothers or fathers will appear multiple times in the data set, with those records having the same hhid but new values for prim_key? Or are there other variables in the data set for this purpose that have not appeared in this thread yet? Whatever the answer is, when posting back, please include new example data that includes some children in this multiple mother or father situation.

                              Also added: What I said about -merge m:m- is correct, but it doesn't apply to the code shown earlier. The code in #9 uses frames, and connects them with -frlink-. The -frlink- command, thankfully, does not allow m:m linkage. If you tried it, Stata would just give you an error message. So some other approach altogether may be needed, the specifics of which depend on the answers to my questions.
                              Last edited by Clyde Schechter; 17 Apr 2025, 09:14.

                              Comment

                              Working...
                              X