Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mapping variable names to IDs to reshape a dataset with inconsistent variable names

    Dear Statalist members,

    I have a dataset that contains information about individuals. There are two ways to identify an individual: by using a variable id in the format "1", "2", etc. or by using a variable name. The mapping between the name and ID is done in a separate file called corresp.dta. I want to reshape the data to the long format, but my dataset is not consistently organized. For example, in Excel, I have a set of n columns indicating whether the individual (there are n of them) knows the individual displayed in the variable. Because the individual name may contain special characters and could be complicated to read in a variable name, I cannot use the firstrow option when importing the spreadsheet. So, I thought about importing the Excel initial first row that is organized into something like this: "Do you know this individual? [XXXXXX]" as a string observation to 1) isolate the XXXXXX and 2) change it with its corresponding ID based on corresp.dta. However, the problem is that I want to append different files before reshaping, and these files do not always follow the "Do you know this individual? [XXXXXX]" format. In some of them, only "[XXXXXX]" is displayed. Here's a dataex example to help understand my request:

    FILE 1 :

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str36(A B C D)
    "Do you know this indiv? [INDIV AAAA]" "Do you know this indiv? [INDIV YYYY]" "Do you know this indiv? [INDIV XXXX]" "Do you know this indiv? [INDIV ZZZZ]"
    "0"                                    "1"                                    "0"                                    "1"                                  
    "1"                                    "0"                                    "0"                                    "0"                                  
    "0"                                    "1"                                    "0"                                    "0"                                  
    "1"                                    "0"                                    "0"                                    "0"                                  
    end
    FILE 2 :

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str12(E F G H)
    "[INDIV AAAA]" "[INDIV YYYY]" "[INDIV XXXX]" "[INDIV ZZZZ]"
    "0"            "1"            "0"            "0"          
    "1"            "0"            "1"            "1"          
    "0"            "1"            "0"            "1"          
    "1"            "1"            "1"            "0"          
    end

    The process of appending the files would not be successful as the variable names, as they are imported directly from Excel, would be different. My goal is to, for each file to be appended,
    1. simplify the first observation from "Do you know this individual? [INDIV AAAA]" or "[INDIV AAAA]" to just "INDIV AAAA"
    2. replace the name of this first observation with its corresponding ID in the corresp.dta file. Here's a toy example of what it may contain:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str10 name str1 id
    "INDIV AAAA" "1"
    "INDIV YYYY" "2"
    "INDIV XXXX" "3"
    "INDIV ZZZZ" "4"
    end
    3. Replace the variable name with its first observation and adding "knows" as a prefix for the reshape process. Once the files are appended, I will have

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(knows1 knows2 knows3 knows4)
    0 1 0 1
    1 0 0 0
    0 1 0 0
    1 0 0 0
    0 1 0 0
    1 0 1 1
    0 1 0 1
    1 1 1 0
    end
    It may be important to note that the order of the variables is crucial. For example, AAAA must always be before YYYY, who must always be before XXXX, etc. I am trying to use the substr function to remove the square brackets, but since the name of the individual is not the same across variables, I am having trouble figuring it out. I believe step two involves preserving and restoring the names and IDs from corresp.dta using the levelsof command, but for the third step, I have no idea what to do.

    I understand that this is a long and tedious thread, so any suggestions or guidance for any part of the process would be greatly appreciated. Thank you for the help provided by this forum, from a very beginner user!
    Last edited by Adam Sadi; 20 Jan 2023, 05:02.

  • #2
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str36(A B C D)
    "Do you know this indiv? [INDIV AAAA]" "Do you know this indiv? [INDIV YYYY]" "Do you know this indiv? [INDIV XXXX]" "Do you know this indiv? [INDIV ZZZZ]"
    "0"                                    "1"                                    "0"                                    "1"                                  
    "1"                                    "0"                                    "0"                                    "0"                                  
    "0"                                    "1"                                    "0"                                    "0"                                  
    "1"                                    "0"                                    "0"                                    "0"                                  
    end
    tempfile dataset1
    save `dataset1'
    
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str10 name str1 id
    "INDIV AAAA" "1"
    "INDIV YYYY" "2"
    "INDIV XXXX" "3"
    "INDIV ZZZZ" "4"
    end
    tempfile crosswalk_data
    save `crosswalk_data'
    
    clear
    clear frames
    use `dataset1'
    ds _all
    local vbles `r(varlist)'
    rename (`r(varlist)') name#, addnumber
    gen `c(obs_t)' obs_no = _n
    reshape long name, i(obs_no) j(column)
    
    gen begin = strpos(name, "[")
    replace name = substr(name, begin+1, strlen(name)-begin-1) if obs_no == 1
    
    frame create crosswalk
    frame crosswalk: use `crosswalk_data'
    frlink m:1 name, frame(crosswalk)
    frget id, from(crosswalk)
    assert missing(crosswalk) == (obs_no != 1)
    
    replace name = "knows"+id if !missing(id)
    drop begin crosswalk id
    reshape wide
    foreach v of varlist name* {
        rename `v' `=`v'[1]'
    }
    drop in 1
    drop obs_no

    Comment


    • #3
      Dear Clyde : Thank you for doing what I couldn't do in days. Unfortunately, the line

      Code:
      assert missing(crosswalk) == (obs_no != 1)
      was very useful even if I wish it wasn't! It yielded the following message : "assertion is false". Is there any way to ask Stata for more details about the reason why the assertion is false ?

      Comment


      • #4
        At the point where the assertion fails, run:

        Code:
        browse if missing(crosswalk) != (obs_no != 1)
        and Stata will show you the problematic observations. Most likely this will traceable back to some column in your original data set where your corresp.dta file has no entry for the corresponding INDIV.

        Comment


        • #5
          Thank you for your useful insights Clyde. Indeed there are some problems that I did not know about, namely there are first observations with two pairs of square brackets (for instance, the first pair of square brackets had the question label within it), so the code :

          Code:
          gen begin = strpos(name, "[")
          replace name = substr(name, begin+1, strlen(name)-begin-1) if obs_no == 1
          replaced the content of name with what was inside the first pair of brackets, while I needed what was within the second one. I tried adding these lines of code:

          Code:
          gen begin2 = strpos(subinstr(name, "[", "(", 1), "[")
          But then I came to realize that it would also remove the brackets of the observations that only have one pair of them, since I'm looping on several files. Is there a way to add a condition like "if the number of pairs of brackets is equal to two, replace the first pair by parentheses" ?

          Comment


          • #6
            Provide an example including a few cases using dataex.

            Comment


            • #7
              If it is always the case that the [INDIV AAAA] that you are interested in always appears at the end of the string, so that only the last [ and ] in the string are of interest and the ] of interest is the final character in the string, then the following modification to the code will work:
              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input str44(A B C D)
              "Do you know this indiv [qwert]? [INDIV AAAA]" "Do you know this indiv? [INDIV YYYY]" "Do you know this indiv? [INDIV XXXX]" "Do you know this indiv? [INDIV ZZZZ]"
              "0"                                    "1"                                    "0"                                    "1"                                  
              "1"                                    "0"                                    "0"                                    "0"                                  
              "0"                                    "1"                                    "0"                                    "0"                                  
              "1"                                    "0"                                    "0"                                    "0"                                  
              end
              tempfile dataset1
              save `dataset1'
              
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input str10 name str1 id
              "INDIV AAAA" "1"
              "INDIV YYYY" "2"
              "INDIV XXXX" "3"
              "INDIV ZZZZ" "4"
              end
              tempfile crosswalk_data
              save `crosswalk_data'
              
              clear
              clear frames
              use `dataset1'
              ds _all
              local vbles `r(varlist)'
              rename (`r(varlist)') name#, addnumber
              gen `c(obs_t)' obs_no = _n
              reshape long name, i(obs_no) j(column)
              
              gen reversed_name = reverse(name)
              replace name = reverse(substr(reversed_name, 2, strpos(reversed_name, "[")-2)) if obs_no == 1
              
              frame create crosswalk
              frame crosswalk: use `crosswalk_data'
              frlink m:1 name, frame(crosswalk)
              frget id, from(crosswalk)
              assert missing(crosswalk) == (obs_no != 1)
              
              replace name = "knows"+id if !missing(id)
              drop reversed_name crosswalk id
              reshape wide
              foreach v of varlist name* {
                  rename `v' `=`v'[1]'
              }
              drop in 1
              drop obs_no
              If, the position is not always at the very end of the string, but it is always the last set of brackets that encloses the target material, then a slightly different modification (post back) can be written. But if there can be multiple pairs of brackets and there is no clear rule as to which one is going to be the right target, well, then we may have a big problem on our hands. Definitely show new -dataex- output when posting back, and make it as representative of the difficulties in the data as possible.

              Comment

              Working...
              X