Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging and Concatenating Data

    Hello!

    I am trying to organise a dataset for a project.

    the set in question involves 27 variables, with about 27000 rows, not all of which are populated. All 27 variables are populated in unique indices.
    I want to generate a new variable that is a single column with all entries from all 27 variables merged together. Some rows will also end up being unpopulated.

    Approach tried:
    I have implemented a counting for each variable (eg in variable 1, all data is concentrated from row 1-50, in variable 2 data is concentrated from row 51-300, etc.)
    Using cumulative summation I have obtained the start and end index of the data within each variable.
    However quickly fails, if data is not concentrated, or if there is discontinuity from one variable to the next (eg no data in no variable from rows 500-600.

    I am also unsure of how to about using for loops as the data in each variable doesn't span for the same range.

    A brief layout of what I am describing can be seen in the below table
    var1 var2 var3 var4 var5
    left . . . .
    left . . . .
    right . . . .
    other . . . .
    left . . . .
    . right . . .
    . left . . .
    . other . . .
    . right . . .
    . left . . .
    . left . . .
    . . right . .
    . . left . .
    . . left . .
    . . other . .
    . . right . .
    . . . . .
    . . . . .
    . . . right .
    . . . other .
    . . . left .
    . . . left .
    . . . center .
    . . . right .
    . . . other .
    . . . . left
    . . . . other
    . . . . right
    . . . . other

    The target variable should look like this:
    Target
    left
    left
    right
    other
    left
    right
    left
    other
    right
    left
    left
    right
    left
    left
    other
    right
    .
    .
    right
    other
    left
    left
    center
    right
    other
    left
    other
    right
    other
    Thanks in advance
    Odysseas

  • #2
    Assuming that these variables are strings:

    Code:
    gen long obsno=_n
    reshape long var, i(obsno) j(which)
    bys obsno (var): keep if _n==_N
    Otherwise, the last line should be

    Code:
    bys obsno (var): keep if _n==1
    for numerical variables.
    Last edited by Andrew Musau; 28 Jun 2023, 15:34.

    Comment


    • #3
      In addition to Andrew's suggestion (which arrived as I was constructing an answer), I'd also note that what you want provides an instance in which the (relatively obscure) -stack- command could work. What I show below does the stacking you want, but does not deal with any other variables that might be present in your dataset. Note the use of the -dataex- command to prepare and present example data, as prescribed in the StataList FAQ new members are asked to read.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str5(var1 var2 var3) str6 var4 str5 var5
      "left"  ""      ""      ""       ""     
      "left"  ""      ""      ""       ""     
      "right" ""      ""      ""       ""     
      "other" ""      ""      ""       ""     
      "left"  ""      ""      ""       ""     
      ""      "right" ""      ""       ""     
      ""      "left"  ""      ""       ""     
      ""      "other" ""      ""       ""     
      ""      "right" ""      ""       ""     
      ""      "left"  ""      ""       ""     
      ""      "left"  ""      ""       ""     
      ""      ""      "right" ""       ""     
      ""      ""      "left"  ""       ""     
      ""      ""      "left"  ""       ""     
      ""      ""      "other" ""       ""     
      ""      ""      "right" ""       ""     
      ""      ""      ""      ""       ""     
      ""      ""      ""      ""       ""     
      ""      ""      ""      "right"  ""     
      ""      ""      ""      "other"  ""     
      ""      ""      ""      "left"   ""     
      ""      ""      ""      "left"   ""     
      ""      ""      ""      "center" ""     
      ""      ""      ""      "right"  ""     
      ""      ""      ""      "other"  ""     
      ""      ""      ""      ""       "left"
      ""      ""      ""      ""       "other"
      ""      ""      ""      ""       "right"
      ""      ""      ""      ""       "other"
      end
      stack var1 var2 var3 var4 var5, into(target)
      drop if missing(target)
      A "do it yourself" approach is also possible, and would illustrate some technique that might be useful in the particulars of your situation:
      Code:
      gen target = ""
      foreach v of varlist var1 var2 var3 var4 var5 {
        replace `v' = strtrim(`v') // remove stray blanks
        qui replace target = `v' if missing(target) & !missing(`v')
      }
      drop if missing(target)
      Either of these approaches, and more generally your data structure, are potentially dangerous, as they appear to depend on the physical ordering of observations in the data set. A safer approach would take account of some variable that records the desired order of observations.

      Comment


      • #4
        Thank you!

        Comment

        Working...
        X