Merging and Concatenating Data

Odysseas Makariou

Join Date: Jun 2023
Posts: 3

Merging and Concatenating Data

28 Jun 2023, 14:58

Hello!

I am trying to organise a dataset for a project.

the set in question involves 27 variables, with about 27000 rows, not all of which are populated. All 27 variables are populated in unique indices.
I want to generate a new variable that is a single column with all entries from all 27 variables merged together. Some rows will also end up being unpopulated.

Approach tried:
I have implemented a counting for each variable (eg in variable 1, all data is concentrated from row 1-50, in variable 2 data is concentrated from row 51-300, etc.)
Using cumulative summation I have obtained the start and end index of the data within each variable.
However quickly fails, if data is not concentrated, or if there is discontinuity from one variable to the next (eg no data in no variable from rows 500-600.

I am also unsure of how to about using for loops as the data in each variable doesn't span for the same range.

A brief layout of what I am describing can be seen in the below table

var1	var2	var3	var4	var5
left	.	.	.	.
left	.	.	.	.
right	.	.	.	.
other	.	.	.	.
left	.	.	.	.
.	right	.	.	.
.	left	.	.	.
.	other	.	.	.
.	right	.	.	.
.	left	.	.	.
.	left	.	.	.
.	.	right	.	.
.	.	left	.	.
.	.	left	.	.
.	.	other	.	.
.	.	right	.	.
.	.	.	.	.
.	.	.	.	.
.	.	.	right	.
.	.	.	other	.
.	.	.	left	.
.	.	.	left	.
.	.	.	center	.
.	.	.	right	.
.	.	.	other	.
.	.	.	.	left
.	.	.	.	other
.	.	.	.	right
.	.	.	.	other

The target variable should look like this:

Target

left

right

other

left

right

left

other

right

left

right

left

other

right

other

left

center

right

other

left

other

right

other

Thanks in advance
Odysseas

Tags: None

Andrew Musau

Join Date: Oct 2014

Posts: 10284
#2

28 Jun 2023, 15:16

Assuming that these variables are strings:

Code:

gen long obsno=_n reshape long var, i(obsno) j(which) bys obsno (var): keep if _n==_N

Otherwise, the last line should be

Code:

bys obsno (var): keep if _n==1

for numerical variables.

Last edited by Andrew Musau; 28 Jun 2023, 15:34.
2 likes
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2421

28 Jun 2023, 15:34

In addition to Andrew's suggestion (which arrived as I was constructing an answer), I'd also note that what you want provides an instance in which the (relatively obscure) -stack- command could work. What I show below does the stacking you want, but does not deal with any other variables that might be present in your dataset. Note the use of the -dataex- command to prepare and present example data, as prescribed in the StataList FAQ new members are asked to read.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str5(var1 var2 var3) str6 var4 str5 var5
"left"  ""      ""      ""       ""     
"left"  ""      ""      ""       ""     
"right" ""      ""      ""       ""     
"other" ""      ""      ""       ""     
"left"  ""      ""      ""       ""     
""      "right" ""      ""       ""     
""      "left"  ""      ""       ""     
""      "other" ""      ""       ""     
""      "right" ""      ""       ""     
""      "left"  ""      ""       ""     
""      "left"  ""      ""       ""     
""      ""      "right" ""       ""     
""      ""      "left"  ""       ""     
""      ""      "left"  ""       ""     
""      ""      "other" ""       ""     
""      ""      "right" ""       ""     
""      ""      ""      ""       ""     
""      ""      ""      ""       ""     
""      ""      ""      "right"  ""     
""      ""      ""      "other"  ""     
""      ""      ""      "left"   ""     
""      ""      ""      "left"   ""     
""      ""      ""      "center" ""     
""      ""      ""      "right"  ""     
""      ""      ""      "other"  ""     
""      ""      ""      ""       "left"
""      ""      ""      ""       "other"
""      ""      ""      ""       "right"
""      ""      ""      ""       "other"
end
stack var1 var2 var3 var4 var5, into(target)
drop if missing(target)

A "do it yourself" approach is also possible, and would illustrate some technique that might be useful in the particulars of your situation:

Code:

gen target = ""
foreach v of varlist var1 var2 var3 var4 var5 {
  replace `v' = strtrim(`v') // remove stray blanks
  qui replace target = `v' if missing(target) & !missing(`v')
}
drop if missing(target)

Either of these approaches, and more generally your data structure, are potentially dangerous, as they appear to depend on the physical ordering of observations in the data set. A safer approach would take account of some variable that records the desired order of observations.

Comment

Odysseas Makariou

Join Date: Jun 2023

Posts: 3
#4

29 Jun 2023, 12:23

Thank you!
Comment

Announcement

Merging and Concatenating Data

Comment

Comment

Comment