Merging rows

Gianni Spolver

Join Date: May 2019

Posts: 25
#1

Merging rows

23 May 2019, 03:34

Hi,

I am new to Stata and have a question regarding the merging of rows or cells in Stata. Basically, I have a dataset that looks as follows:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int C_id str130 Targetname str18 Dealmethodofpayment 1 "Komioil OOO" "" 2 "Vattenfall Europe Mining AG" "Cash assumed" . "Vattenfall Europe Generation AG" "" 3 "Sibur Kholding OAO" "Cash" . "" "Liabilities" . "" "Deferred payment" end

For some cases (C_id 2 and 3) I have multiple observations per variable. Now, however, for the cases with multiple observations the values are stored in seperate rows. What I would like to do is to merge all observations per case in one cell. For example for C_id 3 I would want "Cash", "Liabilities" and "deferred payment" in one cell. I wondered whether I could do this through a code since I have over 20000 observations.

Thank you in advance.
Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

23 May 2019, 10:27

Welcome to Statalist. Thank you for providing a useful example of your data using the dataex command.

I think your data must have come from a spreadsheet, and the missing values for C_id, Targetname and Dealmethodofpayment are all meant to be the same as the previous observation. While this works in a spreadsheet, it is very bad in Stata and other statistical languages. So we should start by fixing that.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int C_id str130 Targetname str18 Dealmethodofpayment
1 "Komioil OOO"                     ""                
2 "Vattenfall Europe Mining AG"     "Cash assumed"    
. "Vattenfall Europe Generation AG" ""                
3 "Sibur Kholding OAO"              "Cash"            
. ""                                "Liabilities"     
. ""                                "Deferred payment"
end
replace C_id                = C_id[_n-1]                if missing(C_id)
replace Targetname          = Targetname[_n-1]          if C_id==C_id[_n-1] & missing(Targetname)
replace Dealmethodofpayment = Dealmethodofpayment[_n-1] if C_id==C_id[_n-1] & missing(Dealmethodofpayment)
list, sepby(C_id) abbreviate(20)

Code:

. list, sepby(C_id) abbreviate(20)

     +--------------------------------------------------------------+
     | C_id                        Targetname   Dealmethodofpayment |
     |--------------------------------------------------------------|
  1. |    1                       Komioil OOO                       |
     |--------------------------------------------------------------|
  2. |    2       Vattenfall Europe Mining AG          Cash assumed |
  3. |    2   Vattenfall Europe Generation AG          Cash assumed |
     |--------------------------------------------------------------|
  4. |    3                Sibur Kholding OAO                  Cash |
  5. |    3                Sibur Kholding OAO           Liabilities |
  6. |    3                Sibur Kholding OAO      Deferred payment |
     +--------------------------------------------------------------+

But now that we have that, I don't know what it is you want the results to look like. I think you are asking to reduce the three observations for C_id 3 to a single observation. But I don't know what you mean by you want "Cash" "Liabilities" and "Deferred Payment" in one "cell". Is there some other variable that gives the payment amount, and you want to caclculate the total payment? If so, perhaps this example will start you in a useful direction.

Code:

cls
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte C_id str31 Targetname str16 Dealmethodofpayment float paymentamount
1 "Komioil OOO"                     ""                  .
2 "Vattenfall Europe Mining AG"     "Cash assumed"     10
2 "Vattenfall Europe Generation AG" "Cash assumed"     20
3 "Sibur Kholding OAO"              "Cash"             30
3 "Sibur Kholding OAO"              "Liabilities"      40
3 "Sibur Kholding OAO"              "Deferred payment" 50
end

collapse (sum) paymentamount, by(C_id Targetname)
list, sepby(C_id) abbreviate(20)

Code:

. list, sepby(C_id) abbreviate(20)

     +--------------------------------------------------------+
     | C_id                        Targetname   paymentamount |
     |--------------------------------------------------------|
  1. |    1                       Komioil OOO               0 |
     |--------------------------------------------------------|
  2. |    2   Vattenfall Europe Generation AG              20 |
  3. |    2       Vattenfall Europe Mining AG              10 |
     |--------------------------------------------------------|
  4. |    3                Sibur Kholding OAO             120 |
     +--------------------------------------------------------+

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30154
#3

23 May 2019, 10:28

You can do it this way:

Code:

replace C_id = C_id[_n-1] if missing(C_id) foreach v of varlist Targetname Dealmethodofpayment { by C_id (`v'), sort: replace `v' = `v' + ", " if !missing(`v') & _n != _N by C_id (`v'), sort: replace `v' = `v'[_n-1] + `v' if _n > 1 by C_id: replace `v' = `v'[_N] } by C_id: keep if _n == _N

Added: Crossed with #2 which offers a similar solution. One difference in understanding the question: I assumed you wanted to do this with both Targetname and Dealmethodofpayment, whereas the code in #2 applies only to the latter.
Comment
Gianni Spolver

Join Date: May 2019

Posts: 25
#4

23 May 2019, 22:37

Thanks Clyde and William. Very helpful. My objective for doing this is to create dummy variables for each case_id if Dealmethodofpayment for instance contains the word "cash". Clyde's code provides me the exact result I wanted.

Now, however, I have one extra issue I can not seem to find the answer to on statalist or google in general. The example I provided was for 2 variables. However, in the real dataset I have around 100 variables for which I would like to do this same operation. I tried the following code:

Code:

replace C_id = C_id[_n-1] if missing(C_id) foreach v of varlist _all { by C_id (`v'), sort: replace `v' = `v' + ", " if !missing(`v') & _n != _N by C_id (`v'), sort: replace `v' = `v'[_n-1] + `v' if _n > 1 by C_id: replace `v' = `v'[_N] } by C_id: keep if _n == _N

But I keep getting a "type mismatch" error. From what I understood on the forum some people had this issue because some variables are string, while others were numeric. However, all my variables, except C_id, are string variables. Is there some way I could solve this or at least discover which variable is causing this issue?
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30154

23 May 2019, 23:00

Undoubtedly the variable C_id itself is causing the problem: it is not a string variable, but it is part of _all. So try it this way:

Code:

replace C_id = C_id[_n-1] if missing(C_id)

ds C_id, not
local variables `r(varlist)'
foreach v of varlist `variables' {
    by C_id (`v'), sort: replace `v' = `v' + ", " if !missing(`v') & _n != _N
    by C_id (`v'), sort: replace `v' = `v'[_n-1] + `v' if _n > 1
    by C_id: replace `v' = `v'[_N]
}
by C_id: keep if _n == _N

Comment

Gianni Spolver

Join Date: May 2019

Posts: 25
#6

23 May 2019, 23:27

Works perfectly. Thanks a lot!
Comment

Announcement

Comment

Comment

Comment

Comment

Comment