Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging rows

    Hi,

    I am new to Stata and have a question regarding the merging of rows or cells in Stata. Basically, I have a dataset that looks as follows:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int C_id str130 Targetname str18 Dealmethodofpayment
    1 "Komioil OOO"                     ""                
    2 "Vattenfall Europe Mining AG"     "Cash assumed"    
    . "Vattenfall Europe Generation AG" ""                
    3 "Sibur Kholding OAO"              "Cash"            
    . ""                                "Liabilities"     
    . ""                                "Deferred payment"
    end
    For some cases (C_id 2 and 3) I have multiple observations per variable. Now, however, for the cases with multiple observations the values are stored in seperate rows. What I would like to do is to merge all observations per case in one cell. For example for C_id 3 I would want "Cash", "Liabilities" and "deferred payment" in one cell. I wondered whether I could do this through a code since I have over 20000 observations.

    Thank you in advance.

  • #2
    Welcome to Statalist. Thank you for providing a useful example of your data using the dataex command.

    I think your data must have come from a spreadsheet, and the missing values for C_id, Targetname and Dealmethodofpayment are all meant to be the same as the previous observation. While this works in a spreadsheet, it is very bad in Stata and other statistical languages. So we should start by fixing that.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int C_id str130 Targetname str18 Dealmethodofpayment
    1 "Komioil OOO"                     ""                
    2 "Vattenfall Europe Mining AG"     "Cash assumed"    
    . "Vattenfall Europe Generation AG" ""                
    3 "Sibur Kholding OAO"              "Cash"            
    . ""                                "Liabilities"     
    . ""                                "Deferred payment"
    end
    replace C_id                = C_id[_n-1]                if missing(C_id)
    replace Targetname          = Targetname[_n-1]          if C_id==C_id[_n-1] & missing(Targetname)
    replace Dealmethodofpayment = Dealmethodofpayment[_n-1] if C_id==C_id[_n-1] & missing(Dealmethodofpayment)
    list, sepby(C_id) abbreviate(20)
    Code:
    . list, sepby(C_id) abbreviate(20)
    
         +--------------------------------------------------------------+
         | C_id                        Targetname   Dealmethodofpayment |
         |--------------------------------------------------------------|
      1. |    1                       Komioil OOO                       |
         |--------------------------------------------------------------|
      2. |    2       Vattenfall Europe Mining AG          Cash assumed |
      3. |    2   Vattenfall Europe Generation AG          Cash assumed |
         |--------------------------------------------------------------|
      4. |    3                Sibur Kholding OAO                  Cash |
      5. |    3                Sibur Kholding OAO           Liabilities |
      6. |    3                Sibur Kholding OAO      Deferred payment |
         +--------------------------------------------------------------+
    But now that we have that, I don't know what it is you want the results to look like. I think you are asking to reduce the three observations for C_id 3 to a single observation. But I don't know what you mean by you want "Cash" "Liabilities" and "Deferred Payment" in one "cell". Is there some other variable that gives the payment amount, and you want to caclculate the total payment? If so, perhaps this example will start you in a useful direction.
    Code:
    cls
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte C_id str31 Targetname str16 Dealmethodofpayment float paymentamount
    1 "Komioil OOO"                     ""                  .
    2 "Vattenfall Europe Mining AG"     "Cash assumed"     10
    2 "Vattenfall Europe Generation AG" "Cash assumed"     20
    3 "Sibur Kholding OAO"              "Cash"             30
    3 "Sibur Kholding OAO"              "Liabilities"      40
    3 "Sibur Kholding OAO"              "Deferred payment" 50
    end
    
    collapse (sum) paymentamount, by(C_id Targetname)
    list, sepby(C_id) abbreviate(20)
    Code:
    . list, sepby(C_id) abbreviate(20)
    
         +--------------------------------------------------------+
         | C_id                        Targetname   paymentamount |
         |--------------------------------------------------------|
      1. |    1                       Komioil OOO               0 |
         |--------------------------------------------------------|
      2. |    2   Vattenfall Europe Generation AG              20 |
      3. |    2       Vattenfall Europe Mining AG              10 |
         |--------------------------------------------------------|
      4. |    3                Sibur Kholding OAO             120 |
         +--------------------------------------------------------+

    Comment


    • #3
      You can do it this way:
      Code:
      replace C_id = C_id[_n-1] if missing(C_id)
      
      foreach v of varlist Targetname Dealmethodofpayment {
          by C_id (`v'), sort: replace `v' = `v' + ", " if !missing(`v') & _n != _N
          by C_id (`v'), sort: replace `v' = `v'[_n-1] + `v' if _n > 1
          by C_id: replace `v' = `v'[_N]
      }
      by C_id: keep if _n == _N
      Added: Crossed with #2 which offers a similar solution. One difference in understanding the question: I assumed you wanted to do this with both Targetname and Dealmethodofpayment, whereas the code in #2 applies only to the latter.

      Comment


      • #4
        Thanks Clyde and William. Very helpful. My objective for doing this is to create dummy variables for each case_id if Dealmethodofpayment for instance contains the word "cash". Clyde's code provides me the exact result I wanted.

        Now, however, I have one extra issue I can not seem to find the answer to on statalist or google in general. The example I provided was for 2 variables. However, in the real dataset I have around 100 variables for which I would like to do this same operation. I tried the following code:

        Code:
        replace C_id = C_id[_n-1] if missing(C_id)
        
        foreach v of varlist _all {
            by C_id (`v'), sort: replace `v' = `v' + ", " if !missing(`v') & _n != _N
            by C_id (`v'), sort: replace `v' = `v'[_n-1] + `v' if _n > 1
            by C_id: replace `v' = `v'[_N]
        }
        by C_id: keep if _n == _N
        But I keep getting a "type mismatch" error. From what I understood on the forum some people had this issue because some variables are string, while others were numeric. However, all my variables, except C_id, are string variables. Is there some way I could solve this or at least discover which variable is causing this issue?

        Comment


        • #5
          Undoubtedly the variable C_id itself is causing the problem: it is not a string variable, but it is part of _all. So try it this way:

          Code:
          replace C_id = C_id[_n-1] if missing(C_id)
          
          ds C_id, not
          local variables `r(varlist)'
          foreach v of varlist `variables' {
              by C_id (`v'), sort: replace `v' = `v' + ", " if !missing(`v') & _n != _N
              by C_id (`v'), sort: replace `v' = `v'[_n-1] + `v' if _n > 1
              by C_id: replace `v' = `v'[_N]
          }
          by C_id: keep if _n == _N

          Comment


          • #6
            Works perfectly. Thanks a lot!

            Comment

            Working...
            X