Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Aggregating duplicate observations comprising both string and numeric variables

    I have the following unbalanced panel data set.
    This dataset contains both numeric and string variables. Variables in columns 2-6 are numeric and columns 7-8 are string.
    I want to have unique observations under firmID-Year pair. But as can be seen, obs 2,3,4,5,7,8 are duplicates in terms of firmID-year pair.


    ObsNo. industryID firmID Year Var1 var2 var3 var4
    1 1641 1 2006 12 9.3 USD QW
    2 1641 1 2008 9 21 USD DA
    3 1641 1 2008 11 23 USD CN
    4 3486 2 2009 7 99 USD
    5 3486 2 2009 5.7 45 USD
    6 3486 2 2003 34 21.21 USD YY
    7 1641 3 2004 0.9 0 USD RP
    8 1641 3 2004 55 USD













    I do not want to just drop the duplicate values. I want to aggregate the duplicate observations into a single observation. I tried to use collapse command, but it only considers numeric variables and not string variables. Hence my post here.

    I would like to do the following:
    1. For every numeric variable, I want to take a mean in case of numeric variables
    2. In the case of string variables, I want to keep one of the two values (anyone of two works). If for a string variable, one value is missing and one is not, I want to keep the non-missing value.

    So the aforementioned sample data should look like this-

    ObsNo. industryID firmID Year Var1 var2 var3 var4
    1 1641 1 2006 12 9.3 USD DA
    2 1641 1 2008 10 22 USD DA
    3 3486 2 2009 6.35 72 USD
    4 3486 2 2003 34 21.21 USD YY
    5 1641 3 2004 27.95 0 USD RP









    Plase help. Thanks much!

  • #2
    I tried to use collapse command, but it only considers numeric variables and not string variables.
    That's not correct. This appears to do what you want:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte obsno int industryid byte firmid int year float(var1 var2) str3 var3 str2 var4
    1 1641 1 2006  12   9.3 "USD" "QW"
    2 1641 1 2008   9    21 "USD" "DA"
    3 1641 1 2008  11    23 "USD" "CN"
    4 3486 2 2009   7    99 "USD" ""  
    5 3486 2 2009 5.7    45 "USD" ""  
    6 3486 2 2003  34 21.21 "USD" "YY"
    7 1641 3 2004  .9     0 "USD" "RP"
    8 1641 3 2004  55     . "USD" ""  
    end
    
    collapse (mean) var1 var2 (firstnm) industryid var3 var4, by(firmid year)

    Comment


    • #3
      Thank you! I had incomplete information on the collapse command. What is also great is that the collapse command does not alter data in non-duplicate observations.
      I have one further question, though.
      In my post, I gave a sample dataset to explain my problem. But in my real dataset, I have 621 variables in total. So I would need to specify all of them in the collapse command, which makes it error-prone and time-consuming. Is there a quicker/more efficient way to mention all variables in the collapse command without me having to create two lists, one for taking the mean of numeric variables and one for string variables? Thanks again.

      Comment


      • #4
        Code:
        help ds
        lets you identify string and numeric variables separately.

        Comment

        Working...
        X