Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Place data from one row at the end of the row above if it is a duplicate

    Hi, I'm sure the title of my question is a bit confusing, so let me explain briefly.

    I have a panel data structure with some companies in the many different years.

    Within this panel data there may be duplicates with respect to the columns "gvkey" (which is a company identifyer), "year" and "month". However, I do not want to drop these duplicates directly as they differ in a few other columns, namely "NUMEST", "NUMUP", "CURCODE".

    Nevertheless, for further analysis I need a structure in which I have no duplicates with respect to to "gvkey", "year" and "month".

    So what I want now is the following: I want to keep the data "NUMEST", "NUMUP" and "CURCODE" and not throw them out. So if there is a duplicate (in the sense described above), then I want to put these three variables in the row above, and at the very end of the columns.

    Example (Starting Situation):

    Line 1:
    gvkey year month NUMEST NUMUP CURCODE
    Row X: Apple 2018 06 5 3 EUR
    Row X+1: Apple 2018 06 7 2 GBP

    Example (Final Solution):

    gvkey year month NUMEST NUMUP CURCODE NUMEST2 NUMUP2 CURCODE2

    Row X: Apple 2018 06 5 3 EUR 7 2 GBP

    My data looks like this:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long gvkey float(year month NUMEST) byte NUMUP str3 CURCODE
    1004 1989  3 10  0 "USD"
    1004 1989  3 13  0 "USD"
    1004 1991  3  7  0 "USD"
    1004 1992  3  9  0 "USD"
    1004 1993  3  8  0 "USD"
    1004 1994  3  4  0 "USD"
    1004 1995  3  4  0 "USD"
    1004 1995  3  2  0 "USD"
    1004 1997  3  6  1 "USD"
    1004 1998  3  7  0 "USD"
    1004 1999  3  8  3 "USD"
    1004 2000  3  7  0 "USD"
    1004 2001  3  5  0 "USD"
    1004 2001  3  4  1 "USD"
    1004 2003  3  2  0 "USD"
    1004 2004  3  1  0 "USD"
    1004 2005  3  2  2 "USD"
    1004 2006  3  5  0 "USD"
    1004 2007  3  6  0 "USD"
    1004 2008  3  9  3 "USD"
    1004 2009  3  6  0 "USD"
    1004 2010  3  6  0 "USD"
    1004 2011  3  8  6 "USD"
    1004 2012  3  5  0 "USD"
    1004 2013  3  7  0 "USD"
    1004 2014  3  6  0 "USD"
    1004 2015  3  3  0 "USD"
    1004 2016  3  3  0 "USD"
    1004 2017  3  4  0 "USD"
    1004 2018  3  5  0 "USD"
    1004 2019  3  5  0 "USD"
    1009 1989  8  1  0 "USD"
    1009 1990  8  1  0 "USD"
    1009 1991  8  1  0 "USD"
    1009 1992  8  1  0 "USD"
    1009 1993  8  1  0 "USD"
    1009 1994  8  1  0 "USD"
    1009 1995  8  1  0 "USD"
    1011 1993 10  1  0 "USD"
    1013 1988  8  9  1 "USD"
    1013 1988  8  8  2 "USD"
    1013 1988  8  5  0 "USD"
    1013 1991  8  6  0 "USD"
    1013 1992  8  7  0 "USD"
    1013 1993  8  8  0 "USD"
    1013 1994  8  8  1 "USD"
    1013 1995  8 14  2 "USD"
    1013 1996  8 17  2 "USD"
    1013 1997  8 16  0 "USD"
    1013 1998  8 19  9 "USD"
    1013 1999  8 24 15 "USD"
    1013 2000  8 23  0 "USD"
    1013 2001  8 22  0 "USD"
    1013 2002  8 17  0 "USD"
    1013 2003  8 13  0 "USD"
    1013 2004  8 12  3 "USD"
    1013 2005  8 13  2 "USD"
    1013 2005  8 14  1 "USD"
    1013 2005  8 18  2 "USD"
    1013 2008  8  8  0 "USD"
    1013 2009  8 13 10 "USD"
    1013 2010  7 13  1 "USD"
    1017 1988 12  2  0 "USD"
    end

    Since this is the first time I use dataex in this forum, I hope I did everything right.

    I would be very grateful for any suggestions on how to solve this issue, and please also let me know when you have an easier way to solve this.

    Thank you & have a nice day!

  • #2
    Code:
    sort gvkey year month, stable
    by gvkey year month: gen byte seq = _n
    reshape wide NUMEST NUMUP CURCODE, i(gvkey year month) j(seq)
    It's not clear to me what you want to do if you have two or more observations of the same gvkey, year, and month, but they agree on their values of NUMEST, NUMUP, and CURCODE. In the code shown above, we don't even look into it and we just move the values to the end of the newly created unique record for gvkey year and month. But maybe you would prefer to have those extra variables set to missing value when they just contain the same information as the first observation had. In that case, running -duplicates drop- before the code shown above would accomplish that.

    A note on the -sort- command that begins the code. It is unclear from your post whether or not the current ordering of the multiple observations with common values of gvkey, year, and month is informative or not. In the above code, I assume that it is informative, and the information it provides is preserved in the variable names created: NUMEST2 comes from the second observation, NUMEST1 from the first, etc. If, in fact, that information is not needed, you can remove -stable- from the -sort- command.

    Added: As an aside, it is unlikely you will be able to accomplish panel-data analysis when your date is spread out over separate year and month variables. You will probably need to combine them into a single monthly-date variable using the -monthly()- function.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      Code:
      sort gvkey year month, stable
      by gvkey year month: gen byte seq = _n
      reshape wide NUMEST NUMUP CURCODE, i(gvkey year month) j(seq)
      It's not clear to me what you want to do if you have two or more observations of the same gvkey, year, and month, but they agree on their values of NUMEST, NUMUP, and CURCODE. In the code shown above, we don't even look into it and we just move the values to the end of the newly created unique record for gvkey year and month. But maybe you would prefer to have those extra variables set to missing value when they just contain the same information as the first observation had. In that case, running -duplicates drop- before the code shown above would accomplish that.

      A note on the -sort- command that begins the code. It is unclear from your post whether or not the current ordering of the multiple observations with common values of gvkey, year, and month is informative or not. In the above code, I assume that it is informative, and the information it provides is preserved in the variable names created: NUMEST2 comes from the second observation, NUMEST1 from the first, etc. If, in fact, that information is not needed, you can remove -stable- from the -sort- command.

      Added: As an aside, it is unlikely you will be able to accomplish panel-data analysis when your date is spread out over separate year and month variables. You will probably need to combine them into a single monthly-date variable using the -monthly()- function.
      Hello Mr Schechter,

      thank you very much for your time and input!

      I have just noticed the problem you describe regarding the same values for NUMEST, NUMUP and CURCODE and have removed these duplicates.

      The code worked great, thanks for that!

      Regarding the panel data analysis, you have a very good point. I am not yet 100% sure how I will tackle this problem.

      Thank you again and have a nice day! :-)

      Comment

      Working...
      X