need to drop duplicates depending on how many duplicates there are per observation

Sarah Malott

Join Date: Feb 2019
Posts: 2

need to drop duplicates depending on how many duplicates there are per observation

21 Feb 2019, 15:02

I am trying to reshape my data from wide to long, however I need to drop duplicates first. Here is an example of my data of employment by U.S. County:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int year byte qtr str1 disclosure_code str41 area_title str43 agglvl_title long(qtrly_estabs_count month1 month2 month3) float countyid str5 time float dup
2009 1 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 31 1868 1721 1641 1 "20091" 0
2009 2 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 31 1531 1530 1496 1 "20092" 0
2009 3 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 32 1495 1480 1433 1 "20093" 0
2009 4 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 33 1445 1444 1447 1 "20094" 0
2010 1 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 31 1428 1457 1516 1 "20101" 0
2010 2 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 31 1540 1570 1572 1 "20102" 0
2010 3 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 30 1552 1578 1590 1 "20103" 0
2010 4 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 30 1601 1594 1584 1 "20104" 0
2011 1 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 30 1525 1567 1576 1 "20111" 0
2011 2 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 31 1566 1581 1580 1 "20112" 0
2011 3 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 31 1587 1579 1639 1 "20113" 0
2011 4 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 30 1658 1668 1636 1 "20114" 0
2012 1 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 30 1649 1703 1722 1 "20121" 0
2012 2 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 30 1719 1733 1712 1 "20122" 0
2012 3 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 30 1723 1727 1734 1 "20123" 0
2012 4 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 30 1701 1592 1547 1 "20124" 0
2013 1 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 29 1538 1561 1563 1 "20131" 0
2013 2 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 29 1572 1577 1584 1 "20132" 0
2013 3 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 29 1582 1592 1602 1 "20133" 0
2013 4 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 30 1657 1655 1637 1 "20134" 0
2014 1 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 31 1740 1741 1760 1 "20141" 0
2014 2 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 32 1782 1798 1812 1 "20142" 0
2014 3 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 32 1806 1813 1848 1 "20143" 0
2014 4 "" "Abbeville County, South Carolina" "County, NAICS Sector -- by ownership sector" 32 1789 1789 1783 1 "20144" 0
end

This is what I used to generate the dup variable: quietly by countyid time: gen dup = cond(_N==1,0,_n)

However, the issue is that there are duplicates for the same year, quarter, and county with one dupe listing zeros for employment and the other version listing the actual employment. I sorted the data with this: sort countyid time month1 month2 month3 so that the dupe with zeros would appear before the dupe with employment, giving it a dup value of 1. My intention was to do drop if dup == 1 however i discovered that some of the duplicates come in triplets meaning that there are two lines of zeros, with dup values of 1 and 2, and then the actual employment has a dup value of 3. I thought I could try to use gsort to instead sort the duplicates in descending order so I can do drop if dup >1 but I can't seem to get it to work.

I would appreciate any help!

Tags: None

William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

21 Feb 2019, 15:53

Welcome to Statalist.

Since none of your example observations are duplicated and none have zeros for "employment" - which I am guessing are the variables named month1 month2 month3 - this code is a guess, but perhaps will do what you want. It ignores the dup and time variables you created.

Code:

by countyid year quarter (month1 month2 month3): keep if _n==_N

which will keep the final observation of each group of "duplicate" observations (with the same values for countyid year and quarter) where the groups are sorted by ascending values of month1 month2 month3 so all the observations with three zeroes will appear before any observation with a nonzero in at least one of the month variables.

At this point I will say that if your data are messier than you have led us to expect - for example a pair of observations where one has 0 for month1 and month2 and some nonzero value for month3, and the other has nonzero values for month1 and month2 and 0 for month three - you would benefit from reshaping the data to long first, and then doing the deduplication on a monthlyu basis. You will say "but I cannot

Code:

reshape long month, i(countyid year quarter) j(monthnum)

because countyid year quarter have duplicates" to which I reply that for purposes of this reshape, you can just create a separate id variable

Code:

generate fakeid = _n reshape long month, i(fakeid) j(monthnum) rename month employment drop fakeid

and you'll get what you need - data with a single month per observation, and you can then deduplicate by month.

Code:

by countyid year quarter monthnum (employment): keep if _n==_N

Now, about your dates. I assume once you get the data reshaped to long you will eventually create a Stata Internal Format monthly date variable using a command like

Code:

generate date = ym(year, 3*(quarter-1)+monthnum) format date %tm

Having separate values for year and month may be helpful in some circumstances, but having a single numerical value that encompasses them both is what you need to create panel data, and having a numeric version like your quarterly strings - including for example 200912 for December 2009 and 201001 for January 2010 - is not useful for panel data, because 201001 - the next month after 200912 - differs from 200912 by 89 rather than by 1.

Stata's "date and time" variables are complicated and there is a lot to learn. If you have not already read the very detailed Chapter 24 (Working with dates and times) of the Stata User's Guide PDF, do so now. If you have, it's time for a refresher. After that, the help datetime documentation will usually be enough to point the way. You can't remember everything; even the most experienced users end up referring to the help datetime documentation or back to the manual for details. But at least you will get a good understanding of the basics and the underlying principles. An investment of time that will be amply repaid.

All Stata manuals are included as PDFs in the Stata installation (since version 11) and are accessible from within Stata - for example, through the PDF Documentation section of Stata's Help menu.
1 like
Comment

Sergiy Radyakin

Join Date: Apr 2014
Posts: 1867

21 Feb 2019, 15:57

Hello Sarah,

thank you for taking time and effort to put the example of your data in the message. Unfortunately I don't think this short piece has any duplicates, so it's not very illustrative. Perhaps restricting the example to a county where you do have duplicates would be more illustrative.

If you are sure that you have no missings I feel something like this is what can help:

Code:

clear

input str20 city int quarter double unemp
    "Washington" 1 3.2
    "Washington" 2 3.9
    "Washington" 3 5.2
    "Washington" 4 4.7
    "New York" 1 2.1
    "New York" 2 0.0
    "New York" 2 2.9
    "New York" 3 4.1
    "New York" 3 0.0
    "New York" 4 1.9
    "Anchorage" 1 0.0
    "Anchorage" 1 0.0
    "Anchorage" 2 0.0
    "Anchorage" 3 0.1
    "Anchorage" 4 0.1    
end

collapse (max) unemp, by(city quarter)
format unemp %6.1f
list, sepby(city)

Results in:

Code:

     +------------------------------+
     |       city   quarter   unemp |
     |------------------------------|
  1. |  Anchorage         1     0.0 |
  2. |  Anchorage         2     0.0 |
  3. |  Anchorage         3     0.1 |
  4. |  Anchorage         4     0.1 |
     |------------------------------|
  5. |   New York         1     2.1 |
  6. |   New York         2     2.9 |
  7. |   New York         3     4.1 |
  8. |   New York         4     1.9 |
     |------------------------------|
  9. | Washington         1     3.2 |
 10. | Washington         2     3.9 |
 11. | Washington         3     5.2 |
 12. | Washington         4     4.7 |
     +------------------------------+

Best, Sergiy

Comment

Sarah Malott

Join Date: Feb 2019

Posts: 2
#4

21 Feb 2019, 16:05

Thank you so much to you both for your replies. I appreciate the help!
Comment

Announcement

need to drop duplicates depending on how many duplicates there are per observation

Comment

Comment

Comment