Duplicates histogram

Sergiy Radyakin

Join Date: Apr 2014
Posts: 1867

Duplicates histogram

19 Apr 2017, 09:27

Dear All,

I am looking for a fast and [ideally] a one-command solution to building duplicates histogram. Basically, I want a bar-chart version of the duplicates report id output, which may look, for example, like this:

Code:

. duplicates report Id

Duplicates in terms of Id

--------------------------------------
   copies | observations       surplus
----------+---------------------------
        1 |          153             0
        2 |          658           329
        3 |         1737          1158
        4 |         2664          1998
        5 |         2490          1992
        6 |         1164           970
        7 |          588           504
        8 |          232           203
        9 |          144           128
       10 |          100            90
       11 |           33            30
       12 |           24            22
       13 |           13            12
--------------------------------------

So I want a histogram of the observations column over the copies column.

Click image for larger version

Name: hist.png
Views: 1
Size: 24.5 KB
ID: 1384421

This can be done with collapse (see below), but I don't want to destroy (or sort) the dataset for performance reasons, so looking for the fastest way to achieve this.

Code:

preserve

tempvar one cop
generate `one'=1
collapse (count) `cop'=`one', by(Id)
label variable `cop' "Copies"
histogram `cop', d freq

restore

Thanks in advance for all the advice,
Sergiy

Tags: duplicates, histogram, optimization

Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#2

19 Apr 2017, 10:57

How is the histogram related to the output of duplicates? The numbers are not the same.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#3

19 Apr 2017, 11:00

I'd clone duplicates and then change the code if I wanted this. Then you have duplicate duplicates programs (well, almost).
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#4

19 Apr 2017, 15:19

Dear Friedrich,

thank you very much for looking into this. The histogram is plotting almost the same - it plots the number of groups of copies, rather than number of raw observations.
To put it into a practical example, if I had a person level data, with household identifiers, then I can ask questions:
1) how many persons live in the households of size = 1,2,3,4,etc?
2) how many households are of size =1,2,3,4
The difference is whether each item of a group (person of a household) is counted separately, or only the group is contributing. Both make sense and both are interesting.

You are right the histogram is answering question 2 (which is what I actually wanted), and the duplicates report answers 1. However, since there is a simple mapping between the two, I don't really care which one I can obtain. The mapping is dividing in the output of duplicates report, the column observations by the corresponding value in the column copies. So that the max on the histogram is 2664/4=666.

The other case can be plotted with:

Code:

tempvar dup duplicates tag Id, generate(`dup') label variable `dup' "Extra copies" tabulate `dup' histogram `dup', d freq

which relies on duplicates command (still need to add 1 to every point on the x-axis to be exact), which afaik uses sorting inside, and which I wanted to avoid. Not a big deal though.

Dear Nick, thank you for your advice as well. I can certainly do that if no canned alternative exists. Duplicates command is super-useful as it is now, but having it save output matrix to r() for programmatic processing and perhaps graphing the redundancy on a chart could be some ideas for its future development.

Thank you, Sergiy
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#5

19 Apr 2017, 18:16

I'm not a big fan of the duplicates command. You just group observations; the number of duplicates is simply the number of observations in the group. No need to collapse to generate the histogram:

Code:

clear set seed 123312 set obs 3000 gen Id = _n expand int(rgamma(9,.5)) duplicates report Id bysort Id: gen N = _N by Id: gen tag = _n == 1 label variable N "Copies" tab N tag histogram N if tag, d freq
Comment

Announcement

Duplicates histogram

Comment

Comment

Comment

Comment