No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicates histogram

    Dear All,

    I am looking for a fast and [ideally] a one-command solution to building duplicates histogram. Basically, I want a bar-chart version of the duplicates report id output, which may look, for example, like this:

    . duplicates report Id
    Duplicates in terms of Id
       copies | observations       surplus
            1 |          153             0
            2 |          658           329
            3 |         1737          1158
            4 |         2664          1998
            5 |         2490          1992
            6 |         1164           970
            7 |          588           504
            8 |          232           203
            9 |          144           128
           10 |          100            90
           11 |           33            30
           12 |           24            22
           13 |           13            12
    So I want a histogram of the observations column over the copies column.
    Click image for larger version

Name:	hist.png
Views:	1
Size:	24.5 KB
ID:	1384421

    This can be done with collapse (see below), but I don't want to destroy (or sort) the dataset for performance reasons, so looking for the fastest way to achieve this.

    tempvar one cop
    generate `one'=1
    collapse (count) `cop'=`one', by(Id)
    label variable `cop' "Copies"
    histogram `cop', d freq
    Thanks in advance for all the advice,

  • #2
    How is the histogram related to the output of duplicates? The numbers are not the same.


    • #3
      I'd clone duplicates and then change the code if I wanted this. Then you have duplicate duplicates programs (well, almost).


      • #4
        Dear Friedrich,

        thank you very much for looking into this. The histogram is plotting almost the same - it plots the number of groups of copies, rather than number of raw observations.
        To put it into a practical example, if I had a person level data, with household identifiers, then I can ask questions:
        1) how many persons live in the households of size = 1,2,3,4,etc?
        2) how many households are of size =1,2,3,4
        The difference is whether each item of a group (person of a household) is counted separately, or only the group is contributing. Both make sense and both are interesting.

        You are right the histogram is answering question 2 (which is what I actually wanted), and the duplicates report answers 1. However, since there is a simple mapping between the two, I don't really care which one I can obtain. The mapping is dividing in the output of duplicates report, the column observations by the corresponding value in the column copies. So that the max on the histogram is 2664/4=666.

        The other case can be plotted with:
        tempvar dup
        duplicates tag Id, generate(`dup')
        label variable `dup' "Extra copies"
        tabulate `dup'
        histogram `dup', d freq
        which relies on duplicates command (still need to add 1 to every point on the x-axis to be exact), which afaik uses sorting inside, and which I wanted to avoid. Not a big deal though.

        Dear Nick, thank you for your advice as well. I can certainly do that if no canned alternative exists. Duplicates command is super-useful as it is now, but having it save output matrix to r() for programmatic processing and perhaps graphing the redundancy on a chart could be some ideas for its future development.

        Thank you, Sergiy


        • #5
          I'm not a big fan of the duplicates command. You just group observations; the number of duplicates is simply the number of observations in the group. No need to collapse to generate the histogram:

          set seed 123312
          set obs 3000
          gen Id = _n
          expand int(rgamma(9,.5))
          duplicates report Id
          bysort Id: gen N = _N
          by Id: gen tag = _n == 1
          label variable N "Copies"
          tab N tag
          histogram N if tag, d freq