Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Code:
    *LOAD DATASET
    preserve
    cap set obs 300
    kdensity log_output if year == 2018, gen(x1 d1) n(300)
    kdensity log_output if year == 2020, gen(x2 d2) n(300)
    keep x1 d1 x2 d2
    export excel using "/Users/log_output_2018_2020.xlsx", sheet("2018_2020")
    restore

    Comment


    • #17
      Andrew Musau Thank you it worked well !

      Comment


      • #18
        At the risk of coming late to the party, I'd like to put in a pitch for using a histogram in this situation.

        This code generates simulated data that hopefully comes close to matching the structure of Anne-Claire Jo 's confidential data as she describes them in her post #11.

        Code:
        *    The fake data are for 100 firms each of which is observed in two years, 2018 and 2020 
        clear
        graph drop _all
        set obs 100
        set seed 87654321 
        gen firm_number = _n
        bysort firm_number: gen log_output2018 = rnormal(1+runiform(), 2 +runiform())
        bysort firm_number: gen log_output2020 = rnormal(2+runiform(), 1.5 +runiform())
        reshape long log_output, i(firm_number) j(year)
        
        describe 
        summarize
        
        save sim_data, replace
        
        *    Before using kdensity, with all of its many assumptions, 
        *    one might want to start with the less sophisticated histogram.  
        *    By making the width of the bars smaller, you approximate a density function,
        *    but, unlike with -kdensity-, there is no interpolation.  
        *    While some features of a histogram are arbitrary (bin width, etc.), 
        *    the frequencies in the bins are real, not estimated.
        
        hist_overlay log_output, over(year)  ///
            xtitle(log_output) frequency addlabels ///
            title("Overlaid frequency distributions" ///
                "constructed by -hist_overlay, over(year)-")  ///
            saving(hist, replace)
        return list
        matlist r(bindata)
        Now to produce the -kdensity- graphs and extract their data as Anne-Claire Jo requested.

        Andrew Musau pointed out that the -tw (kdensity)- command has no -gen- option. His suggested solution in posts #9 & #16 was to use the standalone command -kdensity, gen()- command twice like this.

        Code:
        set obs 300
        kdensity log_output if year==2018, n(300) gen(kdlog_Q_2018 kd2018)
        kdensity log_output if year==2020, n(300) gen(kdlog_Q_2020 kd2020)
        
        *    To demonstrate that the generated variables do indeed 
        *    represent the interpolated y-values that -kdensity- uses
        *    to construct its kernel density estimates, one can display 
        *    the generated density estimates in a scatter plot like this.
        
        twoway  ///
            (scatter kd2018  kdlog_Q_2018)  ///
            (scatter kd2020  kdlog_Q_2020), ///
            xtitle(log_output) ///
            ytitle(Kernel density estimates)  ///
            title("Kernel densities" ///
                "constructed by -kd if 2018-, -kd if 2020-")  ///
            legend(order(1 "2018" 2 "2020"))  ///
            saving(scat2kd, replace)
        
        des 
        sum 
        
        *    And then compare it to the kernel densities produced 
        *    by the command structure using -twoway- that Ms. Jo first applied.
        
        twoway  ///
            (kdensity log_output if year == 2018)  ///
            (kdensity log_output if year == 2020),  ///
            xtitle(log_output)  ///
            ytitle(Kernel density estimates)  ///
            title("Kernel densities" ///
                "constructed by -tw(kd if 2018) (kd if 2020)-")  ///
            legend(order(1 "2018" 2 "2020"))  ///
            saving(twkd, replace)
        Here are the three graphs:
        Click image for larger version

Name:	hist.png
Views:	1
Size:	157.5 KB
ID:	1615003
        Click image for larger version

Name:	scat2kd.png
Views:	1
Size:	167.9 KB
ID:	1615004

        Click image for larger version

Name:	twkd.png
Views:	1
Size:	168.9 KB
ID:	1615005
        Eyeballing the graphs -scat2kd- and -twkd- suggests that the estimates are very similar, which is reassuring. But regardless of which Stata command one uses to construct the kernel density estimates, such estimates can be sensitive to arbitrary assumptions about the appropriate choice of a kernel and its associated caliper (i.e. "bandwidth").

        In a sense, a histogram is a "non-parametric" depiction of a frequency distribution. As a prelude to estimating a kernel, the histogram provides concrete frequency counts and an associated visualization that some observers will find a useful complement to the estimated kernel distribution. The program -hist_overlay- exploits Stata's new ability to control the opacity of the color in the histogram bars. So the 2018 data in the above graph is in light blue, the 2020 in light red and the overlapping segments of the histogram combine the two translucent colors to produce a purple tint.

        -hist_overlay- is limited to comparing two distributions, because overlaying the colors of more than two histograms would be ugly and unreadable. In this respect, the kernel density visualization is superior, since one could overlay multiple densities without confusion.

        The program -hist_overlay- can be installed here:

        Code:
        view net describe hist_overlay, from("http://digital.cgdev.org/doc/stata/MO/Misc")

        Comment

        Working...
        X