Twoway graph serset

Andrew Musau

Join Date: Oct 2014
Posts: 10188

#16

26 Feb 2021, 06:30

Code:

*LOAD DATASET
preserve
cap set obs 300
kdensity log_output if year == 2018, gen(x1 d1) n(300)
kdensity log_output if year == 2020, gen(x2 d2) n(300)
keep x1 d1 x2 d2
export excel using "/Users/log_output_2018_2020.xlsx", sheet("2018_2020")
restore

Comment

Anne-Claire Jo

Join Date: Feb 2021

Posts: 154
#17

26 Feb 2021, 07:12

Andrew Musau Thank you it worked well !
Comment

Mead Over

Join Date: Sep 2014
Posts: 110

#18

16 Jun 2021, 17:31

At the risk of coming late to the party, I'd like to put in a pitch for using a histogram in this situation.

This code generates simulated data that hopefully comes close to matching the structure of Anne-Claire Jo 's confidential data as she describes them in her post #11.

Code:

*    The fake data are for 100 firms each of which is observed in two years, 2018 and 2020 
clear
graph drop _all
set obs 100
set seed 87654321 
gen firm_number = _n
bysort firm_number: gen log_output2018 = rnormal(1+runiform(), 2 +runiform())
bysort firm_number: gen log_output2020 = rnormal(2+runiform(), 1.5 +runiform())
reshape long log_output, i(firm_number) j(year)

describe 
summarize

save sim_data, replace

*    Before using kdensity, with all of its many assumptions, 
*    one might want to start with the less sophisticated histogram.  
*    By making the width of the bars smaller, you approximate a density function,
*    but, unlike with -kdensity-, there is no interpolation.  
*    While some features of a histogram are arbitrary (bin width, etc.), 
*    the frequencies in the bins are real, not estimated.

hist_overlay log_output, over(year)  ///
    xtitle(log_output) frequency addlabels ///
    title("Overlaid frequency distributions" ///
        "constructed by -hist_overlay, over(year)-")  ///
    saving(hist, replace)
return list
matlist r(bindata)

Now to produce the -kdensity- graphs and extract their data as Anne-Claire Jo requested.

Andrew Musau pointed out that the -tw (kdensity)- command has no -gen- option. His suggested solution in posts #9 & #16 was to use the standalone command -kdensity, gen()- command twice like this.

Code:

set obs 300
kdensity log_output if year==2018, n(300) gen(kdlog_Q_2018 kd2018)
kdensity log_output if year==2020, n(300) gen(kdlog_Q_2020 kd2020)

*    To demonstrate that the generated variables do indeed 
*    represent the interpolated y-values that -kdensity- uses
*    to construct its kernel density estimates, one can display 
*    the generated density estimates in a scatter plot like this.

twoway  ///
    (scatter kd2018  kdlog_Q_2018)  ///
    (scatter kd2020  kdlog_Q_2020), ///
    xtitle(log_output) ///
    ytitle(Kernel density estimates)  ///
    title("Kernel densities" ///
        "constructed by -kd if 2018-, -kd if 2020-")  ///
    legend(order(1 "2018" 2 "2020"))  ///
    saving(scat2kd, replace)

des 
sum 

*    And then compare it to the kernel densities produced 
*    by the command structure using -twoway- that Ms. Jo first applied.

twoway  ///
    (kdensity log_output if year == 2018)  ///
    (kdensity log_output if year == 2020),  ///
    xtitle(log_output)  ///
    ytitle(Kernel density estimates)  ///
    title("Kernel densities" ///
        "constructed by -tw(kd if 2018) (kd if 2020)-")  ///
    legend(order(1 "2018" 2 "2020"))  ///
    saving(twkd, replace)

Here are the three graphs:

Click image for larger version

Name: hist.png
Views: 1
Size: 157.5 KB
ID: 1615003

Click image for larger version

Name: scat2kd.png
Views: 1
Size: 167.9 KB
ID: 1615004

Click image for larger version

Name: twkd.png
Views: 1
Size: 168.9 KB
ID: 1615005

Eyeballing the graphs -scat2kd- and -twkd- suggests that the estimates are very similar, which is reassuring. But regardless of which Stata command one uses to construct the kernel density estimates, such estimates can be sensitive to arbitrary assumptions about the appropriate choice of a kernel and its associated caliper (i.e. "bandwidth").

In a sense, a histogram is a "non-parametric" depiction of a frequency distribution. As a prelude to estimating a kernel, the histogram provides concrete frequency counts and an associated visualization that some observers will find a useful complement to the estimated kernel distribution. The program -hist_overlay- exploits Stata's new ability to control the opacity of the color in the histogram bars. So the 2018 data in the above graph is in light blue, the 2020 in light red and the overlapping segments of the histogram combine the two translucent colors to produce a purple tint.

-hist_overlay- is limited to comparing two distributions, because overlaying the colors of more than two histograms would be ugly and unreadable. In this respect, the kernel density visualization is superior, since one could overlay multiple densities without confusion.

The program -hist_overlay- can be installed here:

Code:

view net describe hist_overlay, from("http://digital.cgdev.org/doc/stata/MO/Misc")

Announcement

Comment

Comment

Comment