descriptive stats by nested group

Guneet Kaur

Join Date: Aug 2014

Posts: 48
#1

descriptive stats by nested group

18 Aug 2015, 13:57

Attached is the format of my data.(SFQ2.dta)

I have n 'customers' who buy one of the 5 'type of food' products multiple 'times' at a different 'price'. I want to be able to tabulate the following:

What is the count of customers and their average price per customer paid for 5 type of food products? I have added an excel sheet example table of what I want.

Attached Files

SFQ2.dta (1.8 KB, 1 view)

SFQexample table.xlsx (10.1 KB, 1 view)
Tags: None

Sergiy Radyakin

Join Date: Apr 2014
Posts: 1867

18 Aug 2015, 15:51

Guneet, your Excel table is probably wrong for item #4.

Code:

use "http://www.statalist.org/forums/filedata/fetch?id=1306758", clear
destring customer foodtype price, replace

tempfile tmp
preserve
  sort foodtype customer
  by foodtype customer: keep if _n==1
  collapse (count)ncust=customer, by(foodtype)
  save `"`tmp'"'
restore
sort foodtype
merge foodtype using `"`tmp'"', keep(ncust)
drop _merge

rename customer purchase
table foodtype, c(mean price min ncust count purchase)

Produces:

Code:

-------------------------------------------------
 foodtype | mean(price)   min(ncust)  N(purchase)
----------+--------------------------------------
        1 |       17.75            2            4
        2 |        9.75            2            4
        3 |          10            2            4
        4 |   17.666666            2            6
        5 |         9.5            2            4
-------------------------------------------------

I don't understand what variable times is. If this is the time indicator (period) then perhaps you want the table separately for different time periods. If it is multiple purchases, then perhaps you want to use this as weights.

Best, Sergiy Radyakin

Comment

Guneet Kaur

Join Date: Aug 2014

Posts: 48
#3

18 Aug 2015, 15:58

Thank you so much Sergiy!!! Yeah entry for customer 4 was wrong.

Can you please tell me why we had to use tempfile function? And when do we use ` sign instead of ' ?
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#4

18 Aug 2015, 16:09

Originally posted by Guneet Kaur View Post

Thank you so much Sergiy!!! Yeah entry for customer 4 was wrong.

1) Can you please tell me why we had to use tempfile function?
2) And when do we use ` sign instead of ' ?

1) I wanted to reduce data from purchase level (customer-item pair) to item level. Tempfile reserves a temporary filename, which can be used by subsequent save and merge commands. Command tempfile allows programs to work on any computer, regardless of where Stata is installed, where your do file is or where your data is. Tempfiles are erased on completion of your program and signify to the person reading your program that the file being saved will in the end be discarded, it is an intermediate product only. Use of tempfiles is a good practice when you need to save something temporarily to the disk (to overcome the present limitation of Stata of 1 dataset in memory). But I envision the same can be done without the use of tempfiles or merges using observation subscripts. There are probably other ways as well.

2) use of quotes is explained here: http://www.stata.com/help.cgi?quotes

Note that your customer ID variable may remain a string, but in general I would advise to use the most basic types for storage, and your customer ID is all numeric. Hence destring at the beginning.
Comment
Guneet Kaur

Join Date: Aug 2014

Posts: 48
#5

18 Aug 2015, 16:58

Thank you Sergiy. I am getting an error at merge stage

merge dnd_bin using `"`tmp'"', keep(ncust)
(note: you are using old merge syntax; see [D] merge for new syntax)
variable _merge already defined

I tried using new syntax for merge but it does not seem to work. I took 1:1 merge syntax and got below error

merge 1:1 dnd_bin using `"`tmp'"', keep(ncust)
ncust: invalid resulttype
results, specified in options assert() and keep(), are the integers 1 through 5, or master (equivalent to 1),
using (2), match (3), match_update (4), or match_conflict (5). The last two arise only when option update is
specified.
r(198);

Will you be able to help? Thanks!!
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

18 Aug 2015, 19:37

Here's another way of doing this

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str8 customer str4 foodtype str5 price float times
"1" "1" "13" 1 
"1" "1" "14" 2 
"2" "2" "14" 1 
"2" "2" "16" 2 
"3" "3" "2" 1 
"3" "3" "4" 2 
"4" "4" "21" 1 
"4" "4" "22" 2 
"4" "4" "31" 3 
"5" "5" "8" 1 
"5" "5" "9" 2 
"6" "1" "21" 1 
"6" "1" "23" 2 
"7" "2" "4" 1 
"7" "2" "5" 2 
"8" "3" "12" 1 
"8" "3" "22" 2 
"9" "4" "9" 1 
"9" "4" "11" 2 
"9" "4" "12" 3 
"10" "5" "10" 1 
"10" "5" "11" 2 
end

* tag first purchase of foodtype by customer
bysort foodtype customer: gen first = _n == 1

* number of customer per food type
by foodtype: egen ncustomer = total(first)

* average price
bysort foodtype: egen avprice = mean(real(price))

* reduce to one observation per foodtype
by foodtype: keep if _n == 1
keep foodtype avprice ncustomer
list

* to get the table in the format in the spreadsheet
destring foodtype, replace
xpose, clear varname
list

Comment

Guneet Kaur

Join Date: Aug 2014

Posts: 48
#7

18 Aug 2015, 19:53

Thank You sooooo much Robert!! very intuitive...It worked. Can you tell me a way in which I do not loose my original database and still get outputs in spreadsheet?
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

19 Aug 2015, 08:21

I'm not sure I understand what you want here. You do not loose the original dataset, you can always reload it from where you started from. If you are asking for a way to combine two datasets the way you can have multiple spreadsheets in Excel, then you can't do that in Stata as there is only one dataset in memory at a time. This does not stop you from combining the information in one dataset however. If your original dataset is called "mydata.dta", then you can do the following, using an unmatched merge:

Code:

use "mydata.dta", clear

* tag first purchase of foodtype by customer
bysort foodtype customer: gen first = _n == 1

* number of customer per food type
by foodtype: egen ncustomer = total(first)

* average price
bysort foodtype: egen avprice = mean(real(price))

* reduce to one observation per foodtype
by foodtype: keep if _n == 1
keep foodtype avprice ncustomer
list

* to get the table in the format in the spreadsheet
destring foodtype, replace
xpose, clear varname
list

* combine table with original data (save under new name!)
merge 1:1 _n using "mydata.dta", nogen
save "mydata+table.dta", replace

Announcement