Fastest/most elegant way to match data for analysis from another file?

William Jakeson

Join Date: Oct 2016

Posts: 9
#1

Fastest/most elegant way to match data for analysis from another file?

28 Oct 2016, 18:35

Im doing some Stata practice on 3 datasets. One data file has transactions with dates, barcodes, store code, etc. Another sheet has what each barcode means and what category and subcategories it is. A third has what each store code means and the attributes for each store. So the analysis is basic and along the lines of at category/sub category is sold the most/least. What is the highest selling categories every month.

So my thought was to load all files with seperate variables and set a for loop identifying all bar codes with the item categories and subcategories and assign a category/sub to each transaction and store attribute. But then I realized this strategy would involve 2 for loops searching through the 1k store list and 200k item list for each transaction in the 1M transaction list.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#2

28 Oct 2016, 18:52

This sounds like a task for the -merge- command, one of the really fundamental data management commands in Stata. You will need to learn this command to be an effective Stata user, so I urge you to set to the task now. The -merge- chapter in the [D] manual has good worked examples that are well worth the time spent studying them.

That you used the word "sheet" to describe a data set suggests to me that either you have your data in spreadsheets, or you are thinking about Stata data sets as if they were spreadsheets. If the former, you need to import them to Stata before you do anything else. If the latter, break the habit. Spreadsheet-based instincts will frequently lead you far astray when doing data management or analysis in Stata. Even when they don't lead you astray, they almost never prove helpful.
Comment
William Jakeson

Join Date: Oct 2016

Posts: 9
#3

28 Oct 2016, 18:57

Originally posted by Clyde Schechter View Post

This sounds like a task for the -merge- command, one of the really fundamental data management commands in Stata. You will need to learn this command to be an effective Stata user, so I urge you to set to the task now. The -merge- chapter in the [D] manual has good worked examples that are well worth the time spent studying them.

That you used the word "sheet" to describe a data set suggests to me that either you have your data in spreadsheets, or you are thinking about Stata data sets as if they were spreadsheets. If the former, you need to import them to Stata before you do anything else. If the latter, break the habit. Spreadsheet-based instincts will frequently lead you far astray when doing data management or analysis in Stata. Even when they don't lead you astray, they almost never prove helpful.

I didn't think merge would work as the first set has transaction numbers and the other ones have category identifiers. e.g. first data row on one sheet might be one transaction with barcode, amount, store number, quantity and the first line on the second set might be item barcode category sub category and the first line in third would be store number, state, avg county income, etc.

I'm not sure I could merge all lines, but I'll definitely learn more about merge.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#4

28 Oct 2016, 19:02

I think you should show examples of each data set so we can be sure we're talking about the same thing. Do that from Stata, not a spreadsheet, and do it by using the -dataex- command. (You can get the -dataex- command by running -ssc install dataex-. The instructions for using it are in -help dataex-.) But from what you described it sounds like the first merge would be an m:1 merge using barcode as the key. And then to bring in the store descriptors it would be another m:1 merge using store as the key. But maybe your data aren't properly set up for that. If that's the case, there's no way to help without seeing an example of the data.
Comment
William Jakeson

Join Date: Oct 2016

Posts: 9
#5

28 Oct 2016, 19:17

Well the first csv would be
Dollar amount, quantity, item id, store id, date

Second csv
item id, product name, category, subcategory, subsub category

Third csv
store id, state, county income
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#6

28 Oct 2016, 19:37

Originally posted by William Jakeson View Post

. . . the analysis is basic and along the lines of at category/sub category is sold the most/least. What is the highest selling categories every month.

You might try something like the following:

1. use "One data file"

2. generate int month = monthly(substr(string(transaction_date, "%tdCY-N-D"), 1, 7), "YM") // After this you can -format month %tdCY-N- if desired

3. contract month barcode, freq(sale_tally) // This is where the one-million transaction list gets reduced to something more manageable

4. merge m:1 barcode using "Another sheet", assert(match using) keep(match) // There's a bug in Stata for these two options used together, so do them in tandem (do -keep if _merge == 3- afterward)

5. preserve

6. contract month category [fweight=sales_tally], freq(items_sold)

7. sort month (items_sold): list month barcode if items_sold == items_sold[_N] // highest-selling category(ies) every month

8. by month: list month barcode if items_sold == items_sold[1] // lowest-selling category(ies) of those categories that had at least one transaction in the month

9. restore

10. contract month subcategory [fweight=sales_tally], freq(items_sold)

11. ditto for 7. and 8. for highest- and lowest-selling subcategories every month

You might be tempted for efficiency's sake to select the highest- or lowest-selling barcodes before the -merge-, but you don't know whether you'd leave behind the ones that, in the aggregate, would tally in the categories or subcategoreis with the monthly sales volume that you're trying to identify.

I don't know what "etc." means in "One data file has transactions with dates, barcodes, store code, etc.", but if you've got a money amount for each transaction, then you'll need to account for it. What's above is for number of items of a given barcode sold, and not, say, dollars' or euros' worth of a given barcode sold per month.

I don't know what you're doing with store attributes, but you could use the same technique (-contract store_attribute- or whatever) before the -merge- to make things more manageable and to avoid looping over observations.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#7

28 Oct 2016, 19:45

The later posts weren't yet visible where I am, so some of my suggestions would need to be updated accordingly.
Comment
William Jakeson

Join Date: Oct 2016

Posts: 9
#8

30 Oct 2016, 09:54

Originally posted by Joseph Coveney View Post

The later posts weren't yet visible where I am, so some of my suggestions would need to be updated accordingly.

I understand the code and it works well, but the contract command likely will not work as each individual transaction also has a quantity. So each row is one product, but can have negative or positive quantity.

Also, why contract then list if you could simply use tab?
Comment
William Jakeson

Join Date: Oct 2016

Posts: 9
#9

30 Oct 2016, 11:30

Thought of something like
sort category quantitypertransaction
by category: egen totalitems=total(quantitypertransaction)

That seem to work

Last edited by William Jakeson; 30 Oct 2016, 11:40.
Comment

Announcement

Fastest/most elegant way to match data for analysis from another file?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment