Can I convert variable frequency to a new variable that represents frequency range?

Johny Daniel

Join Date: May 2017

Posts: 11
#1

Can I convert variable frequency to a new variable that represents frequency range?

09 May 2017, 09:58

I am using Stata 14.2

I have a categorical variable district_name and another categorical variable store_name in each district. I encoded district to get the number of stores in each district (encode dist_name, gen(dist_n)). Now my new variable looks like this:

Dist_n Freq (no of store names)
DistA 10
DistB 1200
DistC 450
DistD 80
DistE 690

Is it possible for me to generate a new variable that represents a range of the frequency of stores in districts. Ex.

District_size N
0 to 99 stores 2
99 to 999 stores 2
1000+ stores 1

Last edited by Johny Daniel; 09 May 2017, 10:19.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

09 May 2017, 12:47

Try this:

Code:

gen size_group = 1 if freq <= 99 replace size_group = 2 if inrange(freq, 99, 999) replace size_group = 3 if freq >= 1000 & !missing(freq) label define size_group 1 "0 to 99" 2 "99 to 999" 3 "1000+" label values size_group size_group

to create a variable that characterizes each district's number of stores according to your scheme. I'm not sure what your final example represents, but perhaps you want to do

Code:

tab size_group
Comment
Johny Daniel

Join Date: May 2017

Posts: 11
#3

09 May 2017, 13:27

I am still a newbie and maybe represented the data wrong. This is how my data looks like before encoding:

Code: tab district

dist Freq. Percent Cum.

DistA 10,686 1.02 1.02
Dist B 10,510 1.01 2.03
Dist C 10,375 0.99 3.02
Dist D 10,259 0.98 4.00

After encoding (encode district, gen(dist_n))
Code: tab dist_n

dist_n Freq. Percent Cum.

1 10,686 1.02 1.02
2 10,510 1.01 2.03
3 10,375 0.99 3.02
4 10,259 0.98 4.00
and so on..

When I use:
gen size_group = 1 if freq <= 99 (I get an error r(111) freq not found)
I may be wrong but shouldn't the code be gen size_group = 1 if dist_n <= 99

But the problem with this is that it takes the values (1,2,3) which have no meaning. They don't generate a new variable with only frequency. My goal is to get a range of the number of stores in each district (which is my outcome variable) and tabulate it with other variables such as district education level, etc. So, I want to create a table that would look like:

Dist size Freq

0 to 1000 110

1001 to 2000 210

2001 to 3000 150

3000+ 220

(110 districts with 0 to 1000 registered stores) (210 districts with 1001 to 2000 registered stores) and so on

Last edited by Johny Daniel; 09 May 2017, 13:30.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

09 May 2017, 15:55

I may be wrong but shouldn't the code be gen size_group = 1 if dist_n <= 99

Well, at this point I'm pretty confused about what your data set looks like, so I'm hesitant to comment. But as far as I can figure out from what you've shown, dist_n is just a consecutive sequence number that runs from 1 through however many districts you have in your data set in alphabetic order. So what you propose here would classify the districts based on where they ranked in alphabetic order of their names, not by the number of stores they contain.

If your data set contains a variable for the name of the district and another variable with the number of stores, then the code I showed in #2 will work if you simply replace -freq- by the actual name of the variable with the number of stores everywhere it appears. I had misunderstood what you showed in #1 to imply that you had a variable named freq (or maybe Freq) that contains the number of stores for each district.

If you don't have any such variable, then you are starting from someplace different than I imagined. In that case, I suggest you post back and include an example of your data and an explanation of what the variables in it mean. Be sure to use the -dataex- command to do that. Run -ssc install dataex- to get the -dataex- command (if you don't already have it), and then run -help dataex- to read the simple instructions for using it.
1 like
Comment
Johny Daniel

Join Date: May 2017

Posts: 11
#5

10 May 2017, 07:51

First, Clyde thanks for taking the time to help me with my stata concern. This is a sample example of my data:

District Shop Total Employ

A1 X1 80

A1 X1 90

A1 X1 150

A1 Y1 90

A1 Y1 55

A1 Y1 72

A2 X11 19

A2 X11 13

A2 X12 88

A2 X12 213

A2 X13 345

A2 X13 44

A2 X13 79

A2 X14 333

As shown, District A1 has a total of 6 stores and District A2 has 8 stores. There can be more than one of the same store in the district too, for example X1 has three stores in district A1.

So, I want to create a new variable district_size. Where the value (not the frequency) of each district is the number of stores in that district. For example,
District No of stores

A1 6

A2 8
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#6

10 May 2017, 09:51

Code:

by district, sort: gen size = _N

Your example data was posted in a way that is easy to read by human eyes; and in this case that is all that was needed. But it would have been cumbersome to import into Stata had it been necessary to try out and test some code. As requested previously, in the future always use the -dataex- command to post example data so that those who want to help you can easily create a faithful replica of your Stata example with just a simple copy and paste operation. Doing so will increase your chances of getting a timely helpful response.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#7

10 May 2017, 09:54

Clyde gives excellent advice as always. But it occurs to me that Johny may just be thinking about "data" in a way that Stata doesn't.

Code:

tab District

would show the number of stores. Perhaps that is what is wanted.
Comment
Johny Daniel

Join Date: May 2017

Posts: 11
#8

10 May 2017, 10:24

Clyde thank you so much for the solution. That works for me . And in the future I will take your advice to post using dataex.

Nick, thanks for your comment. But the tab district does not allow me to break districts into certain groups, e.g. district with 1 to 100 stores, 101 to 1000 stores, etc.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#9

10 May 2017, 10:33

You're changing the question back and forth, which is both OK and confusing. My answer in #7 is a solution to #5.

If you want tabulate to reflect a classification, then indeed the classification must exist beforehand. Clyde's answers in #2 and #6 already answer that, but in reverse order given the lack of clarity about your data. You create a frequency variable first, then reduce it to a classification.

Given a data example as requested in #6 I would be happy to demonstrate.
Comment
Johny Daniel

Join Date: May 2017

Posts: 11
#10

11 May 2017, 11:05

Nick I want to give you an example but I am unable to get dataex on my stata.

ssc install dataex
checking dataex consistency and verifying not already installed...
cannot write in directory D:\StataAdo\ado\plus\d

I get this error when trying to do so.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#11

11 May 2017, 11:23

That's a matter for your local IT support to explain why that does not work.
Comment

Dist size	Freq
0 to 1000	110
1001 to 2000	210
2001 to 3000	150
3000+	220

District	Shop	Total Employ
A1	X1	80
A1	X1	90
A1	X1	150
A1	Y1	90
A1	Y1	55
A1	Y1	72
A2	X11	19
A2	X11	13
A2	X12	88
A2	X12	213
A2	X13	345
A2	X13	44
A2	X13	79
A2	X14	333

District	No of stores
A1	6
A2	8

Announcement

Can I convert variable frequency to a new variable that represents frequency range?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment