Tabulating from very large dataset

Leonard Kalov

Join Date: Jul 2022

Posts: 6
#1

Tabulating from very large dataset

07 Jul 2022, 09:03

Hello,

I have a very large dataset on which I am trying to perform specific analyses but am coming across obstacles due to my lack of Stata inexpertise. The dataset provides details about the gender, age, level of education and other characteristics of the respondents. I am trying to calculate the percentage of never-married urban women aged 20-24 with no ownership of a personal mobile phone and then sort this by various other characteristics, e.g: never-married urban women aged 20-24 who don't own a mobile phone by level of education, by the province in which they reside..etc. Similar tasks would be the percentage of never-married urban women aged 20-24 with the ownership of only a smartphone, the percentage of never-married urban women aged 20-24 with access to the internet. I would like to then sort these by the other characteristics as I mentioned.

I was trying to calculate this using the following code but didn't succeed:

gen mobile=0 if sb1q7==0 & sb1q4==1 & region==1 (region==1 is urban residence)
replace mobile=1 if sb1q7=0 & sb1q4==1 & region==1 & sc2q05==3

The process already failed here but if it had succeeded, I would have then gone to do something like this:

gen mobilepercent=mobile*100
format *_100 %6.1f
tab province if sc1q05==13 summarize(mobilepercent) means noobs (sc1q05==13 is education upto highschool graduation)

I don't know how to fit the age range 20-24 into this.

The dataset is very large, it has over 800,000+ observations, the default dataex only outputs 100 observations of each variable, so the entire data is not being described but nonetheless people can get an idea from these:

age [age of respondents[

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte age 58 46 18 13 12 10 8 6 64 61 34 13 9 6 30 7 4 98 35 35 0 56 44 50 42 16 5 2 67 31 10 15 13 65 57 33 24 2 0 50 26 23 21 29 6 4 2 16 82 44 47 41 17 15 14 8 7 6 5 37 13 12 7 3 1 37 26 48 15 8 5 30 30 47 28 24 20 28 65 30 39 30 2 0 34 10 8 7 5 59 50 20 18 35 33 10 8 6 54 40 end

sb1q7 [marital status of respondent]

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte sb1q7 2 2 1 1 1 1 1 1 2 2 2 1 1 1 2 1 1 3 2 2 1 2 2 2 2 1 1 1 2 3 1 1 1 2 2 2 2 1 1 2 1 1 1 3 1 1 1 1 3 1 2 2 1 1 1 1 1 1 1 2 1 1 1 1 1 2 2 2 1 1 1 2 2 3 1 1 1 1 3 1 2 2 1 1 2 1 1 1 1 2 2 1 1 2 2 1 1 1 2 2 end label values sb1q7 sb1q7 label def sb1q7 1 "unmarried / never married", modify label def sb1q7 2 "currently married", modify label def sb1q7 3 "widow / widower", modify

sb1q4 [gender of respondent]

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte sb1q4 1 2 2 2 2 1 1 1 1 2 2 2 2 1 2 1 1 2 1 2 2 1 2 1 2 1 1 1 2 2 1 2 2 1 2 1 2 1 1 2 1 1 2 2 1 1 1 1 2 2 1 2 1 2 2 2 1 1 2 2 2 2 2 2 2 1 2 2 2 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 2 1 1 1 1 1 2 2 2 1 2 1 2 1 1 2 end label values sb1q4 sb1q4 label def sb1q4 1 "male", modify label def sb1q4 2 "female", modify

sc2q05 [mobile ownership status of respondents]

DE]
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte sc2q05
1
1
3
3
3
3
3
3
1
3
2
3
3
3
2
3
3
3
2
2
3
1
3
3
3
1
3
3
1
1
3
1
3
2
1
2
1
3
3
3
2
2
3
1
3
3
3
2
3
3
2
1
1
3
3
3
3
3
3
1
3
3
3
3
3
1
2
1
3
3
3
1
3
1
1
2
3
1
3
3
1
3
3
3
1
3
3
3
3
1
3
3
3
1
2
3
3
3
3
3
end
label values sc2q05 sc2q05
label def sc2q05 1 "mobile phone", modify
label def sc2q05 2 "smart phone", modify
label def sc2q05 3 "none of above", modify
[/CODE]

Last edited by Leonard Kalov; 07 Jul 2022, 09:30.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

07 Jul 2022, 10:00

gen mobile=0 if sb1q7==0 & sb1q4==1 & region==1 (region==1 is urban residence)
replace mobile=1 if sb1q7=0 & sb1q4==1 & region==1 & sc2q05==3

The process already failed here...

Saying something "failed" is not helpful. In what sense did it fail? Did Stata crash? Did it give you an error message--if so what was the error message? Did it produce results that were not what you expected--if so, show the results and explain how they differ from what is wanted (unless it is blatantly obvious to anyone who can read).

The code itself looks correct, except that you cannot have that parenthetical note explaining the region variable at the end. You need to set that off with //. In any case, there is no need to do this in two steps. You can accomplish the same thing with a single line:

Code:

gen mobile = (sc2q05 == 3) if sb1q7 == 0 & sb1q4 == 1 & region == 1

To include an addition condition that age must be between 18 and 24, make it

Code:

gen mobile = (sc2q05 == 3) if sb1q7 == 0 & sb1q4 == 1 & region == 1 & inrange(age, 18, 24)
Comment
Leonard Kalov

Join Date: Jul 2022

Posts: 6
#3

07 Jul 2022, 10:50

Originally posted by Clyde Schechter View Post

Saying something "failed" is not helpful. In what sense did it fail? Did Stata crash? Did it give you an error message--if so what was the error message? Did it produce results that were not what you expected--if so, show the results and explain how they differ from what is wanted (unless it is blatantly obvious to anyone who can read).

The code itself looks correct, except that you cannot have that parenthetical note explaining the region variable at the end. You need to set that off with //. In any case, there is no need to do this in two steps. You can accomplish the same thing with a single line:

Code:

gen mobile = (sc2q05 == 3) if sb1q7 == 0 & sb1q4 == 1 & region == 1

To include an addition condition that age must be between 18 and 24, make it

Code:

gen mobile = (sc2q05 == 3) if sb1q7 == 0 & sb1q4 == 1 & region == 1 & inrange(age, 18, 24)

The following command works:
gen mobile=0 if sb1q7==0 & sb1q4==1 & region==1

But this one yields "(0 real changes made)"
replace mobile=1 if sb1q7==0 & sb1q4==1 & region==1 & sc2q05==3

"tab mobile" also yields "no observations".

Your commands also yield "no observations" for "mobile". The variable isn't being properly created. Ofcourse, 'mobile' should take the value '1' if the never-married urban female respondent aged 18-24 isn't in possession of a phone OR '1' if the never-married urban female respondent aged 18-24 owns either a mobile phone or a smartphone. I have pasted s2cq05's dataex output above.

Last edited by Leonard Kalov; 07 Jul 2022, 10:52.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30117

07 Jul 2022, 13:08

In the future, please, do not use -dataex- for one variable at a time. Show a -dataex- that includes all the relevant variables. I have pasted yours together to come up with a workable example. (You never posted values for the -region- variable, so I'm just omitting it from the discussion entirely here. I think the same conclusions will apply.) There is nothing wrong with your code, nor with mine. Stata is simply pointing out to you that the distribution of your data doesn't include any observations having the desired combination of values: age between 18 and 24, sb1q7 == 0, sb1q4 == 1, and region == 1 sc2q05 == 3.

If we just look at the example data, we can see that out of the 100 observations shown, there are only 8 with age between 18 and 24. None of those have sb1q7 == 0. Only two of them have sb1q4 == 1.

Code:

. * Example generated by -dataex-. For more info, type help dataex
. clear

. input byte(age sb1q7 sb1q4 sc2q05)

          age     sb1q7     sb1q4    sc2q05
  1. 58 2 1 1
  2. 46 2 2 1
  3. 18 1 2 3
  4. 13 1 2 3
  5. 12 1 2 3
  6. 10 1 1 3
  7.  8 1 1 3
  8.  6 1 1 3
  9. 64 2 1 1
 10. 61 2 2 3
 11. 34 2 2 2
 12. 13 1 2 3
 13.  9 1 2 3
 14.  6 1 1 3
 15. 30 2 2 2
 16.  7 1 1 3
 17.  4 1 1 3
 18. 98 3 2 3
 19. 35 2 1 2
 20. 35 2 2 2
 21.  0 1 2 3
 22. 56 2 1 1
 23. 44 2 2 3
 24. 50 2 1 3
 25. 42 2 2 3
 26. 16 1 1 1
 27.  5 1 1 3
 28.  2 1 1 3
 29. 67 2 2 1
 30. 31 3 2 1
 31. 10 1 1 3
 32. 15 1 2 1
 33. 13 1 2 3
 34. 65 2 1 2
 35. 57 2 2 1
 36. 33 2 1 2
 37. 24 2 2 1
 38.  2 1 1 3
 39.  0 1 1 3
 40. 50 2 2 3
 41. 26 1 1 2
 42. 23 1 1 2
 43. 21 1 2 3
 44. 29 3 2 1
 45.  6 1 1 3
 46.  4 1 1 3
 47.  2 1 1 3
 48. 16 1 1 2
 49. 82 3 2 3
 50. 44 1 2 3
 51. 47 2 1 2
 52. 41 2 2 1
 53. 17 1 1 1
 54. 15 1 2 3
 55. 14 1 2 3
 56.  8 1 2 3
 57.  7 1 1 3
 58.  6 1 1 3
 59.  5 1 2 3
 60. 37 2 2 1
 61. 13 1 2 3
 62. 12 1 2 3
 63.  7 1 2 3
 64.  3 1 2 3
 65.  1 1 2 3
 66. 37 2 1 1
 67. 26 2 2 2
 68. 48 2 2 1
 69. 15 1 2 3
 70.  8 1 2 3
 71.  5 1 1 3
 72. 30 2 1 1
 73. 30 2 2 3
 74. 47 3 2 1
 75. 28 1 1 1
 76. 24 1 1 2
 77. 20 1 2 3
 78. 28 1 1 1
 79. 65 3 2 3
 80. 30 1 2 3
 81. 39 2 1 1
 82. 30 2 2 3
 83.  2 1 2 3
 84.  0 1 1 3
 85. 34 2 2 1
 86. 10 1 1 3
 87.  8 1 1 3
 88.  7 1 1 3
 89.  5 1 1 3
 90. 59 2 1 1
 91. 50 2 2 3
 92. 20 1 2 3
 93. 18 1 2 3
 94. 35 2 1 1
 95. 33 2 2 2
 96. 10 1 1 3
 97.  8 1 2 3
 98.  6 1 1 3
 99. 54 2 1 3
100. 40 2 2 3
101. end

. label values sb1q7 sb1q7

. label def sb1q7 1 "unmarried / never married", modify

. label def sb1q7 2 "currently married", modify

. label def sb1q7 3 "widow / widower", modify

. label values sb1q4 sb1q4

. label def sb1q4 1 "male", modify

. label def sb1q4 2 "female", modify

. label values sc2q05 sc2q05

. label def sc2q05 1 "mobile phone", modify

. label def sc2q05 2 "smart phone", modify

. label def sc2q05 3 "none of above", modify

.
. gen mobile = (sc2q05 == 3) if sb1q7 == 0 & sb1q4 == 1 & inrange(age, 18, 24)
(100 missing values generated)

.
. count if inrange(age, 18, 24)
  8

. count if inrange(age, 18, 24) & sb1q7 == 0
  0

. count if inrange(age, 18, 24) & sb1q4 == 1
  2

.
end of do-file

The example data doesn't have a region variable, but imposing yet another restriction would further restrict the amount of data available for calculating mobile. This small percentage of people who meet even just the age and sex criteria, and the fact that none also meet the marital status criteria, make it seem quite plausible to me that in the full data set you have nobody who meets these restrictions.
To convince yourself that in the full data set there really just aren't any people meeting all these criteria at the same time, run:

Code:

contract sb1q4 sb1q7 sc2q05 region if inrange(age, 18, 24), zero
browse

If the data are correct, then you are just facing a sample that has nobody who meets your criteria. It would make sense to review the data management that created this data set to see if a whole bunch of people were left out, or perhaps one or more of the variables is miscoded.

Comment

Leonard Kalov

Join Date: Jul 2022

Posts: 6
#5

07 Jul 2022, 13:22

Originally posted by Clyde Schechter View Post

In the future, please, do not use -dataex- for one variable at a time. Show a -dataex- that includes all the relevant variables. I have pasted yours together to come up with a workable example. (You never posted values for the -region- variable, so I'm just omitting it from the discussion entirely here. I think the same conclusions will apply.) There is nothing wrong with your code, nor with mine. Stata is simply pointing out to you that the distribution of your data doesn't include any observations having the desired combination of values: age between 18 and 24, sb1q7 == 0, sb1q4 == 1, and region == 1 sc2q05 == 3.

If we just look at the example data, we can see that out of the 100 observations shown, there are only 8 with age between 18 and 24. None of those have sb1q7 == 0. Only two of them have sb1q4 == 1.

Code:

. * Example generated by -dataex-. For more info, type help dataex . clear . input byte(age sb1q7 sb1q4 sc2q05) age sb1q7 sb1q4 sc2q05 1. 58 2 1 1 2. 46 2 2 1 3. 18 1 2 3 4. 13 1 2 3 5. 12 1 2 3 6. 10 1 1 3 7. 8 1 1 3 8. 6 1 1 3 9. 64 2 1 1 10. 61 2 2 3 11. 34 2 2 2 12. 13 1 2 3 13. 9 1 2 3 14. 6 1 1 3 15. 30 2 2 2 16. 7 1 1 3 17. 4 1 1 3 18. 98 3 2 3 19. 35 2 1 2 20. 35 2 2 2 21. 0 1 2 3 22. 56 2 1 1 23. 44 2 2 3 24. 50 2 1 3 25. 42 2 2 3 26. 16 1 1 1 27. 5 1 1 3 28. 2 1 1 3 29. 67 2 2 1 30. 31 3 2 1 31. 10 1 1 3 32. 15 1 2 1 33. 13 1 2 3 34. 65 2 1 2 35. 57 2 2 1 36. 33 2 1 2 37. 24 2 2 1 38. 2 1 1 3 39. 0 1 1 3 40. 50 2 2 3 41. 26 1 1 2 42. 23 1 1 2 43. 21 1 2 3 44. 29 3 2 1 45. 6 1 1 3 46. 4 1 1 3 47. 2 1 1 3 48. 16 1 1 2 49. 82 3 2 3 50. 44 1 2 3 51. 47 2 1 2 52. 41 2 2 1 53. 17 1 1 1 54. 15 1 2 3 55. 14 1 2 3 56. 8 1 2 3 57. 7 1 1 3 58. 6 1 1 3 59. 5 1 2 3 60. 37 2 2 1 61. 13 1 2 3 62. 12 1 2 3 63. 7 1 2 3 64. 3 1 2 3 65. 1 1 2 3 66. 37 2 1 1 67. 26 2 2 2 68. 48 2 2 1 69. 15 1 2 3 70. 8 1 2 3 71. 5 1 1 3 72. 30 2 1 1 73. 30 2 2 3 74. 47 3 2 1 75. 28 1 1 1 76. 24 1 1 2 77. 20 1 2 3 78. 28 1 1 1 79. 65 3 2 3 80. 30 1 2 3 81. 39 2 1 1 82. 30 2 2 3 83. 2 1 2 3 84. 0 1 1 3 85. 34 2 2 1 86. 10 1 1 3 87. 8 1 1 3 88. 7 1 1 3 89. 5 1 1 3 90. 59 2 1 1 91. 50 2 2 3 92. 20 1 2 3 93. 18 1 2 3 94. 35 2 1 1 95. 33 2 2 2 96. 10 1 1 3 97. 8 1 2 3 98. 6 1 1 3 99. 54 2 1 3 100. 40 2 2 3 101. end . label values sb1q7 sb1q7 . label def sb1q7 1 "unmarried / never married", modify . label def sb1q7 2 "currently married", modify . label def sb1q7 3 "widow / widower", modify . label values sb1q4 sb1q4 . label def sb1q4 1 "male", modify . label def sb1q4 2 "female", modify . label values sc2q05 sc2q05 . label def sc2q05 1 "mobile phone", modify . label def sc2q05 2 "smart phone", modify . label def sc2q05 3 "none of above", modify . . gen mobile = (sc2q05 == 3) if sb1q7 == 0 & sb1q4 == 1 & inrange(age, 18, 24) (100 missing values generated) . . count if inrange(age, 18, 24) 8 . count if inrange(age, 18, 24) & sb1q7 == 0 0 . count if inrange(age, 18, 24) & sb1q4 == 1 2 . end of do-file

The example data doesn't have a region variable, but imposing yet another restriction would further restrict the amount of data available for calculating mobile. This small percentage of people who meet even just the age and sex criteria, and the fact that none also meet the marital status criteria, make it seem quite plausible to me that in the full data set you have nobody who meets these restrictions.
To convince yourself that in the full data set there really just aren't any people meeting all these criteria at the same time, run:

Code:

contract sb1q4 sb1q7 sc2q05 region if inrange(age, 18, 24), zero browse

If the data are correct, then you are just facing a sample that has nobody who meets your criteria. It would make sense to review the data management that created this data set to see if a whole bunch of people were left out, or perhaps one or more of the variables is miscoded.

As I mentioned, the dataset is extremely large with over 800k observations, dataex wouldn't allow me to post here every single value of each variable. There is nothing wrong with the dataset as you can see from here:

Code:

. count if inrange(age, 18, 24) 110,980 . count if inrange(age, 18, 24) & sb1q7 == 0 0 . count if inrange(age, 18, 24) & sb1q4 == 1 56,339

I should mention that the dataset is publicly available and can be downloaded from the official government website: https://www.pbs.gov.pk/content/pslm-...9-20-microdata

The files plist.dta and roster.dta contains the background of the respondents (gender, age, marital status, urban/rural residence), secc1.dta contains the educational background of the respondents and secc2.dta contains the mobile ownership status of the respondents. The files are to be merged together using the variables hhcode and idc.

Last edited by Leonard Kalov; 07 Jul 2022, 13:25.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#6

07 Jul 2022, 13:36

So the very results you show in #5 show why you cannot get results for your mobile variable: there are no people who have age between 18 and 24 and sb1q7 == 0, let alone the other requirements on top of that. So Stata appropriately creates only missing values for the variable mobile, because that is what your code asks it to do. Either your data are the problem, or I am misunderstanding what you are trying to do.

So, my question to you is, what were you expecting Stata to give you? What would you expect Stata to put in the mobile variable for people who do meet all of the age, sex, marital status and region criteria? If it's something other than missing value, say what it is and I can show you how to code for it.
Comment
Leonard Kalov

Join Date: Jul 2022

Posts: 6
#7

07 Jul 2022, 15:19

Originally posted by Clyde Schechter View Post

So the very results you show in #5 show why you cannot get results for your mobile variable: there are no people who have age between 18 and 24 and sb1q7 == 0, let alone the other requirements on top of that. So Stata appropriately creates only missing values for the variable mobile, because that is what your code asks it to do. Either your data are the problem, or I am misunderstanding what you are trying to do.

So, my question to you is, what were you expecting Stata to give you? What would you expect Stata to put in the mobile variable for people who do meet all of the age, sex, marital status and region criteria? If it's something other than missing value, say what it is and I can show you how to code for it.

I apologize for the troubles, I now realize that the problem was my misinterpretation if 'tab sb1q7' gives 6 values and the first one is "unmarried/never married", you can select the nevermarried one using sb1q7==0, meaning the first one is 0. However, this is grossly incorrect as you know, using sb1q7==1 fixed all the issues. Here is the code that worked:

gen mobile = (sc2q05 == 3) if sb1q7 == 1 & sb1q4 == 2 & inrange(age, 20, 24) & region == 2
gen mobile_100=mobile*100
format *_100 %6.1f
And now I can successfully calculate however I like:
tab province if sc1q05==13, summarize(mobile_100) means noobs
tab province language, summarize(mobile_100) means noobs

I am now trying to calculate the usage of internet and there are two variables: sc2q08 which is whether the respondent used internet in the last 3 months and sc2q10 which is the frequency of internet usage among those who used internet in the last 3 months. sc2q08 variable contains all 800k respondents bifurcated by whether they used internet in the last 3 months or not, whereas sc2q10 only contains those who used internet in the last 3 months so it has only 116k respondents which is the same number of respondents who said they used internet in the last 3 months in sc2q08.

How do I create a variable which tells me the percentage of never-married [sb1q7==1] urban [region==2] men [sb1q4==1] aged 20-24 who used internet either once a day and once a week? Can you help me with the code?

Code:

. tab sc2q10 how many | times | did----- use | internet? | Freq. Percent Cum. -------------+----------------------------------- 0 | 501 0.43 0.43 once a day | 51,212 44.07 44.50 once a week | 17,114 14.73 59.22 once a month | 2,677 2.30 61.53 as required | 44,714 38.47 100.00 -------------+----------------------------------- Total | 116,218 100.00

Code:

. tab sc2q08 did ----- | use | internet | during last | 3 months? | Freq. Percent Cum. ------------+----------------------------------- yes | 115,616 13.29 13.29 no | 754,546 86.71 100.00 ------------+----------------------------------- Total | 870,162 100.00

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte(sc2q08 sc2q10) 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 1 1 2 . 2 . 2 . 1 1 2 . 2 . 2 . 1 1 1 1 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 1 1 2 . 2 . 2 . 2 . 1 1 1 1 2 . 2 . 2 . 2 . 2 . 1 1 2 . 2 . 1 1 2 . 1 4 1 4 1 4 1 4 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 1 1 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 1 1 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 2 . 1 4 1 4 1 4 1 4 2 . 2 . end label values sc2q08 sc2q08 label def sc2q08 1 "yes", modify label def sc2q08 2 " no", modify label values sc2q10 sc2q10 label def sc2q10 1 "once a day", modify label def sc2q10 4 "as required", modify
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#8

07 Jul 2022, 15:43

Code:

gen byte wanted = inlist(sc2q10, 1, 2) /// if sb1q7 == 1 & sb1q4 == 1 & region == 2 & inrange(age, 20, 24) tab wanted

will do that.

Note: I am guessing from the output of the -tab sc2q10- command you show that once a month is coded as 2 in that variable. Your example data only has values 1 "once a day" and 4 "as required" in that variable, so I cannot be certain how the other values are coded. But on the assumption that the coding goes in consecutive integers in the order shown in the -tab output-, 2 would be correct. If you are not sure yourself, run -label list sc2q10- and Stata will show you.
Comment

Leonard Kalov

Join Date: Jul 2022
Posts: 6

07 Jul 2022, 16:08

Originally posted by Clyde Schechter View Post

Code:

gen byte wanted = inlist(sc2q10, 1, 2) ///
if sb1q7 == 1 & sb1q4 == 1 & region == 2 & inrange(age, 20, 24)
tab wanted

will do that.

Note: I am guessing from the output of the -tab sc2q10- command you show that once a month is coded as 2 in that variable. Your example data only has values 1 "once a day" and 4 "as required" in that variable, so I cannot be certain how the other values are coded. But on the assumption that the coding goes in consecutive integers in the order shown in the -tab output-, 2 would be correct. If you are not sure yourself, run -label list sc2q10- and Stata will show you.

Code:

sc2q10:
           1 once a day
           2 once a week
           3 once a month
           4 as required

Can you confirm whether this is what you were trying to do:

Code:

gen byte want = inlist(sc2q10, 1, 3) if sb1q7 == 1 & sb1q4 == 1 & region == 2 & inrange(age, 20, 24)
tab want

Everything seems fine, now there is a problem in weighting this with the "weights" variable. If I do "tab province [iweight=weights], summarize(want_100) means noobs", I get an error which says that I cannot use noninteger frequency weights. Any idea how I can weight all of this?

Here is the output from dataex of "weights":

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float weights
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
247.0509
end

Last edited by Leonard Kalov; 07 Jul 2022, 16:11.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#10

07 Jul 2022, 17:05

Code:
sc2q10:
1 once a day
2 once a week
3 once a month
4 as required

Can you confirm whether this is what you were trying to do:

Code:
gen byte want = inlist(sc2q10, 1, 3) if sb1q7 == 1 & sb1q4 == 1 & region == 2 & inrange(age, 20, 24)
tab want

No, that's not what I was trying to do. I got confused when I wrote "I am guessing from the output of the -tab sc2q10- command you show that once a month is coded as 2 in that variable." I meant to say that I guessed once a week is coded as 2. And, as you show now, it is. So the criterion should be -inlist(sc2q10, 1, 2)-, so as to capture those who used once a day or once a week--which is what you asked for. Sorry for the confusion. So the code in #8 should be used as it was shown there.

Concerning the weights, the choice of proper type of weights is tricky. I do not know why you are getting the error message you show with the code you show. While fweights must always be non-negative integers, there is no similar restriction on iweights. I find it hard to believe that you really got that error message with that code. If that really does happen, then please post back with a single -dataex- that reproduces this same problem and also contains all the variables needed to run the problematic command(s), not just one or two--I said before that that isn't very helpful. And I'm not going to paste all those -dataex-'s together yet again. And also copy/paste from your Results window or log file the actual command and the error message into the Forum editor between code delimiters.

Also, before using any kind of weights in Stata you need to be certain that they are the right kind of weights for the data. The website that has the data sets themselves should also have documentation that explains the proper use of the weighting variables in their data. Refer to that before assuming that -iweight- is appropriate. It may or may not be.
Comment
Leonard Kalov

Join Date: Jul 2022

Posts: 6
#11

07 Jul 2022, 18:09

Originally posted by Clyde Schechter View Post

No, that's not what I was trying to do. I got confused when I wrote "I am guessing from the output of the -tab sc2q10- command you show that once a month is coded as 2 in that variable." I meant to say that I guessed once a week is coded as 2. And, as you show now, it is. So the criterion should be -inlist(sc2q10, 1, 2)-, so as to capture those who used once a day or once a week--which is what you asked for. Sorry for the confusion. So the code in #8 should be used as it was shown there.

Concerning the weights, the choice of proper type of weights is tricky. I do not know why you are getting the error message you show with the code you show. While fweights must always be non-negative integers, there is no similar restriction on iweights. I find it hard to believe that you really got that error message with that code. If that really does happen, then please post back with a single -dataex- that reproduces this same problem and also contains all the variables needed to run the problematic command(s), not just one or two--I said before that that isn't very helpful. And I'm not going to paste all those -dataex-'s together yet again. And also copy/paste from your Results window or log file the actual command and the error message into the Forum editor between code delimiters.

Also, before using any kind of weights in Stata you need to be certain that they are the right kind of weights for the data. The website that has the data sets themselves should also have documentation that explains the proper use of the weighting variables in their data. Refer to that before assuming that -iweight- is appropriate. It may or may not be.

Using the simple command "tab province want [iweight=weights]" successfully yields the weighted results, but the detailed code isn't working for some reason. Could it be the fact that I am trying to calculate percentages and this interferes somehow with the weighting process? Regardless, here is the code that is problematic:

Code:

gen byte want = inlist(sc2q10, 1, 2) if sb1q7 == 1 & sb1q4 == 1 & region == 2 & inrange(age, 20, 24) gen want_100=want*100 format *_100 %6.1f tab province [iweight=weights], summarize(want_100) means noobs

This yields the following:

Code:

may not use noninteger frequency weights

This works perfectly fine if we exclude [iweight=weights].

Here is the output of the simpler command that worked:

Code:

. tab province want [iweight=weights] | want province | 0 1 | Total -------------------+----------------------+---------- khyber pakhtunkhwa | 153,364 50,763.67 | 204,127.7 punjab | 1026772.4 493,517.7 | 1520290.1 sindh | 592,016.3 315,019.6 | 907,035.9 balochistan | 84,370.5 30,608.44 | 114,978.9 -------------------+----------------------+---------- Total | 1856523.2 889,909.5 | 2746432.7

Last edited by Leonard Kalov; 07 Jul 2022, 18:13.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#12

07 Jul 2022, 18:37

That error message makes no sense after that comment. You didn't use frequency weights. You used importance weights, and those don't have to be integers. You can see that because the simpler -tab- command with -iweight- worked.

I can't think of any reason the more complicated command shouldn't work. I have not tried to do it in your example data because you did not put up a -dataex- output that contains all of the variables province, weights, and want_100 that are needed to run that command. Nevertheless, I tried something similar in the auto.dta that comes installed with Stata and I am able to reproduce your problem:

Code:

. tab rep78 [iweight = headroom], summarize(mpg) means noobs may not use noninteger frequency weights r(401); . tab rep78 [iweight = headroom] Repair | record 1978 | Freq. Percent Cum. ------------+----------------------------------- 1 | 3.5 1.69 1.69 2 | 27 13.04 14.73 3 | 95 45.89 60.63 4 | 53.5 25.85 86.47 5 | 28 13.53 100.00 ------------+----------------------------------- Total | 207 100.00

The problem is that, surprisingly to me, -tab, summarize()- does not allow iweights, only aweights and fweights (which must be integers). So if these are genuinely supposed to be iweights, then you are not going to be able to do it this simply. What you will need to do instead is:

Code:

by province, sort: egen numerator = total(weights*want100) by province: egen denominator = total(weights) gen weighted_mean = numerator/denominator egen flag = tag(province) list province weighted_mean if flag, noobs clean

All of that said, I must emphasize once again that the use of weights is tricky. -iweight- has no fixed meaning in Stata. Often when people specify -iweight- it is because they don't know what kind of weight they really should be using. The results can be extremely misleading. So, again, I'll remind you to verify in the documentation for the data sets that you don't really need -pweights- here. Just given the kind of variables you have been talking about in this thread, I get the impression that this is survey data you are using. Usually survey data works with -pweight-s. (And also there are often other things that go along with survey data, like strata, and primary or secondary sampling units. If you have these things, then you should be -svyset-ing your data and relying on -svy: tab- . Admittedly, just for means, only the -pweight- is needed, but once you get to any more complicated statistics and start looking at confidence intervals or p-values, if you aren't using all of the sampling parameters, then you are getting wrong answers.)
Comment

Announcement