Create dataset

Simona Ferraro

Join Date: Jan 2017
Posts: 34

23 Jan 2017, 12:41

Dear all,
my apologize if the question is quite simple but I am not used to create a dataset and I need your help. I have this kind of values and I need to have a distribution. I have a range for income which is 0-1 where the total income is negative and there are 283468 taxpayers. When I copy and paste them in STATA, they are all strings. How can I create ny dataset and have normal distribution? I hope someone can help me.
Thank you very much

Range income	Total income	N. taxpayers
0-1	- 4 835 516 000	283 468
1-10000	14 002 838 000	2 506 533
10000-20000	71 204 885 000	4 749 939
20000-30000	119 519 432 000	4 784 796
30000-40000	142 509 645 000	4 102 418
40000-50000	129 949 670 000	2 906 925
50000-100000	383 600 971 000	5 641 489
100000-500000	264 531 927 000	1 674 865
>500000	72 251 237 000	56 397

Tags: None

Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

23 Jan 2017, 12:49

Hello Simona,

Welcome to the Stata Forum.

I'm not sure if I understood right your query. I assume by "create" you mean "input".

Being this so, you may type:

Code:

. set obs 8 . input range_income total_income n_taxpayers */ Then you may input the values. For "range_income" you may create 8 groups (say, 1 to 8) and afterwards define and use label values. */ Finally, if you wish to have a huge dataset according to the number of taxpayers, you may type: . expand n_taxpayers

Hopefully that helps.

Last edited by Marcos Almeida; 23 Jan 2017, 12:51.

Best regards,

Marcos
Comment
Simona Ferraro

Join Date: Jan 2017

Posts: 34
#3

23 Jan 2017, 13:04

Thank you for the welcome and for your answer. I am quite new so from that my question.
Yes I meant "input" variables. When I run the first two lines then I have

input range_income total_income n_taxpayers
range_i~e total_i~e n_taxpa~s
1.

Under "1." should I write my values? Because if I click "data editor", I cannot open the window.
Thank you again

Best regards,
Simona
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17850
#4

23 Jan 2017, 23:54

Simona:
the main issue with your dataset rests on the fact that your variables include ranges, not point estimates.
So you have to split the range of your variables and create a new variable for each lower and upper limit of the range.
Obviously, I cannot say if the suggested approach is in line with what you're after.
As far as your second query is concerned, I find difficult to believe that a normal distribution for variables like the ones you're dealing with is realistic.

Kind regards,
Carlo
(Stata 19.0)
Comment
Simona Ferraro

Join Date: Jan 2017

Posts: 34
#5

24 Jan 2017, 01:37

Dear Carlo,
thank you very much for your answer. I have variables as ranges but I got the excel file like that. You suggest to create a new variable for each lower and upper limit of the range: var 1 for 0, var2 for 1, var3 for 10000 and so on but what should I input as values for 0 and values for 1 given that they have same income and number of taxpayers?
I need from that simple data to produce a normal distribution, if I can do it.
Thank you

Best regards,
Simona
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17850
#6

24 Jan 2017, 02:08

Simona:
your interpretation of my previous reply is correct.
However, I don't believe that your dataset, as it is, allows any substantve statitical analyses in Stata.
Obviiously, I may be mistaken: if you clarify the goal of your research some ore positive replies may come alive.
Eventually, I do not follow your need of relying on a normal ditribution; often, income follows a Gamma distribution (or a skewed one, at any rate)

Kind regards,
Carlo
(Stata 19.0)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#7

24 Jan 2017, 03:06

Hello Simona,

As I pointed out in #2,

For "range_income" you may create 8 groups (say, 1 to 8) and afterwards define and use label values.

It is a simple procedure in Stata and you will find well described in the manual.

Giving you an example, you may type 1 for the first range, 2 for the second, etc. Then you will just need to define a label and apply it.

For this, just type:

Code:

. help label

And see some interesting examples.

You may also wish to use the variables manager instead, it you like.

Best regards,

Marcos
Comment

Simona Ferraro

Join Date: Jan 2017
Posts: 34

24 Jan 2017, 04:16

Thank you again Carlo for the quick answer.
I work on income distribution with deductible donations. At the begining, I need some form of distribution within the income groups (assume a normal distribution (Gauss) within the single income groups or assume an equal distribution (group members equally distributed within the group.) I don´t know which is more adequate. I also need to split the number for taxpayers between "single" and "married" because for the country I study, I have this distinction too (the number for taxpayers above is the total: single + married)

I work on income distribution with deductible donations
Thank you

Best regards,
Simona

Range_income	Total_income	N_taxpayers	Range_income	N_taxpayers (single+married)	Tot_Donations (/1000)
0-1	- 4 835 516 000	283 468	0-1	61 380	827 891
1-10000	14 002 838 000	2 506 533	1-10000	343 305	137 215
10000-20000	71 204 885 000	4 749 939	10000-20000	1 271 314	423 459
20000-30000	119 519 432 000	4 784 796	20000-30000	1 462 863	478 909
30000-40000	142 509 645 000	4 102 418	30000-40000	1 438 834	543 478
40000-50000	129 949 670 000	2 906 925	40000-50000	1 159 082	514 188
50000-100000	383 600 971 000	5 641 489	50000-100000	2 696 888	1 356 600
100000-500000	264 531 927 000	1 674 865	100000-500000	1 090 703	1 304 582
>500000	72 251 237 000	56 397	>500000	47 499	1 135 606

	1 192 735 089 000	26 706 830	9 571 868	6 721 927	6 721 927


Range_income	N_taxpayers (single+married)	Donations_paid (/1000)	Range_income	Single	Donation_paid (/1000)
0-1	4 136	880	0-1	2 372	382
1-10000	251 311	51 031	1-10000	211 584	40 467
10000-20000	1 106 972	284 679	10000-20000	844 750	199 328
20000-30000	1 309 756	383 235	20000-30000	776 637	200 222
30000-40000	1 329 616	452 724	30000-40000	658 336	202 929
40000-50000	1 085 995	414 963	40000-50000	423 428	149 040
50000-100000	2 551 674	1 152 886	50000-100000	557 379	242 770
100000-500000	1 044 651	903 827	100000-500000	145 512	140 178
>500000	46 462	544 834	>500000	8 761	153 390

	8 730 573	4 189 059		3 628 759	1 328 706

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17850
#9

24 Jan 2017, 05:00

Simona:
thanks for further clarifications.
As far as I know, the Statalister who has the widest experience with that stuff is Stephen Jenkins.
Waiting for him to chime in, you may want to search fo some of his previous posts (along with related works) and (hopefully) retrieve some suggestions.
Sorry I cannot be more helpful.

Kind regards,
Carlo
(Stata 19.0)
Comment
Simona Ferraro

Join Date: Jan 2017

Posts: 34
#10

24 Jan 2017, 07:39

Thank you very much Carlo.
Of course I have taken a look at the different topics trying to find something out, mainly how to input variables properly otherwise I have a wrong dataset.

Best regards,
Simona
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17850
#11

24 Jan 2017, 07:57

Simona:
the main issue is that you have (for some variables, at least) a sort of grouped data dataset.
Perhaps you may want to calculate a midpoint value whenever you have a range and then input accordingly.

Kind regards,
Carlo
(Stata 19.0)
Comment
Simona Ferraro

Join Date: Jan 2017

Posts: 34
#12

28 Jan 2017, 08:34

Dear all,

I write here again because I found another way to present my dataset. However, I do not know how to input my variables in STATA so that I can work on it.
I have created dist1 = 0.004457153 and I created it in STATA with the command ge dist1=0.004457153

Now I have to input all my taxpayers, many and many starting from 1. So first taxpayer has income 1, second taxpayer has income "1+dist1", third taxpayer has income from second taxpayer + dist1 and so on (you can see the small excel table below). As I need many observations as how many taxpayers, I would like to know how I can do for all of them. It is just a sum of previous value with a value with is fixed in 0.004457153.
I know it is quite simple question for you experts and also my request of advice but I do not know how I can do it . I need for thousand and thousand of taxpayers but as then the computation is the same, I will really appreciate if you can explain me or suggest me how to write the command.
You can see from the table that 2 has income 1+0.00457153; 3 has income 1.00445715+0.004457153; 4 has income 1.00891431+0.004457153 and so on

Thank you very much
Best regards,
Simona

1 1

2 1.00445715

3 1.00891431

4 1.01337146

5 1.01782861

6 1.02228576

7 1.02674292

8 1.03120007
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36053
#13

28 Jan 2017, 08:46

http://www.statalist.org/forums/help#spelling

Code:

gen double newvar = 1 + (_n - 1) * 0.004457153
Comment
Simona Ferraro

Join Date: Jan 2017

Posts: 34
#14

28 Jan 2017, 09:37

Dear Nick Cox,
thank you very much. Finally I got how to create my dataset. Following the advices above and now your comment, I generate a loop and I have what I needed.
Thank you all of you who answered to my post.

Best regards,
Simona
Comment

1	1
2	1.00445715
3	1.00891431
4	1.01337146
5	1.01782861
6	1.02228576
7	1.02674292
8	1.03120007

Announcement

Create dataset

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment