Determining min and max of a variable

Paris Rira

Join Date: Dec 2022
Posts: 384

Determining min and max of a variable

20 Feb 2023, 17:24

Hi everybody,

There are three variables.
i. year -2010-2020
ii. firms'ID: NPC_FIC
ii. Establisments' ID: ESTAB_ID

I seek the number of firms, mean of establishments, std.dev establishment, min of establishment and max of establishment. In other words, I need to know how many firms exist in this period (2010-2020) and whats mean, min and max of establishments. Surely some firms build more plants and some exist from the market during11 years. Since each year the number of firms (not too much, less than 100) is it possible to conclude and reach a definite number of firms during this period? i.e. I can say there 298,000 firms in the period 2010-2020 and mean of the establishments is 10 ....

Code:

sort year NPC_FIC NPC_FIC

year    NPC_FIC    ESTAB_ID
2010    5.000e+08    -8.146e+09
2010    5.000e+08    -8.146e+09
2010    5.000e+08    -8.146e+09
2010    5.000e+08    794390
2010    5.000e+08    794390
2010    5.000e+08    794390
2010    5.000e+08    794390
2010    5.000e+08    794390
2010    5.000e+08    725800
2010    5.000e+08    725800
2010    5.000e+08    5
2010    5.000e+08    5
2010    5.000e+08    8
2010    5.000e+08    721748
2010    5.000e+08    721748
2010    5.000e+08    766713
2010    5.000e+08    766713
2010    5.000e+08    766713
2010    5.000e+08    766713
2010    5.000e+08    766713
2011    5.000e+08    794390    
2011    5.000e+08    794390    
2011    5.000e+08    794390    
2011    5.000e+08    725800    
2011    5.000e+08    725800    
2011    5.000e+08    5    
2011    5.000e+08    8    
2011    5.000e+08    773377    
2011    5.000e+08    1019118    
2011    5.000e+08    766713    
2011    5.000e+08    766713    
2011    5.000e+08    766713    
2011    5.000e+08    766713    
2011    5.000e+08    766713    
2011    5.000e+08    766713    
2011    5.000e+08    766713    
2011    5.000e+08    766713    
2011    5.000e+08    956207    
2011    5.000e+08    18    
2011    5.000e+08    18    
2012    5.000e+08    725800
2012    5.000e+08    5
2012    5.000e+08    8
2012    5.000e+08    1019118
2012    5.000e+08    766713
2012    5.000e+08    766713
2012    5.000e+08    766713
2012    5.000e+08    766713
2012    5.000e+08    766713
2012    5.000e+08    766713
2012    5.000e+08    766713
2012    5.000e+08    766713
2012    5.000e+08    956207
2012    5.000e+08    18
2012    5.000e+08    18
2012    5.000e+08    732861
2012    5.000e+08    732861
2012    5.000e+08    25
2012    5.000e+08    25
2012    5.000e+08    722107
2013    5.000e+08    725800
2013    5.000e+08    5
2013    5.000e+08    8
2013    5.000e+08    1019118
2013    5.000e+08    956207
2013    5.000e+08    671147
2013    5.000e+08    671147
2013    5.000e+08    732861
2013    5.000e+08    732861
2013    5.000e+08    736025
2013    5.000e+08    25
2013    5.000e+08    25
2013    5.000e+08    659845
2013    5.000e+08    659845
2013    5.000e+08    33
2013    5.000e+08    33
2013    5.000e+08    33
2013    5.000e+08    42
2013    5.000e+08    42
2013    5.000e+08    42

I only brought the first 20 obs for 4 years.

Thank you so much.

Cheers,
Paris

Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

20 Feb 2023, 18:31

If you are going to list numbers that contain identifiers, you should first assign them formats so that they are listed in full precision. And a negative ESTAB_ID? And the same establishment ID repeated in a given firm and year.

And it is not clear what you seek your results to be.

Applied to your sample data, the following example may start you in a useful direction.

Code:

duplicates drop
collapse (count) num=ESTAB_ID, by(NPC_FIC year)
table (year) (result), statistic(frequency) statistic(mean num) statistic(sd num) statistic(min num) statistic(max num)

Code:

. table (year) (result), statistic(frequency) statistic(mean num) statistic(sd num) statistic(mi
> n num) statistic(max num)

--------------------------------------------------------------------------------
        |  Frequency   Mean   Standard deviation   Minimum value   Maximum value
--------+-----------------------------------------------------------------------
year    |                                                                       
  2010  |          1      7                    .               7               7
  2011  |          1      9                    .               9               9
  2012  |          1     10                    .              10              10
  2013  |          1     12                    .              12              12
  Total |          4    9.5             2.081666               7              12
--------------------------------------------------------------------------------

which for your data tells us that the single firm had 7, 9, 10, and 12 establishments in 2010 through 2013.

Please note that if your dataset contains other variables than the three given here, you will want to first

Code:

keep year NPC_FIC ESTAB_ID

for this to work.

Comment

Paris Rira

Join Date: Dec 2022
Posts: 384

21 Feb 2023, 03:27

Hi Prof William,
Thank you for getting back to me and appreciated your amazing path as always.

Although the code did not result in what I seek, it clarifies the problem clearly.
Please have a look :

Code:

 duplicates drop

Duplicates in terms of all variables

(23,460,164 observations deleted)

. collapse (count) num=ESTAB_ID, by(NPC_FIC year )

. table (year) (result), statistic(frequency) statistic(mean num) statistic(sd num) statistic(min num) stat
> istic(max num)

----------------------------------------------------------------------------------------------
                  |  Frequency       Mean   Standard deviation   Minimum value   Maximum value
------------------+---------------------------------------------------------------------------
Ano de refer�ncia |                                                                           
  2010            |    264,930   1.206311              4.68689               1             992
  2011            |    249,401    1.21293             4.728623               1             925
  2012            |    233,582     1.2172             4.715511               1             883
  2013            |    230,601   1.212514             4.476316               1             795
  2014            |    235,743   1.205202             4.272109               1             776
  2015            |    240,192    1.20066              4.12317               1             747
  2016            |    245,070   1.197829              4.01613               1             755
  2017            |    249,428   1.192541              3.72733               1             743
  2018            |    253,236   1.191529             3.617836               1             698
  2019            |    249,077   1.190058             3.403862               1             681
  2020            |    248,960   1.189549             3.505815               1             689
  Total           |  2,700,220   1.201322             4.139758               1             992
----------------------------------------------------------------------------------------------

First of all, I need to know the number of firms, not Establishments.
Secondly, the purpose is not for each separate year, this information is required for summary statistics so I would show all years together not each year separately.
What is in mind is something like this:

Code:

                        number of firms         Mean          Std.dev           Min             Max
  
Number of establishments        119950          1.72          4.21              1              236

Comment

Paris Rira

Join Date: Dec 2022

Posts: 384
#4

21 Feb 2023, 05:49

Originally posted by William Lisowski View Post

If you are going to list numbers that contain identifiers, you should first assign them formats so that they are listed in full precision. And a negative ESTAB_ID? And the same establishment ID repeated in a given firm and year.

In terms of negative ESTAB_ID, I can do nothing, again its administrative dataset, I have no clue why they put negative identifiers.
About repeated establishment ID in a given firm and year, I can say that this is worker-level data, it means that if one id repeats it shows the number of workers and other demographic characteristics of workers which I already drop them because they don't matter in this case. One firm may have several establishments which must appear each year once.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

21 Feb 2023, 09:28

So let us look at your example data, with the duplicates and inconsistent establishment_id removed.

Code:

. list, sepby(year) +------------------------------------+ | npc_fic year estab_id once | |------------------------------------| 1. | 500000000 2010 5 1 | 2. | 500000000 2010 8 1 | 3. | 500000000 2010 721748 1 | 4. | 500000000 2010 725800 1 | 5. | 500000000 2010 766713 1 | 6. | 500000000 2010 794390 1 | |------------------------------------| 7. | 500000000 2011 5 0 | 8. | 500000000 2011 8 0 | 9. | 500000000 2011 18 1 | 10. | 500000000 2011 725800 0 | 11. | 500000000 2011 766713 0 | 12. | 500000000 2011 773377 1 | 13. | 500000000 2011 794390 0 | 14. | 500000000 2011 956207 1 | 15. | 500000000 2011 1019118 1 | |------------------------------------| 16. | 500000000 2012 5 0 | 17. | 500000000 2012 8 0 | 18. | 500000000 2012 18 0 | 19. | 500000000 2012 25 1 | 20. | 500000000 2012 722107 1 | 21. | 500000000 2012 725800 0 | 22. | 500000000 2012 732861 1 | 23. | 500000000 2012 766713 0 | 24. | 500000000 2012 956207 0 | 25. | 500000000 2012 1019118 0 | |------------------------------------| 26. | 500000000 2013 5 0 | 27. | 500000000 2013 8 0 | 28. | 500000000 2013 25 0 | 29. | 500000000 2013 33 1 | 30. | 500000000 2013 42 1 | 31. | 500000000 2013 659845 1 | 32. | 500000000 2013 671147 1 | 33. | 500000000 2013 725800 0 | 34. | 500000000 2013 732861 0 | 35. | 500000000 2013 736025 1 | 36. | 500000000 2013 956207 0 | 37. | 500000000 2013 1019118 0 | +------------------------------------+ .

I have looked carefully at your posts, but I cannot tell in computing your "mean establishments", do you want the mean of
the number of distinct establishments in each firm in a year, so that this data gives you four observations - 6 distinct establishments in 2010, 9 in 2011, 10 in 2012, and 12 in 2013

the number of distinct establishments in each firm over all years, so that this data gives you one observation of 18 distinct establishments for the years shown (but probably more if you had all 11 years of data)

Another question: can the same establishment ID occur in two different firms in the same year? That is, are establishment ID's distinct only within firms, or do they represent a particular establishment regardless of what firm owns it?

And now, a separate comment that I apologize for, because it cannot help but sound critical.

It is never correct to say, when confronted with data that is not consistent with other observations, that you "have no clue why they put negative identifiers" but you then use that data as if know that a negative value does not make a difference to your analysis.

Without understanding your data you have no basis for analyzing it. When you obtain data, your first step is to understand the data from the documentation for that data. It is unfortunate that in school we are given perfectly wonderful data with all the oddities removed, to allow us to focus on the statistical analysis. This does nothing to prepare us for the real world, where data never comes nicely packaged in a tidy Excel spreadsheet put together by a professor who doesn't want to have to spend time answering questions about messy data. Real data is messy and it the responsibility of the analyst to understand it before using it.

The first time I ran a regression in college, it was to fit data from a physics lab experiment to a line. My writeup was returned to me with the notation in red "plot your data first!" I then did so. The data clearly lay not on a line but on a parabola. I have forgotten most of the physics I learned a half-century ago but have never forgotten that lesson in data analysis.

We know from your earlier topic that your administrative data likely was output by Eviews. When you look at the data in Eviews, is the establishment ID negative there as well? Or do we have another case - as in your earlier topic - where the Eviews output of a "Stata dataset" was in some way different from the way the data appeared in Eviews?

Or perhaps negative establishment IDs have a particular meaning that should influence your treatment of these establishments. What does the documentation for the administrative data tell you about establishment ID?
2 likes
Comment
Paris Rira

Join Date: Dec 2022

Posts: 384
#6

21 Feb 2023, 10:07

Prof William,

I really appreciated the time you take to teach us. It's valuable.

You are totally right. I have been tought of pretty complex econometrics models in school and also during my Ph.D. But no one said anything about "cleaning Dirty data", especially in big dataset it becomes a tragedy.
Congrats you have a magnificent memory, you can remember the nature of datasets and how they have been stored! among too many users in Forum. You are more than perfect.

I am trying to answer your question which mainly is about Negative IDs.
Yes they are negative as well in the Eviews format. The point is that as you mentioned "a negative value does not make a difference to your analysis" It does not make the story if the ID is "Joa" or -10.6. All I need is the number of firms and min , max and st,dev of Establsihments.

Originally posted by William Lisowski View Post

I have looked carefully at your posts, but I cannot tell in computing your "mean establishments", do you want the mean of[LIST][*]the number of distinct establishments in each firm in a year, so that this data gives you four observations - 6 distinct establishments in 2010, 9 in 2011, 10 in 2012, and 12 in 2013[*]the number of distinct establishments in each firm over all years, so that this data gives you one observation of 18 distinct establishments for the years shown (but probably more if you had all 11 years of data)

I seek for non of them indeed.
I need to know: How many firms exist in the entire period (not year by year) to say that the min and max of establishments in this period (entire 12 year, not year by year) Such this one in #3
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

21 Feb 2023, 10:44

When you say mean and standard deviation and minimum and maximum of establishments in the 11 year period, you don't say how you count establishments:
1 observation for each firm: distinct establishments over the entire 11 years, so the firm in post #5 has 18 distinct establishments in 4 years, even though in any year it never had more than 12?

11 observations for each firm: distinct establishments in any year.

For a firm that has one different establishment in each year, it is different to say it has 11 establishments than to say it only has 1 at a time.

Do you want to be able to say
firms had up to 234 establishments over the the 11 year period, although not necessarily all in the same year

firms had up to 123 establishments in any year, although perhaps different establishments in different years

On IDs, when you say "It does not make the story if the ID is "Joa" or -10.6." the ID cannot be "Joa" because that is not a number and cannot be stored in a numeric variable in Stata.
1 like
Comment
Paris Rira

Join Date: Dec 2022

Posts: 384
#8

21 Feb 2023, 11:33

Originally posted by William Lisowski View Post

can the same establishment ID occur in two different firms in the same year?

Not at all. IDs are unique for each specific establishment.

Originally posted by William Lisowski View Post

When you say mean and standard deviation and minimum and maximum of establishments in the 11 year period, you don't say how you count establishments:
.

Respond: How many establishments, each firm has the entire 2010-2021. For example, during the 12 years, 10 different firms have only 1 plant. So we can say the min is 1. And again during this period 1000 firms have 200000 plants, so the max is 200000.
The rest span between 1 and 200000, but surely there is not any firm which has more than 200000 establishments. All I am going to present on my paper is it.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

21 Feb 2023, 14:07

Perhaps this does what you want. I altered your example data to have two firms for two years.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long npc_fic int year double estab_id
500000000 2010          5
500000000 2010          8
500000000 2010     721748
500000000 2010     725800
500000000 2010     766713
500000000 2010     794390
500000000 2011          5
500000000 2011          8
500000000 2011         18
500000000 2011     725800
500000000 2011     766713
500000000 2011     773377
500000000 2011     794390
500000000 2011     956207
500000000 2011    1019118
600000000 2010          5
600000000 2010          8
600000000 2010         18
600000000 2010         25
600000000 2010     722107
600000000 2010     725800
600000000 2010     732861
600000000 2010     766713
600000000 2010     956207
600000000 2010    1019118
600000000 2011          5
600000000 2011          8
600000000 2011         25
600000000 2011         33
600000000 2011         42
600000000 2011     659845
600000000 2011     671147
600000000 2011     725800
600000000 2011     732861
600000000 2011     736025
600000000 2011     956207
600000000 2011    1019118
end
// so now we have made up data for two firms in two years

// drop firms that aren't in all 11 years (optional - delete these lines if not needed)
duplicates drop
egen firm_year = tag(npc_fic year)
bysort npc_fic: egen years = total(firm_year)
// change 2 to 11 when running with real data
drop if years<2

// now we don't care about years any longer
keep npc_fic estab_id
duplicates drop

// create 1 observaation per firm with the number of establishments
bysort npc_fic: generate estabs=_N
bysort npc_fic: keep if _n==1

// results
collapse (count) firms=npc_fic (sum) estabs ///
         (mean) mean=estabs (sd) sd=estabs (min) min=estabs (max) max=estabs
list, clean noobs

Code:

. list, clean noobs

    firms   estabs   mean         sd   min   max  
        2       25   12.5   3.535534    10    15

Comment

Paris Rira

Join Date: Dec 2022
Posts: 384

#10

21 Feb 2023, 14:38

You are amazing prof William.
I thought that I could never be able to figure it out. Of course, without your assistance, I could not do that honestly. It was pretty time-consuming one. Thank you for your patience.

Code:

. keep npc_fic estab_id

. duplicates drop

Duplicates in terms of all variables

(26,042,919 observations deleted)

. bysort npc_fic: generate estabs=_N

. bysort npc_fic: keep if _n==1
(146,064 observations deleted)

. 
. // results
. collapse (count) firms=npc_fic (sum) estabs ///
> (mean) mean=estabs (sd) sd=estabs (min) min=estabs (max) max=estabs

. list, clean noobs

     firms   estabs       mean         sd   min    max  
    515014   661078   1.283612   4.249814     1   1099

Comment

Paris Rira

Join Date: Dec 2022

Posts: 384
#11

21 Feb 2023, 16:12

Now, I have another question very similar to this one with only one difference: instead of Establishment, "total export values"
The condition is as before, each firm's ID - npc_fic is unique within a year but they may repeat/disappear over 12 years. Again I need for number of firms, mean , sd , min and max of Total export values
The doll data is:

Code:

year npc_fic total_export 2010 100 1 2011 100 2 2011 80 2 2010 300 3 2011 300 4

Although firm with id, 80 dose not exist in the year 2010, I can say that there are 3 firms and min of total export is 1 , max of total export is 4 and the mean is 2.5.

Thank you so so much.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#12

21 Feb 2023, 17:26

Note that this is different than the establishments: for establishments you wanted the distibution of the number of distinct establishments for each firm; here you want the distribution of total_export for each year of each firm.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int(year npc_fic) byte total_export
2010 100 1
2011 100 2
2011  80 2
2010 300 3
2011 300 4
end
// each firm appears only once for any year
isid npc_fic year
// mean sd min max for total export
summarize total_export
// number of distinct firms
egen firm = tag(npc_fic)
count if firm

Code:

. // mean sd min max for total export
. summarize total_export

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
total_export |          5         2.4    1.140175          1          4

. // number of distinct firms
. egen firm = tag(npc_fic)

. count if firm
  3

Comment

Paris Rira

Join Date: Dec 2022

Posts: 384
#13

22 Feb 2023, 05:12

Prof William,
I can't thank you enough.
It worked perfectly.
Comment

Announcement

Determining min and max of a variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment