Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Determining min and max of a variable

    Hi everybody,

    There are three variables.
    i. year -2010-2020
    ii. firms'ID: NPC_FIC
    ii. Establisments' ID: ESTAB_ID

    I seek the number of firms, mean of establishments, std.dev establishment, min of establishment and max of establishment. In other words, I need to know how many firms exist in this period (2010-2020) and whats mean, min and max of establishments. Surely some firms build more plants and some exist from the market during11 years. Since each year the number of firms (not too much, less than 100) is it possible to conclude and reach a definite number of firms during this period? i.e. I can say there 298,000 firms in the period 2010-2020 and mean of the establishments is 10 ....

    Code:
    sort year NPC_FIC NPC_FIC
    
    year    NPC_FIC    ESTAB_ID
    2010    5.000e+08    -8.146e+09
    2010    5.000e+08    -8.146e+09
    2010    5.000e+08    -8.146e+09
    2010    5.000e+08    794390
    2010    5.000e+08    794390
    2010    5.000e+08    794390
    2010    5.000e+08    794390
    2010    5.000e+08    794390
    2010    5.000e+08    725800
    2010    5.000e+08    725800
    2010    5.000e+08    5
    2010    5.000e+08    5
    2010    5.000e+08    8
    2010    5.000e+08    721748
    2010    5.000e+08    721748
    2010    5.000e+08    766713
    2010    5.000e+08    766713
    2010    5.000e+08    766713
    2010    5.000e+08    766713
    2010    5.000e+08    766713
    2011    5.000e+08    794390    
    2011    5.000e+08    794390    
    2011    5.000e+08    794390    
    2011    5.000e+08    725800    
    2011    5.000e+08    725800    
    2011    5.000e+08    5    
    2011    5.000e+08    8    
    2011    5.000e+08    773377    
    2011    5.000e+08    1019118    
    2011    5.000e+08    766713    
    2011    5.000e+08    766713    
    2011    5.000e+08    766713    
    2011    5.000e+08    766713    
    2011    5.000e+08    766713    
    2011    5.000e+08    766713    
    2011    5.000e+08    766713    
    2011    5.000e+08    766713    
    2011    5.000e+08    956207    
    2011    5.000e+08    18    
    2011    5.000e+08    18    
    2012    5.000e+08    725800
    2012    5.000e+08    5
    2012    5.000e+08    8
    2012    5.000e+08    1019118
    2012    5.000e+08    766713
    2012    5.000e+08    766713
    2012    5.000e+08    766713
    2012    5.000e+08    766713
    2012    5.000e+08    766713
    2012    5.000e+08    766713
    2012    5.000e+08    766713
    2012    5.000e+08    766713
    2012    5.000e+08    956207
    2012    5.000e+08    18
    2012    5.000e+08    18
    2012    5.000e+08    732861
    2012    5.000e+08    732861
    2012    5.000e+08    25
    2012    5.000e+08    25
    2012    5.000e+08    722107
    2013    5.000e+08    725800
    2013    5.000e+08    5
    2013    5.000e+08    8
    2013    5.000e+08    1019118
    2013    5.000e+08    956207
    2013    5.000e+08    671147
    2013    5.000e+08    671147
    2013    5.000e+08    732861
    2013    5.000e+08    732861
    2013    5.000e+08    736025
    2013    5.000e+08    25
    2013    5.000e+08    25
    2013    5.000e+08    659845
    2013    5.000e+08    659845
    2013    5.000e+08    33
    2013    5.000e+08    33
    2013    5.000e+08    33
    2013    5.000e+08    42
    2013    5.000e+08    42
    2013    5.000e+08    42
    I only brought the first 20 obs for 4 years.

    Thank you so much.

    Cheers,
    Paris

  • #2
    If you are going to list numbers that contain identifiers, you should first assign them formats so that they are listed in full precision. And a negative ESTAB_ID? And the same establishment ID repeated in a given firm and year.

    And it is not clear what you seek your results to be.

    Applied to your sample data, the following example may start you in a useful direction.
    Code:
    duplicates drop
    collapse (count) num=ESTAB_ID, by(NPC_FIC year)
    table (year) (result), statistic(frequency) statistic(mean num) statistic(sd num) statistic(min num) statistic(max num)
    Code:
    . table (year) (result), statistic(frequency) statistic(mean num) statistic(sd num) statistic(mi
    > n num) statistic(max num)
    
    --------------------------------------------------------------------------------
            |  Frequency   Mean   Standard deviation   Minimum value   Maximum value
    --------+-----------------------------------------------------------------------
    year    |                                                                       
      2010  |          1      7                    .               7               7
      2011  |          1      9                    .               9               9
      2012  |          1     10                    .              10              10
      2013  |          1     12                    .              12              12
      Total |          4    9.5             2.081666               7              12
    --------------------------------------------------------------------------------
    which for your data tells us that the single firm had 7, 9, 10, and 12 establishments in 2010 through 2013.

    Please note that if your dataset contains other variables than the three given here, you will want to first
    Code:
    keep year NPC_FIC ESTAB_ID
    for this to work.

    Comment


    • #3
      Hi Prof William,
      Thank you for getting back to me and appreciated your amazing path as always.

      Although the code did not result in what I seek, it clarifies the problem clearly.
      Please have a look :
      Code:
       duplicates drop
      
      Duplicates in terms of all variables
      
      (23,460,164 observations deleted)
      
      . collapse (count) num=ESTAB_ID, by(NPC_FIC year )
      
      . table (year) (result), statistic(frequency) statistic(mean num) statistic(sd num) statistic(min num) stat
      > istic(max num)
      
      ----------------------------------------------------------------------------------------------
                        |  Frequency       Mean   Standard deviation   Minimum value   Maximum value
      ------------------+---------------------------------------------------------------------------
      Ano de refer�ncia |                                                                           
        2010            |    264,930   1.206311              4.68689               1             992
        2011            |    249,401    1.21293             4.728623               1             925
        2012            |    233,582     1.2172             4.715511               1             883
        2013            |    230,601   1.212514             4.476316               1             795
        2014            |    235,743   1.205202             4.272109               1             776
        2015            |    240,192    1.20066              4.12317               1             747
        2016            |    245,070   1.197829              4.01613               1             755
        2017            |    249,428   1.192541              3.72733               1             743
        2018            |    253,236   1.191529             3.617836               1             698
        2019            |    249,077   1.190058             3.403862               1             681
        2020            |    248,960   1.189549             3.505815               1             689
        Total           |  2,700,220   1.201322             4.139758               1             992
      ----------------------------------------------------------------------------------------------
      First of all, I need to know the number of firms, not Establishments.
      Secondly, the purpose is not for each separate year, this information is required for summary statistics so I would show all years together not each year separately.
      What is in mind is something like this:
      Code:
                              number of firms         Mean          Std.dev           Min             Max
        
      Number of establishments        119950          1.72          4.21              1              236

      Comment


      • #4
        Originally posted by William Lisowski View Post
        If you are going to list numbers that contain identifiers, you should first assign them formats so that they are listed in full precision. And a negative ESTAB_ID? And the same establishment ID repeated in a given firm and year.
        In terms of negative ESTAB_ID, I can do nothing, again its administrative dataset, I have no clue why they put negative identifiers.
        About repeated establishment ID in a given firm and year, I can say that this is worker-level data, it means that if one id repeats it shows the number of workers and other demographic characteristics of workers which I already drop them because they don't matter in this case. One firm may have several establishments which must appear each year once.

        Comment


        • #5
          So let us look at your example data, with the duplicates and inconsistent establishment_id removed.
          Code:
          . list, sepby(year)
          
               +------------------------------------+
               |   npc_fic   year   estab_id   once |
               |------------------------------------|
            1. | 500000000   2010          5      1 |
            2. | 500000000   2010          8      1 |
            3. | 500000000   2010     721748      1 |
            4. | 500000000   2010     725800      1 |
            5. | 500000000   2010     766713      1 |
            6. | 500000000   2010     794390      1 |
               |------------------------------------|
            7. | 500000000   2011          5      0 |
            8. | 500000000   2011          8      0 |
            9. | 500000000   2011         18      1 |
           10. | 500000000   2011     725800      0 |
           11. | 500000000   2011     766713      0 |
           12. | 500000000   2011     773377      1 |
           13. | 500000000   2011     794390      0 |
           14. | 500000000   2011     956207      1 |
           15. | 500000000   2011    1019118      1 |
               |------------------------------------|
           16. | 500000000   2012          5      0 |
           17. | 500000000   2012          8      0 |
           18. | 500000000   2012         18      0 |
           19. | 500000000   2012         25      1 |
           20. | 500000000   2012     722107      1 |
           21. | 500000000   2012     725800      0 |
           22. | 500000000   2012     732861      1 |
           23. | 500000000   2012     766713      0 |
           24. | 500000000   2012     956207      0 |
           25. | 500000000   2012    1019118      0 |
               |------------------------------------|
           26. | 500000000   2013          5      0 |
           27. | 500000000   2013          8      0 |
           28. | 500000000   2013         25      0 |
           29. | 500000000   2013         33      1 |
           30. | 500000000   2013         42      1 |
           31. | 500000000   2013     659845      1 |
           32. | 500000000   2013     671147      1 |
           33. | 500000000   2013     725800      0 |
           34. | 500000000   2013     732861      0 |
           35. | 500000000   2013     736025      1 |
           36. | 500000000   2013     956207      0 |
           37. | 500000000   2013    1019118      0 |
               +------------------------------------+
          
          .
          I have looked carefully at your posts, but I cannot tell in computing your "mean establishments", do you want the mean of
          • the number of distinct establishments in each firm in a year, so that this data gives you four observations - 6 distinct establishments in 2010, 9 in 2011, 10 in 2012, and 12 in 2013
          • the number of distinct establishments in each firm over all years, so that this data gives you one observation of 18 distinct establishments for the years shown (but probably more if you had all 11 years of data)
          Another question: can the same establishment ID occur in two different firms in the same year? That is, are establishment ID's distinct only within firms, or do they represent a particular establishment regardless of what firm owns it?

          And now, a separate comment that I apologize for, because it cannot help but sound critical.

          It is never correct to say, when confronted with data that is not consistent with other observations, that you "have no clue why they put negative identifiers" but you then use that data as if know that a negative value does not make a difference to your analysis.

          Without understanding your data you have no basis for analyzing it. When you obtain data, your first step is to understand the data from the documentation for that data. It is unfortunate that in school we are given perfectly wonderful data with all the oddities removed, to allow us to focus on the statistical analysis. This does nothing to prepare us for the real world, where data never comes nicely packaged in a tidy Excel spreadsheet put together by a professor who doesn't want to have to spend time answering questions about messy data. Real data is messy and it the responsibility of the analyst to understand it before using it.

          The first time I ran a regression in college, it was to fit data from a physics lab experiment to a line. My writeup was returned to me with the notation in red "plot your data first!" I then did so. The data clearly lay not on a line but on a parabola. I have forgotten most of the physics I learned a half-century ago but have never forgotten that lesson in data analysis.

          We know from your earlier topic that your administrative data likely was output by Eviews. When you look at the data in Eviews, is the establishment ID negative there as well? Or do we have another case - as in your earlier topic - where the Eviews output of a "Stata dataset" was in some way different from the way the data appeared in Eviews?

          Or perhaps negative establishment IDs have a particular meaning that should influence your treatment of these establishments. What does the documentation for the administrative data tell you about establishment ID?

          Comment


          • #6
            Prof William,

            I really appreciated the time you take to teach us. It's valuable.

            You are totally right. I have been tought of pretty complex econometrics models in school and also during my Ph.D. But no one said anything about "cleaning Dirty data", especially in big dataset it becomes a tragedy.
            Congrats you have a magnificent memory, you can remember the nature of datasets and how they have been stored! among too many users in Forum. You are more than perfect.

            I am trying to answer your question which mainly is about Negative IDs.
            Yes they are negative as well in the Eviews format. The point is that as you mentioned "a negative value does not make a difference to your analysis" It does not make the story if the ID is "Joa" or -10.6. All I need is the number of firms and min , max and st,dev of Establsihments.


            Originally posted by William Lisowski View Post
            I have looked carefully at your posts, but I cannot tell in computing your "mean establishments", do you want the mean of[LIST][*]the number of distinct establishments in each firm in a year, so that this data gives you four observations - 6 distinct establishments in 2010, 9 in 2011, 10 in 2012, and 12 in 2013[*]the number of distinct establishments in each firm over all years, so that this data gives you one observation of 18 distinct establishments for the years shown (but probably more if you had all 11 years of data)
            I seek for non of them indeed.
            I need to know: How many firms exist in the entire period (not year by year) to say that the min and max of establishments in this period (entire 12 year, not year by year) Such this one in #3

            Comment


            • #7
              When you say mean and standard deviation and minimum and maximum of establishments in the 11 year period, you don't say how you count establishments:
              • 1 observation for each firm: distinct establishments over the entire 11 years, so the firm in post #5 has 18 distinct establishments in 4 years, even though in any year it never had more than 12?
              • 11 observations for each firm: distinct establishments in any year.
              For a firm that has one different establishment in each year, it is different to say it has 11 establishments than to say it only has 1 at a time.

              Do you want to be able to say
              • firms had up to 234 establishments over the the 11 year period, although not necessarily all in the same year
              • firms had up to 123 establishments in any year, although perhaps different establishments in different years
              On IDs, when you say "It does not make the story if the ID is "Joa" or -10.6." the ID cannot be "Joa" because that is not a number and cannot be stored in a numeric variable in Stata.

              Comment


              • #8
                Originally posted by William Lisowski View Post
                can the same establishment ID occur in two different firms in the same year?
                Not at all. IDs are unique for each specific establishment.

                Originally posted by William Lisowski View Post
                When you say mean and standard deviation and minimum and maximum of establishments in the 11 year period, you don't say how you count establishments:
                .
                Respond: How many establishments, each firm has the entire 2010-2021. For example, during the 12 years, 10 different firms have only 1 plant. So we can say the min is 1. And again during this period 1000 firms have 200000 plants, so the max is 200000.
                The rest span between 1 and 200000, but surely there is not any firm which has more than 200000 establishments. All I am going to present on my paper is it.


                Comment


                • #9
                  Perhaps this does what you want. I altered your example data to have two firms for two years.
                  Code:
                  * Example generated by -dataex-. For more info, type help dataex
                  clear
                  input long npc_fic int year double estab_id
                  500000000 2010          5
                  500000000 2010          8
                  500000000 2010     721748
                  500000000 2010     725800
                  500000000 2010     766713
                  500000000 2010     794390
                  500000000 2011          5
                  500000000 2011          8
                  500000000 2011         18
                  500000000 2011     725800
                  500000000 2011     766713
                  500000000 2011     773377
                  500000000 2011     794390
                  500000000 2011     956207
                  500000000 2011    1019118
                  600000000 2010          5
                  600000000 2010          8
                  600000000 2010         18
                  600000000 2010         25
                  600000000 2010     722107
                  600000000 2010     725800
                  600000000 2010     732861
                  600000000 2010     766713
                  600000000 2010     956207
                  600000000 2010    1019118
                  600000000 2011          5
                  600000000 2011          8
                  600000000 2011         25
                  600000000 2011         33
                  600000000 2011         42
                  600000000 2011     659845
                  600000000 2011     671147
                  600000000 2011     725800
                  600000000 2011     732861
                  600000000 2011     736025
                  600000000 2011     956207
                  600000000 2011    1019118
                  end
                  // so now we have made up data for two firms in two years
                  
                  // drop firms that aren't in all 11 years (optional - delete these lines if not needed)
                  duplicates drop
                  egen firm_year = tag(npc_fic year)
                  bysort npc_fic: egen years = total(firm_year)
                  // change 2 to 11 when running with real data
                  drop if years<2
                  
                  // now we don't care about years any longer
                  keep npc_fic estab_id
                  duplicates drop
                  
                  // create 1 observaation per firm with the number of establishments
                  bysort npc_fic: generate estabs=_N
                  bysort npc_fic: keep if _n==1
                  
                  // results
                  collapse (count) firms=npc_fic (sum) estabs ///
                           (mean) mean=estabs (sd) sd=estabs (min) min=estabs (max) max=estabs
                  list, clean noobs
                  Code:
                  . list, clean noobs
                  
                      firms   estabs   mean         sd   min   max  
                          2       25   12.5   3.535534    10    15

                  Comment


                  • #10
                    You are amazing prof William.
                    I thought that I could never be able to figure it out. Of course, without your assistance, I could not do that honestly. It was pretty time-consuming one. Thank you for your patience.
                    Code:
                    . keep npc_fic estab_id
                    
                    . duplicates drop
                    
                    Duplicates in terms of all variables
                    
                    (26,042,919 observations deleted)
                    
                    . bysort npc_fic: generate estabs=_N
                    
                    . bysort npc_fic: keep if _n==1
                    (146,064 observations deleted)
                    
                    . 
                    . // results
                    . collapse (count) firms=npc_fic (sum) estabs ///
                    > (mean) mean=estabs (sd) sd=estabs (min) min=estabs (max) max=estabs
                    
                    . list, clean noobs
                    
                         firms   estabs       mean         sd   min    max  
                        515014   661078   1.283612   4.249814     1   1099

                    Comment


                    • #11
                      Now, I have another question very similar to this one with only one difference: instead of Establishment, "total export values"
                      The condition is as before, each firm's ID - npc_fic is unique within a year but they may repeat/disappear over 12 years. Again I need for number of firms, mean , sd , min and max of Total export values
                      The doll data is:
                      Code:
                      year     npc_fic   total_export
                      2010      100          1
                      2011      100          2
                      2011      80           2
                      2010      300          3
                      2011      300          4
                      Although firm with id, 80 dose not exist in the year 2010, I can say that there are 3 firms and min of total export is 1 , max of total export is 4 and the mean is 2.5.

                      Thank you so so much.



                      Comment


                      • #12
                        Note that this is different than the establishments: for establishments you wanted the distibution of the number of distinct establishments for each firm; here you want the distribution of total_export for each year of each firm.
                        Code:
                        * Example generated by -dataex-. For more info, type help dataex
                        clear
                        input int(year npc_fic) byte total_export
                        2010 100 1
                        2011 100 2
                        2011  80 2
                        2010 300 3
                        2011 300 4
                        end
                        // each firm appears only once for any year
                        isid npc_fic year
                        // mean sd min max for total export
                        summarize total_export
                        // number of distinct firms
                        egen firm = tag(npc_fic)
                        count if firm
                        Code:
                        . // mean sd min max for total export
                        . summarize total_export
                        
                            Variable |        Obs        Mean    Std. dev.       Min        Max
                        -------------+---------------------------------------------------------
                        total_export |          5         2.4    1.140175          1          4
                        
                        . // number of distinct firms
                        . egen firm = tag(npc_fic)
                        
                        . count if firm
                          3

                        Comment


                        • #13
                          Prof William,
                          I can't thank you enough.
                          It worked perfectly.

                          Comment

                          Working...
                          X