Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Normalizing variables to create an index

    Dear Stata members
    I am trying learn an index which works on the % of textual counts of certain words to total words. However, ultimately I need to create an index based on the below logic
    "The monthly index is the number of articles related to external shocks divided by the total number of articles and is then normalized to have an average value of 100". In another article it is stated as "Subsequently, the authors divided each monthly count by the mean (from 2000 to 2009) of the series and multiplied it by 100 to obtain normalized value index".


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(month COUNT_USA COUNT_VEN COUNT_TWN COUNT_UKR)
    480  1.198969    .0269768  .05619781  .02408478
    481 1.2321326  .018208856  .13607757 .025514543
    482 1.0100458   .01916076  .23037936  .03839656
    483 1.0314224  .011993283  .05822659  .04159042
    484 1.3923223  .011365897  .09688358 .032294527
    485 1.2740495   .02004123  .07413509  .04118616
    486  .9578033  .014826676   .0967898 .008065817
    487   .820109   .04737091 .031365167 .007841292
    488 1.1048642   .03076837 .068083815 .015129738
    489 1.7102947   .04124505  .04268944 .014229812
    490  .9510832   .02780945  .02993788  .01496894
    491  .9818148  .037049614 .031142946 .015571473
    492 1.0246317   .03762772  .14675668  .03668917
    493 1.1720848  .009477775  .04025117   .0483014
    494 1.1910303  .016428005   .1081237  .05766597
    495 1.1368091   .02110786  .23651484 .022888534
    496 1.2756264  .011389522  .08127078  .02955301
    497  1.512445   .01488627 .031407036  .08636934
    498  1.317399 .0031366644  .05504876 .023592325
    499  1.349414   .01558215  .03089996  .04634994
    500  13.22901   .06319591  .09726171  .02244501
    501 13.100196  .068948396  .11903928  .09803235
    502  7.890069   .04071543 .065198496  .02897711
    503  6.047551  .025530556 .068156004  .03786445
    504  4.232385  .027403934  .02239642  .02986189
    506  3.863307  .031856358  .07072636  .02121791
    507  3.840491   .15030675  .01544998  .00772499
    508  4.089164   .05356983  .05158058 .014737307
    509  3.673885  .037016474  .02295157 .007650524
    510  2.954308  .009499382  .04058112  .01623245
    511  3.130244  .035497613 .032226875 .032226875
    512 4.3860188   .04300018  .03000525  .05250919
    513 4.1565065   .06166092  .05547081 .027735405
    514  3.697484   .03830063  .02202805  .09545488
    515  4.328755    .3329811  .04454012  .02227006
    516  4.968297   .27088976  .04305087  .02870058
    517  5.824158   .28083614  .04606172  .02303086
    518   9.00474    .1912364  .05243839  .05243839
    519  6.587385   .08744317  .04202564  .04902991
    520  4.000343  .022842785   .0851547   .0283849
    521  3.549583   .03917196 .065592885 .021864295
    522 3.4111476   .03067579  .00762079  .01524158
    523 3.3566434   .03108003  .05148194 .036772817
    524  2.818476    .0353045  .00744879  .00744879
    525  2.644918   .02965759  .04014721 .033456005
    526 2.8708134   .01805543  .05817759 .029088793
    527  3.160041    .0127421  .14273909  .02253775
    528  2.693432  .023862077  .05884949 .007356186
    529  2.816728   .02465407 .029502876 .014751438
    530  3.165661   .08007092   .0508167  .04355717
    531  3.546311  .017880557   .1105298  .12526712
    532  2.880948   .04987385  .06508063 .014462362
    533 3.3609934   .05969793   .0227704 .007590133
    534    3.1758  .015079772  .03765627 .022593765
    535  3.450163    .1033529 .014394702 .035986755
    536 3.5085096  .023928454  .00727855   .0291142
    537  2.939661   .00858713 .013637913 .013637913
    538  2.974993  .036024135 .030355923  .12901267
    539 2.5848656   .02222857   .0754774   .1886935
    end
    format %tm month



    In the above dataset, COUNT_USA represents the percent of articles for USA and COUNT_VEN represents percent of articles with Veneuzuela. Based on the above statements can we convert these country figures into normalized values to have an average value of 100.

  • #2
    The recipe is surely:

    1. Calculate the mean of a variable over whatever block of observations it refers to. egen can be helpful here.

    2. Scale by 100 / mean to get mean 100.

    Comment


    • #3
      Thanks Nick Cox. Can I ask you for some more help? I need to find the mean by the country during the period 2000m1-2002m12.
      I tried
      egen v1= mean ( COUNT_USA ) if inrange( month , 2000m1, 2002m12) but it is not working.
      So how to get the average values within my data range.

      Also, can you help me to get a single code that can combine your 1 and 2



      Edits: Somewhere I saw ready made egen functions like below, can this be used in my context??
      egen new_var =std( COUNT_TWN ), mean(100)
      Last edited by lal mohan kumar; 20 Jul 2022, 04:17.

      Comment


      • #4
        As you have found, that is not going to work, but this might get you closer:

        Code:
        su COUNT_USA if inrange(month, ym(2000, 1), ym(2002, 12)), meanonly
        
        gen wanted_US = 100 * COUNT_USA / r(mean)

        Or check out the function

        Code:
        help tin()
        An egen solutioncould be

        Code:
        egen wanted_US = mean(cond(tin(month, 2000m1, 2002m12), COUNT_USA, .))
        
        replace wanted_US = 100 * COUNT_USA  / wanted_US
        The code in #1 presumably failed because inrange() doesn't do what you want, but it would produce missing values outside 2000-2002, which are awkward.

        "not working" is a poor problem report. See our FAQ Advice. I once wrote down about 20 versions of what that might mean, which certainly start with

        * my code is not even legal

        * my code worked but the results seem wrong or not what I want.

        Comment


        • #5
          Thanks a lot Nick Cox for the support and instructions (I am sorry for not following the protocols, the code is not working implied, I gave a garbage instruction to Stata and Stata couldnt understand)
          Regarding the commands with respect to egen, this is what Stata throws back

          Code:
          egen wanted_US = mean(cond(tin(month, 2000m1, 2002m12), COUNT_USA, .))
          time-series operators not allowed
          r(101);
          .
          .I tried myself with some changes in code but in vain.

          .


          Comment


          • #6
            You need first to

            Code:
            tsset month
            This is explained in the help for tin(). which I recommended you to read.

            There may be other problems; I didn't try to implement the code.

            Comment


            • #7
              The code in #4, copied in #5 confuses the syntax of -inrange()- and -tin()-. Even after -tsset month-, -egen wanted_US = mean(cond(tin(month, 2000m1, 2002m12), COUNT_USA, .))- will throw a syntax error. The -tin()- function does not take three arguments: it takes only the last two shown. For -tin-, what corresponds to the first argument of -inrange()- is not explicitly specified, as Stata understands it to be the time variable given in the -tsset- command. The correct usage of -tin()- would be:
              Code:
              tsset month
              egen wanted_US = mean(cond(tin(2000m1, 2002m12), COUNT_USA, .))

              Comment


              • #8
                Clyde Schechter is right. Sorry.

                Comment


                • #9
                  Thanks Nick Cox, for the excellent help and support. I didn't check with function -tin- as you instructed, hence I asked you again why the command throws an error. Later I figured it out by reading about tin. The code works really well. Thanks Clyde Schechter for pointing this out by looking at this thread in detail.

                  I have one more doubt in this regard, as I have here 4 countries in the example (COUNT_USA COUNT_VEN COUNT_TWN COUNT_UKR), is there a way to do it in a quick way rather than doing it one by one as follows

                  Code:
                  *1) For USA
                  
                  su COUNT_USA if inrange(month, ym(2000, 1), ym(2002, 12)), meanonly
                  gen wanted_US = 100 * COUNT_USA / r(mean)
                  
                  *2) For Venuezela
                  su COUNT_VEN if inrange(month, ym(2000, 1), ym(2002, 12)), meanonly
                  gen wanted_VEN = 100 * COUNT_VEN/ r(mean)
                  I have some 15 countries and I am afraid that if I do it manually, I may do it erroneously

                  Comment


                  • #10
                    Code:
                    foreach v of var COUNT_* {
                          
                    su `v' if inrange(month, ym(2000, 1), ym(2002, 12)), meanonly  
                    
                    local new = subinstr("`v'", "COUNT", "wanted", 1)    
                    
                    gen `new' = `v' * 100 / r(mean)  
                    
                    }
                    Last edited by Nick Cox; 20 Jul 2022, 11:12.

                    Comment


                    • #11
                      Thanks Nick Cox for the extended help and excellent support. I am delighted to seethe magical command (#9) for replacing repeated commands.

                      Comment

                      Working...
                      X