Normalizing variables to create an index

lal mohan kumar

Join Date: May 2019
Posts: 265

Normalizing variables to create an index

20 Jul 2022, 02:57

Dear Stata members
I am trying learn an index which works on the % of textual counts of certain words to total words. However, ultimately I need to create an index based on the below logic
"The monthly index is the number of articles related to external shocks divided by the total number of articles and is then normalized to have an average value of 100". In another article it is stated as "Subsequently, the authors divided each monthly count by the mean (from 2000 to 2009) of the series and multiplied it by 100 to obtain normalized value index".

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(month COUNT_USA COUNT_VEN COUNT_TWN COUNT_UKR)
480  1.198969    .0269768  .05619781  .02408478
481 1.2321326  .018208856  .13607757 .025514543
482 1.0100458   .01916076  .23037936  .03839656
483 1.0314224  .011993283  .05822659  .04159042
484 1.3923223  .011365897  .09688358 .032294527
485 1.2740495   .02004123  .07413509  .04118616
486  .9578033  .014826676   .0967898 .008065817
487   .820109   .04737091 .031365167 .007841292
488 1.1048642   .03076837 .068083815 .015129738
489 1.7102947   .04124505  .04268944 .014229812
490  .9510832   .02780945  .02993788  .01496894
491  .9818148  .037049614 .031142946 .015571473
492 1.0246317   .03762772  .14675668  .03668917
493 1.1720848  .009477775  .04025117   .0483014
494 1.1910303  .016428005   .1081237  .05766597
495 1.1368091   .02110786  .23651484 .022888534
496 1.2756264  .011389522  .08127078  .02955301
497  1.512445   .01488627 .031407036  .08636934
498  1.317399 .0031366644  .05504876 .023592325
499  1.349414   .01558215  .03089996  .04634994
500  13.22901   .06319591  .09726171  .02244501
501 13.100196  .068948396  .11903928  .09803235
502  7.890069   .04071543 .065198496  .02897711
503  6.047551  .025530556 .068156004  .03786445
504  4.232385  .027403934  .02239642  .02986189
506  3.863307  .031856358  .07072636  .02121791
507  3.840491   .15030675  .01544998  .00772499
508  4.089164   .05356983  .05158058 .014737307
509  3.673885  .037016474  .02295157 .007650524
510  2.954308  .009499382  .04058112  .01623245
511  3.130244  .035497613 .032226875 .032226875
512 4.3860188   .04300018  .03000525  .05250919
513 4.1565065   .06166092  .05547081 .027735405
514  3.697484   .03830063  .02202805  .09545488
515  4.328755    .3329811  .04454012  .02227006
516  4.968297   .27088976  .04305087  .02870058
517  5.824158   .28083614  .04606172  .02303086
518   9.00474    .1912364  .05243839  .05243839
519  6.587385   .08744317  .04202564  .04902991
520  4.000343  .022842785   .0851547   .0283849
521  3.549583   .03917196 .065592885 .021864295
522 3.4111476   .03067579  .00762079  .01524158
523 3.3566434   .03108003  .05148194 .036772817
524  2.818476    .0353045  .00744879  .00744879
525  2.644918   .02965759  .04014721 .033456005
526 2.8708134   .01805543  .05817759 .029088793
527  3.160041    .0127421  .14273909  .02253775
528  2.693432  .023862077  .05884949 .007356186
529  2.816728   .02465407 .029502876 .014751438
530  3.165661   .08007092   .0508167  .04355717
531  3.546311  .017880557   .1105298  .12526712
532  2.880948   .04987385  .06508063 .014462362
533 3.3609934   .05969793   .0227704 .007590133
534    3.1758  .015079772  .03765627 .022593765
535  3.450163    .1033529 .014394702 .035986755
536 3.5085096  .023928454  .00727855   .0291142
537  2.939661   .00858713 .013637913 .013637913
538  2.974993  .036024135 .030355923  .12901267
539 2.5848656   .02222857   .0754774   .1886935
end
format %tm month

In the above dataset, COUNT_USA represents the percent of articles for USA and COUNT_VEN represents percent of articles with Veneuzuela. Based on the above statements can we convert these country figures into normalized values to have an average value of 100.

Tags: None

Nick Cox

Join Date: Mar 2014

Posts: 36054
#2

20 Jul 2022, 03:21

The recipe is surely:

1. Calculate the mean of a variable over whatever block of observations it refers to. egen can be helpful here.

2. Scale by 100 / mean to get mean 100.
Comment
lal mohan kumar

Join Date: May 2019

Posts: 265
#3

20 Jul 2022, 03:43

Thanks Nick Cox. Can I ask you for some more help? I need to find the mean by the country during the period 2000m1-2002m12.
I tried
egen v1= mean ( COUNT_USA ) if inrange( month , 2000m1, 2002m12) but it is not working.
So how to get the average values within my data range.

Also, can you help me to get a single code that can combine your 1 and 2

Edits: Somewhere I saw ready made egen functions like below, can this be used in my context??
egen new_var =std( COUNT_TWN ), mean(100)

Last edited by lal mohan kumar; 20 Jul 2022, 04:17.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36054
#4

20 Jul 2022, 04:20

As you have found, that is not going to work, but this might get you closer:

Code:

su COUNT_USA if inrange(month, ym(2000, 1), ym(2002, 12)), meanonly gen wanted_US = 100 * COUNT_USA / r(mean)

Or check out the function

Code:

help tin()

An egen solutioncould be

Code:

egen wanted_US = mean(cond(tin(month, 2000m1, 2002m12), COUNT_USA, .)) replace wanted_US = 100 * COUNT_USA / wanted_US

The code in #1 presumably failed because inrange() doesn't do what you want, but it would produce missing values outside 2000-2002, which are awkward.

"not working" is a poor problem report. See our FAQ Advice. I once wrote down about 20 versions of what that might mean, which certainly start with

* my code is not even legal

* my code worked but the results seem wrong or not what I want.
Comment
lal mohan kumar

Join Date: May 2019

Posts: 265
#5

20 Jul 2022, 04:47

Thanks a lot Nick Cox for the support and instructions (I am sorry for not following the protocols, the code is not working implied, I gave a garbage instruction to Stata and Stata couldnt understand)
Regarding the commands with respect to egen, this is what Stata throws back

Code:

egen wanted_US = mean(cond(tin(month, 2000m1, 2002m12), COUNT_USA, .)) time-series operators not allowed r(101);

.
.I tried myself with some changes in code but in vain.

.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36054
#6

20 Jul 2022, 04:57

You need first to

Code:

tsset month

This is explained in the help for tin(). which I recommended you to read.

There may be other problems; I didn't try to implement the code.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#7

20 Jul 2022, 09:14

The code in #4, copied in #5 confuses the syntax of -inrange()- and -tin()-. Even after -tsset month-, -egen wanted_US = mean(cond(tin(month, 2000m1, 2002m12), COUNT_USA, .))- will throw a syntax error. The -tin()- function does not take three arguments: it takes only the last two shown. For -tin-, what corresponds to the first argument of -inrange()- is not explicitly specified, as Stata understands it to be the time variable given in the -tsset- command. The correct usage of -tin()- would be:

Code:

tsset month egen wanted_US = mean(cond(tin(2000m1, 2002m12), COUNT_USA, .))
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36054
#8

20 Jul 2022, 09:45

Clyde Schechter is right. Sorry.
Comment
lal mohan kumar

Join Date: May 2019

Posts: 265
#9

20 Jul 2022, 11:00

Thanks Nick Cox, for the excellent help and support. I didn't check with function -tin- as you instructed, hence I asked you again why the command throws an error. Later I figured it out by reading about tin. The code works really well. Thanks Clyde Schechter for pointing this out by looking at this thread in detail.

I have one more doubt in this regard, as I have here 4 countries in the example (COUNT_USA COUNT_VEN COUNT_TWN COUNT_UKR), is there a way to do it in a quick way rather than doing it one by one as follows

Code:

*1) For USA su COUNT_USA if inrange(month, ym(2000, 1), ym(2002, 12)), meanonly gen wanted_US = 100 * COUNT_USA / r(mean) *2) For Venuezela su COUNT_VEN if inrange(month, ym(2000, 1), ym(2002, 12)), meanonly gen wanted_VEN = 100 * COUNT_VEN/ r(mean)

I have some 15 countries and I am afraid that if I do it manually, I may do it erroneously
Comment

Nick Cox

Join Date: Mar 2014
Posts: 36054

#10

20 Jul 2022, 11:10

Code:

foreach v of var COUNT_* {
      
su `v' if inrange(month, ym(2000, 1), ym(2002, 12)), meanonly  

local new = subinstr("`v'", "COUNT", "wanted", 1)    

gen `new' = `v' * 100 / r(mean)  

}

Last edited by Nick Cox; 20 Jul 2022, 11:12.

Comment

lal mohan kumar

Join Date: May 2019

Posts: 265
#11

20 Jul 2022, 13:42

Thanks Nick Cox for the extended help and excellent support. I am delighted to seethe magical command (#9) for replacing repeated commands.
Comment

Announcement

Normalizing variables to create an index

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment