Convert values and sum product to one within Index Range

Ian Bell

Join Date: Feb 2015

Posts: 27
#1

Convert values and sum product to one within Index Range

03 Sep 2018, 08:33

I have around 78 columns and 160,000 rows with values that I would like to convert to probabilities within each data set.
On average each data set would be around 9 with a maximum of 36 and a minimum of 2. Therefore the approx number
of data sets would be 160,000 / 9 = 17,777.
Below is a simple example using a two and a three data set sample.:
Index Val prob sumprob result sumcheck
1 20 .05 0.4 .125
1 30 .33 0.4 .825
1 50 .02 0.4 .05 1 (index 1)
2 90 .11 0.25 .44
2 70 .14 . 0.25 .56 1 (index 2)
I am code illiterate so any help much appreciated.
Data is currently stored in Excel. I bought Stata Version 12 quite a while ago and still a virgin
Thanks
Ian
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

03 Sep 2018, 09:46

Here is code that reproduces what you seem to want.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(Index Val prob) 1 20 .05 1 30 .33 1 50 .02 2 90 .11 2 70 .14 end sort Index, stable by Index: egen sumprob = total(prob) generate result = prob/sumprob by Index: egen sumcheck = total(result) // confirm results, allow for some rounding error assert abs(sumcheck-1)<.001 list, noobs sepby(Index)

Code:

. list, noobs sepby(Index) +--------------------------------------------------+ | Index Val prob sumprob result sumcheck | |--------------------------------------------------| | 1 20 .05 .4 .125 1 | | 1 30 .33 .4 .825 1 | | 1 50 .02 .4 .05 1 | |--------------------------------------------------| | 2 90 .11 .25 .44 1 | | 2 70 .14 .25 .56 1 | +--------------------------------------------------+

Let me mention that your use of terminology is non-standard, and this makes your question slightly confusing. You have one dataset with 78 variables and 160,000 rows. Within that, the variable Index defines groups of observations. Each group has around 9 observations, with a maximum of 36 and a minimum of 2, so there are approximately 17,777 groups with distinct values of Index.

I'm sympathetic to you as a new user of Stata - it's a lot to absorb.

When I began using Stata in a serious way, I started, as have others here, by reading my way through the Getting Started with Stata manual relevant to my setup. Chapter 18 then gives suggested further reading, much of which is in the Stata User's Guide, and I worked my way through much of that reading as well. There are a lot of examples to copy and paste into Stata's do-file editor to run yourself, and better yet, to experiment with changing the options to see how the results change.

All of these manuals are included as PDFs in the Stata installation (since version 11) and are accessible from within Stata - for example, through the PDF Documentation section of Stata's Help menu. The objective in doing the reading was not so much to master Stata as to be sure I'd become familiar with a wide variety of important basic techniques, so that when the time came that I needed them, I might recall their existence, if not the full syntax, and know how to find out more about them in the help files and PDF manuals.

Stata supplies exceptionally good documentation that amply repays the time spent studying it - there's just a lot of it. The path I followed surfaces the things you need to know to get started in a hurry and to work effectively. And it will familiarize you with the terminology and facilitate your communication with other Stata users.
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#3

03 Sep 2018, 10:13

William gives excellent advice as always. I'd also check out the pc() function of egen (which can produce proportions too).
1 like
Comment
Ian Bell

Join Date: Feb 2015

Posts: 27
#4

03 Sep 2018, 14:29

Thanks William for your detailed reply and thanks to Nick for the follow up comment.
With reference to William's fine advice on RTFM I agree whole heartedly and should have done that some 25 years ago when I first heard about Stata. I was 47 back then and more responsive to learning new tricks. At 72 my learning skills are somewhat diminished but hope springs eternal. Nowadays I want to focus in on a project that I have worked on for several years and need Stata for specific tasks. By the way I picked up that 'terminology' from an Excel forum and now realise each software uses their own terms.
I will be working on it on Tuesday and post feedback.
Thanks again.
Ian
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

03 Sep 2018, 14:39

Let me add that I'm not far from you in the responsiveness to learning, but from Nick I picked up a formerly-overlooked egen function. I was focused on reproducing the track you followed in your example, but here's a more direct approach.

Code:

sort Index, stable
by Index: egen result = pc(prob)
replace result = result/100
list, noobs sepby(Index)

Code:

. list, noobs sepby(Index)

  +-----------------------------+
  | Index   Val   prob   result |
  |-----------------------------|
  |     1    20    .05     .125 |
  |     1    30    .33     .825 |
  |     1    50    .02      .05 |
  |-----------------------------|
  |     2    90    .11      .44 |
  |     2    70    .14      .56 |
  +-----------------------------+

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#6

03 Sep 2018, 15:17

By the way I picked up that 'terminology' from an Excel forum and now realise each software uses their own terms.

Well, yes, but. Depending on your philosophy, my response would be either "terminology matters" or "it's not just terminology." While there is a certain correspondence between rows in Excel and observations in Stata, and columns in Excel and variables in Stata, that correspondence breaks down badly once we get beyond referring to the layout of the data on the screen in certain circumstances. For example, the various cells in a column of Excel can be completely unrelated to each other and have all different types of information. But the values of a Stata variable must either be all numeric, or all strings. Similarly, in many Excel data sets, the information that appears in a cell bears no necessary relationship to the information to its left or right, or above or below it. But those things are not true in Stata: the information in horizontally adjacent positions necessarily refer to the same entity. When we start working on calculations, things diverge even farther.

So it is best to use Excel terminology when discussing Excel, and Stata terminology when discussing Stata. If you confuse the two in your mind, based on the very superficial correspondence between them when looking at a data grid on the screen, you are likely to also start to try working with the data in Stata as if it were in Excel. And that leads down a rabbit hole that, I assure you , you don't want to go to. So it is best to keep the terminology distinct and consistently use the appropriate one so that you do not trick yourself into trying to do things in one that only make sense (or can only be done) in the other, or that have a misleadingly similar appearance in both but do very different things. The differences are very substantive, and the different terminologies are there to help us keep them straight in our heads.

Good luck, and welcome to the world of Stata.
2 likes
Comment
Ian Bell

Join Date: Feb 2015

Posts: 27
#7

03 Sep 2018, 15:35

Thank you Clyde, in fact I should easily adapt to the management of data in Stata as my spreadsheets mostly use the same format being numeric with some text and headings can contain one to around 5 rows maximum. I certainly don't want to go down that 'rabbit hole' again ...I've been down it faster than a rabbit on a promise
I need to search for some basic text material on Stata terminology. I'm sure there's plenty on the web.
Thanks for your kind welcoming message
Ian
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

03 Sep 2018, 16:06

I need to search for some basic text material on Stata terminology. I'm sure there's plenty on the web.

Of wildly varying quality, with little help sorting the wheat from the chaff.

By contrast, Page 2 of the Getting Started with Stata manual I recommended in post #3 covers "variables" and "observations".

Also, my apologies, in post #3 I echoed "160,000 rows" where I meant "160,000 observations".
1 like
Comment
Ian Bell

Join Date: Feb 2015

Posts: 27
#9

03 Sep 2018, 16:13

I've nearly converted you Mine is V10 I'll check out post "3 Thanks
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#10

03 Sep 2018, 17:36

To push #3 all the way:

Code:

clear input float(Index Val prob) 1 20 .05 1 30 .33 1 50 .02 2 90 .11 2 70 .14 end sort Index, stable by Index: egen pr = pc(prob), prop list, sepby(Index) +---------------------------+ | Index Val prob pr | |---------------------------| 1. | 1 20 .05 .125 | 2. | 1 30 .33 .825 | 3. | 1 50 .02 .05 | |---------------------------| 4. | 2 90 .11 .44 | 5. | 2 70 .14 .56 | +---------------------------+

Back in 1999 prop() was written up as a separate egen function.

STB-50 dm70 . . . . . . . . . . . . . . . . Extensions to generate, extended
(help egenodd if installed) . . . . . . . . . . . . . . . . N. J. Cox
7/99 pp.9--17; STB Reprints Vol 9, pp.34--45
24 additional egen functions presented; includes various string,
data management, and statistical functions;
many of the egen functions added to Stata 7

StataCorp folded prop() into pc() as an option when adopting into Stata 7.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#11

03 Sep 2018, 17:54

Stata Release 10. Wow.

Well, the Statalist FAQ linked to from the top of the page (you should review it to refresh your memory) asks that you be sure to mention the version of Stata you are using, if it is not the latest. In your case it's going to be very important, because it's been 9 years since Stata 10 was superseded by Stata 11. Many of us were not using Stata that long ago, so you're likely to get advice that your system doesn't support; telling us what version you use will reduce the number of explanations that you should be using factor variable notation, to pick a popular example.

Do you still have the shelf full of printed documentation that came with Stata Version 10? Look amongst them for a "getting started" and "user's guide" manuals.

If you can't find that, below are links to PDFs of the Release 13 versions of the Getting Started and User's Guide, which is about as far back as I can find online.

When you get advice from Statalist, you should be sure to use Stata's built-in help to help you with any necessary "translation" from Stata "now" to Stata "then".

Added in edit: Crossed with Nick's excellent post, displaying his encyclopedic knowledge of the history of Stata. And displaying the fact that I didn't pay enough attention to the egen pc() documentation when I read it, I was so eager to give it a try.

Also, I note that in post #1 you spelled "Version 10" with a "2". Wishful thinking?

Last edited by William Lisowski; 03 Sep 2018, 18:00.
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#12

04 Sep 2018, 03:51

Links promised in previous post.

https://www.stata.com/manuals13/gsw.pdf
https://www.stata.com/manuals13/u.pdf
1 like
Comment
Ian Bell

Join Date: Feb 2015

Posts: 27
#13

04 Sep 2018, 13:59

Thank you William for those links. I will have some time this week to study them. Thanks also to Nick for his post #10 I will also research info on egen function.
I ran into some snags when using the above code. I'm not suggesting there's anything wrong with the code but more my total lack of understanding.
I've attached a small sample of 6 variables and 108 observations and wondered if you would convert to probs that sum to one as per var "pr" in
post #10 by Nick.
If either of you have the time and it's not too onerous on your part could you please return the sheet with the 'pr' values together with the complete code.
This would really help me understand the entire process.
Thank You
Ian
Attached Files

STATA VARIABLES.csv (16.1 KB, 1 view)
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#14

04 Sep 2018, 19:12

I'm traveling with only occasional access to Stata at the moment.

I can see your CSV and wonder if, when you imported it into Stata (by whatever means were available in Stata 10) did you have Stata ignore the 11 lines of junk that the CSV begins with, and use the 12th row for the variable names?

Since you're new to Stata, that you may not have grasped the significance of Clyde's warning from post #6

For example, the various cells in a column of Excel can be completely unrelated to each other and have all different types of information. But the values of a Stata variable must either be all numeric, or all strings.

but that is exactly what I'm suggesting may be the problem - an attempt to stuff numeric and character data into the same variable, which inevitably results in the variable becoming a string (character) variable not ameanable to numeric calculation. And that would certainly cause the code to fail.

It's possible (in Stata 15) to tell Stata where to start and where the variable names are. But the easiest approach would be to edit the CSV and delete the first 11 lines; I cannot see where they will ever be of use to you.

Implicit in the above: In Excel, your column names occupy the first row of the column, with number below. In Stata, variable names are kept separate from the data they describe.

Beyond that, I see in your dataset the variable Index (it is indeed a variable, although you didn't count it as one), and six other variables, none of them named prob as in your example in post #1. I also see that your CSV has 720 rows, of which the first 11 are junk and the next contains the variable names. So you have 708 observations of 7 variables, with 59 "groups" (my terminology from post #2) defined by distinct values of Index. I'm not at all sure where you got 108 observations (per post #13) and wonder if perhaps you uploaded a different file than you intended.
2 likes
Comment
Ian Bell

Join Date: Feb 2015

Posts: 27
#15

05 Sep 2018, 01:01

Yes I now see there are 7 variables and 708 observations after deleting the first 11 rows.
I've added the calc variables 'prob' 'pr' as shown in post #10 by Nick.
The only output I require is the end result 'pr' for each of the other 6 variables
with the appropriate Index, then I would export it back to Excel.
I've attached the amended sheet.
Thanks again William for your continued patience. I am learning albeit a steep curve.

Attached Files

STATA VARIABLES.csv (24.4 KB, 1 view)
Comment

Announcement

Convert values and sum product to one within Index Range

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment