Miscalculation for Some but Not All Data

Fred Shappell

Join Date: Jan 2023

Posts: 4
#1

Miscalculation for Some but Not All Data

16 Apr 2023, 08:09

Hello, I am importing a very large .csv dataset by splitting it into chunks of 1,000,000 observations and reading one chunk at a time. For each chunk, I am assigning IDs to observations based on their row number in the original dataset.

My code reads:

Code:

forvalues i= 11000001(1000000)19000001 { display `i' //make sure starting row number of chunk is correct local endrow= `i' + 999999 //get the row number of last observation to be included in the chunk display `endrow' clear import delimited using "large_data.csv", rowrange(`i': `endrow') //import chunk gen id= _n + `i' - 1 }

So the first chunk includes rows 11,000,001 to 12,000,000, the second chunk 12,000,001 to 13,000,000 and so on. There are nine chunks in total, with the last chunk covering rows 19,000,001 to 20,000,000. The IDs are supposed to correspond to these row numbers.

What I am noticing is that although all chunks do have 1,000,000 observations each, the first five chunks have correctly assigned IDs while the last four do not. For example, Chunk 5 has correctly assigned IDs:

Row ID

1 15,000,001

2 15,000,002

3 15,000,003

Chunk 7, on the other hand, shows the following:

Row ID

1 17,000,000

2 17,000,002

3 17,000,004

4 17,000,004

5 17,000,004

6 17,000,006

7 17,000,008

8 17,000,008

The same erroneous pattern of IDs can be found for Chunks 8 and 9. Interestingly, I can fix the problem after it has occurred, by manually assigning the value of i and re-assigning the IDs as before. For instance, I can fix Chunk 7's IDs by running the following code:

Code:

local i=17000001 replace id= _n + `i' - 1

Does anyone understand what's going on here? Thanks very much!
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4493
#2

16 Apr 2023, 08:31

see:

Code:

help precision

the issue is that, by default, Stata makes new variables as floats and you need more precision that that; try inserting "double" between "gen" and "id"
2 likes
Comment
Amna Ghaffar

Join Date: Apr 2023

Posts: 4
#3

16 Apr 2023, 12:17

1-- In STATA I have 500 data and the data name "delivery kit" For this data I have three forms one is the "enrollment form" the second form is the "follow-up form" and the third one is the "devilry form" and every enrolment against three followups and first follow-up done 7th day after enrollment and 2nd followup 14th days after enrollment and 3rd followup 28days after enrolment. Hence, I find that every enrollment against how many follow-ups and which day after enrolment, so what command will use in STATA?

2- in STATA I have 500 data in this one variable is "date of last menstrual period" and the second variable is "date of outcome or date of delivery" to find the difference between the date of delivery and date of last menstrual period and generate new variable so what command will use in STATA
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 586
#4

16 Apr 2023, 14:48

Reply to Anna Ghaffar (I can't find the @-Link for you): How does your question relate to the discussion of Fred Shappell (#1 to #2)? Note that you should start your post with a new topic if your question does not fit under the topic "Miscalculation for some but not all data".
Comment
Fred Shappell

Join Date: Jan 2023

Posts: 4
#5

16 Apr 2023, 17:14

Thank you so much Rich Goldstein! It was indeed the precision issue that was causing the problem. I appreciate your help!
Comment

Row	ID
1	15,000,001
2	15,000,002
3	15,000,003

Row	ID
1	17,000,000
2	17,000,002
3	17,000,004
4	17,000,004
5	17,000,004
6	17,000,006
7	17,000,008
8	17,000,008

Announcement

Miscalculation for Some but Not All Data

Comment

Comment

Comment

Comment