Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Miscalculation for Some but Not All Data

    Hello, I am importing a very large .csv dataset by splitting it into chunks of 1,000,000 observations and reading one chunk at a time. For each chunk, I am assigning IDs to observations based on their row number in the original dataset.

    My code reads:

    Code:
    forvalues i= 11000001(1000000)19000001 {
    
    display `i' //make sure starting row number of chunk is correct
    
    local endrow= `i' + 999999 //get the row number of last observation to be included in the chunk
    display `endrow'
    
    clear
    import delimited using "large_data.csv", rowrange(`i': `endrow') //import chunk
    
    gen id= _n + `i' - 1
    
    }
    So the first chunk includes rows 11,000,001 to 12,000,000, the second chunk 12,000,001 to 13,000,000 and so on. There are nine chunks in total, with the last chunk covering rows 19,000,001 to 20,000,000. The IDs are supposed to correspond to these row numbers.

    What I am noticing is that although all chunks do have 1,000,000 observations each, the first five chunks have correctly assigned IDs while the last four do not. For example, Chunk 5 has correctly assigned IDs:

    Row ID
    1 15,000,001
    2 15,000,002
    3 15,000,003

    Chunk 7, on the other hand, shows the following:

    Row ID
    1 17,000,000
    2 17,000,002
    3 17,000,004
    4 17,000,004
    5 17,000,004
    6 17,000,006
    7 17,000,008
    8 17,000,008

    The same erroneous pattern of IDs can be found for Chunks 8 and 9. Interestingly, I can fix the problem after it has occurred, by manually assigning the value of i and re-assigning the IDs as before. For instance, I can fix Chunk 7's IDs by running the following code:

    Code:
    local i=17000001
    replace id= _n + `i' - 1
    Does anyone understand what's going on here? Thanks very much!

  • #2
    see:
    Code:
    help precision
    the issue is that, by default, Stata makes new variables as floats and you need more precision that that; try inserting "double" between "gen" and "id"

    Comment


    • #3
      1-- In STATA I have 500 data and the data name "delivery kit" For this data I have three forms one is the "enrollment form" the second form is the "follow-up form" and the third one is the "devilry form" and every enrolment against three followups and first follow-up done 7th day after enrollment and 2nd followup 14th days after enrollment and 3rd followup 28days after enrolment. Hence, I find that every enrollment against how many follow-ups and which day after enrolment, so what command will use in STATA?

      2- in STATA I have 500 data in this one variable is "date of last menstrual period" and the second variable is "date of outcome or date of delivery" to find the difference between the date of delivery and date of last menstrual period and generate new variable so what command will use in STATA

      Comment


      • #4
        Reply to Anna Ghaffar (I can't find the @-Link for you): How does your question relate to the discussion of Fred Shappell (#1 to #2)? Note that you should start your post with a new topic if your question does not fit under the topic "Miscalculation for some but not all data".

        Comment


        • #5
          Thank you so much Rich Goldstein! It was indeed the precision issue that was causing the problem. I appreciate your help!

          Comment

          Working...
          X