Hello, I am importing a very large .csv dataset by splitting it into chunks of 1,000,000 observations and reading one chunk at a time. For each chunk, I am assigning IDs to observations based on their row number in the original dataset.
My code reads:
So the first chunk includes rows 11,000,001 to 12,000,000, the second chunk 12,000,001 to 13,000,000 and so on. There are nine chunks in total, with the last chunk covering rows 19,000,001 to 20,000,000. The IDs are supposed to correspond to these row numbers.
What I am noticing is that although all chunks do have 1,000,000 observations each, the first five chunks have correctly assigned IDs while the last four do not. For example, Chunk 5 has correctly assigned IDs:
Chunk 7, on the other hand, shows the following:
The same erroneous pattern of IDs can be found for Chunks 8 and 9. Interestingly, I can fix the problem after it has occurred, by manually assigning the value of i and re-assigning the IDs as before. For instance, I can fix Chunk 7's IDs by running the following code:
Does anyone understand what's going on here? Thanks very much!
My code reads:
Code:
forvalues i= 11000001(1000000)19000001 { display `i' //make sure starting row number of chunk is correct local endrow= `i' + 999999 //get the row number of last observation to be included in the chunk display `endrow' clear import delimited using "large_data.csv", rowrange(`i': `endrow') //import chunk gen id= _n + `i' - 1 }
What I am noticing is that although all chunks do have 1,000,000 observations each, the first five chunks have correctly assigned IDs while the last four do not. For example, Chunk 5 has correctly assigned IDs:
Row | ID |
1 | 15,000,001 |
2 | 15,000,002 |
3 | 15,000,003 |
Chunk 7, on the other hand, shows the following:
Row | ID |
1 | 17,000,000 |
2 | 17,000,002 |
3 | 17,000,004 |
4 | 17,000,004 |
5 | 17,000,004 |
6 | 17,000,006 |
7 | 17,000,008 |
8 | 17,000,008 |
The same erroneous pattern of IDs can be found for Chunks 8 and 9. Interestingly, I can fix the problem after it has occurred, by manually assigning the value of i and re-assigning the IDs as before. For instance, I can fix Chunk 7's IDs by running the following code:
Code:
local i=17000001 replace id= _n + `i' - 1
Comment