Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • to Add

    Hello
    I have a question.
    My merge file is huge ( obseravtions are 163 718). I had an event. Before the event were 2 surveys (1 Wave and 2 Wave), after the event were 2 surveys (4 Wave and 5 Wave) It looks like:
    Mergeid Wave
    AT-01 1
    AT-01 4
    AT-02 1
    AT-02 2
    AT-02 5
    BF-05 1
    BF-05 2
    CZ-44 2
    CZ-44 4
    CZ-45 5
    My condition: I need only these observations what at least one time was participated in 1 or 2 wave (before the evenet) and at least one time was participated in 4 or 5 wave (after the event).
    E.g. AT-01 is needed, because one time in 1 wave and one time in 4
    AT-02 is needed, beacuse two times in before the event and one time after the event
    BF-05 is not needed, because two times before the evenet and 0 times after the event
    CZ-44 is needed, because one time before the evenet and one time after the event
    CZ-45 is not needed, beacues 0 time in before event and one time after event

    My task: drop these observations which are not needed.

    Anybody has idea, how to do that?


  • #2
    Somebody yesterday had about 160 million observations...

    So, you want a positive count for wave 1 or 2 and also for wave 4 or 5. Your data example leaves ambiguous whether your identifier is a string variable or a numeric variable with value labels. But code could be the same either way. Please note our longstanding request to use dataex.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str5 mergeid byte wave
    "AT-01" 1
    "AT-01" 4
    "AT-02" 1
    "AT-02" 2
    "AT-02" 5
    "BF-05" 1
    "BF-05" 2
    "CZ-44" 2
    "CZ-44" 4
    "CZ-45" 5
    end
    
    .bysort mergeid : egen cond1 = total(wave == 1 | wave == 2)
    
    .by mergeid : egen cond2 = total(wave == 4 | wave == 5)
    
    .gen wantec = cond1 & cond2
    
    
    . l, sepby(mergeid)
    
         +-----------------------------------------+
         | mergeid   wave   cond1   cond2   wanted |
         |-----------------------------------------|
      1. |   AT-01      1       1       1        1 |
      2. |   AT-01      4       1       1        1 |
         |-----------------------------------------|
      3. |   AT-02      1       2       1        1 |
      4. |   AT-02      2       2       1        1 |
      5. |   AT-02      5       2       1        1 |
         |-----------------------------------------|
      6. |   BF-05      1       2       0        0 |
      7. |   BF-05      2       2       0        0 |
         |-----------------------------------------|
      8. |   CZ-44      2       1       1        1 |
      9. |   CZ-44      4       1       1        1 |
         |-----------------------------------------|
     10. |   CZ-45      5       0       1        0 |
         +-----------------------------------------+
    
    keep if wanted

    Comment


    • #3
      Somebody yesterday had about 160 million observations...
      It's funny how discipline changes things. I don't know what discipline OP works in, but I was in an interview to be a research analyst this past Wednesday, and they asked me if I was comfortable working with big datasets. I asked them to define "big" since this is relative to your field. I usually work with 100,000+ observations in some form, but this to a psychologist would (presumably) be gigantic, and to others in.... I don't know, other disciplines, to be puny.

      Comment


      • #4
        To be fair, I looked again at the post I had in mind, and the person really said 60 million.

        When I started teaching statistics, it was a rule of thumb around here that a reasonable size of dataset was 20 to 30 observations, as being fair for hand calculations with at most a small electronic calculator or (a bit later) as what each student might fairly be expected to type in to whatever software was being used.

        Comment


        • #5
          "Type in"? I don't mean to sound like this, but did CSV files exist in those days?

          I couldn't imagine manually typing in data-points.

          Comment


          • #6
            Typing at a keyboard was easy -- compared with punching your own cards (routine when I started with **the** University computer) or paper tape (routine for many contemporaries).

            I still type in small datasets from books or papers.

            Now consider early text file editors in which you couldn't see the file in question while you were editing it (and printing it out wasn't always trivial either).

            Comment


            • #7
              My econometrics instructor when I was pursuing my PhD (now retired) explained truncation in the following way: The secretary is inputting data from a sheet of paper and then she accidentally spills ink on the bottom half of the sheet, completely losing the bottom half of the data (the data is ordered in some way, e.g., individuals heights). That is right truncation. I am not that old (but barely in my 30's), but I get it since we used to use fountain pens and ink when I was younger and spilling fountain pen ink was a very common thing. I also remember typewriters as computers became mainstream (in my view) in the late 1980's / early 1990's. Very soon, I suspect that the truncation story will draw blank stares from a new generation of students.

              Comment


              • #8
                Originally posted by Nick Cox View Post
                Somebody yesterday had about 160 million observations...

                So, you want a positive count for wave 1 or 2 and also for wave 4 or 5. Your data example leaves ambiguous whether your identifier is a string variable or a numeric variable with value labels. But code could be the same either way. Please note our longstanding request to use dataex.

                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input str5 mergeid byte wave
                "AT-01" 1
                "AT-01" 4
                "AT-02" 1
                "AT-02" 2
                "AT-02" 5
                "BF-05" 1
                "BF-05" 2
                "CZ-44" 2
                "CZ-44" 4
                "CZ-45" 5
                end
                
                .bysort mergeid : egen cond1 = total(wave == 1 | wave == 2)
                
                .by mergeid : egen cond2 = total(wave == 4 | wave == 5)
                
                .gen wantec = cond1 & cond2
                
                
                . l, sepby(mergeid)
                
                +-----------------------------------------+
                | mergeid wave cond1 cond2 wanted |
                |-----------------------------------------|
                1. | AT-01 1 1 1 1 |
                2. | AT-01 4 1 1 1 |
                |-----------------------------------------|
                3. | AT-02 1 2 1 1 |
                4. | AT-02 2 2 1 1 |
                5. | AT-02 5 2 1 1 |
                |-----------------------------------------|
                6. | BF-05 1 2 0 0 |
                7. | BF-05 2 2 0 0 |
                |-----------------------------------------|
                8. | CZ-44 2 1 1 1 |
                9. | CZ-44 4 1 1 1 |
                |-----------------------------------------|
                10. | CZ-45 5 0 1 0 |
                +-----------------------------------------+
                
                keep if wanted
                Hello
                It works, but
                It works if I write manually the observations.
                I have already have a merge file with 163 718 observations.
                How can I use my merge file?

                I tried it, but did not work
                input str5 mergeid byte wave
                use "dataset9.dta", clear
                end

                Thank you

                Comment

                Working...
                X