Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Use if else loop to implement multiple conditional statements

    Hi Statalist community,

    Gig workers are individuals who have multiple employers and they can work for Uber, Lyft, or Doordash at the same time. I am trying to determine a gig worker's primary employer based on several rules.
    1. The primary employer is the one where the employee worked the most hours.
    2. If the most hours worked by an employee are equal across two or more employers, the primary employer is the one where the employee earned the most wages.
    3. If the most hours worked by an employee are equal across two or more employers, and most wages earned by an employee are equal across two or more employers, the primary employer is the one where the employee received the most benefits.
    4. If the most hours worked by an employee are equal across two or more employers, and most wages earned by an employee are equal across two or more employers, and the most benefits received by an employee are equal across two or more employers, the primary employer is based on the earliest quarter that an employee was employed at.
    Below is a sample of my unbalanced panel dataset. I have the following variables:
    • worker_id - this is an employee id
    • employer_id - this is the employer's id
    • quarter - time is represented in quarters from 1-4
    • hours_worked - this is the number of hours worked in a quarter
    • wages- this is wages earned in a quarter
    • benefits - this is the benefits received in a quarter
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(worker_id employer_id quarter hours_worked) int(wages benefits) byte primary_employer
    1 10 1 20 1000 200 10
    1 20 2 19 1200 100 10
    2 40 1  8  500 200 40
    2 20 1  8  400 350 40
    2 30 2  5  300 200 40
    3 10 2 10  200 100 20
    3 20 2 10  200 400 20
    3 30 2 10  200 300 20
    3 40 2 10  200 100 20
    4 40 2  2   50  25 30
    4 10 3  5   90  50 30
    4 20 4  5  100  50 30
    4 30 1  5  100  50 30
    end

    I wrote the following loop but I get an error that says that I cannot combine if with by.


    Code:
    gen primary_employer=.
    
    by worker_id: if max(hours_worked) {
            replace primary_employer=employer_id
        }
        else if max(wages) {
            replace primary_employer=employer_id
        }
        else if max(benefits) {
            replace primary_employer=employer_id
        }
        else if min(quarter) {
            replace primary_employer=employer_id
        }

    In the dummy dataset, I manually created a variable called primary_employer which is what I am trying to accomplish. Does anyone know how to solve my problem? Thank you.


    Last edited by James Lee; 09 Nov 2022, 18:52.

  • #2
    Code:
    gsort worker_id -hours_worked -wages -benefits quarter
    by worker_id: egen wanted = max(employer_id * (_n == 1))
    Results:

    Code:
         +---------------------------------------------------------------------------------+
         | worker~d   employ~d   quarter   hours_~d   wages   benefits   primar~r   wanted |
         |---------------------------------------------------------------------------------|
      1. |        1         10         1         20    1000        200         10       10 |
      2. |        1         20         2         19    1200        100         10       10 |
         |---------------------------------------------------------------------------------|
      3. |        2         40         1          8     500        200         40       40 |
      4. |        2         20         1          8     400        350         40       40 |
      5. |        2         30         2          5     300        200         40       40 |
         |---------------------------------------------------------------------------------|
      6. |        3         20         2         10     200        400         20       20 |
      7. |        3         30         2         10     200        300         20       20 |
      8. |        3         40         2         10     200        100         20       20 |
      9. |        3         10         2         10     200        100         20       20 |
         |---------------------------------------------------------------------------------|
     10. |        4         30         1          5     100         50         30       30 |
     11. |        4         20         4          5     100         50         30       30 |
     12. |        4         10         3          5      90         50         30       30 |
     13. |        4         40         2          2      50         25         30       30 |
         +---------------------------------------------------------------------------------+
    Last edited by Ken Chui; 09 Nov 2022, 19:02.

    Comment


    • #3
      Ken Chui's response in #2 shows you how to get the result you want. It's the best solution I know of. But #2 doesn't say anything about what is wrong with what was tried in #1.

      Stata has two different kinds of -if-, and they do very different things. The most commonly used is the -if- clause attached to a single command. For example: -gen y = 1 if x > 2-. The function of an -if- command is to define a subset of the observations to which the command will apply. In this example, it will apply to all and only those observations where the value of x is greater than 2.

      What was used in #1, is the other kind of -if-, the -if- command. The -if- of an -if- command differs in several ways from the -if- clause:
      1. It precedes, not follows, what it applies to.
      2. It can be associated with -else if- and -else- commands.
      3. It can apply to more than one command by enclosing those commands within curly braces.
      4. But, most important, it does not define a subset of the observations to which the command (or commands) it guards will apply. Instead it tests for some condition of the state of the Stata environment to determine whether or not the guarded command(s) will be executed or not. In doing this, if the condition being tested refers to the value of a vasriable, that is understood to be the value of that variable in the first observation of the active data set in memory.
      So, apart from the fact that you tried to prefix the -if-command with -by- (which Stata told you is not legal), your code would not have done what you were seeking. For example, max(hours_worked) would have been interpreted as max(hours_worked[1]). But that's a syntax error because the -max()- function requires a minimum of 2 arguments. (Don't confuse the -max()- function with the -egen, max()- function. The latter gives the maximum value of a single expression evaluated over the entire data set. But that function can only be accessed through the -egen- command. And even if the -max()- function worked the way you wanted it to, that is, like the -egen, max()- function, you still would not have gotten what you seek, because that maximum value, assuming it is other than 0, would have then led Stata to -replace primary_employer=employer_id- in every observation of the data set. Or, if prefixing -if- commands with -by- were legal, it would have done so in every observation of the by-group.

      Learning when and how to use -if- commands and -if- clauses is one of the hurdles that people new to Stata have to overcome. It is probably harder for people who have prior experience with structured programming languages, because the syntactic similarity to if {...} else if {...} else {...} structures in general purpose programming languages tempts them to use them in situations where -if- clauses are needed instead. Most general purpose programming languages do not have anything that is exactly analogous to Stata's -if- clause.

      So, summarizing so far, both -if- and -max()- are overloaded terms in Stata whose different meanings appear in different syntactic contexts, and the code in #1 gets it wrong in both cases.

      Finally, as you have seen, -by- can only be applied to a single command, and not all commands allow it.
      Last edited by Clyde Schechter; 09 Nov 2022, 20:08.

      Comment


      • #4
        Thank you so much for the responses.

        @Ken Chui . Your solution is so logical and I see the thought process. I follow your code fairly well. Can I ask for a quick clarification about the line of code pasted below? A * symbol is used. What does that do? Is the * sign multiplying employer_id with (_n==1)? Is it serving as a wildcard? I have never seen it used in this manner.

        Code:
         
         by worker_id: egen wanted = max(employer_id * (_n == 1))

        @Clyde Schechter Thank you for the explanation on the difference between the -if- clause and the -if- command. Below are additional commentary on the differences but your explanation is so clear and understandable. I am still trying to get the hang of Stata and this forum is very helpful. Thank for also explaining how to use the max() function. I am mixing different programming languages together.



        https://www.statalist.org/forums/for...s-if-condition


        https://www.stata.com/support/faqs/p...-if-qualifier/

        Comment


        • #5
          Originally posted by James Lee View Post
          Thank you so much for the responses.
          @Ken Chui . Your solution is so logical and I see the thought process. I follow your code fairly well. Can I ask for a quick clarification about the line of code pasted below? A * symbol is used. What does that do? Is the * sign multiplying employer_id with (_n==1)? Is it serving as a wildcard? I have never seen it used in this manner.
          No, it's used as multiplication here. The more verbose version of the code is:

          Code:
          gsort worker_id -hours_worked -wages -benefits quarter
          by worker_id: gen firstrow = _n == 1
          generate chosen_num = primary_employer * firstrow
          by worker_id: egen wanted = max(chosen_num)
          First, it's sorted based on the order of criteria.

          Then, we assigned a variable called "order" which is the same as row number within each worker_id. The one with order == 1 is the one we need to extract the employer information from.

          Then, we extract that one number.

          Using an egen max, that number was then copied through all the empty cells within each unique worker_id.

          These all can be boiled down to:

          Code:
          gsort worker_id -hours_worked -wages -benefits quarter
          by worker_id: egen wanted = max(employer_id * (_n == 1))
          Here, instead of creating "order" it just use _n (row number). If the row number is equal to 1, then condition (_n == 1) will be equal to TRUE, which is coded as 1. It'd then be multiplied with the employer_id of that row. All other non-first-rows will get a false FALSE and will return a 0. And among those numbers, the maximum will be picked through egen max().

          And because of this max() mechanism, this code is not suitable if you have negative employer ID, which I'd consider a highly rare practice.
          Last edited by Ken Chui; 11 Nov 2022, 12:06.

          Comment


          • #6
            I think a more transparent approach to this, and one that will not fail in the presence of anomalies like negative employer_ids would be:
            Code:
            gsort worker_id -hours_worked -wages -benefits quarter
            by worker_id: egen wanted = employer_id[1]
            And it would also work if employer_id is a string variable, which is quite common in practice.

            Comment


            • #7
              In #6 egen is a typo for gen.

              Comment


              • #8
                @Ken Chui
                Thank you for graciously breaking down your code. Your explanation is very clear. This thread has been a great lesson and helped me build up my Stata skills. Thank you.

                @Clyde Schechter
                Thanks for offering an alternative approach. I am more familiar with the square brackets to index for row position. Thank you.

                @Nick Cox
                You are super sharp and thank you for pointing out this detail. Thank you.

                Comment

                Working...
                X