Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Simple Python Graphing Question

    Hey everyone. I'm working on a project with my friend, and mentor, and he's a Python expert. Thus, I wanna use this opportunity to learn more Python. I'm trying, more precisely, to replicate graphs that I have already done in Stata. My question is this: How do I get Python to know that I want to plot the reference line at the year 1989, instead of the index where 1989 appears at? Consider the following code:
    Code:
    cls
    clear *
    python:
    
    import pandas as pd
    
    import matplotlib.pyplot as plt
    
    hfont = {'fontname':'Times New Roman'}
    
    df = pd.read_csv('https://raw.githubusercontent.com/synth-inference/synthdid/master/data/california_prop99.csv', sep=';', parse_dates=['Year'], index_col='Year')
    #Imports our data
    
    
    df = df.sort_values(by=['State'])
    
    #For some reason it wasn't sorted- changing that
    
    
    df_Cali = df[df['State'] == 'California']
    #For now... we only want California.
    
    df_Cali.plot(y='PacksPerCapita', color=[(.17, .27, .57)], legend=None)
    # Our basic plot
    
    
    plt.title('Tobacco Trends', fontsize=14, **hfont)
    
    
    plt.xlabel('Year', fontsize=14, **hfont)
    plt.xticks(fontsize=14, rotation=45, **hfont)
    
    
    plt.ylabel('Cigarette Sales Per Capita', fontsize=14, **hfont)
    
    
    plt.vlines(x=1989, ymin=40, ymax=140, color='red', label="Proposition 99")
    # !! The problem of interest.
    
    plt.grid()
    
    plt.show()
    
    end
    The graph may not be constructed quite as "Pythonic", but it does what I want. But, the intervention happened in 1989! Not 1970s-ish. Presumably this is because Python recognizes 1989 as 1970-something on the index, instead of as the variable "Year" that I want for it to be at. How might I get the reference line to be at the correct position, at the year 1989? Perhaps Leonardo Guizzetti or Daniel Schaefer might have thoughts?

    Oh and also, if you have any edits you'd suggest to the code itself, like how to make it cleaner/more efficient, I'd appreciate it! I look forward to the day that I'll be fluently bilingual in both Stata and Python.

  • #2
    Hey Jered,

    Here is my solution. I googled around a bit (wasn't going to dive too deep into the documentation) and I couldn't find an obvious way to place the vertical line based on the date label. I know the pandas basics, but I don't know the weeds of pandas all that well, so there may be a cleaner solution that I am unaware of. Additionally, I also have no idea why you got a vertical line within your plot at all, since it seems like matplotlib.pyplot.plt is expecting an index for x, and 1989 is clearly an index outside of the bounds of the domain. Just sorting your data by the date gives a vertical that I believe is out of bounds of your plot and therefore not rendered.

    Below, I make a few changes to your code. First, I sort the data by year as well as state, so that the index corresponding to the year is more meaningful.

    Code:
    df = df.sort_values(by=['State', 'Year'])
    Next, I want to find the row index corresponding to the year 1989. So I dereference the date index object from the dataframe (red); convert the index object to a list of datetime objects (green); loop through each date time object, extract the year, and put it in a list (orange); then I find the index in the list corresponding to 1989 (purple). This kind of syntax is referred to as a list comprehension, and can be useful for writing ideomatic and syntactically minimal for loops. Technically, we could save a little bit of processor time by writing this in a less compact way, but almost certainly not enough to matter from a practical standpoint.

    Code:
    year = [datetime.year for datetime in list(df_Cali.index)].index(1989)
    Finally, I just plug the year index in for the x parameter to the vlines method.

    Code:
    plt.vlines(x=year, ymin=40, ymax=140, color='red', label="Proposition 99")
    The full script is below.

    Code:
    import pandas as pd
    import matplotlib.pyplot as plt
    
    # Constants
    hfont = {'fontname':'Times New Roman'}
    
    #Imports our data
    df = pd.read_csv('https://raw.githubusercontent.com/synth-inference/synthdid/master/data/california_prop99.csv',
                     sep=';', parse_dates=['Year'], index_col='Year')
    df = df.sort_values(by=['State', 'Year']) # For some reason it wasn't sorted- changing that
    df_Cali = df[df['State'] == 'California'] # For now... we only want California.
    # Our basic plot
    df_Cali.plot(y='PacksPerCapita', color=[(.17, .27, .57)], legend=None)
    plt.title('Tobacco Trends', fontsize=14, **hfont)
    plt.xlabel('Year', fontsize=14, **hfont)
    plt.xticks(fontsize=14, rotation=45, **hfont)
    plt.ylabel('Cigarette Sales Per Capita', fontsize=14, **hfont)
    year = [datetime.year for datetime in list(df_Cali.index)].index(1989)
    plt.vlines(x=year, ymin=40, ymax=140, color='red', label="Proposition 99") # !! The problem of interest.
    plt.grid()
    plt.show()

    Comment


    • #3
      Yep we essentially got the same solution! I did it like this
      Code:
      cls
      clear *
      python:
      
      import pandas as pd
      
      import matplotlib.pyplot as plt
      
      hfont = {'fontname':'Times New Roman'}
      
      df = pd.read_csv('https://raw.githubusercontent.com/synth-inference/synthdid/master/data/california_prop99.csv', 
      sep=';', parse_dates=['Year'], index_col='Year')
      
      
      df = df.sort_values(by=['State'])
      
      
      df_Cali = df[df['State'] == 'California']
      
      df_Cali = df_Cali.sort_values(by=['Year'])
      
      
      df_Cali.plot(y='PacksPerCapita', color=[(.17, .27, .57)], legend=None)
      
      
      plt.title('Tobacco Trends', fontsize=14, **hfont)
      
      
      plt.xlabel('Year', fontsize=14, **hfont)
      plt.xticks(fontsize=14, rotation=45, **hfont)
      
      
      plt.ylabel('Cigarette Sales Per Capita', fontsize=14, **hfont)
      
      x_position = df_Cali.index.searchsorted('1989-01-01')
      
      plt.vlines(x= x_position, ymin=40, ymax=140, color='red', label="Proposition 99")
      
      plt.grid(True)
      
      plt.show()
      
      end

      Comment


      • #4
        Nice! I prefer your solution actually. Better to use the built in methods like that. I'm just feeling a little lazy and didn't want to read the docs!

        Comment


        • #5
          I have nothing to add here as I learned a little Python some years ago but now have no professional use for if, so it’s forgotten. Seems like you have found your solution though.

          Comment

          Working...
          X