Visualizing Temperature Changes

Today we will create a SQL database, store data frames in it, then perform selective querying and create interactive visualizations using plotly.

Key Imports

import pandas as pd # data frame manipulation
import numpy as np # numeric manipulation
import sqlite3 # connection to SQL database
from plotly import express as px # plotly visualization
from sklearn.linear_model import LinearRegression # Regression
from plotly.io import write_html # for exporting plotly html
import datetime as dt # data/time manipulation

Getting Data and Creating a database

We will read in the three relevant data frames: temperature, countries, stations.

# Read in the temperature df
temps = pd.read_csv("temps_stacked.csv")
# Read in the countries df
countries = pd.read_csv('countries.csv')
# Read in the stations df directly from url
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/noaa-ghcn/station-metadata.csv"
stations = pd.read_csv(url)

We first remove white space from data frame column names to avoid issues with SQL.

# Substitute white space for underscore in df column names
countries = countries.rename(columns= {"FIPS 10-4": "FIPS_10_4"})
countries = countries.rename(columns= {"ISO 3166": "ISO_3166"})

Create a database and store the three data frames.

# Create and connect to a temps.db database
conn = sqlite3.connect("temps.db")

# Add the three tables to temps.db
temps.to_sql("temperatures", conn,
             if_exists="replace", index=False)
stations.to_sql("stations", conn,
                if_exists="replace", index=False)
countries.to_sql("countries", conn,
                 if_exists="replace", index=False)

# Close the connection after database construction
conn.close()

Query Function

The query function connects to the database, joins the three tables by relevant country IDs, and returns one data frame for user-specified a) year duration b) month, and c) country.

def query_climate_database(country, year_begin, year_end, month):
    '''
    Function designed to query the pre-established database and
    return data for the user-specified country, in the specified
    date range, in the specified month of the year. 
    
    Inputs:
    countries - a list of countries
    year_begin - numeric int for starting year (inclusive)
    year_end - numeric int for ending year (inclusive)
    
    Output:
    pandas data frame
    '''
    # Connect to the database
    conn = sqlite3.connect("temps.db")
    # SQL command for joining the tables and returning the desired columns
    cmd = '''
    SELECT
        s.NAME, s.LATITUDE, s.LONGITUDE, c.Name Country, t.Year, t.Month, t.Temp
    FROM
        temperatures t
    LEFT JOIN stations s ON t.ID = s.ID
    LEFT JOIN countries c ON SUBSTR(t.ID, 1, 2) = c.FIPS_10_4
    '''
    # Execute the SQL command and store the queried data into df
    df = pd.read_sql_query(cmd, conn)
    conn.close()
    # Return entries that follow user specifications
    return df.loc[(df.Country == country) & (df.Month == month) &
                  (df.Year >= year_begin) & (df.Year <= year_end)].reset_index().drop('index', axis = 1)

Now we query January data from India between 1980 and 2020.

df = query_climate_database(country = "India", 
                       year_begin = 1980, 
                       year_end = 2020,
                       month = 1)

Geographic Scatter Function for Yearly Temperature Increases

First we create a function that can take in a data frame like we created before and calculate best-fit lines’ slopes, giving us an estimated average temperature change per station.

def coef(data_group):
    '''
    Function defined to take in a data frame with Year and Temp columns
    Regresses on the data points and returns the best fit line's slope
    
    Input:
    data_group - a pandas df with Year and Temp columns
    
    Output:
    slope - a single numeric value of the best fit line's slope
    '''
    X = data_group[["Year"]] # expect data frame, not series
    y = data_group["Temp"]
    LR = LinearRegression()
    LR.fit(X, y)
    slope = LR.coef_[0]
    return slope

Then we create a function that takes in criterias (country, year_begin, year_end, month) as before, as well as min_obs, the minimal number of years of data that must be present for a station to be included in our visualization. *kwargs is also included to allow unspecified number of additional parameters for the plotly function from the user.

def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
    '''
    Function combining query_climate_database and coef functions before
    will query specific data from the data base subject to user demand
    keep only months with a user-set minimum number of observations
    and creates an interactive geographic plotly visualization, with **kwargs
    enabled to take in undetermined number of plotting parameters
    
    Output: a plotly interactive geographic visualization
    '''
    raw = query_climate_database(country = country, 
                                 year_begin = year_begin, 
                                 year_end = year_end,
                                 month = month)
    freq = pd.DataFrame(raw.groupby(['NAME', 'Month'])['Temp'].transform(len)).rename(columns={'Temp':'freq'})
    raw = raw.join(freq)
    filtered = raw[raw.freq >= min_obs]
    temp_change = filtered.groupby(["NAME", "Month"]).apply(coef).reset_index()
    temp_change = temp_change.rename(columns = {0:"slope"}).round(decimals = 4)
    df = temp_change.merge(filtered[["NAME", "LATITUDE", "LONGITUDE"]], left_on = "NAME", right_on = "NAME")
    fig = px.scatter_mapbox(df, # data for the points you want to plot
                            lat = "LATITUDE", # column name for latitude informataion
                            lon = "LONGITUDE", # column name for longitude information
                            hover_name = "NAME", # what's the bold text that appears when you hover over
                            color="slope", # represent temp using color
                            labels = {"slope":"Estimated Yearly Increase"},
                            **kwargs) # remaining user-specified arguments
    fig.update_layout(margin={"r":0,"t":50,"l":20,"b":0})
    return fig

Scatter Map Plot for India:

Here is the interactive plotly visualizations of stations in India and their respective Estimated Yearly Temperature Increase between 1980 to 2020.

color_map = px.colors.diverging.RdGy_r # choose a colormap

fig1 = temperature_coefficient_plot("India", 1980, 2020, 1, min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map,
                                   title = "Estimates of yearly change in Temperature in January<br>"\
                                   "for stations in India, years 1980-2020",
                                   color_continuous_midpoint = 0)
fig1.show()

Scatter Map Plot for China:

Here, as per the given instructions, we make a similar plot as above but this time querying and plotting station/temperature data from China instead of India. The time period is also more recent from 1990 to 2020. The mapbox_style has been changed to “carto-darkmatter” to highlight the stations with minimal temperature increase (thus having very faint colors or pure white colors).

color_map = px.colors.diverging.RdGy_r

fig2 = temperature_coefficient_plot("China", 1990, 2020, 1, min_obs = 5,
                                   zoom = 2.2,
                                   mapbox_style="carto-darkmatter",
                                   color_continuous_scale=color_map,
                                   title = "Estimates of yearly change in Temperature in January<br>"\
                                    "for stations in China, years 1990-2020",
                                    center = {"lat":35, "lon":101},
                                    color_continuous_midpoint = 0)
fig2.show()

Analysis on STNELEV (Station Elevation)

def query_climate_database2(countries, year_begin, year_end):
    '''
    Function designed to query the pre-established database and
    return data from a LIST of desired countries between user-defined
    start/end years
    
    Inputs:
    countries - a list of countries
    year_begin - numeric int for starting year (inclusive)
    year_end - numeric int for ending year (inclusive)
    
    Output:
    pandas data frame
    '''
    # Connect to the database
    conn = sqlite3.connect("temps.db")
    # SQL command for joining the tables and returning the desired columns
    cmd = '''
    SELECT
        c.Name Country, s.NAME station, t.Year, t.Month, t.Temp, s.STNELEV
    FROM
        temperatures t
    LEFT JOIN stations s ON t.ID = s.ID
    LEFT JOIN countries c ON SUBSTR(t.ID, 1, 2) = c.FIPS_10_4
    '''
    # Execute the SQL command and store the queried data into df
    df = pd.read_sql_query(cmd, conn)
    conn.close()
    # Return entries that follow user specifications
    return df.loc[(df.Country.isin(countries))& (df.Year >= year_begin) & (df.Year <= year_end)].reset_index().drop('index', axis = 1)

We will query and analyze the station data from the three countries that produce the most pollution, China, the United States, and India. The goal is study the relationship between station elevation and temperature. with facets of Year and Month, we may also observe shift in overall temperature throughout the years.

big_three = query_climate_database2(['China', 'United States', 'India'], 2016, 2020)

We first check that for each unique station in each country, its elevation remains constant. We take out any stations that violate this standard.

elev_std = big_three.groupby(["Country", "station"]).STNELEV.std().reset_index()
# Find stations whose STNELEV has 0 standard deviation <-> constant
valid_stations = pd.DataFrame(elev_std[elev_std.STNELEV == 0].station)
# Only keep the stations found above by merging back
station_df = valid_stations.merge(big_three, left_on = "station",
                                  right_on = "station")

Since we are interested in the effect of elevation, we take the mean of temperature after having grouped by Year, Month, Country, and STNELEV.

fig_data = pd.DataFrame(station_df.groupby(["Year", "Month", "Country", "STNELEV"])[["Temp"]].mean()).reset_index()

Finally, we plot the data frame using a custom function:

def line_plotter(df, facet_row, facet_col, x, y, **kwargs):
    '''
    Function designed to create a multi-facet line plot
    
    Inputs:
    df - a pandas data frame containing all plotting info
    facet_row - the category to group data into on horizontal level
    facet_col - the category to group data into on vertical level
    x - the variable to plot on x axis of each plot
    y - the variable to plot on y axis of each plot
    **kwargs - allowing non-determined number of additional plotting param
    '''
    out = px.line(data_frame = df,
                  facet_row = facet_row,
                  facet_col = facet_col,
                  x = x,
                  y = y,
                  **kwargs
                 )
    # Ensure sufficient margin for labels
    out.update_layout(margin = {"r":0,"t":100,"l":0,"b":0})
    write_html(new_fig, "facet_line.html")
    out.show()

line_plotter(df = fig_data,
             facet_row = 'Month',
             facet_col = 'Year',
             x = 'STNELEV',
             y = 'Temp',
             color = 'Country',
             range_x = [-100, 5000],
             range_y = [-35, 45],
             width = 1000,
             height = 1500,
             title = 'Average Temp by Elevation in Big Three Pollution Countries<br>'\
             'with facets of Years and Months')

Conclusion: We make several observations:

There is a slight negative relationship between STNELEV and average temperature, that is station temperature on average decreases with higher elevation, which follows our intuition.
The variability of average temperature also decreases with higher elevation
From these plots themselves, it is difficult to discern a significant and clear increase in temperature from 2016 to 2020. Recall that the data points are average temperature across stations with the same STNELEV in a country, so there may have been offsets.
The temperature of the three countries are more aligned in the summer months, and more divergent in the winter months (as seen by three colors all showing up, where in the summer the green United States trend cover the other two up)
Lastly, we acknowledge that due to the geographic elevation makeup of t.hese three countries, they are not exactly comparable. For example, the highest STNELV for stations in India is a little over 2300, while for the U.S. the highest is 3700 and in China, 4700.

Visualizing Monthly Temperature Distribution

Now we are interested in visualizing the distribution of temperatures in the US for the past three years across months, with the goal of discovering any potential shift (for example, for global warming, we expect the distribution to shift towards more frequent high temperature)

To do so, we first query the station data in the US for the past three years. Notably, we choose the “group” for parameter barmode to prevent stacking, and “percent” for histnorm so that the y axis is percentage, instead of raw count (the number of station observations per year is different).

US_three_yr = query_climate_database2(["United States"],
                                      2018, 2020)
# Specify increasing Year and Month for plotting order
US_three_yr.sort_values(["Year", "Month", "Temp"],
                        inplace = True)

Then we plot histograms, broken down by months and color-coded by yea using our custom function.

def facet_hist(df, x, **kwargs):
    '''
    Function designed to create facet histograms
    With large degree of freedom for user-specified param
    
    Inputs:
    df - pandas data frame containing all plotting info
    x - the variable for x axis (for which to count prevalency)
    **kwargs - for undetermined number of additional parameters
    '''
    out = px.histogram(df, x = x, **kwargs)
    out.update_layout(margin={"r":0,"t":20,"l":0,"b":0})
    write_html(out, "multi_facet_hist.html")
    out.show()

facet_hist(US_three_yr, 
           x = "Temp", 
           color = "Year",
           opacity = 0.8, 
           nbins = 20,
           barmode='group',
           histnorm = 'percent',
           width = 900,
           height = 1500,
           facet_col = 'Month',
           facet_col_wrap = 3)

Conclusion: Here are several observations:

There is no consistent showing of shift towards higher temperature across the months in the past three years in the US.
We can state, however, that the monthly temperature patterns have become more unpredictable. For example: in Jan-March, we observe a shift towards higher temperature; in Oct-Dec, there is a narrowing of the distribution, meaning that the monthly temperature have settled on a smaller range of values.
We also observe, in general, that certain months tend to have a wider distribution (larger range of temperature fluctuations) than others.

Written on April 15, 2022