Visualizing Temperature Changes
Today we will create a SQL database, store data frames in it, then perform selective querying and create interactive visualizations using plotly.
Key Imports
import pandas as pd # data frame manipulation
import numpy as np # numeric manipulation
import sqlite3 # connection to SQL database
from plotly import express as px # plotly visualization
from sklearn.linear_model import LinearRegression # Regression
from plotly.io import write_html # for exporting plotly html
import datetime as dt # data/time manipulation
Getting Data and Creating a database
We will read in the three relevant data frames: temperature, countries, stations.
# Read in the temperature df
temps = pd.read_csv("temps_stacked.csv")
# Read in the countries df
countries = pd.read_csv('countries.csv')
# Read in the stations df directly from url
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/noaa-ghcn/station-metadata.csv"
stations = pd.read_csv(url)
We first remove white space from data frame column names to avoid issues with SQL.
# Substitute white space for underscore in df column names
countries = countries.rename(columns= {"FIPS 10-4": "FIPS_10_4"})
countries = countries.rename(columns= {"ISO 3166": "ISO_3166"})
Create a database and store the three data frames.
# Create and connect to a temps.db database
conn = sqlite3.connect("temps.db")
# Add the three tables to temps.db
temps.to_sql("temperatures", conn,
if_exists="replace", index=False)
stations.to_sql("stations", conn,
if_exists="replace", index=False)
countries.to_sql("countries", conn,
if_exists="replace", index=False)
# Close the connection after database construction
conn.close()
Query Function
The query function connects to the database, joins the three tables by relevant country IDs, and returns one data frame for user-specified a) year duration b) month, and c) country.
def query_climate_database(country, year_begin, year_end, month):
'''
Function designed to query the pre-established database and
return data for the user-specified country, in the specified
date range, in the specified month of the year.
Inputs:
countries - a list of countries
year_begin - numeric int for starting year (inclusive)
year_end - numeric int for ending year (inclusive)
Output:
pandas data frame
'''
# Connect to the database
conn = sqlite3.connect("temps.db")
# SQL command for joining the tables and returning the desired columns
cmd = '''
SELECT
s.NAME, s.LATITUDE, s.LONGITUDE, c.Name Country, t.Year, t.Month, t.Temp
FROM
temperatures t
LEFT JOIN stations s ON t.ID = s.ID
LEFT JOIN countries c ON SUBSTR(t.ID, 1, 2) = c.FIPS_10_4
'''
# Execute the SQL command and store the queried data into df
df = pd.read_sql_query(cmd, conn)
conn.close()
# Return entries that follow user specifications
return df.loc[(df.Country == country) & (df.Month == month) &
(df.Year >= year_begin) & (df.Year <= year_end)].reset_index().drop('index', axis = 1)
Now we query January data from India between 1980 and 2020.
df = query_climate_database(country = "India",
year_begin = 1980,
year_end = 2020,
month = 1)
Geographic Scatter Function for Yearly Temperature Increases
First we create a function that can take in a data frame like we created before and calculate best-fit lines’ slopes, giving us an estimated average temperature change per station.
def coef(data_group):
'''
Function defined to take in a data frame with Year and Temp columns
Regresses on the data points and returns the best fit line's slope
Input:
data_group - a pandas df with Year and Temp columns
Output:
slope - a single numeric value of the best fit line's slope
'''
X = data_group[["Year"]] # expect data frame, not series
y = data_group["Temp"]
LR = LinearRegression()
LR.fit(X, y)
slope = LR.coef_[0]
return slope
Then we create a function that takes in criterias (country, year_begin, year_end, month) as before, as well as min_obs
, the minimal number of years of data that must be present for a station to be included in our visualization. *kwargs
is also included to allow unspecified number of additional parameters for the plotly function from the user.
def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
'''
Function combining query_climate_database and coef functions before
will query specific data from the data base subject to user demand
keep only months with a user-set minimum number of observations
and creates an interactive geographic plotly visualization, with **kwargs
enabled to take in undetermined number of plotting parameters
Output: a plotly interactive geographic visualization
'''
raw = query_climate_database(country = country,
year_begin = year_begin,
year_end = year_end,
month = month)
freq = pd.DataFrame(raw.groupby(['NAME', 'Month'])['Temp'].transform(len)).rename(columns={'Temp':'freq'})
raw = raw.join(freq)
filtered = raw[raw.freq >= min_obs]
temp_change = filtered.groupby(["NAME", "Month"]).apply(coef).reset_index()
temp_change = temp_change.rename(columns = {0:"slope"}).round(decimals = 4)
df = temp_change.merge(filtered[["NAME", "LATITUDE", "LONGITUDE"]], left_on = "NAME", right_on = "NAME")
fig = px.scatter_mapbox(df, # data for the points you want to plot
lat = "LATITUDE", # column name for latitude informataion
lon = "LONGITUDE", # column name for longitude information
hover_name = "NAME", # what's the bold text that appears when you hover over
color="slope", # represent temp using color
labels = {"slope":"Estimated Yearly Increase"},
**kwargs) # remaining user-specified arguments
fig.update_layout(margin={"r":0,"t":50,"l":20,"b":0})
return fig
Scatter Map Plot for India:
Here is the interactive plotly visualizations of stations in India and their respective Estimated Yearly Temperature Increase between 1980 to 2020.
color_map = px.colors.diverging.RdGy_r # choose a colormap
fig1 = temperature_coefficient_plot("India", 1980, 2020, 1, min_obs = 10,
zoom = 2,
mapbox_style="carto-positron",
color_continuous_scale=color_map,
title = "Estimates of yearly change in Temperature in January<br>"\
"for stations in India, years 1980-2020",
color_continuous_midpoint = 0)
fig1.show()
Scatter Map Plot for China:
Here, as per the given instructions, we make a similar plot as above but this time querying and plotting station/temperature data from China instead of India. The time period is also more recent from 1990 to 2020. The mapbox_style has been changed to “carto-darkmatter” to highlight the stations with minimal temperature increase (thus having very faint colors or pure white colors).
color_map = px.colors.diverging.RdGy_r
fig2 = temperature_coefficient_plot("China", 1990, 2020, 1, min_obs = 5,
zoom = 2.2,
mapbox_style="carto-darkmatter",
color_continuous_scale=color_map,
title = "Estimates of yearly change in Temperature in January<br>"\
"for stations in China, years 1990-2020",
center = {"lat":35, "lon":101},
color_continuous_midpoint = 0)
fig2.show()
Analysis on STNELEV (Station Elevation)
def query_climate_database2(countries, year_begin, year_end):
'''
Function designed to query the pre-established database and
return data from a LIST of desired countries between user-defined
start/end years
Inputs:
countries - a list of countries
year_begin - numeric int for starting year (inclusive)
year_end - numeric int for ending year (inclusive)
Output:
pandas data frame
'''
# Connect to the database
conn = sqlite3.connect("temps.db")
# SQL command for joining the tables and returning the desired columns
cmd = '''
SELECT
c.Name Country, s.NAME station, t.Year, t.Month, t.Temp, s.STNELEV
FROM
temperatures t
LEFT JOIN stations s ON t.ID = s.ID
LEFT JOIN countries c ON SUBSTR(t.ID, 1, 2) = c.FIPS_10_4
'''
# Execute the SQL command and store the queried data into df
df = pd.read_sql_query(cmd, conn)
conn.close()
# Return entries that follow user specifications
return df.loc[(df.Country.isin(countries))& (df.Year >= year_begin) & (df.Year <= year_end)].reset_index().drop('index', axis = 1)
We will query and analyze the station data from the three countries that produce the most pollution, China, the United States, and India. The goal is study the relationship between station elevation and temperature. with facets of Year and Month, we may also observe shift in overall temperature throughout the years.
big_three = query_climate_database2(['China', 'United States', 'India'], 2016, 2020)
We first check that for each unique station in each country, its elevation remains constant. We take out any stations that violate this standard.
elev_std = big_three.groupby(["Country", "station"]).STNELEV.std().reset_index()
# Find stations whose STNELEV has 0 standard deviation <-> constant
valid_stations = pd.DataFrame(elev_std[elev_std.STNELEV == 0].station)
# Only keep the stations found above by merging back
station_df = valid_stations.merge(big_three, left_on = "station",
right_on = "station")
Since we are interested in the effect of elevation, we take the mean of temperature after having grouped by Year, Month, Country, and STNELEV.
fig_data = pd.DataFrame(station_df.groupby(["Year", "Month", "Country", "STNELEV"])[["Temp"]].mean()).reset_index()
Finally, we plot the data frame using a custom function:
def line_plotter(df, facet_row, facet_col, x, y, **kwargs):
'''
Function designed to create a multi-facet line plot
Inputs:
df - a pandas data frame containing all plotting info
facet_row - the category to group data into on horizontal level
facet_col - the category to group data into on vertical level
x - the variable to plot on x axis of each plot
y - the variable to plot on y axis of each plot
**kwargs - allowing non-determined number of additional plotting param
'''
out = px.line(data_frame = df,
facet_row = facet_row,
facet_col = facet_col,
x = x,
y = y,
**kwargs
)
# Ensure sufficient margin for labels
out.update_layout(margin = {"r":0,"t":100,"l":0,"b":0})
write_html(new_fig, "facet_line.html")
out.show()
line_plotter(df = fig_data,
facet_row = 'Month',
facet_col = 'Year',
x = 'STNELEV',
y = 'Temp',
color = 'Country',
range_x = [-100, 5000],
range_y = [-35, 45],
width = 1000,
height = 1500,
title = 'Average Temp by Elevation in Big Three Pollution Countries<br>'\
'with facets of Years and Months')
Conclusion: We make several observations:
- There is a slight negative relationship between STNELEV and average temperature, that is station temperature on average decreases with higher elevation, which follows our intuition.
- The variability of average temperature also decreases with higher elevation
- From these plots themselves, it is difficult to discern a significant and clear increase in temperature from 2016 to 2020. Recall that the data points are average temperature across stations with the same STNELEV in a country, so there may have been offsets.
- The temperature of the three countries are more aligned in the summer months, and more divergent in the winter months (as seen by three colors all showing up, where in the summer the green United States trend cover the other two up)
- Lastly, we acknowledge that due to the geographic elevation makeup of t.hese three countries, they are not exactly comparable. For example, the highest STNELV for stations in India is a little over 2300, while for the U.S. the highest is 3700 and in China, 4700.
Visualizing Monthly Temperature Distribution
Now we are interested in visualizing the distribution of temperatures in the US for the past three years across months, with the goal of discovering any potential shift (for example, for global warming, we expect the distribution to shift towards more frequent high temperature)
To do so, we first query the station data in the US for the past three years. Notably, we choose the “group” for parameter barmode
to prevent stacking, and “percent” for histnorm
so that the y axis is percentage, instead of raw count (the number of station observations per year is different).
US_three_yr = query_climate_database2(["United States"],
2018, 2020)
# Specify increasing Year and Month for plotting order
US_three_yr.sort_values(["Year", "Month", "Temp"],
inplace = True)
Then we plot histograms, broken down by months and color-coded by yea using our custom function.
def facet_hist(df, x, **kwargs):
'''
Function designed to create facet histograms
With large degree of freedom for user-specified param
Inputs:
df - pandas data frame containing all plotting info
x - the variable for x axis (for which to count prevalency)
**kwargs - for undetermined number of additional parameters
'''
out = px.histogram(df, x = x, **kwargs)
out.update_layout(margin={"r":0,"t":20,"l":0,"b":0})
write_html(out, "multi_facet_hist.html")
out.show()
facet_hist(US_three_yr,
x = "Temp",
color = "Year",
opacity = 0.8,
nbins = 20,
barmode='group',
histnorm = 'percent',
width = 900,
height = 1500,
facet_col = 'Month',
facet_col_wrap = 3)
Conclusion: Here are several observations:
- There is no consistent showing of shift towards higher temperature across the months in the past three years in the US.
- We can state, however, that the monthly temperature patterns have become more unpredictable. For example: in Jan-March, we observe a shift towards higher temperature; in Oct-Dec, there is a narrowing of the distribution, meaning that the monthly temperature have settled on a smaller range of values.
- We also observe, in general, that certain months tend to have a wider distribution (larger range of temperature fluctuations) than others.