Introduction to Data Science in Python 第 4 周 Assignment
Assignment 4 - Hypothesis Testing
本次作业相比以往的作业需要投入更多的自主学习时间。建议您访问pandas文档以发现尚未使用的功能或方法,并在Stack Overflow寻求帮助时请记得标记问题为与pandas和Python相关。讨论版块随时欢迎您的参与与交流。
Definitions:
- Quarter refers to a specific three-month period. In particular:
- Q1 corresponds to January through March
- Q2 corresponds to April through June
- Q3 corresponds to July through September
- Q4 corresponds to October through December
- A recession refers to an economic condition characterized by two consecutive quarters of GDP decline followed by two consecutive quarters of GDP growth.
- The recession bottom refers to the quarter within a recession that exhibited the lowest level of GDP.
- A university town refers to a city where the percentage of university students exceeds that of the total urban population.
Hypothesis: Campus towns are not significantly influenced by economic recessions. Conduct a statistical test to analyze and compare the ratio of house prices in university towns during preceding quarters against those at the trough of a recession. The formula for calculating this ratio is defined as price_ratio = (quarter_before_recession / recession_bottom).
The following data files are available for this assignment:
Accessed from Zillow's research data site at http://www.zillow.com/research/data/, this resource provides housing statistics for the United States. Specifically, it includes a comprehensive data file (City_Zhvi_AllHomes.csv) that contains median home sale prices at an granular level.
Found on Wikipedia's dedicated page about college towns (https://en.wikipedia.org/wiki/List_of_college_towns#College_towns_in_the_United_States), this list has been copied and pasted into university_towns.txt.
The U.S. Department of Commerce's Bureau of Economic Analysis provides economic indicators including annual GDP trends (using chained values based on 2009 dollars) across quarters. The relevant data for this assignment spans from Q1 2000 onward.
Every function in this assignment is counted at 10%, except for run_ttest(), which is allocated 50%.
Q1:
def get_list_of_university_towns():
'''Returns a DataFrame of towns and the states they are in from the
university_towns.txt list. The format of the DataFrame should be:
DataFrame( [ ["Michigan", "Ann Arbor"], ["Michigan", "Yipsilanti"] ],
columns=["State", "RegionName"] )
The following cleaning needs to be done:
1. For "State", removing characters from "[" to the end.
2. For "RegionName", when applicable, removing every character from " (" to the end.
3. Depending on how you read the data, you may need to remove newline character '\n'. '''
df = pd.read_table('university_towns.txt', header=None, names=['data'])
bool_df = df['data'].str.contains('[edit]', regex=False)
university_array = []
for index, value in bool_df.items():
if value:
state = df.loc[index].values[0]
else:
region = df.loc[index].values[0]
university_array.append([state, region])
university_df = pd.DataFrame(university_array, columns=['State', 'RegionName'])
university_df['State'] = university_df['State'].str.replace('\[edit.*', '')
university_df['RegionName'] = university_df['RegionName'].str.replace(' \(.*', '')
# 坑。。又说让转换成缩写,提交答案却还是要全名的
# 后记:是 Q5 用来做转换的。。
# states_swap = {value:key for key, value in states.items()}
# university_df['State'].replace(states_swap, inplace=True)
return university_df
get_list_of_university_towns()

# A recession is defined as starting with two consecutive quarters of GDP decline, and ending with two consecutive quarters of GDP growth
gdp_df = pd.read_excel('gdplev.xls', header=5, usecols=[3,4,5], names=['Quarter', 'GDP current', 'GDP chained 2009']).set_index('Quarter').dropna()
gdp_df
# A recession is defined as starting with two consecutive quarters of GDP decline, and ending with two consecutive quarters of GDP growth.
# A recession bottom is the quarter within a recession which had the lowest GDP.
def cal_recession():
result = {
'start': [],
'end': [],
'bottom': []
}
# 下降计数
down_counter = 2
down_start_index = ''
# 上涨计数
up_counter = 0
bottom = 99999999
bottom_index = ''
flag = False # 进入recession的征兆
last_value = gdp_df.iloc[0]['GDP chained 2009'] # 坑啊,要认真审题! 题干中说明了要用 chained 2009 字段来计算
print('初始值:{}'.format(last_value))
for index, item in gdp_df.iterrows():
value = item['GDP chained 2009']
print('{} {}'.format(index, value))
# 环比下降
if value < last_value:
print('-------- 开始下降 !!!')
if down_counter == 2: # 记录可能进入recession的起始时间
down_start_index = index
down_counter -= 1
if value < bottom:
bottom = value
bottom_index = index
else:
down_counter = 2 # 当前值不小于上季度值,则重置下降标记
# 连续两季度下降,进入recession的征兆
if down_counter == 0:
print('-------- 进入 recession !!!')
bottom = value
bottom_index = index
flag = True
up_counter = 0 # 上涨标记重置
result['start'].append(down_start_index) # 开始时间是刚开始下降的时间,不是真正进入recession(即连续两次下降)的时间
# 环比上涨
if flag:
if value > last_value:
print('++++++++++ 开始上涨 !!!')
up_counter += 1
else:
up_counter = 0 # 当前值不大于上季度值,则重置上涨标记
# 连续两季度上涨并且有recession征兆
if up_counter == 2:
print('+++++++++++ 结束 recession !!!')
down_counter = 0 # 下降标记重置
flag = False
result['end'].append(index)
result['bottom'].append(bottom_index)
last_value = value
return result
result = cal_recession()
result
初始值:1934.5
1947q1 1934.5
1947q2 1932.3
-------- 开始下降 !!!
1947q3 1930.3
-------- 开始下降 !!!
-------- 进入 recession !!!
1947q4 1960.7
++++++++++ 开始上涨 !!!
1948q1 1989.5
++++++++++ 开始上涨 !!!
+++++++++++ 结束 recession !!!
1948q2 2021.9
1948q3 2033.2
1948q4 2035.3
1949q1 2007.5
-------- 开始下降 !!!
1949q2 2000.8
-------- 开始下降 !!!
-------- 进入 recession !!!
1949q3 2022.8
++++++++++ 开始上涨 !!!
1949q4 2004.7
-------- 开始下降 !!!
1950q1 2084.6
++++++++++ 开始上涨 !!!
1950q2 2147.6
++++++++++ 开始上涨 !!!
+++++++++++ 结束 recession !!!
1950q3 2230.4
1950q4 2273.4
1951q1 2304.5
1951q2 2344.5
1951q3 2392.8
1951q4 2398.1
1952q1 2423.5
1952q2 2428.5
1952q3 2446.1
1952q4 2526.4
1953q1 2573.4
1953q2 2593.5
1953q3 2578.9
-------- 开始下降 !!!
1953q4 2539.8
-------- 开始下降 !!!
-------- 进入 recession !!!
1954q1 2528.0
-------- 开始下降 !!!
1954q2 2530.7
++++++++++ 开始上涨 !!!
1954q3 2559.4
++++++++++ 开始上涨 !!!
+++++++++++ 结束 recession !!!
1954q4 2609.3
1955q1 2683.8
1955q2 2727.5
1955q3 2764.1
1955q4 2780.8
1956q1 2770.0
-------- 开始下降 !!!
1956q2 2792.9
1956q3 2790.6
-------- 开始下降 !!!
1956q4 2836.2
1957q1 2854.5
1957q2 2848.2
-------- 开始下降 !!!
1957q3 2875.9
1957q4 2846.4
-------- 开始下降 !!!
1958q1 2772.7
-------- 开始下降 !!!
-------- 进入 recession !!!
1958q2 2790.9
++++++++++ 开始上涨 !!!
1958q3 2855.5
++++++++++ 开始上涨 !!!
+++++++++++ 结束 recession !!!
1958q4 2922.3
1959q1 2976.6
1959q2 3049.0
1959q3 3043.1
-------- 开始下降 !!!
1959q4 3055.1
1960q1 3123.2
1960q2 3111.3
-------- 开始下降 !!!
1960q3 3119.1
1960q4 3081.3
-------- 开始下降 !!!
1961q1 3102.3
1961q2 3159.9
1961q3 3212.6
1961q4 3277.7
1962q1 3336.8
1962q2 3372.7
1962q3 3404.8
1962q4 3418.0
1963q1 3456.1
1963q2 3501.1
1963q3 3569.5
1963q4 3595.0
1964q1 3672.7
1964q2 3716.4
1964q3 3766.9
1964q4 3780.2
1965q1 3873.5
1965q2 3926.4
1965q3 4006.2
1965q4 4100.6
1966q1 4201.9
1966q2 4219.1
1966q3 4249.2
1966q4 4285.6
1967q1 4324.9
1967q2 4328.7
1967q3 4366.1
1967q4 4401.2
1968q1 4490.6
1968q2 4566.4
1968q3 4599.3
1968q4 4619.8
1969q1 4691.6
1969q2 4706.7
1969q3 4736.1
1969q4 4715.5
-------- 开始下降 !!!
1970q1 4707.1
-------- 开始下降 !!!
-------- 进入 recession !!!
1970q2 4715.4
++++++++++ 开始上涨 !!!
1970q3 4757.2
++++++++++ 开始上涨 !!!
+++++++++++ 结束 recession !!!
1970q4 4708.3
-------- 开始下降 !!!
1971q1 4834.3
1971q2 4861.9
1971q3 4900.0
1971q4 4914.3
1972q1 5002.4
1972q2 5118.3
1972q3 5165.4
1972q4 5251.2
1973q1 5380.5
1973q2 5441.5
1973q3 5411.9
-------- 开始下降 !!!
1973q4 5462.4
1974q1 5417.0
-------- 开始下降 !!!
1974q2 5431.3
1974q3 5378.7
-------- 开始下降 !!!
1974q4 5357.2
-------- 开始下降 !!!
-------- 进入 recession !!!
1975q1 5292.4
-------- 开始下降 !!!
1975q2 5333.2
++++++++++ 开始上涨 !!!
1975q3 5421.4
++++++++++ 开始上涨 !!!
+++++++++++ 结束 recession !!!
1975q4 5494.4
1976q1 5618.5
1976q2 5661.0
1976q3 5689.8
1976q4 5732.5
1977q1 5799.2
1977q2 5913.0
1977q3 6017.6
1977q4 6018.2
1978q1 6039.2
1978q2 6274.0
1978q3 6335.3
1978q4 6420.3
1979q1 6433.0
1979q2 6440.8
1979q3 6487.1
1979q4 6503.9
1980q1 6524.9
1980q2 6392.6
-------- 开始下降 !!!
1980q3 6382.9
-------- 开始下降 !!!
-------- 进入 recession !!!
1980q4 6501.2
++++++++++ 开始上涨 !!!
1981q1 6635.7
++++++++++ 开始上涨 !!!
+++++++++++ 结束 recession !!!
1981q2 6587.3
-------- 开始下降 !!!
1981q3 6662.9
1981q4 6585.1
-------- 开始下降 !!!
1982q1 6475.0
-------- 开始下降 !!!
-------- 进入 recession !!!
1982q2 6510.2
++++++++++ 开始上涨 !!!
1982q3 6486.8
-------- 开始下降 !!!
1982q4 6493.1
++++++++++ 开始上涨 !!!
1983q1 6578.2
++++++++++ 开始上涨 !!!
+++++++++++ 结束 recession !!!
1983q2 6728.3
1983q3 6860.0
1983q4 7001.5
1984q1 7140.6
1984q2 7266.0
1984q3 7337.5
1984q4 7396.0
1985q1 7469.5
1985q2 7537.9
1985q3 7655.2
1985q4 7712.6
1986q1 7784.1
1986q2 7819.8
1986q3 7898.6
1986q4 7939.5
1987q1 7995.0
1987q2 8084.7
1987q3 8158.0
1987q4 8292.7
1988q1 8339.3
1988q2 8449.5
1988q3 8498.3
1988q4 8610.9
1989q1 8697.7
1989q2 8766.1
1989q3 8831.5
1989q4 8850.2
1990q1 8947.1
1990q2 8981.7
1990q3 8983.9
1990q4 8907.4
-------- 开始下降 !!!
1991q1 8865.6
-------- 开始下降 !!!
-------- 进入 recession !!!
1991q2 8934.4
++++++++++ 开始上涨 !!!
1991q3 8977.3
++++++++++ 开始上涨 !!!
+++++++++++ 结束 recession !!!
1991q4 9016.4
1992q1 9123.0
1992q2 9223.5
1992q3 9313.2
1992q4 9406.5
1993q1 9424.1
1993q2 9480.1
1993q3 9526.3
1993q4 9653.5
1994q1 9748.2
1994q2 9881.4
1994q3 9939.7
1994q4 10052.5
1995q1 10086.9
1995q2 10122.1
1995q3 10208.8
1995q4 10281.2
1996q1 10348.7
1996q2 10529.4
1996q3 10626.8
1996q4 10739.1
1997q1 10820.9
1997q2 10984.2
1997q3 11124.0
1997q4 11210.3
1998q1 11321.2
1998q2 11431.0
1998q3 11580.6
1998q4 11770.7
1999q1 11864.7
1999q2 11962.5
1999q3 12113.1
1999q4 12323.3
2000q1 12359.1
2000q2 12592.5
2000q3 12607.7
2000q4 12679.3
2001q1 12643.3
-------- 开始下降 !!!
2001q2 12710.3
2001q3 12670.1
-------- 开始下降 !!!
2001q4 12705.3
2002q1 12822.3
2002q2 12893.0
2002q3 12955.8
2002q4 12964.0
2003q1 13031.2
2003q2 13152.1
2003q3 13372.4
2003q4 13528.7
2004q1 13606.5
2004q2 13706.2
2004q3 13830.8
2004q4 13950.4
2005q1 14099.1
2005q2 14172.7
2005q3 14291.8
2005q4 14373.4
2006q1 14546.1
2006q2 14589.6
2006q3 14602.6
2006q4 14716.9
2007q1 14726.0
2007q2 14838.7
2007q3 14938.5
2007q4 14991.8
2008q1 14889.5
-------- 开始下降 !!!
2008q2 14963.4
2008q3 14891.6
-------- 开始下降 !!!
2008q4 14577.0
-------- 开始下降 !!!
-------- 进入 recession !!!
2009q1 14375.0
-------- 开始下降 !!!
2009q2 14355.6
-------- 开始下降 !!!
2009q3 14402.5
++++++++++ 开始上涨 !!!
2009q4 14541.9
++++++++++ 开始上涨 !!!
+++++++++++ 结束 recession !!!
2010q1 14604.8
2010q2 14745.9
2010q3 14845.5
2010q4 14939.0
2011q1 14881.3
-------- 开始下降 !!!
2011q2 14989.6
2011q3 15021.1
2011q4 15190.3
2012q1 15291.0
2012q2 15362.4
2012q3 15380.8
2012q4 15384.3
2013q1 15491.9
2013q2 15521.6
2013q3 15641.3
2013q4 15793.9
2014q1 15747.0
-------- 开始下降 !!!
2014q2 15900.8
2014q3 16094.5
2014q4 16186.7
2015q1 16269.0
2015q2 16374.2
2015q3 16454.9
2015q4 16490.7
2016q1 16525.0
2016q2 16583.1
{'bottom': ['1947q3',
'1949q2',
'1954q1',
'1958q1',
'1970q1',
'1975q1',
'1980q3',
'1982q1',
'1991q1',
'2009q2'],
'end': ['1948q1',
'1950q2',
'1954q3',
'1958q3',
'1970q3',
'1975q3',
'1981q1',
'1983q1',
'1991q3',
'2009q4'],
'start': ['1947q2',
'1949q1',
'1953q3',
'1957q4',
'1969q4',
'1974q3',
'1980q2',
'1981q4',
'1990q4',
'2008q3']}
Q2:
# 重新审题,发现有年份要求:
# For this assignment, only look at GDP data from the first quarter of 2000 onward
def get_recession_start():
'''Returns the year and quarter of the recession start time as a
string value in a format such as 2005q3'''
start = result['start']
return start[-1] # 答案只要最近的
get_recession_start()
'2008q3'
Q3:
def get_recession_end():
'''Returns the year and quarter of the recession end time as a
string value in a format such as 2005q3'''
end = result['end']
return end[-1]
get_recession_end()
'2009q4'
Q4:
def get_recession_bottom():
'''Returns the year and quarter of the recession bottom time as a
string value in a format such as 2005q3'''
bottom = result['bottom']
return bottom[-1]
get_recession_bottom()
'2009q2'
Q5:
# 参考文献: (假设已知列的分组关系,并希望根据分组计算列的总计)
#
############
# 要生成映射关系:
# {'1996-04': '1996q2', '1996-05': '1996q2', ..., '2000-01': '2000q1', '2000-02': '2000q1', ...}
#
###########
quarters = {
'01': 'q1',
'02': 'q1',
'03': 'q1',
'04': 'q2',
'05': 'q2',
'06': 'q2',
'07': 'q3',
'08': 'q3',
'09': 'q3',
'10': 'q4',
'11': 'q4',
'12': 'q4'
}
def convert_to_quarters(name_list):
mapping = {}
quarter_list = [] # 保存那些要转换的字段
non_quarter_list = [] # 保存那些不需要转换的其他字段
for item in name_list:
name = item.split('-')
if len(name) > 1:
year = name[0]
month = name[1]
# 题目要求是 2000 年后的
if year >= '2000':
mapping[item] = year + quarters[month]
quarter_list.append(item)
else:
non_quarter_list.append(item)
return mapping, quarter_list, non_quarter_list
def convert_housing_data_to_quarters():
'''Converts the housing data to quarters and returns it as mean
values in a dataframe. This dataframe should be a dataframe with
columns for 2000q1 through 2016q3, and should have a multi-index
in the shape of ["State","RegionName"].
Note: Quarters are defined in the assignment description, they are
not arbitrary three month periods.
The resulting dataframe should have 67 columns, and 10,730 rows.
'''
housing_data_df = pd.read_csv('City_Zhvi_AllHomes.csv')
mapping, quarter_list, non_quarter_list = convert_to_quarters(housing_data_df.columns)
# 将 housing_data_df 分割成两个 df (一个是具体内容的,一个是基础信息的)
quarter_housing_data_df = housing_data_df[quarter_list]
non_quarter_housing_data_df = housing_data_df[non_quarter_list]
# 对具体内容按quarter进行聚类
quarter_housing_data_since_2000 = quarter_housing_data_df.groupby(mapping, axis=1)
mean_quarter_housing_data = quarter_housing_data_since_2000.mean()
# should have a multi-index in the shape of ["State","RegionName"]
city_df = housing_data_df[['State', 'RegionName']]
# 将 State 的缩写映射回全名
city_df['State'].replace(states, inplace=True)
result_df = city_df.merge(mean_quarter_housing_data, left_index=True, right_index=True)
result_df.set_index(['State', 'RegionName'], inplace=True)
return result_df
convert_housing_data_to_quarters()


还有最后一题,待填坑
