Advertisement

[Coursera | Introduction to Data Science in Python] Assignment 4 - Hypothesis Testing

阅读量:

作为一个新手党,在旁听课程后因未能上交作业而感到无奈, 最终也只能将这份心得发出来纪念一下啦~


Assignment 4 - Hypothesis Testing

此次作业相比以往需要投入更多自主学习的时间。建议您深入探索pandas.pydata.org/pandas-docs/stable/以查找尚未使用的功能或方法,并请标记为与Python和Pandas相关的标签。或者在Stack Overflow上提问并标签为pandas和python相关问题。讨论论坛则是一个开放的平台供您与同学及教师互动协作。

Definitions:

  • A specific three-month period constitutes a quarter. Q1 spans from January to March, Q2 from April to June, Q3 from July to September, and Q4 from October to December.
  • A recession's characterization includes beginning with two consecutive quarters of GDP decline and concluding with two consecutive quarters of GDP growth.
  • The term "recession's trough" refers to the quarter during a recession that recorded the lowest GDP.
  • The term "university town" refers to cities where a high proportion of their population are university students.

Hypotheses : University towns are less impacted by recessions in terms of their mean housing prices. Conduct a t-test to analyze the price ratio between university towns’ housing prices in the preceding quarter of recession onset and those at the recession trough (price_ratio = preceding_quarter_of_recession_start / recession_trough).

The following data files are available for this assignment:

  • From Zillow's research data website, there exists housing statistics for the United States. Specifically, within their City_Zhvi_AllHomes.csv file, median home sale prices are provided at a high-resolution per-city-level.
  • The Wikipedia page on college towns provides a compiled list of university towns in the United States, which has been saved as university_towns.txt.
  • The U.S. Bureau of Economic Analysis' website offers GDP time series through its BEA database. These are presented in quarterly intervals and are stored in gdplev.xls. For this assignment, focus on GDP figures from Q1 2000 onwards.

Every function in this assignment carries a weight of 10%, except for run_ttest(), which has a value of 50%.

Question 1

The program constructs a DataFrame to compile towns alongside their respective states, which are extracted from the university_towns.txt file. The resultant DataFrame should conform to the structure: DataFrame(\left[\left["Michigan", "Ann Arbor"\right], \left["Michigan", "Yipsilanti"\right]\right], columns=["State", "RegionName"]). The subsequent data preparation tasks involve several essential steps, including but not limited to data validation, formatting adjustments, and missing value handling.

1. By extracting State variables starting with "[".
2. By omitting all characters following "(" for RegionName variables.
3. When processing data input, it is advisable to eliminate any newline characters represented by "\n".

复制代码
 def get_list_of_university_towns():  
    
     unitown=pd.read_table('university_towns.txt', header=None)
    
     stateindex=[]
    
     for i in range(len(unitown)):
    
     if '[edit]' in unitown[0][i]:
    
         stateindex.append(i)
    
     stateindex.append(len(unitown))#防止后面的循环溢出
    
     statename=[]
    
     for j in range(len(stateindex)-1):
    
     for i in range(len(unitown)):
    
         if i>=stateindex[j] and i<stateindex[j+1]:
    
             statename.append(unitown[0][stateindex[j]].replace('[edit]',''))
    
     unitown['State']=statename
    
     unitown=unitown.drop(stateindex[:-1])
    
     unitown['RegionName']=unitown.apply(lambda x: x.replace(to_replace=r'\s\(.+', value='', regex=True))[0]
    
     return unitown.drop(0, axis=1)
    
 get_list_of_university_towns()

Question 2

Returns the starting year and quarter of the recession, presented as a string value in formats like 2005Q3.

复制代码
 # 为下面计算recession的start,end,bottom 做准备

    
 gdplev=pd.read_excel('gdplev.xls', skiprows=218, usecols=[4,6], names=['Quarter','GDP'])
    
 growth=[np.nan]
    
 for i in range(1,len(gdplev)):
    
     if gdplev['GDP'][i]>gdplev['GDP'][i-1]:
    
     growth.append(1)
    
     else:
    
     growth.append(0)
    
 for i in range(3, len(growth)):
    
     if growth[i-1]==1 and growth[i]==1:
    
     if growth[i-2]==0 and growth[i-3]==0:
    
         end=gdplev['Quarter'][i]
    
         end_index=i
    
         for j in range(4,i+1):
    
             if growth[i-j]==1:
    
                 start=gdplev['Quarter'][i-j+1]
    
                 start_index=i-j+1
    
                 break
复制代码
 def get_recession_start():

    
     return start
    
 get_recession_start()

Question 3

Returns the year and quarter of the recession ending time as a string value in the format of a string, for example, formatted as 2005q3

复制代码
 def get_recession_end():

    
     return end
    
 get_recession_end()

Question 4

Calculates the year and quarter of the recession bottom time and returns it as a string, for example, 2005q3.

复制代码
 bottom_index=gdplev.iloc[start_index:end_index]['GDP'].idxmin()

    
 bottom=gdplev['Quarter'][bottom_index]
    
 def get_recession_bottom():
    
     return bottom
    
 get_recession_bottom()

Question 5

此方法将住房数据按季度划分,并以平均值的形式整合到一个数据框中。该数据框需包含从2000年第一季度到2016年第三季度的所有季度,并采用['State','RegionName']作为多层索引结构。最终生成的数据框应包含67个字段和10,730条记录。注:季度定义在本说明中,请确保其为非随意的连续三个月时间段。

复制代码
 def convert_housing_data_to_quarters():

    
     housing=pd.read_csv('City_Zhvi_AllHomes.csv', header=None)
    
     housing=housing.iloc[:,1:3].merge(housing.iloc[:, 51:], left_index=True, right_index=True)
    
     housing=housing.replace(housing.iloc[0][2:],housing.iloc[0][2:].apply(pd.to_datetime).dt.to_period('q'))
    
     housing.iloc[0][2:]=housing.iloc[0][2:].apply(lambda x: str(x).replace('Q', 'q'))
    
     housing=housing.T.set_index(0).T
    
     housing=housing.apply(lambda x: pd.to_numeric(x, errors='ignore'))#转换string数据,否则无法计算.mean()
    
     h=housing.iloc[:, 2:]
    
     h=h.groupby(level=0, axis=1).mean()
    
     housing['State']=housing['State'].replace(states)
    
     return h.set_index([housing['State'],housing['RegionName']])
    
 convert_housing_data_to_quarters()

Question 6

First, it generates new data to examine the trend of housing price changes during the period from the start of a recession to its trough. Next, it conducts a statistical test comparing home value indices between university towns and non-university towns. This analysis evaluates whether there is a statistically significant difference between these two groups at a 95% confidence level, along with providing the associated p-value.

Return a tuple containing (different, p, better), where different is assigned True if the t-test yields a result significant at the 0.01 level (thus, we can reject the null hypothesis), and False otherwise (we cannot reject it). The variable p must be set to the precise p-value obtained from scipy.stats.ttest_ind(). The categorization for better depends on whether university towns exhibit a lower mean price ratio than non-university towns, which effectively reduces market losses.

复制代码
 def run_ttest():

    
     housing=convert_housing_data_to_quarters()
    
     price=pd.DataFrame(housing[get_recession_bottom()]-housing[get_recession_start()], columns=['delta_price'])
    
     university_towns=get_list_of_university_towns()
    
     # method 1
    
     university_towns['test']=university_towns['State']+university_towns['RegionName']
    
     price=price.reset_index()
    
     price['Is_utowns']=price.apply(lambda x: x[0]+x[1] in list(university_towns['test']), axis=1)
    
     # method 2
    
     # university_towns['Is_utowns']=True
    
     # university_towns.set_index(['State', 'RegionName'], inplace=True)
    
     # price=pd.merge(price, university_towns, how='left', left_index=True, right_index=True).fillna(False)
    
     price_u=price[price['Is_utowns']==True]['delta_price']
    
     price_nu=price[price['Is_utowns']==False]['delta_price']
    
     statistic, pvalue=ttest_ind(price_u, price_nu, nan_policy='omit')
    
     if pvalue <0.01:
    
     different=True
    
     else:
    
     different=False
    
     if price_u.mean() > price_nu.mean(): #因为delta_price是用bottom减去start
    
     better='university town'
    
     else:
    
     better='non-university town'
    
     return (different, pvalue, better)
    
 run_ttest()

全部评论 (0)

还没有任何评论哟~