[Coursera | Introduction to Data Science in Python] Assignment 4 - Hypothesis Testing

阅读量：

作为一个新手党，在旁听课程后因未能上交作业而感到无奈, 最终也只能将这份心得发出来纪念一下啦~

Assignment 4 - Hypothesis Testing

此次作业相比以往需要投入更多自主学习的时间。建议您深入探索pandas.pydata.org/pandas-docs/stable/以查找尚未使用的功能或方法，并请标记为与Python和Pandas相关的标签。或者在Stack Overflow上提问并标签为pandas和python相关问题。讨论论坛则是一个开放的平台供您与同学及教师互动协作。

Definitions:

A specific three-month period constitutes a quarter. Q1 spans from January to March, Q2 from April to June, Q3 from July to September, and Q4 from October to December.
A recession's characterization includes beginning with two consecutive quarters of GDP decline and concluding with two consecutive quarters of GDP growth.
The term "recession's trough" refers to the quarter during a recession that recorded the lowest GDP.
The term "university town" refers to cities where a high proportion of their population are university students.

Hypotheses : University towns are less impacted by recessions in terms of their mean housing prices. Conduct a t-test to analyze the price ratio between university towns’ housing prices in the preceding quarter of recession onset and those at the recession trough (price_ratio = preceding_quarter_of_recession_start / recession_trough).

The following data files are available for this assignment:

From Zillow's research data website, there exists housing statistics for the United States. Specifically, within their City_Zhvi_AllHomes.csv file, median home sale prices are provided at a high-resolution per-city-level.
The Wikipedia page on college towns provides a compiled list of university towns in the United States, which has been saved as university_towns.txt.
The U.S. Bureau of Economic Analysis' website offers GDP time series through its BEA database. These are presented in quarterly intervals and are stored in gdplev.xls. For this assignment, focus on GDP figures from Q1 2000 onwards.

Every function in this assignment carries a weight of 10%, except for run_ttest(), which has a value of 50%.

Question 1

The program constructs a DataFrame to compile towns alongside their respective states, which are extracted from the university_towns.txt file. The resultant DataFrame should conform to the structure: $DataFrame(\left[\left["Michigan", "Ann Arbor"\right], \left["Michigan", "Yipsilanti"\right]\right], columns=["State", "RegionName"])$ . The subsequent data preparation tasks involve several essential steps, including but not limited to data validation, formatting adjustments, and missing value handling.

1. By extracting State variables starting with "[".
2. By omitting all characters following "(" for RegionName variables.
3. When processing data input, it is advisable to eliminate any newline characters represented by "\n".

复制代码

 def get_list_of_university_towns():  
    
     unitown=pd.read_table('university_towns.txt', header=None)
    
     stateindex=[]
    
     for i in range(len(unitown)):
    
     if '[edit]' in unitown[0][i]:
    
         stateindex.append(i)
    
     stateindex.append(len(unitown))#防止后面的循环溢出
    
     statename=[]
    
     for j in range(len(stateindex)-1):
    
     for i in range(len(unitown)):
    
         if i>=stateindex[j] and i<stateindex[j+1]:
    
             statename.append(unitown[0][stateindex[j]].replace('[edit]',''))
    
     unitown['State']=statename
    
     unitown=unitown.drop(stateindex[:-1])
    
     unitown['RegionName']=unitown.apply(lambda x: x.replace(to_replace=r'\s\(.+', value='', regex=True))[0]
    
     return unitown.drop(0, axis=1)
    
 get_list_of_university_towns()

Question 2

Returns the starting year and quarter of the recession, presented as a string value in formats like 2005Q3.

复制代码

 # 为下面计算recession的start，end，bottom 做准备

    
 gdplev=pd.read_excel('gdplev.xls', skiprows=218, usecols=[4,6], names=['Quarter','GDP'])
    
 growth=[np.nan]
    
 for i in range(1,len(gdplev)):
    
     if gdplev['GDP'][i]>gdplev['GDP'][i-1]:
    
     growth.append(1)
    
     else:
    
     growth.append(0)
    
 for i in range(3, len(growth)):
    
     if growth[i-1]==1 and growth[i]==1:
    
     if growth[i-2]==0 and growth[i-3]==0:
    
         end=gdplev['Quarter'][i]
    
         end_index=i
    
         for j in range(4,i+1):
    
             if growth[i-j]==1:
    
                 start=gdplev['Quarter'][i-j+1]
    
                 start_index=i-j+1
    
                 break

复制代码

 def get_recession_start():

    
     return start
    
 get_recession_start()

Question 3

Returns the year and quarter of the recession ending time as a string value in the format of a string, for example, formatted as 2005q3

复制代码

 def get_recession_end():

    
     return end
    
 get_recession_end()

Question 4

Calculates the year and quarter of the recession bottom time and returns it as a string, for example, 2005q3.

复制代码

 bottom_index=gdplev.iloc[start_index:end_index]['GDP'].idxmin()

    
 bottom=gdplev['Quarter'][bottom_index]
    
 def get_recession_bottom():
    
     return bottom
    
 get_recession_bottom()

Question 5

此方法将住房数据按季度划分，并以平均值的形式整合到一个数据框中。该数据框需包含从2000年第一季度到2016年第三季度的所有季度，并采用['State','RegionName']作为多层索引结构。最终生成的数据框应包含67个字段和10,730条记录。注：季度定义在本说明中，请确保其为非随意的连续三个月时间段。

复制代码

 def convert_housing_data_to_quarters():

    
     housing=pd.read_csv('City_Zhvi_AllHomes.csv', header=None)
    
     housing=housing.iloc[:,1:3].merge(housing.iloc[:, 51:], left_index=True, right_index=True)
    
     housing=housing.replace(housing.iloc[0][2:],housing.iloc[0][2:].apply(pd.to_datetime).dt.to_period('q'))
    
     housing.iloc[0][2:]=housing.iloc[0][2:].apply(lambda x: str(x).replace('Q', 'q'))
    
     housing=housing.T.set_index(0).T
    
     housing=housing.apply(lambda x: pd.to_numeric(x, errors='ignore'))#转换string数据，否则无法计算.mean()
    
     h=housing.iloc[:, 2:]
    
     h=h.groupby(level=0, axis=1).mean()
    
     housing['State']=housing['State'].replace(states)
    
     return h.set_index([housing['State'],housing['RegionName']])
    
 convert_housing_data_to_quarters()

Question 6

First, it generates new data to examine the trend of housing price changes during the period from the start of a recession to its trough. Next, it conducts a statistical test comparing home value indices between university towns and non-university towns. This analysis evaluates whether there is a statistically significant difference between these two groups at a 95% confidence level, along with providing the associated p-value.

Return a tuple containing (different, p, better), where different is assigned True if the t-test yields a result significant at the 0.01 level (thus, we can reject the null hypothesis), and False otherwise (we cannot reject it). The variable p must be set to the precise p-value obtained from scipy.stats.ttest_ind(). The categorization for better depends on whether university towns exhibit a lower mean price ratio than non-university towns, which effectively reduces market losses.

复制代码

 def run_ttest():

    
     housing=convert_housing_data_to_quarters()
    
     price=pd.DataFrame(housing[get_recession_bottom()]-housing[get_recession_start()], columns=['delta_price'])
    
     university_towns=get_list_of_university_towns()
    
     # method 1
    
     university_towns['test']=university_towns['State']+university_towns['RegionName']
    
     price=price.reset_index()
    
     price['Is_utowns']=price.apply(lambda x: x[0]+x[1] in list(university_towns['test']), axis=1)
    
     # method 2
    
     # university_towns['Is_utowns']=True
    
     # university_towns.set_index(['State', 'RegionName'], inplace=True)
    
     # price=pd.merge(price, university_towns, how='left', left_index=True, right_index=True).fillna(False)
    
     price_u=price[price['Is_utowns']==True]['delta_price']
    
     price_nu=price[price['Is_utowns']==False]['delta_price']
    
     statistic, pvalue=ttest_ind(price_u, price_nu, nan_policy='omit')
    
     if pvalue <0.01:
    
     different=True
    
     else:
    
     different=False
    
     if price_u.mean() > price_nu.mean(): #因为delta_price是用bottom减去start
    
     better='university town'
    
     else:
    
     better='non-university town'
    
     return (different, pvalue, better)
    
 run_ttest()

全部评论 (0)

还没有任何评论哟~

[Coursera | Introduction to Data Science in Python] Assignment 4 - Hypothesis Testing

作为一个近乎小白的新玩家，旁听的这个课又无法提交，想来想去还是发出来留个纪念嘻嘻。 Assignment4HypothesisTesting Thisassignmentrequiresmoreind...

[Coursera | Introduction to Data Science in Python] Assignment 3 - More Pandas

作为一个近乎小白的新玩家，旁听的这个课又无法提交，想来想去还是发出来留个纪念嘻嘻。 Assignment3MorePandas Thisassignmentrequiresmoreindividual...

Introduction to Data Science in Python 第 4 周 Assignment

Assignment4HypothesisTesting Thisassignmentrequiresmoreindividuallearningthanpreviousassignmentsyoua...

Coursera-Applied Data Science with Python-Introduction to Data Science in Python-Week2

一、TheSeriesDataStructure： Series是pandas的核心数据结构之一，在pandas中，Series是一个一维的类似的数组对象，可以看成一个介于list和dictionar...

Coursera-Applied Data Science with Python-Introduction to Data Science in Python-Week3

一、MergingDataframes：首先，我们先复习一下上周所学的内容，上周主要是学习了Pandas中两个核心的数据结构：一维结构的Series和二维结构的DataFrame。

Introduction to Data Science in Python 第 3 周 Assignment

Assignment3MorePandas Question120% LoadtheenergydatafromthefileEnergyIndicators.xls,whichisalistofin...

Introduction to Data Science in Python 第 2 周 Assignment

IntroductiontoDataScienceinPython 第2周Assignment 记录下问题和自己的答案 Part1 Thefollowingcodeloadstheolympicsda...

Coursera | Introduction to Data Science in Python（University of Michigan）| quiz和assignment1-4答案

最后还是把assignment代码放出来了，u1s1,这门课的assignment还是有点难度的，特别是assigment4（哀怨），放给大家参考啦有时间（需求）就把所有代码放到github上（好担...

Coursera-Applied Data Science with Python-Introduction to Data Science in Python-assignment2第二周作业

\最近在上密歇根大学开的python应用数据科学AppliedDataSciencewithPython \这个课比较适合有编程基础的人 \第二周的作业，总共有8道题提交系统时不时会抽风导致明明只错...

Coursera | Introduction to Data Science in Python（University of Michigan）| Quiz答案

最后还是把assignment代码放出来了，这门课的assignment还是有点难度的，特别是assigment4（哀怨），放给大家参考啦有时间（需求）就把所有代码放到github上（好担心被河蟹啊...

是否确定退出登录?

[Coursera | Introduction to Data Science in Python] Assignment 4 - Hypothesis Testing

Assignment 4 - Hypothesis Testing

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

全部评论 (0)

相关文章推荐

[Coursera | Introduction to Data Science in Python] Assignment 4 - Hypothesis Testing

[Coursera | Introduction to Data Science in Python] Assignment 3 - More Pandas

Introduction to Data Science in Python 第 4 周 Assignment

Coursera-Applied Data Science with Python-Introduction to Data Science in Python-Week2

Coursera-Applied Data Science with Python-Introduction to Data Science in Python-Week3

Introduction to Data Science in Python 第 3 周 Assignment

Introduction to Data Science in Python 第 2 周 Assignment

Coursera | Introduction to Data Science in Python（University of Michigan）| quiz和assignment1-4答案

Coursera-Applied Data Science with Python-Introduction to Data Science in Python-assignment2第二周作业

Coursera | Introduction to Data Science in Python（University of Michigan）| Quiz答案