Introduction to Data Science in Python 第 3 周 Assignment

阅读量：

Assignment 3 - More Pandas

Question 1 (20%)

Import energy information from the specified file path: Energy%20Indicators.xls, which includes indicators related to energy supply and renewable electricity production. This dataset originates from the United Nations' official website (http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy Indicators.xls) for the year 2013. The data should then be assigned to a DataFrame with variable name energy.

Please note that this is an Excel file format, not a comma-separated values (CSV) file. You must ensure that the header and footer information are excluded from the dataset. The first two columns are unnecessary; therefore, they should be removed. Additionally, you should rename these columns to reflect their intended structure.

['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']

Express the Energy Supply variable in terms of gigajoules (note that one petawatt-hour equals one terawatt-hour). For all countries where there’s a lack of data (such as cases where the data includes “…”) ensure that this is represented as np.NaN values.

Rename the following list of countries (for use in later questions):

复制代码

    "Republic of Korea": "South Korea",
    "United States of America": "United States",
    "United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
    "China, Hong Kong Special Administrative Region": "Hong Kong"

Additionally, there are multiple countries with numerical identifiers or parenthetical references in their names. Please ensure that these entries are removed.

e.g.

'Bolivia (Plurinational State of)' should be 'Bolivia',

'Switzerland17' should be 'Switzerland'.

Next, obtain the GDP dataset from the CSV file named world_bank.csv, which contains country-level GDP figures spanning from 1960 to 2015 sourced directly from the World Bank. Assign this dataset to a variable named GDP.

Make sure to skip the header, and rename the following list of countries:

复制代码

    "Korea, Rep.": "South Korea", 
    "Iran, Islamic Rep.": "Iran",
    "Hong Kong SAR, China": "Hong Kong"

Finally, load the Sciamgo Journal and Country Rank data for Energy Engineering and Power Technology (available at http://www.scimagojr.com/countryrank.php?category=2102) from the file scimagojr-3.xlsx, which is designed to rank countries based upon their journal contributions in that specific field. Assign this dataset to a DataFrame named ScimEn.

Aggregate three data sources: GDP, Energy, and ScimEn, into a unified dataset by leveraging common country identifiers. Only restrict to GDP statistics from the most recent decade (2006–2015), while selecting top nations based on Scimagojr ranking for inclusion.

The index of this DataFrame is designated as the country names, while the columns encapsulate a comprehensive set of metrics including ['排名', '文档数量', '可引用文档数量', '引用次数', '自引次数',
'平均每篇文档引用次数', 'H指数', '能源供给',
'人均能源供给', '可再生能源占比', '2006', '2007', '2008',
'2009', '2010', '2011', '2012', '2013', '2014', '2015'].

This function should return a DataFrame with 20 columns and 15 entries.

复制代码

    import pandas as pd
    import numpy as np

复制代码

    def get_energy():
        
    ###########################
    # Part One
    # Enery Indicators.xls
    # 
    energy_df = pd.read_excel('Energy Indicators.xls', skiprows=16)
    del energy_df['Unnamed: 0']
    del energy_df['Unnamed: 1']
    
    # Drop NA data in excel
    energy_df.dropna(inplace=True)
    
    # change the column labels ['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']
    energy_df.rename(columns={"Unnamed: 2":"Country", "Renewable Electricity Production": "% Renewable", "Energy Supply per capita": "Energy Supply per Capita"}, inplace=True)
    
    # Convert Energy Supply to gigajoules (there are 1,000,000 gigajoules in a petajoule).
    energy_df.replace('...', np.NaN, inplace=True)
    
    # For all countries which have missing data (e.g. data with "...") make sure this is reflected as np.NaN values
    energy_df['Energy Supply'] = energy_df['Energy Supply'] 
    
    # There are also several countries with numbers and/or parenthesis in their name
    energy_df['Country'] = energy_df['Country'].str.replace(' \(.*\)', '')
    energy_df['Country'] = energy_df['Country'].str.replace('[0-9]*', '')
    
    # Rename the following list of countries
    x = {
    "Republic of Korea": "South Korea",
    "United States of America": "United States",
    "United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
    "China, Hong Kong Special Administrative Region": "Hong Kong"
    }
    
    for item in x:
        energy_df['Country'].replace(item, x[item], inplace=True)
        
    return energy_df
    
    
    def get_gdp():
    
    # Part TWO
    # world_bank.csv
    gdp_df = pd.read_csv('world_bank.csv', skiprows=4)
    
    # rename the following list of countries
    y = {
    "Korea, Rep.": "South Korea", 
    "Iran, Islamic Rep.": "Iran",
    "Hong Kong SAR, China": "Hong Kong"
    }
    
    for item in y:
        gdp_df['Country Name'].replace(item, y[item], inplace=True)
    
    gdp_df.rename(columns={'Country Name': 'Country'}, inplace=True)
        
    return gdp_df
    
    
    def get_rank():
    ###########################
    # Part THREE
    # scimagojr-3.xlsx
    ScimEn = pd.read_excel('scimagojr-3.xlsx')
    return ScimEn
    
    
    def answer_one():
    energy_df = get_energy()
    gdp_df = get_gdp()
    ScimEn = get_rank()
    
    ###########################
    # 
    # Use only the last 10 years (2006-2015) of GDP data and only the top 15 countries by Scimagojr 'Rank' (Rank 1 through 15).
    
    for i in range(1960, 2006):
        gdp_df.drop([str(i)], axis=1, inplace=True)
    
    ScimEn = ScimEn[0:15]
    
    # Join the three datasets: GDP, Energy, and ScimEn into a new dataset (using the intersection of country names).
    temp1 = pd.merge(energy_df, gdp_df, how="inner", left_on="Country", right_on="Country")
    temp2 = pd.merge(temp1, ScimEn, how="inner")
    
    # The index of this DataFrame should be the name of the country 
    temp2.set_index('Country', inplace=True)
     
    # the columns should be ['Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations', 'Citations per document', 'H index', 'Energy Supply', 'Energy Supply per Capita', '% Renewable', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015'].
    list_to_be_drop = ['Country Code', 'Indicator Name', 'Indicator Code']
    for item in list_to_be_drop:
        temp2.drop(item, axis=1, inplace=True)
    
    # Check
    #     real = temp2.columns.tolist()
    #     shouldbe = ['Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations', 'Citations per document', 'H index', 'Energy Supply', 'Energy Supply per Capita', '% Renewable', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015']
    
    #     print(real.sort() == shouldbe.sort())
    #     for item in real:
    #         if item in shouldbe:
    #             print('YES: {} in'.format(item))
    #         else:
    #             print('NO: {} not in'.format(item))
    
    return temp2    
    
    answer_one()

Question 2 (6.6%)

The previous question combined three datasets and then condensed these to just the top 15 entries. When joining the datasets before reducing them to the top 15 items, how many entries were lost?

This function should return a single number.

复制代码

    # 三个集合的并集： A + B + C - AB - AC - BC + ABC
    # 本题要求的是“lose after merging”，也就是 三者并集 减去 三者交集，即
    # A + B + C - AB - AC - BC
    
    def answer_two():
    energy_df = get_energy()
    gdp_df = get_gdp()
    rank_df = get_rank()
    
    energy_len = len(energy_df)
    #     print(energy_len)
    gdp_len = len(gdp_df)
    #     print(gdp_len)
    rank_len = len(rank_df)
    #     print(rank_len)
    
    tempAB = pd.merge(energy_df, gdp_df, how="inner")
    merge_AB_len = len(tempAB)
    #     print(merge_AB_len)
    
    tempAC = pd.merge(energy_df, rank_df, how="inner")
    merge_AC_len = len(tempAC)
    #     print(merge_AC_len)
    
    tempBC = pd.merge(rank_df, gdp_df, how="inner")
    merge_BC_len = len(tempBC)
    #     print(merge_BC_len)
    
    result = energy_len + gdp_len + rank_len - merge_AB_len - merge_AC_len - merge_BC_len
    
    #     print(result)
    
    return result
    
    answer_two()

Question 3 (6.6%)

What does the mean GDP over a decade typically reveal across all nations, omitting any incomplete data?

This function should generate a collection or dataset named avgGDP that contains data for 15 countries, whose average GDP is arranged from highest to lowest.

复制代码

    def answer_three():
    Top15 = answer_one()
    gdp_list = Top15[[str(i) for i in range(2006, 2016)]]
    # 默认是按列进行统计，这里要标记axis=1（按行统计）
    avgGDP = gdp_list.mean(axis=1).sort_values(ascending=False)
    return avgGDP
    
    answer_three()

Question 4 (6.6%)

What percentage change in GDP did that country experience over a 10-year period among those with an average GDP ranking of sixth?

This function should return a single number.

复制代码

    def answer_four():
    Top15 = answer_one()
    _6th = answer_three().index[5]
    gdp_list = Top15[[str(i) for i in range(2006, 2016)]]
    result = gdp_list.loc[_6th]
    return result[-1] - result[0]
    
    answer_four()

Question 5 (6.6%)

What is the mean Energy Supply per Capita?

This function should return a single number.

复制代码

    def answer_five():
    Top15 = answer_one()
    return Top15['Energy Supply per Capita'].mean(axis=0)
    
    answer_five()

Question 6 (6.6%)

What country has the maximum % Renewable and what is the percentage?

The function should be returned as a data structure containing the country name and the percentage.

复制代码

    def answer_six():
    Top15 = answer_one()
    max = Top15['% Renewable'].max()
    result = Top15[Top15['% Renewable'] == max]
    return (result.index[0], max)
    
    answer_six()

Question 7 (6.6%)

Construct a new column expressing the ratio of Self-Citations to Total Citations. What is the highest value of this new column, and which country exhibits the highest ratio?

This function is expected to output a tuple containing both the country's name and its ratio.

复制代码

    def answer_seven():
    Top15 = answer_one()
    total_citations = Top15['Self-citations'].sum()  # 万恶啊，不是算总数。。
    Top15['Ratio'] = Top15['Self-citations'] / Top15['Citations']
    max = Top15['Ratio'].max()
    result = Top15[Top15['Ratio'] == max]
    return (result.index[0], max)
    
    answer_seven()

Question 8 (6.6%)

Generate a column to estimate the population by incorporating both Energy Supply and Energy Supply per capita. According to this estimation, which country holds the third position in terms of population?

This function should return a single string value.

复制代码

    def answer_eight():
    Top15 = answer_one()
    Top15['Population'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita']
    Top15.sort_values(by='Population', ascending=False, inplace=True)
    return Top15.iloc[2].name
    
    answer_eight()

Question 9 (6.6%)

Construct a column to calculate the number of citable documents on a per-person basis. What is the relationship between citable documents per capita and energy supply per capita? Use the .corr() method, which employs Pearson’s correlation.

This function should return a single number.

(Optional: Use the built-in functionplot9() to visualize the relationship between Energy Supply per Capita vs. Citable docs per Capita)

复制代码

    def answer_nine():
    Top15 = answer_one()
    Top15['PopEst'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita']
    Top15['Citable docs per Capita'] = Top15['Citable documents'] / Top15['PopEst']
    result = Top15['Citable docs per Capita'].corr(Top15['Energy Supply per Capita'])
    return result
    
    answer_nine()

Question 10 (6.6%)

Generate a new column variable that assigns a value of 1 to countries whose percentage of renewable energy is equal to or higher than the median value among all countries in the top 15, and assigns a value of 0 to countries whose percentage of renewable energy is below this median.

This function should be returned as a series named HighRenew, with its index consisting of country names sorted by rank in ascending order.

复制代码

    def answer_ten():
    Top15 = answer_one()
    median = Top15['% Renewable'].median()
    Top15['HighRenew'] = (Top15['% Renewable'] >= median) 
    return Top15.sort_values('Rank')['HighRenew']
    
    answer_ten()

Question 11 (6.6%)

Refer to the provided dictionary to classify countries based on their Continent. Create a dataframe that displays for each country its estimated population, along with statistical measures such as sum, mean, and standard deviation.

复制代码

    ContinentDict  = {'China':'Asia', 
                  'United States':'North America', 
                  'Japan':'Asia', 
                  'United Kingdom':'Europe', 
                  'Russian Federation':'Europe', 
                  'Canada':'North America', 
                  'Germany':'Europe', 
                  'India':'Asia',
                  'France':'Europe', 
                  'South Korea':'Asia', 
                  'Italy':'Europe', 
                  'Spain':'Europe', 
                  'Iran':'Asia',
                  'Australia':'Australia', 
                  'Brazil':'South America'}

This function should produce a DataFrame with the index named Continent['Asia','Australia','Europe','North America','South America'] and whose columns are ['size','sum','mean','std'].

复制代码

    ContinentDict  = {'China':'Asia', 
                  'United States':'North America', 
                  'Japan':'Asia', 
                  'United Kingdom':'Europe', 
                  'Russian Federation':'Europe', 
                  'Canada':'North America', 
                  'Germany':'Europe', 
                  'India':'Asia',
                  'France':'Europe', 
                  'South Korea':'Asia', 
                  'Italy':'Europe', 
                  'Spain':'Europe', 
                  'Iran':'Asia',
                  'Australia':'Australia', 
                  'Brazil':'South America'}
    
    def answer_eleven():
    Top15 = answer_one()
    Top15['PopEst'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita']
    Top15_population = Top15['PopEst']
    group = Top15_population.groupby(ContinentDict)
    #     size = pd.DataFrame(group.size(), columns=['size'])
    #     print(size)
    #     sum = pd.DataFrame(group.sum()).rename(columns={'PopEst': 'sum'})
    #     print(sum)
    #     mean = pd.DataFrame(group.mean()).rename(columns={'PopEst': 'mean'})
    #     print(mean)
    #     std = pd.DataFrame(group.std()).rename(columns={'PopEst': 'std'})
    #     print(std)
    #     result = pd.merge(size, sum, left_index=True, right_index=True).merge(mean, left_index=True, right_index=True).merge(std, left_index=True, right_index=True)
    
    size = group.size().astype('float64')
    sum = group.sum()
    mean = group.mean()
    std = group.std()
    result = pd.DataFrame({
        'size': size,
        'sum': sum,
        'mean': mean,
        'std': std
    })
    return result
    
    answer_eleven()

Question 12 (6.6%)

Divide the % Renewable data into five equal parts. Categorize the Top15 countries by continent and also include the new % Renewable groups. It is important to determine how many nations fall into each of these categories.

This function should generate aSeries with a MultiIndex organized by Continent, followed by the bins for % Renewable. Issues involving groups without any member nations should be excluded.

复制代码

    ContinentDict  = {'China':'Asia', 
                  'United States':'North America', 
                  'Japan':'Asia', 
                  'United Kingdom':'Europe', 
                  'Russian Federation':'Europe', 
                  'Canada':'North America', 
                  'Germany':'Europe', 
                  'India':'Asia',
                  'France':'Europe', 
                  'South Korea':'Asia', 
                  'Italy':'Europe', 
                  'Spain':'Europe', 
                  'Iran':'Asia',
                  'Australia':'Australia', 
                  'Brazil':'South America'}
    
    def answer_twelve():
    Top15 = answer_one()
    cut = pd.cut(Top15['% Renewable'], bins=5)
    Top15['bin'] = cut
    group = Top15.groupby(by=[ContinentDict, cut])
    #     for name, group in result:
    #         print(name)
    #         print(group)
    result = group.size()
    return result
    
    answer_twelve()

Question 13 (6.6%)

Transform the Population Estimate series into a string format that incorporates a thousand separator symbol, specifically using comma separators. Without rounding the figures, ensure precision in the output.

e.g. 317615384.61538464 -> 317,615,384.61538464

This function should return a Series named PopEst, with an index consisting of country names and their values being population estimate strings.

复制代码

    def answer_thirteen():
    Top15 = answer_one()
    Top15['PopEst'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita']
    Top15['PopEst'] = Top15['PopEst'].map(lambda x: format(x, ',')) # 来源：
    result = Top15['PopEst']
    return result
    
    answer_thirteen()

全部评论 (0)

还没有任何评论哟~

Introduction to Data Science in Python 第 3 周 Assignment

Assignment3MorePandas Question120% LoadtheenergydatafromthefileEnergyIndicators.xls,whichisalistofin...

Introduction to Data Science in Python 第 4 周 Assignment

Assignment4HypothesisTesting Thisassignmentrequiresmoreindividuallearningthanpreviousassignmentsyoua...

Introduction to Data Science in Python 第 2 周 Assignment

IntroductiontoDataScienceinPython 第2周Assignment 记录下问题和自己的答案 Part1 Thefollowingcodeloadstheolympicsda...

[Coursera | Introduction to Data Science in Python] Assignment 3 - More Pandas

作为一个近乎小白的新玩家，旁听的这个课又无法提交，想来想去还是发出来留个纪念嘻嘻。 Assignment3MorePandas Thisassignmentrequiresmoreindividual...

[Coursera | Introduction to Data Science in Python] Assignment 4 - Hypothesis Testing

作为一个近乎小白的新玩家，旁听的这个课又无法提交，想来想去还是发出来留个纪念嘻嘻。 Assignment4HypothesisTesting Thisassignmentrequiresmoreind...

Coursera-Applied Data Science with Python-Introduction to Data Science in Python-assignment2第二周作业

\最近在上密歇根大学开的python应用数据科学AppliedDataSciencewithPython \这个课比较适合有编程基础的人 \第二周的作业，总共有8道题提交系统时不时会抽风导致明明只错...

Coursera-Applied Data Science with Python-Introduction to Data Science in Python-Week2

一、TheSeriesDataStructure： Series是pandas的核心数据结构之一，在pandas中，Series是一个一维的类似的数组对象，可以看成一个介于list和dictionar...

Coursera-Applied Data Science with Python-Introduction to Data Science in Python-Week3

一、MergingDataframes：首先，我们先复习一下上周所学的内容，上周主要是学习了Pandas中两个核心的数据结构：一维结构的Series和二维结构的DataFrame。

Cousera-Introduction to Data Science in Python Assignment1-4答案

CouseraIntroductiontoDataScienceinPythonAssignment14答案全是密歇根大学的python课程，作业难度对于初学者来说着实不太友好啊hhhh，为了方便各...

Coursera | Introduction to Data Science in Python（University of Michigan）| Quiz答案

最后还是把assignment代码放出来了，这门课的assignment还是有点难度的，特别是assigment4（哀怨），放给大家参考啦有时间（需求）就把所有代码放到github上（好担心被河蟹啊...

是否确定退出登录?

Introduction to Data Science in Python 第 3 周 Assignment

Assignment 3 - More Pandas

Question 1 (20%)

Question 2 (6.6%)

Question 3 (6.6%)

Question 4 (6.6%)

Question 5 (6.6%)

Question 6 (6.6%)

Question 7 (6.6%)

Question 8 (6.6%)

Question 9 (6.6%)

Question 10 (6.6%)

Question 11 (6.6%)

Question 12 (6.6%)

Question 13 (6.6%)

全部评论 (0)

相关文章推荐

Introduction to Data Science in Python 第 3 周 Assignment

Introduction to Data Science in Python 第 4 周 Assignment

Introduction to Data Science in Python 第 2 周 Assignment

[Coursera | Introduction to Data Science in Python] Assignment 3 - More Pandas

[Coursera | Introduction to Data Science in Python] Assignment 4 - Hypothesis Testing

Coursera-Applied Data Science with Python-Introduction to Data Science in Python-assignment2第二周作业

Coursera-Applied Data Science with Python-Introduction to Data Science in Python-Week2

Coursera-Applied Data Science with Python-Introduction to Data Science in Python-Week3

Cousera-Introduction to Data Science in Python Assignment1-4答案

Coursera | Introduction to Data Science in Python（University of Michigan）| Quiz答案