[Coursera | Introduction to Data Science in Python] Assignment 3 - More Pandas

阅读量：

作为一个刚接触的新手玩家，在旁听相关课程后由于无法上交作业而感到无奈,实在忍不住了就想着放到网上做个纪念

Assignment 3 - More Pandas

This assignment calls for a greater emphasis on individual learning compared to the last assignment. We encourage you to explore resources such as the pandas documentation (available at http://pandas.pydata.org/pandas-docs/stable/) to discover functions and methods that may be new to you. Additionally, the discussion forums provide an open platform for interaction with both peers and course staff.

Question 1 (20%)

Import energy data from the file path Energy Indicators.xls provided by the United Nations, which serves as a list of indicators for energy supply and renewable electricity production. This dataset is available through their Excel file tables for 2013, and it should be loaded into a DataFrame named energy.

Note: This is an Excel file, not a comma-separated values (CSV) file.

['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']

Ensure that Energy Supply is transformed into gigajoules (note that there are 1,000,000 gigajoules in each petajoule). Make certain that any country exhibiting missing data—such as those indicated by an ellipsis—is appropriately represented using the np.NaN value.

Rename the following list of countries (for use in later questions):

"South Korean Republic": "South Korea",
"North American Nation": "United States of America",
"UK Associated Nations": "United Kingdom of Great Britain and Northern Ireland",
"China, Hong Kong Special Administrative Region": "China, Hong Kong Special Administrative Region (简称：香港特别行政区)"

There are also a number of countries' names that include numeric characters and/or parentheses. Make sure to delete these entries.

e.g.

'Bolivia (Plurinational State of)' should be 'Bolivia',

'Switzerland17' should be 'Switzerland'.

Next, import the GDP dataset from world_bank.csv, which includes country-level GDP data spanning from 1960 to 2015, sourced from World Bank. Assign this DataFrame to a variable named GDP.

Make sure to skip the header, and rename the following list of countries:

"Korea, Rep.": "South Korea", "Iran, Islamic Rep.": "Iran", "Hong Kong SAR, China": "Hong Kong"

Upon finalizing, we will import the Sciamgo Journal and Country Rank dataset for Energy Engineering and Power Technology from the file scimagojr-3.xlsx into our working environment. This dataset, which is specifically curated to assess country rankings based on their journal contributions within the aforementioned field, will be referred to as ScimEn.

Aggregate three datasets - GDP, Energy, and ScimEn - into a consolidated dataset by considering only common country names. Utilize solely the most recent 10-year span (specifically from 2006 to 2015) for GDP data and include those with the highest Scimagojr Rank from Rank 1 to Rank 15.

The index of this DataFrame should consist of the country names, and its columns should include a comprehensive list: ['Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations', 'Citations per document', 'H index', 'Energy Supply', 'Energy Supply per Capita', '% Renewable', and years from 2006 to 2015 inclusive.]

This function should return a DataFrame with 20 columns and 15 entries.

复制代码

 import pandas as pd

    
 energy=pd.read_excel('Energy Indicators.xls',
    
                  skiprows=17, skipfooter=38,
    
                  na_values='...',
    
                  usecols=[2,3,4,5],
    
                  names=['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable'])
    
 energy['Energy Supply']*=1000000
    
 energy['Country']=energy['Country'].str.replace(r'\d| \(.+\)','')
    
 energy.replace(to_replace={'Republic of Korea': 'South Korea',
    
                        'United States of America': 'United States',
    
                        'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom',
    
                        'China, Hong Kong Special Administrative Region': 'Hong Kong'},
    
            inplace=True)
    
 GDP=pd.read_csv('world_bank.csv', skiprows=4)
    
 GDP.replace(to_replace={'Korea, Rep.': 'South Korea',
    
                     'Iran, Islamic Rep.': 'Iran',
    
                     'Hong Kong SAR, China': 'Hong Kong'},
    
         inplace=True)
    
 ScimEn=pd.read_excel('scimagojr-3.xlsx')
    
 def answer_one():
    
     answer=pd.merge(ScimEn.set_index('Country')[:15],
    
             energy.set_index('Country'),
    
             left_index=True,
    
             right_index=True)
    
     G=GDP.T[-10:].T
    
     G=G.set_index(GDP['Country Name'])
    
     answer=pd.merge(answer,G, left_index=True, right_index=True)
    
     return answer
    
 answer_one()

Question 2 (6.6%)

The prior inquiry merged three data sets initially and then streamlined them down to merely the top 15 entries. When integrating these datasets before streamlining them down to merely the top 15 items, how many records were lost in this process?

This function should return a single number.

复制代码

 def answer_two():

    
     return len(GDP['Country Name'].append(energy['Country'], ignore_index=True).append(ScimEn['Country'], ignore_index=True).unique())-15
    
 answer_two()
    
  
    
 #my answer is 303. I also have seen other answers, I'm not sure which is right.

回答以下问题基于全球排名前15的国家（即answer_one()返回的数据框）。

Question 3 (6.6%)

What was the average GDP per nation in the past ten years? (omitting any missing data in this computation.)

This function should produce a named Series called avgGDP, which contains data for 15 countries and their average GDP values, sorted in descending order.

复制代码

 def answer_three():

    
     Top15 = answer_one()
    
     avgGDP=Top15.T[-10:].mean(skipna=True)
    
     avgGDP.sort(ascending=False, inplace=True)
    
     return avgGDP
    
 answer_three()

Question 4 (6.6%)

What is the percentage change in GDP for a nation with an average GDP ranking of sixth over a 10-year period?

This function should return a single number.

复制代码

 def answer_four():

    
     Top15 = answer_one()
    
     avgGDP= answer_three()
    
     change=Top15.loc[avgGDP.index[5]]['2015']-Top15.loc[avgGDP.index[5]]['2006']
    
     return change
    
 answer_four()

Question 5 (6.6%)

What is the mean Energy Supply per Capita?

This function should return a single number.

复制代码

 def answer_five():

    
     Top15 = answer_one()
    
     return Top15['Energy Supply per Capita'].mean()
    
 answer_five()

Question 6 (6.6%)

What country has the maximum % Renewable and what is the percentage?

This function will return a tuple containing both the country's name and its percentage.

复制代码

 def answer_six():

    
     Top15 = answer_one()
    
     maxindex=Top15['% Renewable'].idxmax()
    
     return (maxindex,str(Top15['% Renewable'][maxindex])+'%')
    
 answer_six()

Problem 7 (approximately 6.6%)¶

Generate a new column which represents the ratio of Self-Citations to Total Citations. What is the maximum value of this new column, and which country has the highest ratio?

This function should be returns a tuple containing both the country's name and its ratio.

复制代码

 def answer_seven():

    
     Top15 = answer_one()
    
     Top15['ratio_Self-Total']=Top15['Self-citations']/Top15['Citations']
    
     maxindex=Top15['ratio_Self-Total'].idxmax()
    
     return (maxindex, Top15['ratio_Self-Total'][maxindex])
    
 answer_seven()

Question 8 (6.6%)

Generate a new column that estimates the population based on Energy Supply and Energy Supply per capita. What country has the third highest population based on this estimation?

This function should return a single string value.

复制代码

 def answer_eight():

    
     Top15 = answer_one()
    
     Top15['Pop_estimate']=Top15['Energy Supply']/Top15['Energy Supply per Capita']
    
     POP=Top15['Pop_estimate'].copy()
    
     POP.sort(inplace=True)
    
     return POP.index[-3]
    
 answer_eight()

Question 9 (6.6%)

Generate a column named "人均可引用文献量" to estimate the number of citable documents per person. To explore the relationship between "人均可引用文献量" and "人均能源供应量", employ Pearson's correlation coefficient by utilizing the .corr() method.

This function should return a single number.

Invoke the built-in function plot9() to display the relationship between Energy Supply per Capita and Citable documents per capita.

复制代码

 def answer_nine():

    
     Top15 = answer_one()
    
     Top15['Citable docs per Capita']=Top15['Citable documents']/(Top15['Energy Supply']/Top15['Energy Supply per Capita'])
    
     return Top15['Citable docs per Capita'].corr(Top15['Energy Supply per Capita'], method='pearson')
    
 answer_nine()

Question 10 (6.6%)

Generate a new column by assigning a value of 1 to countries whose % Renewable figure is at or above the median among all countries ranked in the top 15, and assign a value of 0 to those with a % Renewable figure below that median.

This function should be returned as a series named HighRenew, with its index consisting of country names sorted by their rank in ascending order.

复制代码

 def answer_ten():

    
     Top15 = answer_one()
    
     Top15.sort(columns='Rank', inplace=True)
    
     HighRenew=Top15['% Renewable']>=Top15['% Renewable'].median()
    
     return HighRenew.astype('int')
    
 answer_ten()
    
  
    
 #这一题我真的觉得自己很机智嘻嘻

Question 11 (6.6%)

According to the provided dictionary, group countries by continent. Then create a dataframe that shows the number of countries in each continent group. This dataframe should display sum, mean, and standard deviation of each country’s estimated population.

复制代码

 ContinentDict  = {'China':'Asia',

    
               'United States':'North America', 
    
               'Japan':'Asia', 
    
               'United Kingdom':'Europe', 
    
               'Russian Federation':'Europe', 
    
               'Canada':'North America', 
    
               'Germany':'Europe', 
    
               'India':'Asia',
    
               'France':'Europe', 
    
               'South Korea':'Asia', 
    
               'Italy':'Europe', 
    
               'Spain':'Europe', 
    
               'Iran':'Asia',
    
               'Australia':'Australia', 
    
               'Brazil':'South America'}

This function should be returned to have an index named Continent['Asia','Australia','Europe','North America','South America'], consisting of the columns ['size','sum','mean','std'].

复制代码

 def answer_eleven():

    
     Top15 = answer_one()
    
     ContinentDict  = {'China':'Asia', 
    
                   'United States':'North America', 
    
                   'Japan':'Asia', 
    
                   'United Kingdom':'Europe', 
    
                   'Russian Federation':'Europe', 
    
                   'Canada':'North America', 
    
                   'Germany':'Europe', 
    
                   'India':'Asia',
    
                   'France':'Europe', 
    
                   'South Korea':'Asia', 
    
                   'Italy':'Europe', 
    
                   'Spain':'Europe', 
    
                   'Iran':'Asia',
    
                   'Australia':'Australia', 
    
                   'Brazil':'South America'}
    
     Continent=pd.Series(data=ContinentDict)
    
     Top15['Continent']=Continent
    
     Top15['Pop_estimate']=Top15['Energy Supply']/Top15['Energy Supply per Capita']
    
     return Top15.groupby(by='Continent')['Pop_estimate'].agg(['size','sum', 'mean','std'])
    
 answer_eleven()

Question 12 (6.6%)

Divide the % Renewable variable into 5 categories. Group the top 15 countries by continent and also include these new % Renewable categories. How many countries are in each of these groups?

This function should return a pandas Series object with a MultiIndex composed of Continent and then group the data by the % Renewable bins. To exclude groups that do not contain any member countries.

复制代码

 def answer_twelve():

    
     Top15 = answer_one()
    
     ContinentDict  = {'China':'Asia', 
    
                   'United States':'North America', 
    
                   'Japan':'Asia', 
    
                   'United Kingdom':'Europe', 
    
                   'Russian Federation':'Europe', 
    
                   'Canada':'North America', 
    
                   'Germany':'Europe', 
    
                   'India':'Asia',
    
                   'France':'Europe', 
    
                   'South Korea':'Asia', 
    
                   'Italy':'Europe', 
    
                   'Spain':'Europe', 
    
                   'Iran':'Asia',
    
                   'Australia':'Australia', 
    
                   'Brazil':'South America'}
    
     Continent=pd.Series(data=ContinentDict)
    
     Top15['Continent']=Continent
    
     Top15['% Renewable']=pd.cut(Top15['% Renewable'],5)
    
     return Top15.groupby(by=['Continent','% Renewable']).size()
    
 answer_twelve()

Question 13 (6.6%)

Transform the Population Estimate series into a string, incorporating thousands separators using commas. Preclude rounding of the results.

e.g. 317615384.61538464 -> 317,615,384.61538464

This function will generate a Series named PopEst that includes population estimates for each country. The indexed values within this series correspond to country names, while the associated data points represent population estimates.

复制代码

 def answer_thirteen():

    
     Top15 = answer_one()
    
     PopEst=Top15['Energy Supply']/Top15['Energy Supply per Capita']
    
     return PopEst.apply('{:,}'.format)
    
 answer_thirteen()

;

Use the built in function plot_optional() to see an example visualization.

在当前项目中提供的源文件（如下）存在一些问题，在执行过程中会导致系统提示错误信息：例如，在处理浮点数时发现 $float$ 类型的变量不包含 $sqrt$ 属性。

复制代码

 def plot_optional():

    
     import matplotlib as plt
    
     import numpy as np
    
     %matplotlib inline
    
     Top15 = answer_one()
    
     ax = Top15.plot(x='Rank', y='% Renewable', kind='scatter', 
    
                 c=['#e41a1c','#377eb8','#e41a1c','#4daf4a','#4daf4a','#377eb8','#4daf4a','#e41a1c',
    
                    '#4daf4a','#e41a1c','#4daf4a','#4daf4a','#e41a1c','#dede00','#ff7f00'], 
    
                 xticks=range(1,16), s=6*Top15['2014']/10**10, alpha=.75, figsize=[16,6]);
    
  
    
     for i, txt in enumerate(Top15.index):
    
     ax.annotate(txt, [Top15['Rank'][i], Top15['% Renewable'][i]], ha='center')
    
  
    
     print("This is an example of a visualization that can be created to help understand the data. \
    
 This is a bubble chart showing % Renewable vs. Rank. The size of the bubble corresponds to the countries' \
    
 2014 GDP, and the color corresponds to the continent.")

因此只需将Top15['2014']中的数据格式类型转换为整型数值即可；通过执行Top15['2014'].astype(int)，其效果如何？

全部评论 (0)

还没有任何评论哟~

[Coursera | Introduction to Data Science in Python] Assignment 3 - More Pandas

作为一个近乎小白的新玩家，旁听的这个课又无法提交，想来想去还是发出来留个纪念嘻嘻。 Assignment3MorePandas Thisassignmentrequiresmoreindividual...

[Coursera | Introduction to Data Science in Python] Assignment 4 - Hypothesis Testing

作为一个近乎小白的新玩家，旁听的这个课又无法提交，想来想去还是发出来留个纪念嘻嘻。 Assignment4HypothesisTesting Thisassignmentrequiresmoreind...

Introduction to Data Science in Python 第 3 周 Assignment

Assignment3MorePandas Question120% LoadtheenergydatafromthefileEnergyIndicators.xls,whichisalistofin...

Coursera-Applied Data Science with Python-Introduction to Data Science in Python-Week2

一、TheSeriesDataStructure： Series是pandas的核心数据结构之一，在pandas中，Series是一个一维的类似的数组对象，可以看成一个介于list和dictionar...

Coursera-Applied Data Science with Python-Introduction to Data Science in Python-Week3

一、MergingDataframes：首先，我们先复习一下上周所学的内容，上周主要是学习了Pandas中两个核心的数据结构：一维结构的Series和二维结构的DataFrame。

Introduction to Data Science in Python 第 4 周 Assignment

Assignment4HypothesisTesting Thisassignmentrequiresmoreindividuallearningthanpreviousassignmentsyoua...

Introduction to Data Science in Python 第 2 周 Assignment

IntroductiontoDataScienceinPython 第2周Assignment 记录下问题和自己的答案 Part1 Thefollowingcodeloadstheolympicsda...

Coursera-Applied Data Science with Python-Introduction to Data Science in Python-assignment2第二周作业

\最近在上密歇根大学开的python应用数据科学AppliedDataSciencewithPython \这个课比较适合有编程基础的人 \第二周的作业，总共有8道题提交系统时不时会抽风导致明明只错...

Coursera | Introduction to Data Science in Python（University of Michigan）| Quiz答案

最后还是把assignment代码放出来了，这门课的assignment还是有点难度的，特别是assigment4（哀怨），放给大家参考啦有时间（需求）就把所有代码放到github上（好担心被河蟹啊...

Coursera | Introduction to Data Science in Python（University of Michigan）| quiz和assignment1-4答案

最后还是把assignment代码放出来了，u1s1,这门课的assignment还是有点难度的，特别是assigment4（哀怨），放给大家参考啦有时间（需求）就把所有代码放到github上（好担...

是否确定退出登录?

[Coursera | Introduction to Data Science in Python] Assignment 3 - More Pandas

Assignment 3 - More Pandas

Question 1 (20%)

Question 2 (6.6%)

Question 3 (6.6%)

Question 4 (6.6%)

Question 5 (6.6%)

Question 6 (6.6%)

Problem 7 (approximately 6.6%)¶

Question 8 (6.6%)

Question 9 (6.6%)

Question 10 (6.6%)

Question 11 (6.6%)

Question 12 (6.6%)

Question 13 (6.6%)

全部评论 (0)

相关文章推荐

[Coursera | Introduction to Data Science in Python] Assignment 3 - More Pandas

[Coursera | Introduction to Data Science in Python] Assignment 4 - Hypothesis Testing

Introduction to Data Science in Python 第 3 周 Assignment

Coursera-Applied Data Science with Python-Introduction to Data Science in Python-Week2

Coursera-Applied Data Science with Python-Introduction to Data Science in Python-Week3

Introduction to Data Science in Python 第 4 周 Assignment

Introduction to Data Science in Python 第 2 周 Assignment

Coursera-Applied Data Science with Python-Introduction to Data Science in Python-assignment2第二周作业

Coursera | Introduction to Data Science in Python（University of Michigan）| Quiz答案

Coursera | Introduction to Data Science in Python（University of Michigan）| quiz和assignment1-4答案