[Coursera | Introduction to Data Science in Python] Assignment 3 - More Pandas
作为一个刚接触的新手玩家,在旁听相关课程后由于无法上交作业而感到无奈,实在忍不住了就想着放到网上做个纪念
Assignment 3 - More Pandas
This assignment calls for a greater emphasis on individual learning compared to the last assignment. We encourage you to explore resources such as the pandas documentation (available at http://pandas.pydata.org/pandas-docs/stable/) to discover functions and methods that may be new to you. Additionally, the discussion forums provide an open platform for interaction with both peers and course staff.
Question 1 (20%)
Import energy data from the file path Energy Indicators.xls provided by the United Nations, which serves as a list of indicators for energy supply and renewable electricity production. This dataset is available through their Excel file tables for 2013, and it should be loaded into a DataFrame named energy.
Note: This is an Excel file, not a comma-separated values (CSV) file.
['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']
Ensure that Energy Supply is transformed into gigajoules (note that there are 1,000,000 gigajoules in each petajoule). Make certain that any country exhibiting missing data—such as those indicated by an ellipsis—is appropriately represented using the np.NaN value.
Rename the following list of countries (for use in later questions):
"South Korean Republic": "South Korea",
"North American Nation": "United States of America",
"UK Associated Nations": "United Kingdom of Great Britain and Northern Ireland",
"China, Hong Kong Special Administrative Region": "China, Hong Kong Special Administrative Region (简称:香港特别行政区)"
There are also a number of countries' names that include numeric characters and/or parentheses. Make sure to delete these entries.
e.g.
'Bolivia (Plurinational State of)' should be 'Bolivia',
'Switzerland17' should be 'Switzerland'.
Next, import the GDP dataset from world_bank.csv, which includes country-level GDP data spanning from 1960 to 2015, sourced from World Bank. Assign this DataFrame to a variable named GDP.
Make sure to skip the header, and rename the following list of countries:
"Korea, Rep.": "South Korea", "Iran, Islamic Rep.": "Iran", "Hong Kong SAR, China": "Hong Kong"
Upon finalizing, we will import the Sciamgo Journal and Country Rank dataset for Energy Engineering and Power Technology from the file scimagojr-3.xlsx into our working environment. This dataset, which is specifically curated to assess country rankings based on their journal contributions within the aforementioned field, will be referred to as ScimEn.
Aggregate three datasets - GDP, Energy, and ScimEn - into a consolidated dataset by considering only common country names. Utilize solely the most recent 10-year span (specifically from 2006 to 2015) for GDP data and include those with the highest Scimagojr Rank from Rank 1 to Rank 15.
The index of this DataFrame should consist of the country names, and its columns should include a comprehensive list: ['Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations', 'Citations per document', 'H index', 'Energy Supply', 'Energy Supply per Capita', '% Renewable', and years from 2006 to 2015 inclusive.]
This function should return a DataFrame with 20 columns and 15 entries.
import pandas as pd
energy=pd.read_excel('Energy Indicators.xls',
skiprows=17, skipfooter=38,
na_values='...',
usecols=[2,3,4,5],
names=['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable'])
energy['Energy Supply']*=1000000
energy['Country']=energy['Country'].str.replace(r'\d| \(.+\)','')
energy.replace(to_replace={'Republic of Korea': 'South Korea',
'United States of America': 'United States',
'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom',
'China, Hong Kong Special Administrative Region': 'Hong Kong'},
inplace=True)
GDP=pd.read_csv('world_bank.csv', skiprows=4)
GDP.replace(to_replace={'Korea, Rep.': 'South Korea',
'Iran, Islamic Rep.': 'Iran',
'Hong Kong SAR, China': 'Hong Kong'},
inplace=True)
ScimEn=pd.read_excel('scimagojr-3.xlsx')
def answer_one():
answer=pd.merge(ScimEn.set_index('Country')[:15],
energy.set_index('Country'),
left_index=True,
right_index=True)
G=GDP.T[-10:].T
G=G.set_index(GDP['Country Name'])
answer=pd.merge(answer,G, left_index=True, right_index=True)
return answer
answer_one()

Question 2 (6.6%)
The prior inquiry merged three data sets initially and then streamlined them down to merely the top 15 entries. When integrating these datasets before streamlining them down to merely the top 15 items, how many records were lost in this process?
This function should return a single number.
def answer_two():
return len(GDP['Country Name'].append(energy['Country'], ignore_index=True).append(ScimEn['Country'], ignore_index=True).unique())-15
answer_two()
#my answer is 303. I also have seen other answers, I'm not sure which is right.
回答以下问题基于全球排名前15的国家(即answer_one()返回的数据框)。
Question 3 (6.6%)
What was the average GDP per nation in the past ten years? (omitting any missing data in this computation.)
This function should produce a named Series called avgGDP, which contains data for 15 countries and their average GDP values, sorted in descending order.
def answer_three():
Top15 = answer_one()
avgGDP=Top15.T[-10:].mean(skipna=True)
avgGDP.sort(ascending=False, inplace=True)
return avgGDP
answer_three()

Question 4 (6.6%)
What is the percentage change in GDP for a nation with an average GDP ranking of sixth over a 10-year period?
This function should return a single number.
def answer_four():
Top15 = answer_one()
avgGDP= answer_three()
change=Top15.loc[avgGDP.index[5]]['2015']-Top15.loc[avgGDP.index[5]]['2006']
return change
answer_four()

Question 5 (6.6%)
What is the mean Energy Supply per Capita?
This function should return a single number.
def answer_five():
Top15 = answer_one()
return Top15['Energy Supply per Capita'].mean()
answer_five()

Question 6 (6.6%)
What country has the maximum % Renewable and what is the percentage?
This function will return a tuple containing both the country's name and its percentage.
def answer_six():
Top15 = answer_one()
maxindex=Top15['% Renewable'].idxmax()
return (maxindex,str(Top15['% Renewable'][maxindex])+'%')
answer_six()

Problem 7 (approximately 6.6%)¶
Generate a new column which represents the ratio of Self-Citations to Total Citations. What is the maximum value of this new column, and which country has the highest ratio?
This function should be returns a tuple containing both the country's name and its ratio.
def answer_seven():
Top15 = answer_one()
Top15['ratio_Self-Total']=Top15['Self-citations']/Top15['Citations']
maxindex=Top15['ratio_Self-Total'].idxmax()
return (maxindex, Top15['ratio_Self-Total'][maxindex])
answer_seven()

Question 8 (6.6%)
Generate a new column that estimates the population based on Energy Supply and Energy Supply per capita. What country has the third highest population based on this estimation?
This function should return a single string value.
def answer_eight():
Top15 = answer_one()
Top15['Pop_estimate']=Top15['Energy Supply']/Top15['Energy Supply per Capita']
POP=Top15['Pop_estimate'].copy()
POP.sort(inplace=True)
return POP.index[-3]
answer_eight()

Question 9 (6.6%)
Generate a column named "人均可引用文献量" to estimate the number of citable documents per person. To explore the relationship between "人均可引用文献量" and "人均能源供应量", employ Pearson's correlation coefficient by utilizing the .corr() method.
This function should return a single number.
Invoke the built-in function plot9() to display the relationship between Energy Supply per Capita and Citable documents per capita.
def answer_nine():
Top15 = answer_one()
Top15['Citable docs per Capita']=Top15['Citable documents']/(Top15['Energy Supply']/Top15['Energy Supply per Capita'])
return Top15['Citable docs per Capita'].corr(Top15['Energy Supply per Capita'], method='pearson')
answer_nine()


Question 10 (6.6%)
Generate a new column by assigning a value of 1 to countries whose % Renewable figure is at or above the median among all countries ranked in the top 15, and assign a value of 0 to those with a % Renewable figure below that median.
This function should be returned as a series named HighRenew, with its index consisting of country names sorted by their rank in ascending order.
def answer_ten():
Top15 = answer_one()
Top15.sort(columns='Rank', inplace=True)
HighRenew=Top15['% Renewable']>=Top15['% Renewable'].median()
return HighRenew.astype('int')
answer_ten()
#这一题我真的觉得自己很机智嘻嘻

Question 11 (6.6%)
According to the provided dictionary, group countries by continent. Then create a dataframe that shows the number of countries in each continent group. This dataframe should display sum, mean, and standard deviation of each country’s estimated population.
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
This function should be returned to have an index named Continent['Asia','Australia','Europe','North America','South America'], consisting of the columns ['size','sum','mean','std'].
def answer_eleven():
Top15 = answer_one()
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
Continent=pd.Series(data=ContinentDict)
Top15['Continent']=Continent
Top15['Pop_estimate']=Top15['Energy Supply']/Top15['Energy Supply per Capita']
return Top15.groupby(by='Continent')['Pop_estimate'].agg(['size','sum', 'mean','std'])
answer_eleven()

Question 12 (6.6%)
Divide the % Renewable variable into 5 categories. Group the top 15 countries by continent and also include these new % Renewable categories. How many countries are in each of these groups?
This function should return a pandas Series object with a MultiIndex composed of Continent and then group the data by the % Renewable bins. To exclude groups that do not contain any member countries.
def answer_twelve():
Top15 = answer_one()
ContinentDict = {'China':'Asia',
'United States':'North America',
'Japan':'Asia',
'United Kingdom':'Europe',
'Russian Federation':'Europe',
'Canada':'North America',
'Germany':'Europe',
'India':'Asia',
'France':'Europe',
'South Korea':'Asia',
'Italy':'Europe',
'Spain':'Europe',
'Iran':'Asia',
'Australia':'Australia',
'Brazil':'South America'}
Continent=pd.Series(data=ContinentDict)
Top15['Continent']=Continent
Top15['% Renewable']=pd.cut(Top15['% Renewable'],5)
return Top15.groupby(by=['Continent','% Renewable']).size()
answer_twelve()

Question 13 (6.6%)
Transform the Population Estimate series into a string, incorporating thousands separators using commas. Preclude rounding of the results.
e.g. 317615384.61538464 -> 317,615,384.61538464
This function will generate a Series named PopEst that includes population estimates for each country. The indexed values within this series correspond to country names, while the associated data points represent population estimates.
def answer_thirteen():
Top15 = answer_one()
PopEst=Top15['Energy Supply']/Top15['Energy Supply per Capita']
return PopEst.apply('{:,}'.format)
answer_thirteen()

Use the built in function plot_optional() to see an example visualization.
在当前项目中提供的源文件(如下)存在一些问题,在执行过程中会导致系统提示错误信息:例如,在处理浮点数时发现float类型的变量不包含sqrt属性。
def plot_optional():
import matplotlib as plt
import numpy as np
%matplotlib inline
Top15 = answer_one()
ax = Top15.plot(x='Rank', y='% Renewable', kind='scatter',
c=['#e41a1c','#377eb8','#e41a1c','#4daf4a','#4daf4a','#377eb8','#4daf4a','#e41a1c',
'#4daf4a','#e41a1c','#4daf4a','#4daf4a','#e41a1c','#dede00','#ff7f00'],
xticks=range(1,16), s=6*Top15['2014']/10**10, alpha=.75, figsize=[16,6]);
for i, txt in enumerate(Top15.index):
ax.annotate(txt, [Top15['Rank'][i], Top15['% Renewable'][i]], ha='center')
print("This is an example of a visualization that can be created to help understand the data. \
This is a bubble chart showing % Renewable vs. Rank. The size of the bubble corresponds to the countries' \
2014 GDP, and the color corresponds to the continent.")
因此只需将Top15['2014']中的数据格式类型转换为整型数值即可;通过执行Top15['2014'].astype(int),其效果如何?

