2012欧洲杯排名函数:Python数据分析-数据初探2
本文主要目的:了解数据过滤与排序,熟悉Pandas数据分析库,应用于数据处理,提高数据处理效率
前途一片光明
import pandas as pdimport numpy as np
euro=pd.read_csv(r'C:UsersAdministratorDesktopexercise_dataEuro2012_stats.csv')euro------------------------------结果: Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 ... 13 81.3% 41 62 2 9 0 9 9 161 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 ... 9 60.1% 53 73 8 7 0 11 11 192 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 ... 10 66.7% 25 38 8 4 0 7 7 153 England 5 11 18 50.0% 17.2% 40 0 0 0 ... 22 88.1% 43 45 6 5 0 11 11 16
euro['Goals']#或euro.Goals--------------------------结果:0 41 42 43 54 35 106 57 68 29 210 611 112 513 1214 515 2Name: Goals, dtype: int64
euro['Team'].nunique()#或euro.shape[0]-------------------------------结果:16
euro.shape[1]---------------------------结果:35
ddyl=euro[['Team','Yellow Cards','Red Cards']]ddyl--------------------------结果: Team Yellow Cards Red Cards0 Croatia 9 01 Czech Republic 7 02 Denmark 4 03 England 5 04 France 6 05 Germany 4 06 Greece 9 17 Italy 16 08 Netherlands 5 09 Poland 7 110 Portugal 12 011 Republic of Ireland 6 112 Russia 6 013 Spain 11 014 Sweden 7 015 Ukraine 5 0
ddyl.sort_values(['Red Cards','Yellow Cards'],ascending=False)-----------------------------结果: Team Yellow Cards Red Cards6 Greece 9 19 Poland 7 111 Republic of Ireland 6 17 Italy 16 010 Portugal 12 013 Spain 11 00 Croatia 9 01 Czech Republic 7 014 Sweden 7 04 France 6 012 Russia 6 03 England 5 08 Netherlands 5 015 Ukraine 5 02 Denmark 4 05 Germany 4 0
round(ddyl['Yellow Cards'].mean(),0)-------------------------结果:7.0
euro.Goals>6 #查询值为布尔值,判断是否大于6,结果如下--------------------------------结果:0 False1 False2 False3 False4 False5 True6 False7 False8 False9 False10 False11 False12 False13 True14 False15 FalseName: Goals, dtype: bool
euro[euro.Goals>6] #------------------------结果: Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 10 62.6% 63 49 12 4 0 15 15 1713 Spain 12 42 33 55.9% 16.0% 100 0 1 0 ... 15 93.8% 102 83 19 11 0 17 17 18
euro[euro['Team'].str.startswith('G')]#或# euro[euro.Team.str.startswith('G')]----------------------------结果: Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 10 62.6% 63 49 12 4 0 15 15 176 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 13 65.1% 67 48 12 9 1 12 12 20
注解: 函数:startswith()
作用:判断字符串是否以指定字符或子字符串开头
函数说明 语法:string.startswith(str, beg=0,end=len(string)) 或string[beg:end].startswith(str)
参数说明:
string: 被检测的字符串
str: 指定的字符或者子字符串。(可以使用元组,会逐一匹配)
beg: 设置字符串检测的起始位置(可选)
end: 设置字符串检测的结束位置(可选) 如果存在参数 beg 和 end,则在指定范围内检查,否则在整个字符串中检查
返回值 如果检测到字符串,则返回True,否则返回False。默认空字符为True
euro.iloc[:,:6]#或选取除了最后三列的其他列# euro.iloc[:,:-3]-----------------------------结果: Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots0 Croatia 4 13 12 51.9% 16.0%1 Czech Republic 4 13 18 41.9% 12.9%2 Denmark 4 10 10 50.0% 20.0%3 England 5 11 18 50.0% 17.2%4 France 3 22 24 37.9% 6.5%5 Germany 10 32 32 47.8% 15.6%6 Greece 5 8 18 30.7% 19.2%7 Italy 6 34 45 43.0% 7.5%8 Netherlands 2 12 36 25.0% 4.1%9 Poland 2 15 23 39.4% 5.2%10 Portugal 6 22 42 34.3% 9.3%11 Republic of Ireland 1 7 12 36.8% 5.2%12 Russia 5 9 31 22.5% 12.5%13 Spain 12 42 33 55.9% 16.0%14 Sweden 5 17 19 47.2% 13.8%15 Ukraine 2 7 26 21.2% 6.0%
注解:
loc是指location的意思,iloc中的i是指integer。这两者的区别如下:
loc works on labels in the index. iloc works on the positions in the index (so it only takes integers) 也就是说loc是根据index和列名来索引,
如上table定义了一个index,那么loc就根据这个index来索引对应的行/列。
iloc是根据行/列号来索引
如果索引的列,可以用isin()函数:
#找到英格兰(England)、意大利(Italy)和俄罗斯(Russia)的射正率(Shooting Accuracy)euro.loc[euro['Team'].isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]-----------------------------结果: Team Shooting Accuracy3 England 50.0%7 Italy 43.0%12 Russia 22.5%
总结:
通过对数据切片:loc / iloc函数;sort_volues函数 ;str.startswith函数;isin函数的应用
可以方便的选取所需的数据,达到高效的目的性。
文章参考于 Github:https://github.com/guipsamora/pandas_exercises
数据集:需要请留言
寄予:厚积而薄发