pandas 入门练习

置顶大数据/ 数据分析 Pandas

创建时间:2018-02-18 11:20

字数:2.5k

pandas 是基于 Numpy 构建的含有更高级数据结构和工具的数据分析包

pandas引入规则

1 2	from pandas import Series, DataFrame import pandas as pd

pandas 数据结构

Series

一种类似于一维数组的对象,它是由一组数据(各种Numpy数据类型)以及一组与之相关的数据标签(即索引)组成,仅由一组数据即可产生简单的 Series

通过一维数组创建 Series

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

arr = np.array([1, 2, 3, 4])
series = Series(arr)
# 0    1
# 1    2
# 2    3
# 3    4
# dtype: int64

series2 = Series([4, 3, 2, 1])
# 0    4
# 1    3
# 2    2
# 3    1
# dtype: int64

# 如上所示,第一列为索引,最后一行为元素数据类型

series.index
# RangeIndex(start=0, stop=4, step=1)
series.values
# array([1, 2, 3, 4])
series.dtype
# dtype('int64')

# 通过数组创建时,如果没有为数据指定索引,则会自动创建一个从 0 到 N-1(N 为数据的长度)的整数索引,默认索引可以通过赋值方式进行修改

series3 = Series([45, 66, 88, 99], index=['语文', '数学', '英语', '体育'])
# 语文    45
# 数学    66
# 英语    88
# 体育    99
# dtype: int64

series3.index
# Index(['语文', '数学', '英语', '体育'], dtype='object')

series3.values
#  array([45, 66, 88, 99])

通过字典的方式创建 Series

Series 可以被看成是一个定长的有序字典, 是索引值到数据值的一个映射,因此可以直接通过字典来创建 Series

d = {'A': 1, 'B': 2, 'C': 3, 'D': 4}
series4 = Series(d)
series4.index
# Index(['A', 'B', 'C', 'D'], dtype='object')
series4.values
# array([1, 2, 3, 4])

Series 应用 Numpy 数组运算

numpy 中的数组运算,在 Series 中都保留使用,并且 Series 进行数组运算时,索引与值之间的映射关系不会改变

d = {'A': 1, 'B': 2, 'C': 3, 'D': 4}
series = Series(d)

series[series > 2]
# C    3
# D    4
# dtype: int64

series / 2
# A    0.5
# B    1.0
# C    1.5
# D    2.0
# dtype: float64

np.power(series, 2)
# A     1
# B     4
# C     9
# D    16
# dtype: int64

Series 缺失值检测

NaN 在 pandas 中用于表示一个缺失的值
pandas 中的 isnull 和 notnull 函数可用于 Series 缺失值检测
isnull 和 not null 都返回一个布尔类型的 Series

scores = Series({"kobe": 92, "wade": 90, "ai": 90, "jordan": 100})
new_index = ['kobe', 'wade', "james", "ai", 'jordan']
scores = Series(scores, index=new_index)
# kobe       92.0
# wade       90.0
# james       NaN
# ai         90.0
# jordan    100.0
# dtype: float64

pd.isnull(scores)
# kobe      False
# wade      False
# james      True
# ai        False
# jordan    False
# dtype: bool

pd.notnull(scores)
# kobe       True
# wade       True
# james     False
# ai         True
# jordan     True
# dtype: bool

# 过滤出为缺失的项
scores[pd.isnull(scores)]
# james   NaN
# dtype: float64

Series 自动对齐

不同 Series 之间进行算术运算, 会自动对齐不同索引的数据

product_num = Series([23, 45, 67, 89], index=['p1', 'p3', 'p4', 'p2'])
product_price_table = Series([1, 2, 3, 4, 5], index=['p2', 'p5', 'p3', 'p1', 'p4'])
product_sum = product_num * product_price_table
# p1     92.0
# p2     89.0
# p3    135.0
# p4    335.0
# p5      NaN
# dtype: float64

Series 及其索引的 name 属性

Series 对象本身及其索引都有一个 name 属性, 可赋值

product_num = Series([23, 45, 67, 89], index=['p1', 'p3', 'p4', 'p2'])
product_num.name = 'ProductNums'
product_num.index.name = 'ProductType'
# ProductType
# p1    23
# p3    45
# p4    67
# p2    89
# Name: ProductNums, dtype: int64

DataFrame

一个表格型的数据结构,含有一组有序的列, 每列可以是不同的值类型(数值,字符串,布尔值等), DataFrame 既有行索引,也有列索引,可以被看做是由 Series 组成的字典

通过二维数组创建 DataFrame

df1 = DataFrame([['kobe', 'ai', 'wade'], [81, 58, 55]])
#     0   1     2     列索引
# 0  kobe  ai  wade
# 1    81  58    55
  
  行
  索
  引
  
  
df2 = DataFrame([['kobe', 68], ['ai', 58], ['wade', '55']])
#       0   1  
# 0  kobe  68
# 1    ai  58
# 2  wade  55

# 自定义行索引(index), 列索引(columns)
arr = np.array([
    ['kobe', 68], 
    ['ai', 58],
    ['wade', '55']
    ])
df3 = DataFrame(arr, index=['one', 'two', 'three'], columns=['name', 'point'])
#        name point
# one    kobe    68
# two      ai    58
# three  wade    55

通过字典的方式创建 DataFrame

data = {
    "apart": ['101', '102', '103', '104'],
    'profits': [567, 789, 456, 678],
    'year':[2016, 2017, 2018, 2017]
}
df = DataFrame(data)
#   apart  profits  year
# 0   101      567  2016
# 1   102      789  2017
# 2   103      456  2018
# 3   104      678  2017

df.index
# RangeIndex(start=0, stop=4, step=1)

df.columns
# Index(['apart', 'profits', 'year'], dtype='object')

df.values
# array([['101', 567, 2016],
#        ['102', 789, 2017],
#        ['103', 456, 2018],
#        ['104', 678, 2017]], dtype=object)

索引对象

不管是 Series 对象还是 DataFrame 对象,都有索引对象
索引对象负责管理轴标签和其他元数据(比如轴名称)
通过索引可以从 Series, DataFrame 中取值或对某个位置的值重新赋值
Series 或者 DataFrame 自动化对齐功能就是通过索引进行的

通过索引从 DataFrame 中取值

可以直接通过类索引获取指定列的数据
要通过行索引获取指定行数据需要 loc 方法

data = {
    "apart": ['101', '102'],
    'profits': [567, 789],
    'year':[2016, 2017]
}
df = DataFrame(data)
#   apart  profits  year
# 0   101      567  2016
# 1   102      789  2017

df['year']
# 0    2016
# 1    2017
# Name: year, dtype: int64

df.loc[0]
# apart       101
# profits     567
# year       2016
# Name: 0, dtype: object

pandas 基本功能

常用的数学和统计方法

方法	说明
count	非NA值的数量
describe	针对Series或各DataFrame列计算总统计
min/max	计算最小值、最大值
argmin、argmax	计算能够获取到最小值和最大值的索引位置(整数)
idxmin、idxmax	计算能够获取到最小值和最大值的索引值
quantile	计算样本的分位数(0到1)
sum	值的总和
mean	值的平均数
median	值的算术中位数 (50%分位数)
mad	根据平均值计算平均绝对离差
var	样本数值的方差
std	样本值的标准差
cumsum	样本值的累计和
cummin、cummax	样本值的累计最小值、最大值
cumprod	样本值的累计积
Pct_change	计算百分数变化

对于 DataFrame, 这些通过统计方法, 默认是计算各列上的数据, 如果要应用于各行数据,则增加参数 axis= 1

df = DataFrame([
    [0, 1, 2, 3],
    [4, 5, 6, 7]
])
df.describe()
#               0         1         2         3
# count  2.000000  2.000000  2.000000  2.000000
# mean   2.000000  3.000000  4.000000  5.000000
# std    2.828427  2.828427  2.828427  2.828427
# min    0.000000  1.000000  2.000000  3.000000
# 25%    1.000000  2.000000  3.000000  4.000000
# 50%    2.000000  3.000000  4.000000  5.000000
# 75%    3.000000  4.000000  5.000000  6.000000
# max    4.000000  5.000000  6.000000  7.000000
#

df.count()
# 0    2
# 1    2
# 2    2
# 3    2
# dtype: int6

df.count(axis=1)
# 0    4
# 1    4
# dtype: int64

unique方法用于获取Series唯一值数组
value_counts方法,用于计算一个Series中各值出现的频率
isin方法,用于判断矢量化集合的成员资格,可用于选取Series中或者
DataFrame中列中数据的子集

series = Series(["a", "b", "c", "a", "a", "b", "c"])
series.unique()
# array(['a', 'b', 'c'], dtype=object)

df = DataFrame({
    'id': [1, 2, 3, 4],
    'point': [55, 66, 77, 88],
    'team_id': [1, 2, 1, 1]
})
df['team_id'].unique()
# array([1, 2])


# 返回结果默认会按值出现频率降序排序
series.value_counts()
# a    3
# c    2
# b    2
# dtype: int64

series.value_counts(ascending=True)
# b    2
# c    2
# a    3
# dtype: int64


mask = series.isin(['b', 'c'])
# 0    False
# 1     True
# 2     True
# 3    False
# 4    False
# 5     True
# 6     True
# dtype: bool

# 选出值为 'b', 'c' 的项
series[mask]
# 1    b
# 2    c
# 5    b
# 6    c
# dtype: object

处理缺失数据

方法	说明
dropna	根据标签的值中是否存在缺失数据对轴标签进行过滤(删除),可通过阈值调节对缺失值的容忍度
findna	用指定值或插值方法(如ffill或bfill)填充缺失数据
isnull	返回一个含有布尔值的对象,这些布尔值表示哪些值是缺失值NA
notnull	Isnull的否定式

## 缺失值检测
df = DataFrame({
    'id': [1, 2, np.NaN],
    'point': [55, np.NaN, 88],
    'team_id': [1, 2, np.NaN]
})
df.isnull()
#       id  point  team_id
# 0  False  False    False
# 1  False   True    False
# 2   True  False     True


## 过滤缺失数据
series = Series([1, 2, np.NaN, 4])
# 0    1.0
# 1    2.0
# 3    4.0
# dtype: float64

df = DataFrame([
    [1, 2],
    [np.NaN, np.NaN],
    [4, np.NaN]
])
# 默认丢弃所有包含缺失值的行
df.dropna()
#      0    1
# 0  1.0  2.0
# 丢弃全部为缺失值的行
df.dropna(how='all')
#     0    1
# 0  1.0  2.0
# 2  4.0  NaN


## 填充缺失数据
df.fillna(0)
#      0    1
# 0  1.0  2.0
# 1  0.0  0.0
# 2  4.0  0.0

层次化索引

在某个方向上拥有多个(两个及两个以上)索引级别
通过层次化索引,pandas能够以低维度形式处理高维度数据
通过层次化索引,可以按层级统计数据

series = Series([40, 20 , 24, 32], index=[['West', 'West', 'East', 'East'],['Westbrook', 'kuzma', 'Brown', 'James']] )
# West  Westbrook    40
#       kuzma        20
# East  Brown        24
#       James        32
# dtype: int64

series.index.names = ['区域', '姓名']
series
# 区域    姓名       
# West  Westbrook    40
#       kuzma        20
# East  Brown        24
#       James        32
# dtype: int64

df = DataFrame({
    'Date': ['03', '03', '04', '04'],
    'Team': ['Lakers', 'Thunder', 'Lakers' ,'Thunder'],
    'PTS': [115, 123, 92, 112],
    'REB': [42, 36, 52, 32]
})
#   Date  PTS  REB     Team
# 0   03  115   42   Lakers
# 1   03  123   36  Thunder
# 2   04   92   52   Lakers
# 3   04  112   32  Thunder

df2 = df.set_index(['Date', 'Team'])
#               PTS  REB
# Date Team             
# 03   Lakers   115   42
#      Thunder  123   36
# 04   Lakers    92   52
#      Thunder  112   32

df2.index
# MultiIndex(levels=[['03', '04'], ['Lakers', 'Thunder']],
#            labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
#            names=['Date', 'Team'])

# 按层级统计
df2.sum(level = 'Date')
#       PTS  REB
# Date          
# 03    238   78
# 04    204   84

df2.sum(level = 'Team')
#          PTS  REB
# Team             
# Lakers   207   94
# Thunder  235   68