Pandas基本應用(Series, DataFrame)

Sharon Peng

20 min readMay 21, 2021

最近想看點關於NLP的書，看了一看程式碼，發現好多看不懂的code，趕緊花幾天來惡補一下。

Pandas主要有兩種型態：Series, DataFrame

Series

類似陣列型式，主要為一維陣列。

Outline:

1.建立Series
2.查看Series中的資料
3.選擇其中的元素(Element)
4.賦予值(Assigning Values to the Elements)
5.過濾值(Filtering Value)
6.評估值(Evaluating Values)
7.NaN(Not a number)的運用
8.dictionaries轉成Series

1. 建立Series:

Series = pd.Series([1, 2, 3])

自己定義index，預設index為0, 1, 2, …

ser = pd.Series([1, 2, 3], index =[‘a’, ‘b’, ‘c’])Output:
        a 1
        b 2
        c 3 
        dtype: int64

2. 查看Series中的資料

查看 "value"：.values

ser.values  (承上題)Output:   [1 2 3]

查看 "index"：.index

ser.index  (承上題)Output:  Index(['a', 'b', 'c'], dtype='object')

3. 選擇其中的元素(Element)

和一般的List一樣。

(1) 數字的索引：s[7]

ser[2]  (承上題)Output:  3

(2) index的索引：s[‘a’]

ser['a']  (承上題)Output:  1

4. 賦予值(Assigning Values to the Elements)

S[] = value

ser = pd.Series([1, 2, 3], index =['a', 'b', 'c'])
ser[0] = 999
print(ser)Output:
        a   999
        b   2
        c   3
        dtype: int64

上下兩種方法皆可

ser = pd.Series([1, 2, 3], index =['a', 'b', 'c'])
ser['a'] = 999
print(ser)Output:
        a   999
        b   2
        c   3
        dtype: int64

用Numpy的定義來轉換成Serise

arr = np.array([1, 2, 3, 4])
s = pd.Series(arr)
print(s)Output:
       0    1
       1    2
       2    3
       3    4
       dtype: int32

5. 過濾值(Filtering Value)

ser[s>2] //只會印出S中，元素大於8的數字。

arr = np.array([1, 2, 3, 4])
s = pd.Series(arr, index=['a','b','c','d'])
print(s[s > 2])Output:
        c    3
        d    4
        dtype: int32

6. 評估值(Evaluating Values)

(1). unique()

如果有value重複，可以使用 serd.unique()，只會擷取出現一次的值。

ser = pd.Series([1,0,2,1,2,3], index=['white','white','blue','green','green','yellow'])
print(ser.unique())Origin:
        white 1
        white 0
        blue 2
        green 1
        green 2
        yellow 3
        dtype: int64Output:  [1 0 2 3]

(2). value_counts()

有unique的功能，同時也會計算該值重複的次數。

ser = pd.Series([1,0,2,1,2,3], index=['white','white','blue','green','green','yellow'])
print(ser.value_counts())Output:
        1    2    # 1出現2次
        2    2    # 2出現2次
        0    1
        3    1
        dtype: int64

(3). isin()

回傳布林值，告知是否「包含於」那個數字。

ser = pd.Series([1, 0, 2, 1, 2, 3], index =['white', 'white', 'blue', 'green', 'green','yellow'])print(ser.isin([0, 3]))Output:
        white False
        white True
        blue False
        green False
        green False
        yellow True
        dtype: bool也有下方的用法
print(ser[ser.isin([0,3])])Output:
        white     0
        yellow    3
        dtype: int64

7. NaN(Not a number)的運用

(1). isnull()

查看是否為NaN

ser = pd.Series([5, -3, np.NaN, 14])
print(ser.isnull())Output:
        0    False
        1    False
        2     True
        3    False
        dtype: bool

(2). notnull()

查看是否｢非NaN」

ser = pd.Series([5, -3, np.NaN, 14])
print(ser.isnull())Output:
        0     True
        1     True
        2    False
        3     True
        dtype: bool# 印出非NaN的值。
print(ser[ser.notnull()]) # 作為條件(filter)放在裡面，過濾出不是NaN值

8. Dictionaries轉成Series

另一種serise的想法，是把它視為dictionary這項物件。

最大的不同就是，在我們原本Serise物件，會自動給index(像是0, 1, 2, 3, …)，如果透過dict的話，index可以直接做設定。

mydict = {'red': 2000, 'blue': 1000, 'yellow': 500,'orange': 1000}
myseries = pd.Series(mydict)Output:(index)
        red       2000
        blue      1000
        yellow     500
        orange    1000
        dtype: int64

DataFrame

主要是將Series拓展到更高的維度，因為Series只有一維。

Outline:

1. DataFrame的定義
2. Selecting Element(選擇元素)
3. Assigning Values (指派值)
4. 刪除 Column
5. Filtering(過濾值)
6. 巢狀結構的dict
7. Dataframe的轉置
8. Indexing
9. 重複資料
10. Dropping
11. Reindexing
12. 排序 ( Sorting )
13. 排名 ( Ranking )

1. DataFrame的定義

最簡單來定義Dataframe的方式，是直接把一個dict的物件轉成Dataframe的形式。

比較需要注意的是Dataframe，是以column作為集合。
相較於series，Dataframe有兩個index array，第一個index array，指向的是row，可以參考上圖。第二個index array，則是每一個column最上面所代表的東西。

data = {‘color’: [‘blue’, ‘green’, ‘yellow’, ‘red’, ‘white’],
        ‘object’: [‘ball’, ‘pen’, ‘pencil’, ‘paper’, ‘mug’],
        ‘price’: [1.2, 1.0, 0.6, 0.9, 1.7]} #每一個dict代表一個columnframe = pd.DataFrame(data)
print(frame)Output:    
           color  object  price  <- column
        0    blue    ball    1.2
        1   green     pen    1.0
        2  yellow  pencil    0.6
        3     red   paper    0.9
        4   white     mug    1.7

Index, column，只要在不改變數量的前提下，我們也可以隨意做更動。

(1) 選取object, price兩個column

frame = pd.DataFrame(data, columns=[‘object’, ‘price’])
print(frame)Output:
           object  price
        0    ball    1.2
        1     pen    1.0
        2  pencil    0.6
        3   paper    0.9
        4     mug    1.7

(2) 更改index

frame = pd.DataFrame(data, index=['one','two','three','four', 'five'])

(3) 利用np.arange的方式，快速建立dataframe

frame3 = pd.DataFrame(np.arange(16).reshape((4, 4)),
         index=[‘red’, ‘blue’, ‘yellow’, ‘white’],
         columns=[‘ball’, ‘pen’, ‘pencil’, ‘paper’])
print(frame3)Output:
                 ball  pen  pencil  paper
        red        0    1       2      3
        blue       4    5       6      7
        yellow     8    9      10     11
        white     12   13      14     15

2.Selecting Element(選擇元素)

顯示所有的 column

frame3.columnsOutput:  Index(['ball', 'pen', 'pencil', 'paper'], dtype='object')

選取 column

print(frame3['ball'])
print(frame.ball) #這個也可以Output:
        red        0
        blue       4
        yellow     8
        white     12
        Name: ball, dtype: int32

顯示所有 index

frame3.indexOutput:   Index(['red', 'blue', 'yellow', 'white'], dtype='object')

顯示所有 values

frame3.valuesOutput:
        [[ 0  1  2  3]
         [ 4  5  6  7]
         [ 8  9 10 11]
         [12 13 14 15]]

選取特定 index

利用loc ，擷取想要的row

print(frame3.loc['red'])
print(frame3.loc[['red', 'blue']]) # 擷取index: red, blue的值Output:
        ball      0
        pen       1
        pencil    2
        paper     3
        Name: red, dtype: int32Output:
               ball  pen  pencil  paper
        red      0    1       2      3
        blue     4    5       6      7

利用隱藏的 index來選取特定的值

frame3[0:3] # 0 1 2 (不包含3)Output:
                ball  pen  pencil  paper
       red        0    1       2      3
       blue       4    5       6      7
       yellow     8    9      10     11

選取特定 value，格式：frame[column][index | label]

frame3['ball']['yellow']Output: 8

3. Assigning Values (指派值)

幫 column和 index做命名:

frame3.index.name = 'row'
frame3.columns.name = 'column'
print(frame3)Output:
       column  ball  pen  pencil  paper
       row                             
       red        0    1       2      3
       blue       4    5       6      7
       yellow     8    9      10     11
       white     12   13      14     15

更動 column中 value的值:

frame3[‘ball’] = 999 # 全部更動frame3[‘ball’] = [9, 8, 7, 6] # 逐一更動Output:
               ball  pen  pencil  paper
       red      999    1       2      3
       blue     999    5       6      7
       yellow   999    9      10     11
       white    999   13      14     15               ball  pen  pencil  paper
       red        9    1       2      3
       blue       8    5       6      7
       yellow     7    9      10     11
       white      6   13      14     15

快速新增一個 column與其值:

# 新增column名為'new'，value為np.arange生成的0 1 2 3
frame3['new'] = np.arange(4)
print(frame3)Output:
               ball  pen  pencil  paper  new
      red        0    1       2      3    0
      blue       4    5       6      7    1
      yellow     8    9      10     11    2
      white     12   13      14     15    3

4. 刪除Column

del frame3[‘new’]

5. Filtering(過濾值)

frame3[frame3 < 5]  # 過濾<5的數字，只接受小於5的數字Output:
               ball  pen  pencil  paper
       red      0.0  1.0     2.0    3.0
       blue     4.0  NaN     NaN    NaN
       yellow   NaN  NaN     NaN    NaN
       white    NaN  NaN     NaN    NaN

6. 巢狀結構的dict

會依照對應到的index去做調配，如果沒有相對應的話，顯示NaN。

nestdict = {'red': { 2012:22, 2013:33 },
            'white': {2011:13, 2012:22, 2013:16},
            'blue': { 2011:17, 2012:27, 2013:18}}
frame2 = pd.DataFrame(nestdict)
print(frame2)Output:
              red   white  blue
       2012  22.0     22    27
       2013  33.0     16    18
       2011   NaN     13    17

7. Dataframe的轉置

非常簡單，加一個T即可。

Frame2.TOutput:
              2012  2013  2011
       red    22.0  33.0   NaN
       white  22.0  16.0  13.0
       blue   27.0  18.0  17.0

8. Indexing

使用函式idxmin, idxmax，回傳最小與最大的值。如果是字串的話就不能使用。

ser = pd.Series([5,0,3,8,4], index=['red','blue','yellow','white','green'])print(ser.idxmin())  -> blue
print(ser.idxmax())  -> white

9. 出現重複資料

如果只是比較小筆的資料沒有關係，但如果資料量一多，就不容易察覺。

查看是否有重複: ser.inedex.is_unique

ser = pd.Series(range(6), index=['white','white','blue','green',
'green','yellow'])
print(ser.index.is_unique)Output: False

如果index部分有重複的話，可以用ser[‘index’]，把相同index的皆列表出來。

ser = pd.Series(range(6), index=['white','white','blue','green',
'green','yellow'])
print(ser)
print(ser['white'])Output:
       white     0
       white     1
       blue      2
       green     3
       green     4
       yellow    5
       dtype: int64Output:
       white    0
       white    1
       dtype: int64

10. Reindexing(重給index)

ser.reindex([‘three’,’four’,’five’,’one’])

ser = pd.Series([2,5,7,4], index=['one','two','three','four'])Output:
       one      2
       two      5
       three    7
       four     4
       dtype: int64
print(ser.reindex(['three','four','five','one']))重新做排序，如果原本沒有出現，則為NaN
Output:
       three    7.0
       four     4.0
       five     NaN
       one      2.0
       dtype: float64

reindex(, method=’ffill’)

由前項，往後填補，因為是range(6)，所以只會從0~5。

ser3 = pd.Series([1,5,6,3],index=[0,3,5,6])
print(ser3)
print(ser3.reindex(range(6),method='ffill'))Output:
       0    1
       3    5
       5    6
       6    3
       dtype: int64Output:
       0    1
       1    1    # 由0 1那項往後填補
       2    1    # 由0 1那項往後填補
       3    5
       4    5    # 由3 5那項往後填補
       5    6
       dtype: int64

reindex(, method = ‘bfill’)

由後項，往前項填補。

print(ser3.reindex(range(6),method='bfill'))Output:      0    1
      1    5
      2    5
      3    5
      4    6  # 由5 6 那項往後填補
      5    6
      dtype: int644

為甚麼需要這項功能？
Ans: 方便新增column, index，如果沒有該值的話，panda自動給 NaN。

在column新增了’new’，其value按照，method=’ffill’，來做更新。

data = {‘color’ : [‘blue’,’green’,’yellow’,’red’,’white’],
        ‘object’ : [‘ball’,’pen’,’pencil’,’paper’,’mug’],
        ‘price’ : [1.2,1.0,0.6,0.9,1.7]}frame = pd.DataFrame(data)
var = frame.reindex(range(5), method=’ffill’,columns=[‘colors’,’price’,’new’,’object’])
print(var)Output:colors  price     new  object
    0    blue    1.2    blue    ball
    1   green    1.0   green     pen
    2  yellow    0.6  yellow  pencil
    3     red    0.9     red   paper
    4   white    1.7   white     mug

注意！！明明 ”new”前面是 ”price”，按照 ffill規則來說，應該要參照 price才對，為甚麼會跑去參照colors？
Ans: 如果去看官方文件的話，它有註明到 "monotonically increasing/decreasing index. "
這句我看了好久才懂它的意思，代表是要我們的 dataframe必須按照，遞增會遞減的次數，也就是，就算是字母的話，我們也要按照順序。

A b c d e f g h I j k l m n o p q r s t u v w x y z

所以依照’new’來說，按照 ’ffill’，參照前面項，該項會是color。

11. Dropping(丟掉資料)

另一種刪除的方法。

刪除單個index: ser.drop(‘label’)

ser = pd.Series(np.arange(4.), index=['red','blue','yellow','white'])
print(ser.drop('yellow'))

刪除多個index: ser.drop([‘red’, ‘yellow’])

print(ser.drop(['yellow', 'blue']))

刪除column: frame.drop([‘pen’, ‘pencil’], axis = 1)

asix = 1代表 column,
axis = 0代表 row。

frame = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
print(frame)Output:
                ball  pen  pencil  paper
        red        0    1       2      3
        blue       4    5       6      7
        yellow     8    9      10     11
        white     12   13      14     15#刪除column，pen, pencil
print(frame.drop(['pen','pencil'],axis=1))                ball  paper
        red        0      3
        blue       4      7
        yellow     8     11
        white     12     15

12. 排序(Sorting)

Serise的排序：一開始預設的排序是「由小到大」

ser.sort_index()
ser.sort_index(ascending=False)
ser.sort_values

ser = pd.Series([5,0,3,8,4], index=['red','blue','yellow','white','green'])
print(ser.sort_index())
print(ser.sort_index(ascending=False))Output:
         (大到小)
        blue      0
        green     4
        red       5
        white     8
        yellow    3
        dtype: int64       (小到大)
        yellow    3
        white     8
        red       5
        green     4
        blue      0
        dtype: int64

frame的排序:(分為index, column)

index排序: frame.sort_index()
column排序: frame.sort_index(axis=1)
value排序: frame.sort_values(by=’label’)，以”label”為主來做排序。

** 注意：如果是frame的話，括號內需要指定的label才可以。

frame = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
print(frame.sort_index())
print(frame.sort_index(axis=1))Output:
                ball  pen  pencil  paper
       blue       4    5       6      7
       red        0    1       2      3
       white     12   13      14     15
       yellow     8    9      10     11                 ball  paper  pen  pencil
       red        0      3    1       2
       blue       4      7    5       6
       yellow     8     11    9      10
       white     12     15   13      14
                 ball  pen  pencil  paper
        red        0    1       2      3
        blue       4    5       6      7
        yellow     8    9      10     11
        white     12   13      14     15

13. 排名(Ranking)

排名跟排序的差別在於，排名的話，會賦予一個數字給他。

直接看範例比較快:

ser.rank()

右邊的值是panda自己給他的排名，從字母次序由大到小。

ser = pd.Series([5,0,3,8,4],
index=['red','blue','yellow','white','green'])

print(ser.rank())Output:
        red       4.0
        blue      1.0  # 第一名
        yellow    2.0
        white     5.0
        green     3.0
        dtype: float64

ser.rank(ascending=False)

相反過來，字母次序最小的給1，最大的給1。

print(ser.rank(ascending=False))Output:

       red       2.0
       blue      5.0 # 第五名
       yellow    4.0
       white     1.0 # 第一名
       green     3.0
       dtype: float64

Reference: https://www.amazon.com/Python-Data-Analytics-Pandas-Matplotlib-ebook/dp/B07FT6FB6Y

Pandas基本應用(Series, DataFrame)

Series

類似陣列型式，主要為一維陣列。

Outline:

1. 建立Series:

2. 查看Series中的資料

3. 選擇其中的元素(Element)

4. 賦予值(Assigning Values to the Elements)

5. 過濾值(Filtering Value)

6. 評估值(Evaluating Values)

(1). unique()

(2). value_counts()

(3). isin()

7. NaN(Not a number)的運用

(1). isnull()

(2). notnull()

8. Dictionaries轉成Series

DataFrame

Outline:

1. DataFrame的定義

2.Selecting Element(選擇元素)

3. Assigning Values (指派值)

4. 刪除Column

5. Filtering(過濾值)

6. 巢狀結構的dict

7. Dataframe的轉置

8. Indexing

9. 出現重複資料

10. Reindexing(重給index)

11. Dropping(丟掉資料)

12. 排序(Sorting)

Serise的排序：一開始預設的排序是「由小到大」

frame的排序:(分為index, column)

13. 排名(Ranking)

Written by Sharon Peng

No responses yet