Pandas (1)

모플로 2021. 7. 25. 14:08

Pandas

데이터를 다루기 위한 도구
기본 인덱스는 0부터 순차적으로 증가함

import pandas as pd
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
#csv 타입 데이터 로드, separate는 빈공간(정규표현식으로 되어있는 \s+)으로 지정하고, Column은 없음
df_data = pd.read_csv(data_url, sep='\s+', header = None)

df_data.head()

->
0    1    2    3    4    5    6    7    8    9    10    11    12    13
0    0.00632    18.0    2.31    0    0.538    6.575    65.2    4.0900    1    296.0    15.3    396.90    4.98    24.0
1    0.02731    0.0    7.07    0    0.469    6.421    78.9    4.9671    2    242.0    17.8    396.90    9.14    21.6
2    0.02729    0.0    7.07    0    0.469    7.185    61.1    4.9671    2    242.0    17.8    392.83    4.03    34.7
3    0.03237    0.0    2.18    0    0.458    6.998    45.8    6.0622    3    222.0    18.7    394.63    2.94    33.4
4    0.06905    0.0    2.18    0    0.458    7.147    54.2    6.0622    3    222.0    18.7    396.90    5.33    36.2

column name 변경

df_data.columns = ["A","B","C","D","E","G","H","I","J","K","L","M","N","O"]

df_data.head()

->
    A    B    C    D    E    G    H    I    J    K    L    M    N    O
0    0.00632    18.0    2.31    0    0.538    6.575    65.2    4.0900    1    296.0    15.3    396.90    4.98    24.0
1    0.02731    0.0    7.07    0    0.469    6.421    78.9    4.9671    2    242.0    17.8    396.90    9.14    21.6
2    0.02729    0.0    7.07    0    0.469    7.185    61.1    4.9671    2    242.0    17.8    392.83    4.03    34.7
3    0.03237    0.0    2.18    0    0.458    6.998    45.8    6.0622    3    222.0    18.7    394.63    2.94    33.4
4    0.06905    0.0    2.18    0    0.458    7.147    54.2    6.0622    3    222.0    18.7    396.90    5.33    36.2

특정한 컬럼만 가져오거나 기존에 없던 컬럼 추가

pd.DataFrame(df_data, columns= ["A","H"])

->
    A    H
0    0.00632    65.2
1    0.02731    78.9
2    0.02729    61.1
3    0.03237    45.8
4    0.06905    54.2

pd.DataFrame(df_data, columns = ["B","Z"])

->
    B    Z
0    18.0    NaN
1    0.0    NaN
2    0.0    NaN
3    0.0    NaN
4    0.0    NaN

특정 column 호출

df_data.A    또는      df_data["A"]

-> 
0      0.00632
1      0.02731
2      0.02729
3      0.03237
4      0.06905

iloc, loc
- loc: index 이름으로 row의 데이터를 가져옴
- iloc: series 데이터 에서의 index를 가져옴

raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
        'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
        'age': [42, 52, 36, 24, 73],
        'city': ['San Francisco', 'Baltimore', 'Miami', 'Douglas', 'Boston']}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'city'])

-> 
first_name    last_name    age    city
0    Jason    Miller    42    San Francisco
1    Molly    Jacobson    52    Baltimore
2    Tina    Ali    36    Miami
3    Jake    Milner    24    Douglas
4    Amy    Cooze    73    Boston

df.loc[0].iloc[1]
-> Miller

기존 데이터의 조건으로 새로운 column 추가

df['is_old'] = df['age']>40

-> 
    first_name    last_name    age    city    is_old
0    Jason    Miller    42    San Francisco    True
1    Molly    Jacobson    52    Baltimore    True
2    Tina    Ali    36    Miami    False
3    Jake    Milner    24    Douglas    False
4    Amy    Cooze    73    Boston    True

transpose

df.T

->
    0    1    2    3    4
first_name    Jason    Molly    Tina    Jake    Amy
last_name    Miller    Jacobson    Ali    Milner    Cooze
age    42    52    36    24    73
city    San Francisco    Baltimore    Miami    Douglas    Boston
is_old    True    True    False    False    True

값 출력 ( array로 반환 )

df.values

->
array([['Jason', 'Miller', 42, 'San Francisco', True],
       ['Molly', 'Jacobson', 52, 'Baltimore', True],
       ['Tina', 'Ali', 36, 'Miami', False],
       ['Jake', 'Milner', 24, 'Douglas', False],
       ['Amy', 'Cooze', 73, 'Boston', True]], dtype=object)

column 제거
- object를 제거할 때 사용하는 명령어와 동일한 del을 사용

del df['is_old']

# drop함수는 inplace=True를 써야 df의 데이터 프레임 자체가 바뀜
# df.drop("is_old", axis=1, inplace=True)

df.drop("is_old", axis=1)

data selection

# 맨위에서부터 3개만 출력
df['first_name'].head(3)
->
0    Jason
1    Molly
2     Tina

df[['first_name','last_name','age']].head(3)

->
    first_name    last_name    age
0    Jason    Miller    42
1    Molly    Jacobson    52
2    Tina    Ali    36

# row의 index를 기준으로 가져옴
df[:3]

# 나이가 40 미만인 사람들의 모든 정보
df[df['age'] < 40]

->
    first_name    last_name    age    city    is_old
2    Tina    Ali    36    Miami    False
3    Jake    Milner    24    Douglas    False

index 변경

df.index = df['city']
del df['city']
df

->


        first_name    last_name    age    is_old
city                
San Francisco    Jason    Miller    42    True
Baltimore    Molly    Jacobson    52    True
Miami    Tina    Ali    36    False
Douglas    Jake    Milner    24    False
Boston    Amy    Cooze    73    True

iloc를 사용한 selection

df.iloc[:3,:2]

->
      first_name    last_name
city        
San Francisco    Jason    Miller
Baltimore    Molly    Jacobson
Miami    Tina    Ali

row 제거
- 인덱스를 기준으로 제거함

df.drop(['Miami','San Francisco'])

Series

데이터의 column에 해당하는 Object
Series 데이터와 DataFrame 데이터를 연산하면 broadcasting이 발생
덧셈 연산
- index를 기준으로 연산을 해줌, 같은 index가 없는 row는 NaN을 내뱉음

s1 = Series(range(1,6), index=list("abcded"))

s2 = Series(range(5,11), index = list("bcedef"))

# 두개의 연산이 같음
s1.add(s2) 
s1 + s2

->
a     NaN
b     7.0
c     9.0
d    13.0
e    11.0
e    13.0
f     NaN

# fill_value=0을 넣어주면 연산하기 위한 한쪽의 index가 없다면 해당 값을 0으로 변경
s1.add(s2, fill_value=0)

Pandas Lambda, Map,

lambda

s1 = pd.Series(np.arange(10))
s1.map(lambda x:x**2).head()

->
0     0
1     1
2     4
3     9
4    16

map

d = {1:"A", 2:"B", 3:"C"}
s1.map(d).head(5)

->
0    NaN
1      A
2      B
3      C
4    NaN

replace

d = {1:"A", 2:"B", 3:"C"}
s1.replace(d).head(5)

->
0      0
1      A
2      B
3      C
4      1


raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
        'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
        'age': [42, 52, 36, 24, 73],
        'city': ['San Francisco', 'Baltimore', 'Miami', 'Douglas', 'Boston']}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'city'])

df.city.replace({"Boston":"BostonA"}, inplace=True)

-> 
first_name    last_name    age    city
0    Jason    Miller    42    San Francisco
1    Molly    Jacobson    52    Baltimore
2    Tina    Ali    36    Miami
3    Jake    Milner    24    Douglas
4    Amy    Cooze    73    BostonA

apply

map과 달리 series 전체(column)에 해당 함수를 적용
입력값이 series 데이터로 입력받아 handling 가능
column의 통계치를 낼 때 유용함

raw_data = {'earn': [795,963,487,804,820],
        'height': [73,66,63,63,63],
        'age': [42, 52, 36, 24, 73]
        }
df = pd.DataFrame(raw_data, columns = ['earn', 'height', 'age'])

f = lambda x: x.max() - x.min()
df.apply(f)

->
earn      476
height     10
age        49

applymap
- series 단위가 아닌 element 단위로 함수를 적용
- series 단위에 apply를 적용시킬 때와 같은 효과

df.applymap(lambda x:-x).head(3)
->
earn    height    age
0    -795    -73    -42
1    -963    -66    -52
2    -487    -63    -36

describe
- Numeric type의 데이터 요약 정보를 출력

df.describe()
->

earn    height    age
count    5.000000    5.000000    5.000000
mean    773.800000    65.600000    45.400000
std    174.317813    4.335897    18.460769
min    487.000000    63.000000    24.000000
25%    795.000000    63.000000    36.000000
50%    804.000000    63.000000    42.000000
75%    820.000000    66.000000    52.000000
max    963.000000    73.000000    73.000000

unique
- 유일한 값을 뽑아줌
- ex) 카테고리라는 컬럼에 카테고리가 몇개인지 모른다면 사용

raw_data = {'category': ['clothes', 'clothes', 'ring', 'shoes', 'ring'],
        'price': [42, 52, 36, 24, 73]}
df = pd.DataFrame(raw_data, columns = ['category', 'price'])

df.category.unique()
->
array(['clothes', 'ring', 'shoes'], dtype=object)