0
点赞
收藏
分享

微信扫一扫

数据清洗之 字符串数据处理


字符串数据处理

  • Pandas中提供了字符串的函数,但只能对字符型变量进行使用
  • 通过str方法访问相关属性
  • 可以使用字符串的相关方法进行数据处理

函数名称

说明

contains()

返回表示各str是否含有指定模式的字符串

replace()

替换字符串

lower()

返回字符串的副本,其中所有字母都转换为小写

upper()

返回字符串的副本,其中所有字母都转换为大写

split()

返回字符串中的单词列表

strip()

删除前导和后置空格

join()

返回一个字符串,该字符串是给定序列中所有字符串的连接

import pandas as pd
import numpy as np
import

os.getcwd()

'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据清洗之数据转换'

os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')

df = pd.read_csv('MotorcycleData.csv', encoding='gbk')

df.head(5)



Condition

Condition_Desc

Price

Location

Model_Year

Mileage

Exterior_Color

Make

Warranty

Model

...

Vehicle_Title

OBO

Feedback_Perc

Watch_Count

N_Reviews

Seller_Status

Vehicle_Tile

Auction

Buy_Now

Bid_Count

0

Used

mint!!! very low miles

$11,412

McHenry, Illinois, United States

2013.0

16,000

Black

Harley-Davidson

Unspecified

Touring

...

NaN

FALSE

8.1

NaN

2427

Private Seller

Clear

True

FALSE

28.0

1

Used

Perfect condition

$17,200

Fort Recovery, Ohio, United States

2016.0

60

Black

Harley-Davidson

Vehicle has an existing warranty

Touring

...

NaN

FALSE

100

17

657

Private Seller

Clear

True

TRUE

0.0

2

Used

NaN

$3,872

Chicago, Illinois, United States

1970.0

25,763

Silver/Blue

BMW

Vehicle does NOT have an existing warranty

R-Series

...

NaN

FALSE

100

NaN

136

NaN

Clear

True

FALSE

26.0

3

Used

CLEAN TITLE READY TO RIDE HOME

$6,575

Green Bay, Wisconsin, United States

2009.0

33,142

Red

Harley-Davidson

NaN

Touring

...

NaN

FALSE

100

NaN

2920

Dealer

Clear

True

FALSE

11.0

4

Used

NaN

$10,000

West Bend, Wisconsin, United States

2012.0

17,800

Blue

Harley-Davidson

NO WARRANTY

Touring

...

NaN

FALSE

100

13

271

OWNER

Clear

True

TRUE

0.0

5 rows × 22 columns

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7493 entries, 0 to 7492
Data columns (total 22 columns):
Condition 7493 non-null object
Condition_Desc 1656 non-null object
Price 7493 non-null object
Location 7491 non-null object
Model_Year 7489 non-null float64
Mileage 7468 non-null object
Exterior_Color 6778 non-null object
Make 7489 non-null object
Warranty 5109 non-null object
Model 7370 non-null object
Sub_Model 2426 non-null object
Type 6011 non-null object
Vehicle_Title 268 non-null object
OBO 7427 non-null object
Feedback_Perc 6611 non-null object
Watch_Count 3517 non-null object
N_Reviews 7487 non-null object
Seller_Status 6868 non-null object
Vehicle_Tile 7439 non-null object
Auction 7476 non-null object
Buy_Now 7256 non-null object
Bid_Count 2190 non-null float64
dtypes: float64(2), object(20)
memory usage: 1.3+ MB

# 里面有字符串,不能进行转换
# df['Price'].astype(float)

# .str 方法可用于提取字符
df['Price'].str[1:3].head(5)

0    11
1 17
2 3,
3 6,
4 10
Name: Price, dtype: object

# 首先要对字符串进行相关处理
df['价格'] = df['Price'].str.strip('$')

df['价格'].head(5)

0    11,412 
1 17,200
2 3,872
3 6,575
4 10,000
Name: 价格, dtype: object

df['价格'] = df['价格'].str.replace(',', '')

df['价格'].head(5)

0    11412 
1 17200
2 3872
3 6575
4 10000
Name: 价格, dtype: object

df['价格'] = df['价格'].astype(float)

df['价格'].head(5)

0    11412.0
1 17200.0
2 3872.0
3 6575.0
4 10000.0
Name: 价格, dtype: float64

df.dtypes

Condition          object
Condition_Desc object
Price object
Location object
Model_Year float64
Mileage object
Exterior_Color object
Make object
Warranty object
Model object
Sub_Model object
Type object
Vehicle_Title object
OBO object
Feedback_Perc object
Watch_Count object
N_Reviews object
Seller_Status object
Vehicle_Tile object
Auction object
Buy_Now object
Bid_Count float64
价格 float64
dtype: object

# 字符串分割
df['Location'].str.split(',').str[0].head(5)

0          McHenry
1 Fort Recovery
2 Chicago
3 Green Bay
4 West Bend
Name: Location, dtype: object

# 计算字符串的长度
df['Location'].str.len().head(5)

0    32.0
1 34.0
2 32.0
3 35.0
4 35.0
Name: Location, dtype: float64


举报

相关推荐

0 条评论