0
点赞
收藏
分享

微信扫一扫

【数据挖掘】心跳信号分类预测 之 My_Task3特征工程


Table of Contents


  • ​​3.1 学习目标​​
  • ​​3.2 内容介绍​​


3.1 学习目标

  • 学习时间序列数据的特征预处理方法
  • 学习时间序列特征处理工具Tsfresh(TimeSeries Fresh) 的使用

3.2 内容介绍

数据预处理

  • 时间序列数据格式处理
  • 加入时间步特征time

特征工程

  • 时间序列特征构造
  • 特征筛选
  • 使用tsfresh

3.3 代码示例

3.3.1 导入包并读取数据

Tsfresh是处理时间序列的关系数据库的特征工程工具,能自动从时间序列中提取100多个特征。
该软件包包含多种特征提取方法和一种稳健的特征选择算法,还包含评价这些特征对回归或分类
任务的解释能力和重要性的方法。
​​​ https://zhuanlan.zhihu.com/p/93310900​​

# 包导入
import pandas as pd
import numpy as np
import tsfresh as tsf
from tsfresh import extract_features,select_features
from tsfresh.utilities.dataframe_functions import impute

# 数据读取
data_train = pd.read_csv("train.csv")
data_test_A = pd.read_csv("testA.csv")

print(data_train.shape)
print(data_test_A.shape)

(100000, 3)
(20000, 2)

3.3.2 数据预处理

  • 对心电特征进行行列处理,同时为每个心电信号加入时间步特征time
  • reset_index()和set_index()的使用

train_heartbeat_df = data_train["heartbeat_signals"].str.split(",",expand=True).stack()

train_heartbeat_df

0      0      0.9912297987616655
1 0.9435330436439665
2 0.7646772997256593
3 0.6185708990212999
4 0.3796321642826237
...
99999 200 0.0
201 0.0
202 0.0
203 0.0
204 0.0
Length: 20500000, dtype: object

  • 重新设置索引 且变成了数据框的形式

train_heartbeat_df = train_heartbeat_df.reset_index()

train_heartbeat_df



level_0

level_1

0

0

0

0

0.9912297987616655

1

0

1

0.9435330436439665

2

0

2

0.7646772997256593

3

0

3

0.6185708990212999

4

0

4

0.3796321642826237

...

...

...

...

20499995

99999

200

0.0

20499996

99999

201

0.0

20499997

99999

202

0.0

20499998

99999

203

0.0

20499999

99999

204

0.0

20500000 rows × 3 columns

  • 将level_0 设置为索引

train_heartbeat_df =  train_heartbeat_df.set_index("level_0")

train_heartbeat_df



level_1

0

level_0

0

0

0.9912297987616655

0

1

0.9435330436439665

0

2

0.7646772997256593

0

3

0.6185708990212999

0

4

0.3796321642826237

...

...

...

99999

200

0.0

99999

201

0.0

99999

202

0.0

99999

203

0.0

99999

204

0.0

20500000 rows × 2 columns

  • 将索引的名字置空,感觉就好像是扔掉了

train_heartbeat_df.index.name = None

train_heartbeat_df



level_1

0

0

0

0.9912297987616655

0

1

0.9435330436439665

0

2

0.7646772997256593

0

3

0.6185708990212999

0

4

0.3796321642826237

...

...

...

99999

200

0.0

99999

201

0.0

99999

202

0.0

99999

203

0.0

99999

204

0.0

20500000 rows × 2 columns

  • 使用rename()方法更改列名,inplace为True应该就是原地更改的意思【直接修改】

train_heartbeat_df.rename(columns={"level_1":"time",0:"heartbeat_signals"},inplace=True)

train_heartbeat_df



time

heartbeat_signals

0

0

0.9912297987616655

0

1

0.9435330436439665

0

2

0.7646772997256593

0

3

0.6185708990212999

0

4

0.3796321642826237

...

...

...

99999

200

0.0

99999

201

0.0

99999

202

0.0

99999

203

0.0

99999

204

0.0

20500000 rows × 2 columns

train_heartbeat_df["heartbeat_signals"] = train_heartbeat_df["heartbeat_signals"].astype(float)

train_heartbeat_df



time

heartbeat_signals

0

0

0.991230

0

1

0.943533

0

2

0.764677

0

3

0.618571

0

4

0.379632

...

...

...

99999

200

0.000000

99999

201

0.000000

99999

202

0.000000

99999

203

0.000000

99999

204

0.000000

20500000 rows × 2 columns

  • 将处理后的心电特征加入到训练数据中,同时将训练数据label列单独存储

data_train_label = data_train["label"]

data_train_label

0        0.0
1 0.0
2 2.0
3 0.0
4 2.0
...
99995 0.0
99996 2.0
99997 3.0
99998 2.0
99999 0.0
Name: label, Length: 100000, dtype: float64

  • 将data_train去掉label这一列

data_train = data_train.drop('label',axis=1)

data_train



id

heartbeat_signals

0

0

0.9912297987616655,0.9435330436439665,0.764677...

1

1

0.9714822034884503,0.9289687459588268,0.572932...

2

2

1.0,0.9591487564065292,0.7013782792997189,0.23...

3

3

0.9757952826275774,0.9340884687738161,0.659636...

4

4

0.0,0.055816398940721094,0.26129357194994196,0...

...

...

...

99995

99995

1.0,0.677705342021188,0.22239242747868546,0.25...

99996

99996

0.9268571578157265,0.9063471198026871,0.636993...

99997

99997

0.9258351628306013,0.5873839035878395,0.633226...

99998

99998

1.0,0.9947621698382489,0.8297017704865509,0.45...

99999

99999

0.9259994004527861,0.916476635326053,0.4042900...

100000 rows × 2 columns

data_train = data_train.drop("heartbeat_signals", axis=1)

data_train



id

0

0

1

1

2

2

3

3

4

4

...

...

99995

99995

99996

99996

99997

99997

99998

99998

99999

99999

100000 rows × 1 columns

data_train = data_train.join(train_heartbeat_df)

data_train



id

time

heartbeat_signals

0

0

0

0.991230

0

0

1

0.943533

0

0

2

0.764677

0

0

3

0.618571

0

0

4

0.379632

...

...

...

...

99999

99999

200

0.000000

99999

99999

201

0.000000

99999

99999

202

0.000000

99999

99999

203

0.000000

99999

99999

204

0.000000

20500000 rows × 3 columns

data_train[data_train["id"]==1]



id

time

heartbeat_signals

1

1

0

0.971482

1

1

1

0.928969

1

1

2

0.572933

1

1

3

0.178457

1

1

4

0.122962

...

...

...

...

1

1

200

0.000000

1

1

201

0.000000

1

1

202

0.000000

1

1

203

0.000000

1

1

204

0.000000

205 rows × 3 columns

可以看到,每个样本的心电特征都由205个时间步的心电信号组成

3.3.3 使用tsfresh 进行时间序列特征处理

1.特征抽取
**Tsfresh(TimeSeries Fresh)**是一个Python第三方工具包。 它可以自动计算大量的时间序列数据的特征。此外,该包还包含了特征重要性评估、特征选择的方法,因此,不管是基于时序数据的分类问题还是回归问题,tsfresh都会是特征提取一个不错的选择。官方文档:​​​Introduction — tsfresh 0.17.1.dev24+g860c4e1 documentation​​

# # 特征提取
# train_features = extract_features(data_train,column_id = 'id',column_sort='time')
# train_features

  • 导入已经跑好的特征(以pkl格式存储),直接读取用,不用每次都要重新生成这么耗时

import pickle
feature_file = open("./HeartbeatClassification/train_features_file.pkl","rb")
train_features = pickle.load(feature_file)

train_features



heartbeat_signals__variance_larger_than_standard_deviation

heartbeat_signals__has_duplicate_max

heartbeat_signals__has_duplicate_min

heartbeat_signals__has_duplicate

heartbeat_signals__sum_values

heartbeat_signals__abs_energy

heartbeat_signals__mean_abs_change

heartbeat_signals__mean_change

heartbeat_signals__mean_second_derivative_central

heartbeat_signals__median

...

heartbeat_signals__fourier_entropy__bins_2

heartbeat_signals__fourier_entropy__bins_3

heartbeat_signals__fourier_entropy__bins_5

heartbeat_signals__fourier_entropy__bins_10

heartbeat_signals__fourier_entropy__bins_100

heartbeat_signals__permutation_entropy__dimension_3__tau_1

heartbeat_signals__permutation_entropy__dimension_4__tau_1

heartbeat_signals__permutation_entropy__dimension_5__tau_1

heartbeat_signals__permutation_entropy__dimension_6__tau_1

heartbeat_signals__permutation_entropy__dimension_7__tau_1

0

0.0

0.0

1.0

1.0

38.927945

18.216197

0.019894

-0.004859

0.000117

0.125531

...

0.095763

0.109222

0.109222

0.356175

0.940492

1.180828

1.734917

2.184420

2.500658

2.722686

1

0.0

0.0

1.0

1.0

19.445634

7.705092

0.019952

-0.004762

0.000105

0.030481

...

0.248333

0.409767

0.567944

0.913016

1.791964

1.360828

2.118249

2.710933

3.065802

3.224835

2

0.0

0.0

1.0

1.0

21.192974

9.140423

0.009863

-0.004902

0.000101

0.000000

...

0.054659

0.054659

0.150231

0.204601

0.542013

0.712221

1.031064

1.263370

1.406001

1.509478

3

0.0

0.0

1.0

1.0

42.113066

15.757623

0.018743

-0.004783

0.000103

0.241397

...

0.054659

0.109222

0.186062

0.258874

1.426345

1.389686

2.206088

2.986728

3.534354

3.854177

4

0.0

0.0

1.0

1.0

69.756786

51.229616

0.014514

0.000000

-0.000137

0.000000

...

0.054659

0.109222

0.109222

0.163690

0.517722

1.045339

1.543338

1.914511

2.165627

2.323993

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

99995

0.0

0.0

1.0

1.0

63.323449

28.742238

0.023588

-0.004902

0.000794

0.388402

...

0.054659

0.054659

0.109222

0.109222

1.405361

1.326208

2.137411

2.873602

3.391830

3.679969

99996

0.0

0.0

1.0

1.0

69.657534

31.866323

0.017373

-0.004543

0.000051

0.421138

...

0.095763

0.095763

0.109222

0.163690

0.749555

1.408284

2.244166

3.085504

3.728881

4.095457

99997

0.0

0.0

1.0

1.0

40.897057

16.412857

0.019470

-0.004538

0.000834

0.213306

...

0.164224

0.186062

0.299588

0.353661

0.995174

1.305626

2.005282

2.601062

2.996962

3.293562

99998

0.0

0.0

1.0

1.0

42.333303

14.281281

0.017032

-0.004902

0.000013

0.264974

...

0.095763

0.109222

0.163690

0.218060

1.321241

1.460980

2.387132

3.236950

3.793512

4.018302

99999

0.0

0.0

1.0

1.0

53.290117

21.637471

0.021870

-0.004539

0.000023

0.320124

...

0.095763

0.150231

0.204601

0.463604

1.768224

1.344607

2.186286

2.949266

3.462549

3.688612

100000 rows × 779 columns

  1. 特征选择
    train_features中包含了heartbeat_signals的779种常见的时间序列特征(所有这些特征的解释可以去看官方文档),这其中有的特征可能为NaN值(产生原因为当前数据不支持此类特征的计算),使用以下方式去除NaN值:

# 去除抽取特征中的NAN值
impute(train_features)



heartbeat_signals__variance_larger_than_standard_deviation

heartbeat_signals__has_duplicate_max

heartbeat_signals__has_duplicate_min

heartbeat_signals__has_duplicate

heartbeat_signals__sum_values

heartbeat_signals__abs_energy

heartbeat_signals__mean_abs_change

heartbeat_signals__mean_change

heartbeat_signals__mean_second_derivative_central

heartbeat_signals__median

...

heartbeat_signals__fourier_entropy__bins_2

heartbeat_signals__fourier_entropy__bins_3

heartbeat_signals__fourier_entropy__bins_5

heartbeat_signals__fourier_entropy__bins_10

heartbeat_signals__fourier_entropy__bins_100

heartbeat_signals__permutation_entropy__dimension_3__tau_1

heartbeat_signals__permutation_entropy__dimension_4__tau_1

heartbeat_signals__permutation_entropy__dimension_5__tau_1

heartbeat_signals__permutation_entropy__dimension_6__tau_1

heartbeat_signals__permutation_entropy__dimension_7__tau_1

0

0.0

0.0

1.0

1.0

38.927945

18.216197

0.019894

-0.004859

0.000117

0.125531

...

0.095763

0.109222

0.109222

0.356175

0.940492

1.180828

1.734917

2.184420

2.500658

2.722686

1

0.0

0.0

1.0

1.0

19.445634

7.705092

0.019952

-0.004762

0.000105

0.030481

...

0.248333

0.409767

0.567944

0.913016

1.791964

1.360828

2.118249

2.710933

3.065802

3.224835

2

0.0

0.0

1.0

1.0

21.192974

9.140423

0.009863

-0.004902

0.000101

0.000000

...

0.054659

0.054659

0.150231

0.204601

0.542013

0.712221

1.031064

1.263370

1.406001

1.509478

3

0.0

0.0

1.0

1.0

42.113066

15.757623

0.018743

-0.004783

0.000103

0.241397

...

0.054659

0.109222

0.186062

0.258874

1.426345

1.389686

2.206088

2.986728

3.534354

3.854177

4

0.0

0.0

1.0

1.0

69.756786

51.229616

0.014514

0.000000

-0.000137

0.000000

...

0.054659

0.109222

0.109222

0.163690

0.517722

1.045339

1.543338

1.914511

2.165627

2.323993

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

99995

0.0

0.0

1.0

1.0

63.323449

28.742238

0.023588

-0.004902

0.000794

0.388402

...

0.054659

0.054659

0.109222

0.109222

1.405361

1.326208

2.137411

2.873602

3.391830

3.679969

99996

0.0

0.0

1.0

1.0

69.657534

31.866323

0.017373

-0.004543

0.000051

0.421138

...

0.095763

0.095763

0.109222

0.163690

0.749555

1.408284

2.244166

3.085504

3.728881

4.095457

99997

0.0

0.0

1.0

1.0

40.897057

16.412857

0.019470

-0.004538

0.000834

0.213306

...

0.164224

0.186062

0.299588

0.353661

0.995174

1.305626

2.005282

2.601062

2.996962

3.293562

99998

0.0

0.0

1.0

1.0

42.333303

14.281281

0.017032

-0.004902

0.000013

0.264974

...

0.095763

0.109222

0.163690

0.218060

1.321241

1.460980

2.387132

3.236950

3.793512

4.018302

99999

0.0

0.0

1.0

1.0

53.290117

21.637471

0.021870

-0.004539

0.000023

0.320124

...

0.095763

0.150231

0.204601

0.463604

1.768224

1.344607

2.186286

2.949266

3.462549

3.688612

100000 rows × 779 columns

接下来,按照特征和响应变量之间的相关性进行特征选择,这一过程包含两步:

  • 首先单独计算每个特征和响应变量之间的相关性
  • 然后利用Benjamini-Yekutieli procedure[1]进行特征选择,决定那些特征可以被保留.
    特征选择的一些常用方法

# 按照特征和数据label之间的相关性进行特征选择
train_features_filtered = select_features(train_features,data_train_label)

train_features_filtered



heartbeat_signals__sum_values

heartbeat_signals__fft_coefficient__attr_"abs"__coeff_35

heartbeat_signals__fft_coefficient__attr_"abs"__coeff_34

heartbeat_signals__fft_coefficient__attr_"abs"__coeff_33

heartbeat_signals__fft_coefficient__attr_"abs"__coeff_32

heartbeat_signals__fft_coefficient__attr_"abs"__coeff_31

heartbeat_signals__fft_coefficient__attr_"abs"__coeff_30

heartbeat_signals__fft_coefficient__attr_"abs"__coeff_29

heartbeat_signals__fft_coefficient__attr_"abs"__coeff_28

heartbeat_signals__fft_coefficient__attr_"abs"__coeff_27

...

heartbeat_signals__fft_coefficient__attr_"abs"__coeff_84

heartbeat_signals__fft_coefficient__attr_"imag"__coeff_97

heartbeat_signals__fft_coefficient__attr_"abs"__coeff_90

heartbeat_signals__fft_coefficient__attr_"abs"__coeff_94

heartbeat_signals__fft_coefficient__attr_"abs"__coeff_92

heartbeat_signals__fft_coefficient__attr_"real"__coeff_97

heartbeat_signals__fft_coefficient__attr_"abs"__coeff_75

heartbeat_signals__fft_coefficient__attr_"real"__coeff_88

heartbeat_signals__fft_coefficient__attr_"real"__coeff_92

heartbeat_signals__fft_coefficient__attr_"real"__coeff_83

0

38.927945

1.168685

0.982133

1.223496

1.236300

1.104172

1.497129

1.358095

1.704225

1.745158

...

0.531883

-0.047438

0.554370

0.307586

0.564596

0.562960

0.591859

0.504124

0.528450

0.473568

1

19.445634

1.460752

1.924501

1.925485

1.715938

2.079957

1.818636

2.490450

1.673244

2.821067

...

0.563590

-0.109579

0.697446

0.398073

0.640969

0.270192

0.224925

0.645082

0.635135

0.297325

2

21.192974

1.787166

2.146987

1.686190

1.540137

2.291031

2.403422

1.765422

1.993213

2.756081

...

0.712487

-0.074042

0.321703

0.390386

0.716929

0.316524

0.422077

0.722742

0.680590

0.383754

3

42.113066

2.071539

1.000340

2.728281

1.391727

2.017176

2.610492

0.747448

2.900299

1.294779

...

0.601499

-0.184248

0.564669

0.623353

0.466980

0.651774

0.308915

0.550097

0.466904

0.494024

4

69.756786

0.653924

0.231422

1.080003

0.711244

1.357904

1.237998

1.346404

1.645870

0.941866

...

0.015292

0.070505

0.065835

0.051780

0.092940

0.103773

0.179405

-0.089611

0.091841

0.056867

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

99995

63.323449

0.417221

2.036034

1.659054

0.500584

1.693545

0.859932

1.963009

1.524831

1.344715

...

0.779955

0.005525

0.486013

0.273372

0.705386

0.602898

0.447929

0.474844

0.564266

0.133969

99996

69.657534

1.611333

1.793044

1.092325

0.507138

1.763940

2.677643

2.640827

1.128049

0.856280

...

0.539489

0.114670

0.579498

0.417226

0.270110

0.556596

0.703258

0.462312

0.269719

0.539236

99997

40.897057

1.190514

0.674603

1.632769

0.229008

2.027802

0.302457

2.016243

0.352602

1.836034

...

0.282597

-0.474629

0.460647

0.478341

0.527891

0.904111

0.728529

0.178410

0.500813

0.773985

99998

42.333303

1.237608

1.325212

2.785515

1.918571

0.814167

2.613950

2.083409

1.330934

2.801509

...

0.594252

-0.162106

0.694276

0.681025

0.357196

0.498088

0.433297

0.406154

0.324771

0.340727

99999

53.290117

0.154759

2.921164

2.183932

1.485150

2.685922

0.583443

3.101826

1.264842

2.877000

...

0.463697

0.289364

0.285321

0.422103

0.692009

0.276236

0.245780

0.269519

0.681719

-0.053993

100000 rows × 700 columns

特征工程总结:

【数据挖掘】心跳信号分类预测 之 My_Task3特征工程_ci

  • ​​数据挖掘流程Xmind图​​

参考

​​GitHub链接​​


举报

相关推荐

0 条评论