Python数据分析： Pandas库概述与应用场景之Series与DataFrame创建指南-CFANZ编程社区

Python Pandas库概述与应用场景：Series与DataFrame创建指南

1. Pandas简介

Pandas是Python数据分析的核心库，提供了高效、灵活的数据结构（Series和DataFrame）和数据分析工具。它特别适合处理表格数据、时间序列和各种结构化数据集。

主要特点：

处理缺失数据
强大的数据对齐功能
灵活的重塑和旋转数据集
基于标签的智能切片和索引
合并和连接数据集

2. Series的创建与应用

Series是Pandas中最基本的一维数据结构，类似于带标签的数组。

2.1 从列表创建Series

import pandas as pd

# 从列表创建Series
data = [10, 20, 30, 40, 50]
s = pd.Series(data)
print(s)
"""
0    10
1    20
2    30
3    40
4    50
dtype: int64
"""

# 自定义索引
s_with_index = pd.Series(data, index=['a', 'b', 'c', 'd', 'e'])
print(s_with_index)
"""
a    10
b    20
c    30
d    40
e    50
dtype: int64
"""

应用场景：适用于一维数据存储，如时间序列、实验测量数据等。

2.2 从字典创建Series

# 从字典创建Series
data_dict = {'a': 1, 'b': 2, 'c': 3}
s_dict = pd.Series(data_dict)
print(s_dict)
"""
a    1
b    2
c    3
dtype: int64
"""

# 指定索引顺序
s_dict_ordered = pd.Series(data_dict, index=['b', 'a', 'c', 'd'])
print(s_dict_ordered)
"""
b    2.0
a    1.0
c    3.0
d    NaN
dtype: float64
"""

说明：字典的键自动成为Series的索引，未匹配的索引值显示为NaN。

2.3 从标量值创建Series

# 从标量值创建Series
s_scalar = pd.Series(5, index=['a', 'b', 'c', 'd'])
print(s_scalar)
"""
a    5
b    5
c    5
d    5
dtype: int64
"""

应用场景：需要创建具有相同值的Series时使用。

3. DataFrame的创建与应用

DataFrame是Pandas中最重要的二维表格型数据结构，可以看作是由多个Series组成的字典。

3.1 从字典创建DataFrame

# 从字典创建DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Paris', 'London']
}
df = pd.DataFrame(data)
print(df)
"""
      Name  Age      City
0    Alice   25  New York
1      Bob   30     Paris
2  Charlie   35    London
"""

# 指定列顺序
df_ordered = pd.DataFrame(data, columns=['City', 'Name', 'Age'])
print(df_ordered)
"""
       City     Name  Age
0  New York    Alice   25
1     Paris      Bob   30
2    London  Charlie   35
"""

# 指定索引
df_indexed = pd.DataFrame(data, index=['id1', 'id2', 'id3'])
print(df_indexed)
"""
          Name  Age      City
id1    Alice   25  New York
id2      Bob   30     Paris
id3  Charlie   35    London
"""

应用场景：最常用的DataFrame创建方式，适合结构化数据存储。

3.2 从列表的列表创建DataFrame

# 从列表的列表创建DataFrame
data = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Paris'],
    ['Charlie', 35, 'London']
]
df_list = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df_list)
"""
      Name  Age      City
0    Alice   25  New York
1      Bob   30     Paris
2  Charlie   35    London
"""

说明：需要显式指定列名，否则会使用默认列名(0,1,2...)。

3.3 从NumPy数组创建DataFrame

import numpy as np

# 从NumPy数组创建DataFrame
arr = np.random.rand(3, 4)  # 3行4列的随机数组
df_np = pd.DataFrame(arr, columns=['A', 'B', 'C', 'D'])
print(df_np)
"""
          A         B         C         D
0  0.374540  0.950714  0.731994  0.598658
1  0.156019  0.155995  0.058084  0.866176
2  0.601115  0.708073  0.020584  0.969910
"""

应用场景：科学计算中NumPy数组与Pandas DataFrame的转换。

3.4 从Series字典创建DataFrame

# 从Series字典创建DataFrame
s1 = pd.Series(['Alice', 'Bob', 'Charlie'])
s2 = pd.Series([25, 30, 35])
s3 = pd.Series(['New York', 'Paris', 'London'])

df_series = pd.DataFrame({'Name': s1, 'Age': s2, 'City': s3})
print(df_series)
"""
      Name  Age      City
0    Alice   25  New York
1      Bob   30     Paris
2  Charlie   35    London
"""

说明：当各Series长度不一致时，会自动对齐索引。

3.5 从文件创建DataFrame

# 从CSV文件创建DataFrame (示例)
# df_csv = pd.read_csv('data.csv')

# 从Excel文件创建DataFrame (示例)
# df_excel = pd.read_excel('data.xlsx')

# 从JSON文件创建DataFrame (示例)
# df_json = pd.read_json('data.json')

应用场景：实际工作中最常见的数据导入方式。

4. 特殊DataFrame创建方法

4.1 创建空DataFrame

# 创建空DataFrame
empty_df = pd.DataFrame(columns=['Name', 'Age', 'City'])
print(empty_df)
"""
Empty DataFrame
Columns: [Name, Age, City]
Index: []
"""

# 添加数据
empty_df.loc[0] = ['Alice', 25, 'New York']
print(empty_df)
"""
    Name Age      City
0  Alice  25  New York
"""

应用场景：需要动态构建DataFrame时使用。

4.2 创建日期范围索引的DataFrame

# 创建日期范围索引的DataFrame
dates = pd.date_range('20230101', periods=6)
df_date = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=['A', 'B', 'C', 'D'])
print(df_date)
"""
                   A         B         C         D
2023-01-01 -0.302665 -0.634707  0.215565 -0.308545
2023-01-02 -0.815322 -0.119054  0.609594  0.541054
2023-01-03  0.158121 -0.612093  1.377373 -0.674487
2023-01-04 -0.568650 -0.572925 -0.290846 -0.303812
2023-01-05 -0.279248  0.837348  0.331791  0.839871
2023-01-06  0.311553 -0.722283 -0.346608  0.692739
"""

应用场景：时间序列分析、金融数据分析等。

5. 数据查看与基本信息

创建DataFrame后，我们可以查看其基本信息：

# 查看前几行
print(df.head(2))
"""
    Name  Age      City
0  Alice   25  New York
1    Bob   30     Paris
"""

# 查看基本信息
print(df.info())
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   City    3 non-null      object
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes
None
"""

# 查看统计信息
print(df.describe())
"""
             Age
count   3.000000
mean   30.000000
std     5.000000
min    25.000000
25%    27.500000
50%    30.000000
75%    32.500000
max    35.000000
"""

6. 总结

Series是带标签的一维数组，适合存储单列数据和标签信息
- 可以从列表、字典、标量值创建
- 自动对齐索引是Pandas的强大特性
DataFrame是二维表格型数据结构，是数据分析的核心
- 可以从字典、列表、NumPy数组、Series字典等多种方式创建
- 支持自定义索引和列名
- 可以从各种文件格式导入数据
选择创建方法应考虑：
- 数据来源（内存数据结构还是外部文件）
- 是否需要自定义索引
- 数据维度（一维用Series，二维用DataFrame）