當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【算法竞赛学习】数字中国创新大赛智慧海洋建设-Task2数据分析

發(fā)布時間：2023/12/15 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了【算法竞赛学习】数字中国创新大赛智慧海洋建设-Task2数据分析小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

智慧海洋建設(shè)-Task2 數(shù)據(jù)分析

此部分為智慧海洋建設(shè)競賽的數(shù)據(jù)分析模塊，通過數(shù)據(jù)分析，可以熟悉數(shù)據(jù)，為后面的特征工程做準(zhǔn)備，歡迎大家后續(xù)多多交流。

賽題：智慧海洋建設(shè)

數(shù)據(jù)分析的目的:

EDA的主要價值在于熟悉整個數(shù)據(jù)集的基本情況(缺失值、異常值)，來確定所獲得數(shù)據(jù)集可以用于接下來的機(jī)器學(xué)習(xí)或者深度學(xué)習(xí)使用。
了解特征之間的相關(guān)性、分布，以及特征與預(yù)測值之間的關(guān)系。
為進(jìn)行特征工程提供理論依據(jù)。

項目地址：https://github.com/datawhalechina/team-learning-data-mining/tree/master/wisdomOcean

比賽地址：https://tianchi.aliyun.com/competition/entrance/231768/introduction?spm=5176.12281957.1004.8.4ac63eafE1rwsY

2.1 學(xué)習(xí)目標(biāo)

學(xué)習(xí)如何對數(shù)據(jù)集整體概況進(jìn)行分析，包括數(shù)據(jù)集的基本情況(缺失值、異常值)

學(xué)習(xí)了解變量之間的相互關(guān)系、變量與預(yù)測值之間的存在關(guān)系。

完成相應(yīng)學(xué)習(xí)打卡任務(wù)

2.2 內(nèi)容介紹

數(shù)據(jù)總體了解

讀取數(shù)據(jù)集并了解數(shù)據(jù)集的大小，原始特征維度；
通過info了解數(shù)據(jù)類型；
粗略查看數(shù)據(jù)集中各特征的基本統(tǒng)計量

缺失值和唯一值

查看數(shù)據(jù)缺失值情況
查看唯一值情況

數(shù)據(jù)特性和特征分布

三類漁船軌跡的可視化
坐標(biāo)序列可視化
三類漁船速度和方向序列可視化
三類漁船速度和方向的數(shù)據(jù)分布

總結(jié)

2.3 代碼示例

2.3.1 載入各種數(shù)據(jù)科學(xué)以及可視化庫

以下庫都是pip install安裝，有特殊情況我會單獨(dú)說明例如 pip install pandas -i https://pypi.tuna.tsinghua.edu.cn/simple

#coding:utf-8 #導(dǎo)入warnings包，利用過濾器來實(shí)現(xiàn)忽略警告語句。 import warnings warnings.filterwarnings('ignore')import numpy as np import pandas as pd from matplotlib import pyplot as plt import seaborn as sns df=pd.read_csv(r'D:\WORK_DATA_F\jupyter_notebook\2020數(shù)字中國創(chuàng)新大賽-智慧海洋建設(shè)\代碼練習(xí)\數(shù)據(jù)\DF.csv')

2.3.2 載入其它相關(guān)的包

from tqdm import tqdm import multiprocessing as mp import os import pickle import random # 把讀取所有數(shù)據(jù)的函數(shù)放在單獨(dú)的python文件中，是為了解決多線程問題在jupyter notebook無法運(yùn)行的問題 import read_all_data

說明：

本次數(shù)據(jù)分析探索，尤其可視化部分均選取某些特定變量進(jìn)行了舉例，所以它只是一個方法的展示而不是整個賽題數(shù)據(jù)分析的解決方案。

2.3.3 定義加載和存儲數(shù)據(jù)的函數(shù)

class Load_Save_Data():def __init__(self,file_name=None):self.filename = file_namedef load_data(self,Path=None):if Path is None:assert self.filename is not None,"Invalid Path...."else:self.filename = Pathwith open(self.filename,"wb") as f:data = pickle.load(f)return datadef save_data(self,data,path):if path is None:assert self.filename is not None,"Invalid path...."else:self.filename = pathwith open(self.filename,"wb") as f:pickle.dump(data,f)

2.3.4 讀取數(shù)據(jù)

# 定義讀取數(shù)據(jù)的函數(shù) def read_data(Path,Kind=""):""":param Path:待讀取數(shù)據(jù)的存放路徑:param Kind:'train' of 'test'"""# 替換成數(shù)據(jù)存放的路徑filenames = os.listdir(Path)print("\n@Read Data From"+Path+".........................")with mp.Pool(processes=mp.cpu_count()) as pool:data_total = list(tqdm(pool.map(read_all_data.read_train_file if Kind == "train" else read_all_data.read_test_file,filenames),total=len(filenames)))print("\n@End Read total Data............................")load_save = Load_Save_Data()if Kind == "train":load_save.save_data(data_total,"./data_tmp/total_data.pkl")return data_total # 訓(xùn)練數(shù)據(jù)讀取# 存放數(shù)據(jù)的絕對路徑 train_path = "D:/code_sea/data/train/hy_round1_train_20200102/" data_train = read_data(train_path,Kind="train") data_train = pd.concat(data_train)# 測試數(shù)據(jù)讀取# 存放數(shù)據(jù)的絕對路徑 test_path = "D:/code_sea/data/test/hy_round1_testA_20200102/" data_test = read_data(test_path,Kind="test") data_test = pd.concat(data_test) @Read Data FromD:/code_sea/data/train/hy_round1_train_20200102/.........................100%|█████████████████████████████████████████████████████████████████████████| 7000/7000 [00:00<00:00, 3527165.79it/s]@End Read total Data............................@Read Data FromD:/code_sea/data/test/hy_round1_testA_20200102/.........................100%|██████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:00<?, ?it/s]@End Read total Data............................

2.3.5 總體了解數(shù)據(jù)

查看數(shù)據(jù)集的樣本個數(shù)和原始特征維度

data_test.shape (782378, 6) data_train.shape (2699638, 7) data_train.columns Index(['漁船ID', 'x', 'y', '速度', '方向', 'time', 'type'], dtype='object')

查看一下具體的列名，賽題理解部分已經(jīng)給出具體的特征含義，這里方便閱讀再給一下

漁船ID：漁船的唯一識別，結(jié)果文件以此ID為標(biāo)識
x：漁船在平面坐標(biāo)系下的x軸坐標(biāo)
y：漁船在平面坐標(biāo)系下的y軸坐標(biāo)
速度：漁船當(dāng)前時刻的航速，單位節(jié)
方向：漁船當(dāng)前時刻的航首向，單位度
time：數(shù)據(jù)上報時刻，單位月日時:分
type：漁船label，作業(yè)類型

pd.options.display.max_info_rows = 2699639 data_train.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 2699638 entries, 0 to 365 Data columns (total 7 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 漁船ID 2699638 non-null int64 1 x 2699638 non-null float642 y 2699638 non-null float643 速度 2699638 non-null float644 方向 2699638 non-null int64 5 time 2699638 non-null object 6 type 2699638 non-null object dtypes: float64(3), int64(2), object(2) memory usage: 164.8+ MB data_train.describe([0.01,0.025,0.05,0.5,0.75,0.9,0.99]) 漁船IDxy速度方向countmeanstdmin1%2.5%5%50%75%90%99%max

2.699638e+06	2.699638e+06	2.699638e+06	2.699638e+06	2.699638e+06
3.496035e+03	6.277243e+06	5.271190e+06	1.784449e+00	1.151533e+02
2.020781e+03	2.698065e+05	2.544160e+05	2.478862e+00	1.168515e+02
0.000000e+00	5.000250e+06	3.345433e+06	0.000000e+00	0.000000e+00
6.900000e+01	5.258862e+06	4.618927e+06	0.000000e+00	0.000000e+00
1.710000e+02	5.817836e+06	4.920685e+06	0.000000e+00	0.000000e+00
3.470000e+02	6.024286e+06	4.985102e+06	0.000000e+00	0.000000e+00
3.502000e+03	6.246522e+06	5.229463e+06	3.200000e-01	8.100000e+01
5.243000e+03	6.365916e+06	5.379267e+06	3.290000e+00	2.170000e+02
6.290000e+03	6.592496e+06	5.602273e+06	4.910000e+00	2.930000e+02
6.928000e+03	7.056209e+06	6.111650e+06	1.009000e+01	3.560000e+02
6.999000e+03	7.133785e+06	7.667581e+06	1.001600e+02	3.600000e+02

data_train.head(3).append(data_train.tail(3)) 漁船IDxy速度方向timetype012363364365

0	6.152038e+06	5.124873e+06	2.59	102	1110 11:58:19	拖網(wǎng)
0	6.151230e+06	5.125218e+06	2.70	113	1110 11:48:19	拖網(wǎng)
0	6.150421e+06	5.125563e+06	2.70	116	1110 11:38:19	拖網(wǎng)
999	6.138413e+06	5.162715e+06	0.32	0	1031 12:28:01	拖網(wǎng)
999	6.138413e+06	5.162715e+06	0.32	0	1031 12:18:00	拖網(wǎng)
999	6.138413e+06	5.162715e+06	0.11	294	1031 12:07:59	拖網(wǎng)

2.3.6查看數(shù)據(jù)集中特征缺失值、唯一值等

查看缺失值

print(f'There are {data_train.isnull().any().sum()} columns in train dataset with missing values.') There are 0 columns in train dataset with missing values.

查看訓(xùn)練集測試集中特征屬性只有一值的特征

one_value_fea = [col for col in data_train.columns if data_train[col].nunique() <= 1] one_value_fea_test = [col for col in data_test.columns if data_test[col].nunique() <= 1] one_value_fea [] one_value_fea_test [] print(f'There are {len(one_value_fea)} columns in train dataset with one unique value.') print(f'There are {len(one_value_fea_test)} columns in test dataset with one unique value.') There are 0 columns in train dataset with one unique value. There are 0 columns in test dataset with one unique value.

總結(jié):

6列特征中0列有缺少數(shù)據(jù)，不存在有唯一值的列，有連續(xù)特征和分類特征

2.4 數(shù)據(jù)特性和特征分布

2.4.1 三類漁船軌跡可視化

# 把訓(xùn)練集的所有數(shù)據(jù),根據(jù)類別存放到不同的數(shù)據(jù)文件中 def get_diff_data():Path = "./data_tmp/total_data.pkl"with open(Path,"rb") as f:total_data = pickle.load(f)load_save = Load_Save_Data()kind_data = ["刺網(wǎng)","圍網(wǎng)","拖網(wǎng)"]file_names = ["ciwang_data.pkl","weiwang_data.pkl","tuowang_data.pkl"]for i,datax in enumerate(kind_data):data_type = [data for data in total_data if data["type"].unique()[0] == datax]load_save.save_data(data_type,"./data_tmp/" + file_names[i]) get_diff_data() # 從存放某個軌跡類別的數(shù)據(jù)文件中，隨機(jī)讀取某個漁船的數(shù)據(jù) def get_random_one_traj(type=None):""":param type:"ciwang","weiwang" or "tuowang""""np.random.seed(10)path = "./data_tmp/"with open(path + type + ".pkl","rb") as f1:data = pickle.load(f1)length = len(data)index = np.random.choice(length)return data[index] # 分別從三個類別的數(shù)據(jù)文件中，隨機(jī)讀取某三個漁船的數(shù)據(jù) def get_random_three_traj(type=None):""":param type:"ciwang","weiwang" or "tuowang""""random.seed(10)path = "./data_tmp/"with open(path + type + ".pkl", "rb") as f:data = pickle.load(f)data_arrange = np.arange(len(data)).tolist()index = random.sample(data_arrange,3)return data[index[0]],data[index[1]],data[index[2]] # 每個類別中隨機(jī)三個漁船的軌跡進(jìn)行可視化 def visualize_three_traj():fig,axes = plt.subplots(nrows=3,ncols=3,figsize=(20,15))plt.subplots_adjust(wspace=0.2,hspace=0.2)# 對于每一個類別，隨機(jī)選出刺網(wǎng)的三條軌跡進(jìn)行可視化lables = ["ciwang","weiwang","tuowang"]for i,file_type in tqdm(enumerate(["ciwang_data","weiwang_data","tuowang_data"])):data1, data2, data3 = get_random_three_traj(type=file_type)for j, datax in enumerate([data1, data2, data3]):x_data = datax["x"].loc[-1:].valuesy_data = datax["y"].loc[-1:].valuesaxes[i][j - 1].scatter(x_data[0], y_data[0], label="start", c="red", s=10, marker="8")axes[i][j - 1].plot(x_data, y_data, label=lables[i])axes[i][j - 1].scatter(x_data[len(x_data) - 1], y_data[len(y_data) - 1], label="end", c="green", s=10,marker="v")axes[i][j - 1].grid(alpha=2)axes[i][j - 1].legend(loc="best")plt.show() visualize_three_traj() 3it [00:02, 1.03it/s]

[外鏈圖片轉(zhuǎn)存失敗,源站可能有防盜鏈機(jī)制,建議將圖片保存下來直接上傳(img-jMZTCAwt-1644978306946)(Task2_files/Task2_50_1.png)]

總結(jié):

可以看到不同軌跡雖然有不同的變化，但是仍然不具有很強(qiáng)的區(qū)分性。

刺網(wǎng)的軌跡偏向于具有規(guī)則多邊形的情形。
圍網(wǎng)的部分軌跡偏向于圍城一個圈的情形。
拖網(wǎng)的軌跡偏向于點(diǎn)到點(diǎn)，沒有拐角的情形。
整體上來說，不同類別的軌跡仍然不具有很強(qiáng)的區(qū)別分性。

通過取不同的隨機(jī)數(shù)，發(fā)現(xiàn)存在異常軌跡，軌跡中只存在幾個點(diǎn)。

2.4.2 坐標(biāo)序列可視化

# 隨機(jī)選取某條數(shù)據(jù)，觀察x坐標(biāo)序列和y坐標(biāo)序列的變化情況 def visualize_one_traj_x_y():fig,axes = plt.subplots(nrows=2,ncols=1,figsize=(10,8))plt.subplots_adjust(wspace=0.5,hspace=0.5)data1 = get_random_one_traj(type="weiwang_data")x = data1["x"].loc[-1:]x = x / 10000y = data1["y"].loc[-1:]y = y / 10000arr1 = np.arange(len(x))arr2 = np.arange(len(y))axes[0].plot(arr1,x,label="x")axes[1].plot(arr2,y,label="y")axes[0].grid(alpha=3)axes[0].legend(loc="best")axes[1].grid(alpha=3)axes[1].legend(loc="best")plt.show() visualize_one_traj_x_y()

[外鏈圖片轉(zhuǎn)存失敗,源站可能有防盜鏈機(jī)制,建議將圖片保存下來直接上傳(img-PQQXXnSo-1644978306947)(Task2_files/Task2_56_0.png)]

總結(jié):

通過對坐標(biāo)x和坐標(biāo)y序列的可視化，發(fā)現(xiàn)兩個序列存在同時不變的情況，也就是速度數(shù)據(jù)一直在該序列中一直接近于0，因此可以判斷存在POI點(diǎn)。

2.4.3 三類漁船速度和方向可視化

# 每類軌跡，隨機(jī)選取某個漁船，可視化速度序列和方向序列 def visualize_three_traj_speed_direction():fig,axes = plt.subplots(nrows=3,ncols=2,figsize=(20,15))plt.subplots_adjust(wspace=0.3,hspace=0.3)# 隨機(jī)選出刺網(wǎng)的三條軌跡進(jìn)行可視化file_types = ["ciwang_data","weiwang_data","tuowang_data"]speed_types = ["ciwang_speed","weiwang_speed","tuowang_speed"]doirections = ["ciwang_direction","weiwang_direction","tuowang_direction"]colors = ['pink', 'lightblue', 'lightgreen']for i,file_name in tqdm(enumerate(file_types)):datax = get_random_one_traj(type=file_name)x_data = datax["速度"].loc[-1:].valuesy_data = datax["方向"].loc[-1:].valuesaxes[i][0].plot(range(len(x_data)), x_data, label=speed_types[i], color=colors[i])axes[i][0].grid(alpha=2)axes[i][0].legend(loc="best")axes[i][1].plot(range(len(y_data)), y_data, label=doirections[i], color=colors[i])axes[i][1].grid(alpha=2)axes[i][1].legend(loc="best")plt.show() visualize_three_traj_speed_direction() 3it [00:02, 1.03it/s]

總結(jié):

不同軌跡速度的數(shù)據(jù)分布，均存在連續(xù)的低值，因此強(qiáng)化了對POI點(diǎn)存在的判斷。

每個類別的漁船方向變化很快，可以判定為漁船在海上漂泊造成，因此此特征對于類別的判斷不具有很強(qiáng)的區(qū)分性。

2.4.4 三類漁船速度和方向的數(shù)據(jù)分布

# 對某一特征進(jìn)行數(shù)據(jù)統(tǒng)計 def get_data_cummulation(type=None,path=None,kind=None,columns=None):""":param type:"ciwang","weiwang" or "tuowang":param path:存放數(shù)據(jù)路徑:param kind: '速度' or '方向':param columns:與kind對應(yīng)，'speed' or 'direction'"""data_dict = dict()with open(path + type+".pkl","rb") as file:data_list = pickle.load(file)for datax in tqdm(data_list):data = datax[kind].valuesfor speed in data:data_dict.setdefault(speed,0)data_dict[speed] += 1data_dict = dict(sorted(data_dict.items(),key=lambda x:x[0],reverse=False))data_df = pd.DataFrame.from_dict(data_dict,columns=[columns],orient="index")return data_df # 分別得到速度和方向的分布數(shù)據(jù) def get_speed_and_direction_distribution_data(type=None):""":param type:"ciwang","weiwang" or "tuowang""""path = "./data_tmp/"data_speed_df = get_data_cummulation(type=type, path=path,kind="速度",columns="speed")data_direction_df = get_data_cummulation(type=type,path=path,kind="方向",columns="direction")return data_speed_df,data_direction_df # 可視化速度和方向的數(shù)據(jù)分布 df_speeds = [] df_directions = []def plot_speed_direction1_distribution():plt.subplots(nrows=1, ncols=2, figsize=(15, 6))plt.subplots_adjust(wspace=0.3, hspace=0.5)file_types = ["ciwang_data", "weiwang_data", "tuowang_data"]lables = ["target==cw", "target==ww", "target==tw"]colors = ["red", "green", "blue"]for i, filenames in enumerate(file_types):df11, df21 = get_speed_and_direction_distribution_data(file_types[i])plt.subplot(1,2,1)ax1 = sns.kdeplot(df11["speed"].values / 1000000, color=colors[i],shade=True)plt.subplot(1,2,2)ax3 = sns.kdeplot(df21["direction"].values / 1000000, color=colors[i],shade=True)df_speeds.append(df11)df_directions.append(df21)ax1.legend(lables)ax1.set_xlabel("Speed")ax3.set_xlabel("Direction")ax3.legend(lables)plt.show() plot_speed_direction1_distribution() 100%|████████████████████████████████████████████████████████████████████████████| 1018/1018 [00:00<00:00, 6898.08it/s] 100%|████████████████████████████████████████████████████████████████████████████| 1018/1018 [00:00<00:00, 7912.51it/s] 100%|████████████████████████████████████████████████████████████████████████████| 1621/1621 [00:00<00:00, 6744.18it/s] 100%|████████████████████████████████████████████████████████████████████████████| 1621/1621 [00:00<00:00, 7929.84it/s] 100%|████████████████████████████████████████████████████████████████████████████| 4361/4361 [00:00<00:00, 6107.10it/s] 100%|████████████████████████████████████████████████████████████████████████████| 4361/4361 [00:00<00:00, 7133.23it/s] # 使用分位圖對速度和方向的數(shù)據(jù)分布進(jìn)行可視化 def plot_speed_direction2_distribution():fig,axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 6))plt.subplots_adjust(wspace=0.3, hspace=0.5)colors_box = ['pink', 'lightblue', 'lightgreen']bplot1 = axes[0].boxplot([df_speeds[0]["speed"].values,df_speeds[1]["speed"].values,df_speeds[2]["speed"].values], vert=True, patch_artist=True, labels=["cw", "ww", "tw"])bplot2 = axes[1].boxplot([df_directions[0]["direction"].values, df_directions[1]["direction"].values, df_directions[2]["direction"].values], vert=True, patch_artist=True, labels=["cw", "ww", "tw"])for blpot in (bplot1,bplot2):for patch,color in zip(blpot["boxes"],colors_box):patch.set_facecolor(color)axes[0].set_title("speed")axes[1].set_title("direction")plt.show() plot_speed_direction2_distribution()

總結(jié)

無論是分布圖，還是分位圖，都可以發(fā)現(xiàn)不同類型軌跡的速度數(shù)據(jù)分布存在很大的差異。

無論是分布圖，還是分位圖，都可以發(fā)現(xiàn)不同類型軌跡的方向數(shù)據(jù)分布差異不是特別明顯。

總結(jié)

通過以上的數(shù)據(jù)分析，我們可以得到以下結(jié)論

每個特征中不存在缺失值和唯一值。

存在異常軌跡，該軌跡只含有幾個點(diǎn)。

雖然不同類別的軌跡有不同的變化傾向，但是整體來說，不具有很強(qiáng)的區(qū)分性。

通過對坐標(biāo)序列的分析，發(fā)現(xiàn)存在POI點(diǎn)。

通過對不同類別的速度數(shù)據(jù)分布可視化，發(fā)現(xiàn)速度具有很強(qiáng)的區(qū)分性。

通過對不同類別的方向數(shù)據(jù)分布可視化，發(fā)現(xiàn)方向的區(qū)分性不是特別強(qiáng)。

作業(yè):

請嘗試用Task1中的異常處理代碼對異常數(shù)據(jù)進(jìn)行刪除之后，再分別繪制速度和方向的數(shù)據(jù)分布圖、速度和方向的分位圖。

2.在前面我們已經(jīng)進(jìn)行了繪制速度和方向的數(shù)據(jù)分布圖。由Task1的keperl.gl可知，不同地理位置和船舶類型的相關(guān)性較大。請嘗試將相同類型船舶的軌跡給拼接起來并繪制經(jīng)度和緯度的總體分布特征。之前由liu123的航空母艦隊伍繪制的分布圖如下所示。

參考文獻(xiàn)

https://tianchi.aliyun.com/forum/postDetail?postId=110932

總結(jié)

以上是生活随笔為你收集整理的【算法竞赛学习】数字中国创新大赛智慧海洋建设-Task2数据分析的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Minisforum 公布新款 NUCG
下一篇：【算法竞赛学习】数字中国创新大赛智慧海洋