當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Pyspark机器学习：向量及其常用操作

發(fā)布時間：2024/1/1 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 Pyspark机器学习：向量及其常用操作小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

Spark版本：V3.2.1

??本篇主要介紹pyspark.ml.linalg中的向量操作。

1. DenseVector(稠密向量)

1.1 創(chuàng)建

稠密向量和一般的數(shù)組差不多，其創(chuàng)建方法如下：

from pyspark.ml import linalg import numpy as np dvect1=linalg.Vectors.dense([1,2,3,4,5]) dvect2=linalg.Vectors.dense(1.2,3,3,4,5) print(dvect1) print(dvect2)

其結(jié)果如下(注意其數(shù)據(jù)類型為float型):

1.2 常用操作

對兩個長度相同的向量可以進行加減乘除操作。具體如下：

res1=dvect1+dvect2 res2=dvect1-dvect2 res3=dvect1*dvect2 res4=dvect1/dvect2 print(res1) print(res2) print(res3) print(res4)

其結(jié)果如下：

可以使用numpy.darray中的一些屬性

dvec1_shape=dvect1.array.shape dvec1_size=dvect1.array.size print(dvec1_shape)# 其結(jié)果為:(5,) print(dvec1_size)# 其結(jié)果為:5

dot點乘操作

res_1=dvect1.dot([1,2,3,4,5]) res_2=dvect1.dot([0,1,0,0,0]) res_3=dvect1.dot(dvect2) print(res_1) #結(jié)果為55 print(res_2) #結(jié)果為2 print(res_3) #結(jié)果為57.2

求向量的范式

dvect1=linalg.Vectors.dense([1,2,3,4,5]) norm_0=dvect1.norm(0) norm_1=dvect1.norm(1) norm_2=dvect2.norm(2) print('dvect1的L0范式為:{}'.format(norm_0)) print('dvect1的L1范式為:{}'.format(norm_1)) print('dvect1的L2范式為:{:.3f}'.format(norm_2))

其結(jié)果如下：

numNonZeros()統(tǒng)計非0元素的個數(shù)

dvect1=linalg.Vectors.dense([1,0,3,0,5]) num_nonzero=dvect1.numNonzeros() print(num_nonzero)#其結(jié)果為3

squared_distance()求兩個維度相同的向量的平方距離

dvect1=linalg.Vectors.dense([1,0,3]) dvect2=linalg.Vectors.dense([1,1,1]) dist=dvect1.squared_distance(dvect2) #其值為5

取出向量的值

dvect1=linalg.Vectors.dense([1,0,3]) print(dvect1.toArray()) print(dvect1.values)

2. SparseVector(稀疏向量)

2.1 創(chuàng)建

稀疏向量的創(chuàng)建主要有以下幾種方式：

Vectors.sparse(向量長度, 索引數(shù)組,與索引數(shù)組所對應的數(shù)值數(shù)組)，其中索引從0開始編號，下同；
Vectors.sparse(向量長度, {索引:數(shù)值,索引:數(shù)值, $…\dots$ })
Vectors.sparse(向量長度,[(索引,數(shù)值),(索引,數(shù)值), $…\dots$ ])

舉例如下：

svect1=linalg.Vectors.sparse(3,[0,1],[3.4,4.5]) svect2=linalg.Vectors.sparse(3,{0:3.4,2:4.5}) svect3=linalg.Vectors.sparse(4,[(2,3),(3,2.3)])

2.2 常用操作

稀疏變量中一些操作與稠密向量的操作一致，不再贅述。這里只介紹以下兩個操作：

toArray顯示稀疏變量的所有數(shù)值

svect1=linalg.Vectors.sparse(3,[0,1],[3.4,4.5]) svect2=linalg.Vectors.sparse(3,{0:3.4,2:4.5}) svect3=linalg.Vectors.sparse(4,[(2,3),(3,2.3)]) print(svect1.toArray()) print(svect2.toArray()) print(svect3.toArray())

其結(jié)果如下：

indices()返回稀疏向量中非0元素的索引值

svect1=linalg.Vectors.sparse(3,[0,1],[3.4,4.5]) svect2=linalg.Vectors.sparse(3,{0:3.4,2:4.5}) svect3=linalg.Vectors.sparse(4,[(2,3),(3,2.3)]) print(svect1.indices) #返回[0 1](array類型，下同） print(svect2.indices) #返回[0 2] print(svect3.indices) #返回[2 3]

總結(jié)

以上是生活随笔為你收集整理的Pyspark机器学习：向量及其常用操作的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。