當前位置：首頁 > 编程语言 > python >内容正文

python

Spark机器学习MLlib系列１（for python）－－数据类型，向量，分布式矩阵，API

發布時間：2025/4/16 python 36 豆豆

生活随笔收集整理的這篇文章主要介紹了 Spark机器学习MLlib系列１（for python）－－数据类型，向量，分布式矩阵，API 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Spark機器學習MLlib系列１（for python）－－數據類型，向量，分布式矩陣，API

關鍵詞：Local vector，Labeled point，Local matrix，Distributed matrix，RowMatrix，IndexedRowMatrix，CoordinateMatrix，BlockMatrix。
前言：MLlib支持本地向量和存儲在單機上的矩陣，當然也支持被存儲為RDD的分布式矩陣。一個有監督的機器學習的例子在MLlib里面叫做標簽點。

1. 本地向量

一個本地向量是由整數類型和從０開始的索引存儲在單機上
。MLlib支持兩種本地向量，稠密向量和稀疏向量。稠密向量由一個浮點數組表示它的的所有值，而一個稀疏矩陣由兩個平行的數組組成，索引和值。舉個例子，一個向量，(1.0,0.0,3.0)能個用稠密表現為[1.0,0.0,3.0] 或者以稀疏的形式表現為（３，[0,2],[1.0,3.0]）,3是這個向量的大小。（本人注解：３為長度，即是元素個數，[0,2]為索引，[1.0,3.0],為值）

１．１MLlib認為如下數據類型是稠密向量：
~NumPys array
~Python list

1.2MLlib認為如下數據類型是稀疏向量：
~MLlib’s SparseVector.
~SciPy’s csc_matrix with a single colum

為了效率，我們推薦使用Numpy arrays ,并使用工廠方法繼承Vectors 來創建稀疏矩陣。

import numpy as np import scipy.sparse as sps from pyspark.mllib.linalg import Vectors# Use a NumPy array as a dense vector. dv1 = np.array([1.0, 0.0, 3.0]) # Use a Python list as a dense vector. dv2 = [1.0, 0.0, 3.0] # Create a SparseVector. sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0]) # Use a single-column SciPy csc_matrix as a sparse vector. sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape = (3, 1))

２．標簽點

標簽點可以是一個本地的向量，可以是稀疏的也可以是稠密的，總之他們是帶有標簽的。在MLlib中，標簽點用來進行有監督的學習算法。我們使用雙精度數來存儲一個標簽，這樣我們既可以用標簽點做分類，也可以用來做回歸了。對于二分法，一個標簽應該不是０就是１。對于多種分類，標簽應該是索引從0,1,2,3….

一個標簽點用LabelPoint來表示。

from pyspark.mllib.linalg import SparseVector from pyspark.mllib.regression import LabeledPoint# Create a labeled point with a positive label and a dense feature vector. pos = LabeledPoint(1.0, [1.0, 0.0, 3.0]) # Create a labeled point with a negative label and a sparse feature vector. neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))

稀疏數據
在訓練中，有一個稀疏訓練數據是有一件很平常的事情。MLlib支持讀取一個以LIBSVM格式存儲訓練例子。LIBSVM是LIBSVM和LIBLINEAR默認的格式。這是一種每一行帶有一個標簽的的稀疏向量格式如下：

label index1:value1 index2:value2 ...

索引從１開始的升序排列的。當讀取完成之后，這些特征索引轉化為從０開始。

MLUtils.loadLibSVMFile　讀取存儲LIBSVM格式的訓練模型

from pyspark.mllib.util import MLUtilsexamples = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

３．本地矩陣
一個本地矩陣有整數型的行，和雙精度的列索引，并且存儲在單機上。MLlib支持將所有數據存儲在一個單獨的數組上并且以列為順序的稠密矩陣，也支持稀疏矩陣。舉個例子，比如像下面的稠密矩陣：

這個矩陣是一個存儲在一維數組 [1.0, 3.0, 5.0, 2.0, 4.0, 6.0]上的大小為(3,2)的矩陣。

本地矩陣的基礎類是Matrix,我們提供兩個實現函數：DenseMatrix和SparseMatrix。我們推薦Matrices里面的工廠實現方法來創造本地矩陣。記住，MLlib的本地矩陣是列為序存儲。

from pyspark.mllib.linalg import Matrix, Matrices# Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0)) dm2 = Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6]) # Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0)) sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])

４．分布式矩陣
一個分布式矩陣有long類型的行和列，還有double類型的值，并且被分布式存儲在一個或者更多的RDDs中。選擇正確的格式來存儲巨大的分布式矩陣是很重要的事情。將一個分布式矩陣轉換可能需要一個全局的清洗，這是代價非常昂貴的。直到現在，四種類型的分布式矩陣已經實現了。
這四種中基礎類型叫做 RowMatrix。這種 RowMatrix一種面向行的分布式矩陣，沒有所謂的行索引。比如：一個特征向量的集合。它依賴與RDD自己的行，并且RDD的每個行是一個本地向量。對于一個RowMatrix我們假設列的數量不是非常巨大，以便一個單獨的本地向量能夠合理正確的與驅動交流并且能夠存儲操作在一個正在使用它的節點上。
IndexedRowMatrix與 RowMatrix除了能被用來區分行和執行合并的行索引不同之外，其他都非常相似。CoordinateMatrix是一種以coordinate list (COO) 格式存儲在RDDs條目中的分布式矩陣。
BlockMatrix 是一種被RDDMatrixBlock支持的分布式矩陣，MatrixBlock是元祖(Int, Int, Matrix).

NOTE
潛在的分布式矩陣RDD必須是確定的，因為我們緩存了矩陣的大小，一般來說使用一個非確定性的RDD會導致錯誤。

RowMatrix
RowMatrix是一個面向行的分布式矩陣，沒有所謂的行索引,可以使用RDD的行，這些行是本地向量。既然每個行都被本地向量表示，列的數目被整數范圍限制，但是列數目在實際情況應該是比行小的多的。

from pyspark.mllib.linalg.distributed import RowMatrix# Create an RDD of vectors. rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]) # Create a RowMatrix from an RDD of vectors. mat = RowMatrix(rows) # Get its size. m = mat.numRows() # 4 n = mat.numCols() # 3 # Get the rows as an RDD of vectors again. rowsRDD = mat.rows

IndexedRowMatrix

IndexedRowMatrix與 RowMatrix除了有意義的行索引外，其他都非常相似。它使用RDD索引行，以便每一行代表它的索引和本地向量。
一個 indexedRowMatrix可以被 indexedRowMatrix 創造出來，一個 indexedRowMatrix以能夠被轉化為RowMatrix通過去掉行索引。

from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix# Create an RDD of indexed rows. # - This can be done explicitly with the IndexedRow class: indexedRows = sc.parallelize([IndexedRow(0, [1, 2, 3]), IndexedRow(1, [4, 5, 6]), IndexedRow(2, [7, 8, 9]), IndexedRow(3, [10, 11, 12])]) # - or by using (long, vector) tuples: indexedRows = sc.parallelize([(0, [1, 2, 3]), (1, [4, 5, 6]), (2, [7, 8, 9]), (3, [10, 11, 12])]) # Create an IndexedRowMatrix from an RDD of IndexedRows. mat = IndexedRowMatrix(indexedRows) # Get its size. m = mat.numRows() # 4 n = mat.numCols() # 3 # Get the rows as an RDD of IndexedRows. rowsRDD = mat.rows # Convert to a RowMatrix by dropping the row indices. rowMat = mat.toRowMatrix()

CoordinateMatrix

CoordinateMatrix是一個分布式矩陣，
并且由RDD的條目支持的。每一個條目就是一個元祖(i: Long, j: Long, value: Double），i是行索引，j 是列索引，value是條目值。CoordinateMatrix應該僅僅使用當矩陣規模特別大并且矩陣很稀疏的時候。

CoordinateMatrix 能夠被MatrixEntry條目創建， CoordinateMatrix能被轉化為 RowMatrix通過使用toRowMatrix，或者一個稀疏行IndexedRowMatrix通過使用 toIndexedRowMatrix.

from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry# Create an RDD of coordinate entries. # - This can be done explicitly with the MatrixEntry class: entries = sc.parallelize([MatrixEntry(0, 0, 1.2), MatrixEntry(1, 0, 2.1), MatrixEntry(6, 1, 3.7)]) # - or using (long, long, float) tuples: entries = sc.parallelize([(0, 0, 1.2), (1, 0, 2.1), (2, 1, 3.7)]) # Create an CoordinateMatrix from an RDD of MatrixEntries. mat = CoordinateMatrix(entries) # Get its size. m = mat.numRows() # 3 n = mat.numCols() # 2 # Get the entries as an RDD of MatrixEntries. entriesRDD = mat.entries # Convert to a RowMatrix. rowMat = mat.toRowMatrix() # Convert to an IndexedRowMatrix. indexedRowMat = mat.toIndexedRowMatrix() # Convert to a BlockMatrix. blockMat = mat.toBlockMatrix()

BlockMatrix

BlockMatrix是一個分布式矩陣，并且被 MatrixBlocks支持， MatrixBlocks是一個元祖， ((Int, Int), Matrix)，(Int, Int)是塊索引，Matrix是rowsPerBlock x colsPerBlock的形狀。

from pyspark.mllib.linalg import Matrices from pyspark.mllib.linalg.distributed import BlockMatrix # Create an RDD of sub-matrix blocks. blocks = sc.parallelize([((0, 0), Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])), ((1, 0), Matrices.dense(3, 2, [7, 8, 9, 10, 11, 12]))]) # Create a BlockMatrix from an RDD of sub-matrix blocks. mat = BlockMatrix(blocks, 3, 2) # Get its size. m = mat.numRows() # 6 n = mat.numCols() # 2 # Get the blocks as an RDD of sub-matrix blocks. blocksRDD = mat.blocks # Convert to a LocalMatrix. localMat = mat.toLocalMatrix() # Convert to an IndexedRowMatrix. indexedRowMat = mat.toIndexedRowMatrix() # Convert to a CoordinateMatrix. coordinateMat = mat.toCoordinateMatrix()

原文轉自 http://blog.csdn.net/qq_30115765/article/details/52594421

轉載于:https://www.cnblogs.com/honey01/p/8043201.html

總結

以上是生活随笔為你收集整理的Spark机器学习MLlib系列１（for python）－－数据类型，向量，分布式矩阵，API的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。