日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 人文社科 > 生活经验 >内容正文

生活经验

特征提取,转换和选择

發(fā)布時(shí)間:2023/11/28 生活经验 38 豆豆
生活随笔 收集整理的這篇文章主要介紹了 特征提取,转换和选择 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

特征提取,轉(zhuǎn)換和選擇
Extracting, transforming and selecting features
This section covers algorithms for working with features, roughly divided into these groups:
? Extraction: Extracting features from “raw” data
? Transformation: Scaling, converting, or modifying features
? Selection: Selecting a subset from a larger set of features
? Locality Sensitive Hashing (LSH): This class of algorithms combines aspects of feature transformation with other algorithms.
本節(jié)涵蓋使用功能的算法,大致分為以下幾類:
? 提取:從“原始”數(shù)據(jù)中提取特征
? 轉(zhuǎn)換:縮放,轉(zhuǎn)換或修改特征
? 選擇:從更大的特征集中選擇一個(gè)子集
? 局部敏感哈希(LSH):此類算法將特征轉(zhuǎn)換的各個(gè)方面與其它算法結(jié)合在一起。
Table of Contents
? Feature Extractors
o TF-IDF
o Word2Vec
o CountVectorizer
o FeatureHasher
? Feature Transformers
o Tokenizer
o StopWordsRemover
o nn-gram
o Binarizer
o PCA
o PolynomialExpansion
o Discrete Cosine Transform (DCT)
o StringIndexer
o IndexToString
o OneHotEncoder
o VectorIndexer
o Interaction
o Normalizer
o StandardScaler
o RobustScaler
o MinMaxScaler
o MaxAbsScaler
o Bucketizer
o ElementwiseProduct
o SQLTransformer
o VectorAssembler
o VectorSizeHint
o QuantileDiscretizer
o Imputer
? Feature Selectors
o VectorSlicer
o RFormula
o ChiSqSelector
o UnivariateFeatureSelector
o VarianceThresholdSelector
? Locality Sensitive Hashing
o LSH Operations
? Feature Transformation
? Approximate Similarity Join
? Approximate Nearest Neighbor Search
o LSH Algorithms
? Bucketed Random Projection for Euclidean Distance
? MinHash for Jaccard Distance
Feature Extractors
TF-IDF
Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. Denote a term by t, a document by d, and the corpus by D. Term frequency TF(t,d) is the number of times that term t appears in document d, while document frequency DF(t,D) is the number of documents that contains term t. If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that appear very often but carry little information about the document, e.g. “a”, “the”, and “of”. If a term appears very often across the corpus, it means it doesn’t carry special information about a particular document. Inverse document frequency is a numerical measure of how much information a term provides:
變量逆頻率文檔頻率(TF-IDF) 是一種特征向量化方法,廣泛用于文本挖掘中,反映變量對(duì)語(yǔ)料庫(kù)中文檔的重要性。用t表示變量,用d表示文檔,用D表示語(yǔ)料庫(kù)。變量頻率TF(t,d)是變量t在文檔d中出現(xiàn)的次數(shù),而文檔頻率DF(t,D)是包含變量t的文檔數(shù)。如果僅使用變量頻率來(lái)衡量重要性,則過(guò)分強(qiáng)調(diào)那些經(jīng)常出現(xiàn),但幾乎不包含有關(guān)文檔信息的變量,例如“一個(gè)a”,“該the”和“屬于of”。如果變量經(jīng)常出現(xiàn)在整個(gè)語(yǔ)料庫(kù)中,則表示該變量不包含有關(guān)特定文檔的特殊信息。逆文檔頻率是一個(gè)變量大小信息,提供了一個(gè)數(shù)值量度:

where |D| is the total number of documents in the corpus. Since logarithm is used, if a term appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product of TF and IDF:
其中|D|是所述語(yǔ)料庫(kù)中的文件的總數(shù)。由于使用對(duì)數(shù),因此如果一個(gè)變量出現(xiàn)在所有文檔中,則其IDF值將變?yōu)?。注意,應(yīng)用了平滑變量以避免對(duì)主體外的變量除以零。TF-IDF度量只是TF和IDF的乘積:

There are several variants on the definition of term frequency and document frequency. In MLlib, we separate TF and IDF to make them flexible.
TF: Both HashingTF and CountVectorizer can be used to generate the term frequency vectors.
HashingTF is a Transformer which takes sets of terms and converts those sets into fixed-length feature vectors. In text processing, a “set of terms” might be a bag of words. HashingTF utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. The hash function used here is MurmurHash 3. Then term frequencies are calculated based on the mapped indices. This approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the chance of collision, we can increase the target feature dimension, i.e. the number of buckets of the hash table. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the feature dimension, otherwise the features will not be mapped evenly to the vector indices. The default feature dimension is 218=262,144218=262,144. An optional binary toggle parameter controls term frequency counts. When set to true all nonzero frequency counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.
CountVectorizer converts text documents to vectors of term counts. Refer to CountVectorizer for more details.
IDF: IDF is an Estimator which is fit on a dataset and produces an IDFModel. The IDFModel takes feature vectors (generally created from HashingTF or CountVectorizer) and scales each feature. Intuitively, it down-weights features which appear frequently in a corpus.
Note: spark.ml doesn’t provide tools for text segmentation. We refer users to the Stanford NLP Group and scalanlp/chalk.
Examples
In the following code segment, we start with a set of sentences. We split each sentence into words using Tokenizer. For each sentence (bag of words), we use HashingTF to hash the sentence into a feature vector. We use IDF to rescale the feature vectors; this generally improves performance when using text as features. Our feature vectors could then be passed to a learning algorithm.
變量頻率和文檔頻率的定義有多種變體。在MLlib中,將TF和IDF分開(kāi)以使其具有靈活性。
TF:HashingTF和CountVectorizer均可用于生成項(xiàng)頻率向量。
HashingTF是,Transformer接受一組變量并將其轉(zhuǎn)換為固定長(zhǎng)度的特征向量。在文本處理中,“一組變量”可能是一袋單詞。 HashingTF利用哈希理論。通過(guò)應(yīng)用哈希函數(shù)將原始特征映射到索引(項(xiàng))。這里使用的哈希函數(shù)是MurmurHash 3。然后根據(jù)映射的索引計(jì)算詞頻。這種方法避免了需要計(jì)算全局項(xiàng)到索引圖的情況,對(duì)于大型語(yǔ)料庫(kù)可能是昂貴的,但是會(huì)遭受潛在的哈希沖突,即哈希后不同的原始特征可能變成相同的變量。為了減少?zèng)_突的概率,可以增加目標(biāo)要素的維數(shù),即哈希表的存儲(chǔ)數(shù)。使用散列值的簡(jiǎn)單模來(lái)確定向量索引,建議使用2的冪作為特征維,否則特征將不會(huì)均勻地映射到向量索引。默認(rèn)特征尺寸為
。可選的二進(jìn)制切換參數(shù)控制項(xiàng)頻率計(jì)數(shù)。當(dāng)設(shè)置為true時(shí),所有非零頻率計(jì)數(shù)都設(shè)置為1。對(duì)于模擬二進(jìn)制而不是整數(shù)計(jì)數(shù)的離散概率模型特別有用。
CountVectorizer將文本文檔轉(zhuǎn)換為變量計(jì)數(shù)向量。有關(guān)更多詳細(xì)信息,請(qǐng)參考CountVectorizer 。
IDF:IDF是Estimator適合數(shù)據(jù)集,產(chǎn)生的IDFIDFModel。所述 IDFModel需要的特征向量(通常從創(chuàng)建HashingTF或CountVectorizer)和縮放每個(gè)特征。直觀地,會(huì)減少在語(yǔ)料庫(kù)中經(jīng)常出現(xiàn)的特征的權(quán)重。
注意: spark.ml不提供用于文本分割的工具。將用戶推薦給Stanford NLP Group和 scalanlp / chalk。
例子
在下面的代碼段中,從一組句子開(kāi)始。使用將每個(gè)句子分成單詞Tokenizer。對(duì)于每個(gè)句子(單詞袋),用HashingTF將句子散列為特征向量。IDF用來(lái)重新縮放特征向量;使用文本作為特征時(shí),通常可以提高性能。然后,特征向量可以傳遞給學(xué)習(xí)算法。

? Scala
? Java
? Python
Refer to the HashingTF Scala docs and the IDF Scala docs for more details on the API.
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}

val sentenceData = spark.createDataFrame(Seq(
(0.0, “Hi I heard about Spark”),
(0.0, “I wish Java could use case classes”),
(1.0, “Logistic regression models are neat”)
)).toDF(“l(fā)abel”, “sentence”)

val tokenizer = new Tokenizer().setInputCol(“sentence”).setOutputCol(“words”)
val wordsData = tokenizer.transform(sentenceData)

val hashingTF = new HashingTF()
.setInputCol(“words”).setOutputCol(“rawFeatures”).setNumFeatures(20)

val featurizedData = hashingTF.transform(wordsData)
// alternatively, CountVectorizer can also be used to get term frequency vectors

val idf = new IDF().setInputCol(“rawFeatures”).setOutputCol(“features”)
val idfModel = idf.fit(featurizedData)

val rescaledData = idfModel.transform(featurizedData)
rescaledData.select(“l(fā)abel”, “features”).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/TfIdfExample.scala” in the Spark repo.
Word2Vec
Word2Vec is an Estimator which takes sequences of words representing documents and trains a Word2VecModel. The model maps each word to a unique fixed-size vector. The Word2VecModel transforms each document into a vector using the average of all words in the document; this vector can then be used as features for prediction, document similarity calculations, etc. Please refer to the MLlib user guide on Word2Vec for more details. Word2Vec是一個(gè)Estimator,表示文檔的單詞序列并訓(xùn)練一個(gè) Word2VecModel。該模型將每個(gè)單詞映射到唯一的固定大小的向量。使用Word2VecModel 文檔中所有單詞的平均值,將每個(gè)文檔轉(zhuǎn)換為向量;然后,可以將此向量用作預(yù)測(cè),文檔相似度計(jì)算等的功能。有關(guān)更多詳細(xì)信息,可參考Word2Vec上的MLlib用戶指南。
Examples
In the following code segment, we start with a set of documents, each of which is represented as a sequence of words. For each document, we transform it into a feature vector. This feature vector could then be passed to a learning algorithm. 在下面的代碼段中,從一組文檔開(kāi)始,每個(gè)文檔都由一個(gè)單詞序列表示。對(duì)于每個(gè)文檔,將其轉(zhuǎn)換為特征向量。然后可以將該特征向量傳遞給學(xué)習(xí)算法。
? Scala
? Java
? Python
Refer to the Word2Vec Scala docs for more details on the API.
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

// Input data: Each row is a bag of words from a sentence or document.
val documentDF = spark.createDataFrame(Seq(
“Hi I heard about Spark”.split(" “),
“I wish Java could use case classes”.split(” “),
“Logistic regression models are neat”.split(” ")
).map(Tuple1.apply)).toDF(“text”)

// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
.setInputCol(“text”)
.setOutputCol(“result”)
.setVectorSize(3)
.setMinCount(0)
val model = word2Vec.fit(documentDF)

val result = model.transform(documentDF)
result.collect().foreach { case Row(text: Seq[_], features: Vector) =>
println(s"Text: [${text.mkString(", “)}] => \nVector: $features\n”) }
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/Word2VecExample.scala” in the Spark repo.
CountVectorizer
CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents to vectors of token counts. When an a-priori dictionary is not available, CountVectorizer can be used as an Estimator to extract the vocabulary, and generates a CountVectorizerModel. The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA.
During the fitting process, CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus. An optional parameter minDF also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. Another optional binary toggle parameter controls the output vector. If set to true all nonzero counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.
CountVectorizer和CountVectorizerModel,幫助轉(zhuǎn)換文本文檔的集合令牌計(jì)數(shù)的載體。當(dāng)先驗(yàn)詞典不可用時(shí),CountVectorizer可以用作Estimator,提取詞匯表并生成CountVectorizerModel。該模型為詞匯表上的文檔生成稀疏表示,然后可以將其傳遞給其它算法,例如LDA。
在擬合過(guò)程中,CountVectorizer將選擇vocabSize整個(gè)語(yǔ)料庫(kù)中,按詞頻排列的前幾個(gè)詞。可選參數(shù)minDF,通過(guò)指定一個(gè)單詞必須出現(xiàn)在詞匯表中的最小數(shù)量(如果小于1.0,則為小數(shù))來(lái)影響擬合過(guò)程。另一個(gè)可選的二進(jìn)制,切換參數(shù)控制輸出向量。如果將其設(shè)置為true,則所有非零計(jì)數(shù)都將設(shè)置為1。這對(duì)于模擬二進(jìn)制,而不是整數(shù)計(jì)數(shù)的離散概率模型特別有用。
Examples
Assume that we have the following DataFrame with columns id and texts:
假設(shè)有以下帶有列id和 texts的DataFrame:

idtexts
0Array(“a”, “b”, “c”)
1Array(“a”, “b”, “b”, “c”, “a”)

each row in texts is a document of type Array[String]. Invoking fit of CountVectorizer produces a CountVectorizerModel with vocabulary (a, b, c). Then the output column “vector” after transformation contains: 每行texts是一個(gè)Array [String]類型的文檔。調(diào)用的契合度CountVectorizer會(huì)產(chǎn)生CountVectorizerModel帶有詞匯量(a,b,c)的a。然后,轉(zhuǎn)換后的輸出列“ vector”包含:

idtextsvector
0Array(“a”, “b”, “c”)(3,[0,1,2],[1.0,1.0,1.0])
1Array(“a”, “b”, “b”, “c”, “a”)(3,[0,1,2],[2.0,2.0,1.0])

Each vector represents the token counts of the document over the vocabulary.
? Scala
? Java
? Python
Refer to the CountVectorizer Scala docs and the CountVectorizerModel Scala docs for more details on the API. 有關(guān)API的更多詳細(xì)信息,參考CountVectorizer Scala文檔 和CountVectorizerModel Scala文檔。
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}

val df = spark.createDataFrame(Seq(
(0, Array(“a”, “b”, “c”)),
(1, Array(“a”, “b”, “b”, “c”, “a”))
)).toDF(“id”, “words”)

// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol(“words”)
.setOutputCol(“features”)
.setVocabSize(3)
.setMinDF(2)
.fit(df)

// alternatively, define CountVectorizerModel with a-priori vocabulary
val cvm = new CountVectorizerModel(Array(“a”, “b”, “c”))
.setInputCol(“words”)
.setOutputCol(“features”)

cvModel.transform(df).show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/CountVectorizerExample.scala” in the Spark repo.
FeatureHasher
Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick to map features to indices in the feature vector.
The FeatureHasher transformer operates on multiple columns. Each column may contain either numeric or categorical features. Behavior and handling of column data types is as follows:
? Numeric columns: For numeric features, the hash value of the column name is used to map the feature value to its index in the feature vector. By default, numeric features are not treated as categorical (even when they are integers). To treat them as categorical, specify the relevant columns using the categoricalCols parameter.
? String columns: For categorical features, the hash value of the string “column_name=value” is used to map to the vector index, with an indicator value of 1.0. Thus, categorical features are “one-hot” encoded (similarly to using OneHotEncoder with dropLast=false).
? Boolean columns: Boolean values are treated in the same way as string columns. That is, boolean features are represented as “column_name=true” or “column_name=false”, with an indicator value of 1.0.
Null (missing) values are ignored (implicitly zero in the resulting feature vector).
The hash function used here is also the MurmurHash 3 used in HashingTF. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector indices.
特征哈希將一組分類或數(shù)字特征投影到指定維數(shù)的特征向量中(通常大大小于原始特征空間的特征向量)。這是通過(guò)使用哈希技巧 將特征映射到特征向量中的索引來(lái)完成的。
該FeatureHasher變壓器上多列運(yùn)行。每列都可以包含數(shù)字或分類特征。列數(shù)據(jù)類型的行為和處理如下:
? 數(shù)字列:對(duì)于數(shù)字特征,列名稱的哈希值用于將特征值映射到特征向量中的索引。默認(rèn)情況下,數(shù)字功能不被視為分類(即使是整數(shù))。要將其視為分類,使用categoricalCols參數(shù)指定相關(guān)列。
? 字符串列:對(duì)于分類特征,字符串“ column_name = value”的哈希值,用于映射到向量索引,指示符值為1.0。因此,分類特征被“一次熱”編碼(類似于將OneHotEncoder與一起使用 dropLast=false)。
? 布爾列:布爾值與字符串列的處理方式相同。即,布爾特征表示為“ column_name = true”或“ column_name = false”,指示符值為1.0。
空(缺失)值將被忽略(在所得特征向量中隱式為零)。
這里使用的哈希函數(shù)也是HashingTF中 使用的MurmurHash 3。由于使用散列值的簡(jiǎn)單模來(lái)確定向量索引,因此建議使用2的冪作為numFeatures參數(shù);否則,建議使用2的冪。不然,這些特征將不會(huì)均勻地映射到矢量索引。

Examples
Assume that we have a DataFrame with 4 input columns real, bool, stringNum, and string. These different data types as input will illustrate the behavior of the transform to produce a column of feature vectors. 假設(shè)有4個(gè)輸入列的數(shù)據(jù)幀real,bool,stringNum,和string。這些不同的數(shù)據(jù)類型作為輸入,將生成一列特征向量的變換。

realboolstringNumstring
2.2true1foo
3.3false2bar
4.4false3baz
5.5false4foo

Then the output of FeatureHasher.transform on this DataFrame is:

realboolstringNumstringfeatures
2.2true1foo(262144,[51871, 63643,174475,253195],[1.0,1.0,2.2,1.0])
3.3false2bar(262144,[6031, 80619,140467,174475],[1.0,1.0,1.0,3.3])
4.4false3baz(262144,[24279,140467,174475,196810],[1.0,1.0,4.4,1.0])
5.5false4foo(262144,[63643,140467,168512,174475],[1.0,1.0,1.0,5.5])

The resulting feature vectors could then be passed to a learning algorithm.
? Scala
? Java
? Python
Refer to the FeatureHasher Scala docs for more details on the API.
import org.apache.spark.ml.feature.FeatureHasher

val dataset = spark.createDataFrame(Seq(
(2.2, true, “1”, “foo”),
(3.3, false, “2”, “bar”),
(4.4, false, “3”, “baz”),
(5.5, false, “4”, “foo”)
)).toDF(“real”, “bool”, “stringNum”, “string”)

val hasher = new FeatureHasher()
.setInputCols(“real”, “bool”, “stringNum”, “string”)
.setOutputCol(“features”)

val featurized = hasher.transform(dataset)
featurized.show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/FeatureHasherExample.scala” in the Spark repo.
Feature Transformers
Tokenizer
Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality. The example below shows how to split sentences into sequences of words.
RegexTokenizer allows more advanced tokenization based on regular expression (regex) matching. By default, the parameter “pattern” (regex, default: “\s+”) is used as delimiters to split the input text. Alternatively, users can set parameter “gaps” to false indicating the regex “pattern” denotes “tokens” rather than splitting gaps, and find all matching occurrences as the tokenization result.
標(biāo)記化是獲取文本(例如句子),并將其分解為單個(gè)術(shù)語(yǔ)(通常是單詞)的過(guò)程。一個(gè)簡(jiǎn)單的Tokenizer類提供了此功能。下面的示例顯示了如何將句子分成單詞序列。
RegexTokenizer允許基于正則表達(dá)式(regex)匹配,進(jìn)行更高級(jí)的標(biāo)記化。默認(rèn)情況下,參數(shù)“ pattern”(正則表達(dá)式,默認(rèn)值:),"\s+"用作分隔輸入文本的定界符。或者,用戶可以將參數(shù)“ gap”設(shè)置為false,以表示正則表達(dá)式“ pattern”表示“令牌”,而不是拆分間隙,并找到所有匹配的出現(xiàn)作為標(biāo)記化結(jié)果。
Examples
? Scala
? Java
? Python
Refer to the Tokenizer Scala docs and the RegexTokenizer Scala docs for more details on the API. 有關(guān)API的更多詳細(xì)信息,可參考Tokenizer Scala文檔 和RegexTokenizer Scala文檔。
import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val sentenceDataFrame = spark.createDataFrame(Seq(
(0, “Hi I heard about Spark”),
(1, “I wish Java could use case classes”),
(2, “Logistic,regression,models,are,neat”)
)).toDF(“id”, “sentence”)

val tokenizer = new Tokenizer().setInputCol(“sentence”).setOutputCol(“words”)
val regexTokenizer = new RegexTokenizer()
.setInputCol(“sentence”)
.setOutputCol(“words”)
.setPattern("\W") // alternatively .setPattern("\w+").setGaps(false)

val countTokens = udf { (words: Seq[String]) => words.length }

val tokenized = tokenizer.transform(sentenceDataFrame)
tokenized.select(“sentence”, “words”)
.withColumn(“tokens”, countTokens(col(“words”))).show(false)

val regexTokenized = regexTokenizer.transform(sentenceDataFrame)
regexTokenized.select(“sentence”, “words”)
.withColumn(“tokens”, countTokens(col(“words”))).show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/TokenizerExample.scala” in the Spark repo.
StopWordsRemover
Stop words are words which should be excluded from the input, typically because the words appear frequently and don’t carry as much meaning.
StopWordsRemover takes as input a sequence of strings (e.g. the output of a Tokenizer) and drops all the stop words from the input sequences. The list of stopwords is specified by the stopWords parameter. Default stop words for some languages are accessible by calling StopWordsRemover.loadDefaultStopWords(language), for which available options are “danish”, “dutch”, “english”, “finnish”, “french”, “german”, “hungarian”, “italian”, “norwegian”, “portuguese”, “russian”, “spanish”, “swedish” and “turkish”. A boolean parameter caseSensitive indicates if the matches should be case sensitive (false by default).
停用詞是應(yīng)從輸入中排除的詞,通常是因?yàn)檫@些詞頻繁出現(xiàn)且含義不大。
StopWordsRemover將一個(gè)字符串序列(例如Tokenizer的輸出)作為輸入,并從輸入序列中刪除所有停用詞。停用詞列表由stopWords參數(shù)指定。可以通過(guò)調(diào)用來(lái)訪問(wèn)某些語(yǔ)言的默認(rèn)停用詞StopWordsRemover.loadDefaultStopWords(language),其可用選項(xiàng)為“丹麥語(yǔ)”,“荷蘭語(yǔ)”,“英語(yǔ)”,“芬蘭語(yǔ)”,“法語(yǔ)”,“德語(yǔ)”,“匈牙利語(yǔ)”,“意大利語(yǔ)”,“挪威語(yǔ)” ”,“葡萄牙語(yǔ)”,“俄語(yǔ)”,“西班牙語(yǔ)”,“瑞典語(yǔ)”和“土耳其語(yǔ)”。布爾參數(shù)caseSensitive表示匹配項(xiàng)是否區(qū)分大小寫(默認(rèn)情況下為false)。
Examples
Assume that we have the following DataFrame with columns id and raw:

idraw
0[I, saw, the, red, balloon]
1[Mary, had, a, little, lamb]

Applying StopWordsRemover with raw as the input column and filtered as the output column, we should get the following:

idrawfiltered
0[I, saw, the, red, balloon][saw, red, balloon]
1[Mary, had, a, little, lamb][Mary, little, lamb]

In filtered, the stop words “I”, “the”, “had”, and “a” have been filtered out.
? Scala
? Java
? Python
Refer to the StopWordsRemover Scala docs for more details on the API.
import org.apache.spark.ml.feature.StopWordsRemover

val remover = new StopWordsRemover()
.setInputCol(“raw”)
.setOutputCol(“filtered”)

val dataSet = spark.createDataFrame(Seq(
(0, Seq(“I”, “saw”, “the”, “red”, “balloon”)),
(1, Seq(“Mary”, “had”, “a”, “l(fā)ittle”, “l(fā)amb”))
)).toDF(“id”, “raw”)

remover.transform(dataSet).show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/StopWordsRemoverExample.scala” in the Spark repo.
nn-gram
An n-gram is a sequence of nn tokens (typically words) for some integer nn. The NGram class can be used to transform input features into nn-grams.
NGram takes as input a sequence of strings (e.g. the output of a Tokenizer). The parameter n is used to determine the number of terms in each nn-gram. The output will consist of a sequence of nn-grams where each nn-gram is represented by a space-delimited string of nn consecutive words. If the input sequence contains fewer than n strings, no output is produced.
Examples
? Scala
? Java
? Python
Refer to the NGram Scala docs for more details on the API.
import org.apache.spark.ml.feature.NGram

val wordDataFrame = spark.createDataFrame(Seq(
(0, Array(“Hi”, “I”, “heard”, “about”, “Spark”)),
(1, Array(“I”, “wish”, “Java”, “could”, “use”, “case”, “classes”)),
(2, Array(“Logistic”, “regression”, “models”, “are”, “neat”))
)).toDF(“id”, “words”)

val ngram = new NGram().setN(2).setInputCol(“words”).setOutputCol(“ngrams”)

val ngramDataFrame = ngram.transform(wordDataFrame)
ngramDataFrame.select(“ngrams”).show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/NGramExample.scala” in the Spark repo.
Binarizer
Binarization is the process of thresholding numerical features to binary (0/1) features.
Binarizer takes the common parameters inputCol and outputCol, as well as the threshold for binarization. Feature values greater than the threshold are binarized to 1.0; values equal to or less than the threshold are binarized to 0.0. Both Vector and Double types are supported for inputCol.
Examples
? Scala
? Java
? Python
Refer to the Binarizer Scala docs for more details on the API.
import org.apache.spark.ml.feature.Binarizer

val data = Array((0, 0.1), (1, 0.8), (2, 0.2))
val dataFrame = spark.createDataFrame(data).toDF(“id”, “feature”)

val binarizer: Binarizer = new Binarizer()
.setInputCol(“feature”)
.setOutputCol(“binarized_feature”)
.setThreshold(0.5)

val binarizedDataFrame = binarizer.transform(dataFrame)

println(s"Binarizer output with Threshold = ${binarizer.getThreshold}")
binarizedDataFrame.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/BinarizerExample.scala” in the Spark repo.
PCA
PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A PCA class trains a model to project vectors to a low-dimensional space using PCA. The example below shows how to project 5-dimensional feature vectors into 3-dimensional principal components.
Examples
? Scala
? Java
? Python
Refer to the PCA Scala docs for more details on the API.
import org.apache.spark.ml.feature.PCA
import org.apache.spark.ml.linalg.Vectors

val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF(“features”)

val pca = new PCA()
.setInputCol(“features”)
.setOutputCol(“pcaFeatures”)
.setK(3)
.fit(df)

val result = pca.transform(df).select(“pcaFeatures”)
result.show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/PCAExample.scala” in the Spark repo.
PolynomialExpansion
Polynomial expansion is the process of expanding your features into a polynomial space, which is formulated by an n-degree combination of original dimensions. A PolynomialExpansion class provides this functionality. The example below shows how to expand your features into a 3-degree polynomial space.
Examples
? Scala
? Java
? Python
Refer to the PolynomialExpansion Scala docs for more details on the API.
import org.apache.spark.ml.feature.PolynomialExpansion
import org.apache.spark.ml.linalg.Vectors

val data = Array(
Vectors.dense(2.0, 1.0),
Vectors.dense(0.0, 0.0),
Vectors.dense(3.0, -1.0)
)
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF(“features”)

val polyExpansion = new PolynomialExpansion()
.setInputCol(“features”)
.setOutputCol(“polyFeatures”)
.setDegree(3)

val polyDF = polyExpansion.transform(df)
polyDF.show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/PolynomialExpansionExample.scala” in the Spark repo.
Discrete Cosine Transform (DCT)
The Discrete Cosine Transform transforms a length NN real-valued sequence in the time domain into another length NN real-valued sequence in the frequency domain. A DCT class provides this functionality, implementing the DCT-II and scaling the result by 1/2–√1/2 such that the representing matrix for the transform is unitary. No shift is applied to the transformed sequence (e.g. the 00th element of the transformed sequence is the 00th DCT coefficient and not the N/2N/2th).
Examples
? Scala
? Java
? Python
Refer to the DCT Scala docs for more details on the API.
import org.apache.spark.ml.feature.DCT
import org.apache.spark.ml.linalg.Vectors

val data = Seq(
Vectors.dense(0.0, 1.0, -2.0, 3.0),
Vectors.dense(-1.0, 2.0, 4.0, -7.0),
Vectors.dense(14.0, -2.0, -5.0, 1.0))

val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF(“features”)

val dct = new DCT()
.setInputCol(“features”)
.setOutputCol(“featuresDCT”)
.setInverse(false)

val dctDf = dct.transform(df)
dctDf.select(“featuresDCT”).show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/DCTExample.scala” in the Spark repo.
StringIndexer
StringIndexer encodes a string column of labels to a column of label indices. StringIndexer can encode multiple columns. The indices are in [0, numLabels), and four ordering options are supported: “frequencyDesc”: descending order by label frequency (most frequent label assigned 0), “frequencyAsc”: ascending order by label frequency (least frequent label assigned 0), “alphabetDesc”: descending alphabetical order, and “alphabetAsc”: ascending alphabetical order (default = “frequencyDesc”). Note that in case of equal frequency when under “frequencyDesc”/”frequencyAsc”, the strings are further sorted by alphabet.
The unseen labels will be put at index numLabels if user chooses to keep them. If the input column is numeric, we cast it to string and index the string values. When downstream pipeline components such as Estimator or Transformer make use of this string-indexed label, you must set the input column of the component to this string-indexed column name. In many cases, you can set the input column with setInputCol.
Examples
Assume that we have the following DataFrame with columns id and category:

idcategory
0a
1b
2c
3a
4a
5c

category is a string column with three labels: “a”, “b”, and “c”. Applying StringIndexer with category as the input column and categoryIndex as the output column, we should get the following:

idcategorycategoryIndex
0a0.0
1b2.0
2c1.0
3a0.0
4a0.0
5c1.0

“a” gets index 0 because it is the most frequent, followed by “c” with index 1 and “b” with index 2.
Additionally, there are three strategies regarding how StringIndexer will handle unseen labels when you have fit a StringIndexer on one dataset and then use it to transform another:
? throw an exception (which is the default)
? skip the row containing the unseen label entirely
? put unseen labels in a special additional bucket, at index numLabels
Examples
Let’s go back to our previous example but this time reuse our previously defined StringIndexer on the following dataset:

idcategory
0a
1b
2c
3d
4e

If you’ve not set how StringIndexer handles unseen labels or set it to “error”, an exception will be thrown. However, if you had called setHandleInvalid(“skip”), the following dataset will be generated:

idcategorycategoryIndex
0a0.0
1b2.0
2c1.0

Notice that the rows containing “d” or “e” do not appear.
If you call setHandleInvalid(“keep”), the following dataset will be generated:

idcategorycategoryIndex
0a0.0
1b2.0
2c1.0
3d3.0
4e3.0

Notice that the rows containing “d” or “e” are mapped to index “3.0”
? Scala
? Java
? Python
Refer to the StringIndexer Scala docs for more details on the API.
import org.apache.spark.ml.feature.StringIndexer

val df = spark.createDataFrame(
Seq((0, “a”), (1, “b”), (2, “c”), (3, “a”), (4, “a”), (5, “c”))
).toDF(“id”, “category”)

val indexer = new StringIndexer()
.setInputCol(“category”)
.setOutputCol(“categoryIndex”)

val indexed = indexer.fit(df).transform(df)
indexed.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/StringIndexerExample.scala” in the Spark repo.
IndexToString
Symmetrically to StringIndexer, IndexToString maps a column of label indices back to a column containing the original labels as strings. A common use case is to produce indices from labels with StringIndexer, train a model with those indices and retrieve the original labels from the column of predicted indices with IndexToString. However, you are free to supply your own labels.
Examples
Building on the StringIndexer example, let’s assume we have the following DataFrame with columns id and categoryIndex:

idcategoryIndex
00.0
12.0
21.0
30.0
40.0
51.0

Applying IndexToString with categoryIndex as the input column, originalCategory as the output column, we are able to retrieve our original labels (they will be inferred from the columns’ metadata):

idcategoryIndexoriginalCategory
00.0a
12.0b
21.0c
30.0a
40.0a
51.0c

? Scala
? Java
? Python
Refer to the IndexToString Scala docs for more details on the API.
import org.apache.spark.ml.attribute.Attribute
import org.apache.spark.ml.feature.{IndexToString, StringIndexer}

val df = spark.createDataFrame(Seq(
(0, “a”),
(1, “b”),
(2, “c”),
(3, “a”),
(4, “a”),
(5, “c”)
)).toDF(“id”, “category”)

val indexer = new StringIndexer()
.setInputCol(“category”)
.setOutputCol(“categoryIndex”)
.fit(df)
val indexed = indexer.transform(df)

println(s"Transformed string column ‘indexer.getInputCol′"+s"toindexedcolumn′{indexer.getInputCol}' " + s"to indexed column 'indexer.getInputCol"+s"toindexedcolumn{indexer.getOutputCol}’")
indexed.show()

val inputColSchema = indexed.schema(indexer.getOutputCol)
println(s"StringIndexer will store labels in output column metadata: " +
s"${Attribute.fromStructField(inputColSchema).toString}\n")

val converter = new IndexToString()
.setInputCol(“categoryIndex”)
.setOutputCol(“originalCategory”)

val converted = converter.transform(indexed)

println(s"Transformed indexed column ‘converter.getInputCol′backtooriginalstring"+s"column′{converter.getInputCol}' back to original string " + s"column 'converter.getInputColbacktooriginalstring"+s"column{converter.getOutputCol}’ using labels in metadata")
converted.select(“id”, “categoryIndex”, “originalCategory”).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/IndexToStringExample.scala” in the Spark repo.
OneHotEncoder
One-hot encoding maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using StringIndexer first.
OneHotEncoder can transform multiple columns, returning an one-hot-encoded output vector column for each input column. It is common to merge these vectors into a single feature vector using VectorAssembler.
OneHotEncoder supports the handleInvalid parameter to choose how to handle invalid input during transforming data. Available options include ‘keep’ (any invalid inputs are assigned to an extra categorical index) and ‘error’ (throw an error).
Examples
? Scala
? Java
? Python
Refer to the OneHotEncoder Scala docs for more details on the API.
import org.apache.spark.ml.feature.OneHotEncoder

val df = spark.createDataFrame(Seq(
(0.0, 1.0),
(1.0, 0.0),
(2.0, 1.0),
(0.0, 2.0),
(0.0, 1.0),
(2.0, 0.0)
)).toDF(“categoryIndex1”, “categoryIndex2”)

val encoder = new OneHotEncoder()
.setInputCols(Array(“categoryIndex1”, “categoryIndex2”))
.setOutputCols(Array(“categoryVec1”, “categoryVec2”))
val model = encoder.fit(df)

val encoded = model.transform(df)
encoded.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/OneHotEncoderExample.scala” in the Spark repo.
VectorIndexer
VectorIndexer helps index categorical features in datasets of Vectors. It can both automatically decide which features are categorical and convert original values to category indices. Specifically, it does the following:

  1. Take an input column of type Vector and a parameter maxCategories.
  2. Decide which features should be categorical based on the number of distinct values, where features with at most maxCategories are declared categorical.
  3. Compute 0-based category indices for each categorical feature.
  4. Index categorical features and transform original feature values to indices.
    Indexing categorical features allows algorithms such as Decision Trees and Tree Ensembles to treat categorical features appropriately, improving performance.
    Examples
    In the example below, we read in a dataset of labeled points and then use VectorIndexer to decide which features should be treated as categorical. We transform the categorical feature values to their indices. This transformed data could then be passed to algorithms such as DecisionTreeRegressor that handle categorical features.
    ? Scala
    ? Java
    ? Python
    Refer to the VectorIndexer Scala docs for more details on the API.
    import org.apache.spark.ml.feature.VectorIndexer

val data = spark.read.format(“l(fā)ibsvm”).load(“data/mllib/sample_libsvm_data.txt”)

val indexer = new VectorIndexer()
.setInputCol(“features”)
.setOutputCol(“indexed”)
.setMaxCategories(10)

val indexerModel = indexer.fit(data)

val categoricalFeatures: Set[Int] = indexerModel.categoryMaps.keys.toSet
println(s"Chose ${categoricalFeatures.size} " +
s"categorical features: ${categoricalFeatures.mkString(", “)}”)

// Create new column “indexed” with categorical values transformed to indices
val indexedData = indexerModel.transform(data)
indexedData.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/VectorIndexerExample.scala” in the Spark repo.
Interaction
Interaction is a Transformer which takes vector or double-valued columns, and generates a single vector column that contains the product of all combinations of one value from each input column.
For example, if you have 2 vector type columns each of which has 3 dimensions as input columns, then you’ll get a 9-dimensional vector as the output column.
Examples
Assume that we have the following DataFrame with the columns “id1”, “vec1”, and “vec2”:

id1vec1vec2
1[1.0,2.0,3.0][8.0,4.0,5.0]
2[4.0,3.0,8.0][7.0,9.0,8.0]
3[6.0,1.0,9.0][2.0,3.0,6.0]
4[10.0,8.0,6.0][9.0,4.0,5.0]
5[9.0,2.0,7.0][10.0,7.0,3.0]
6[1.0,1.0,4.0][2.0,8.0,4.0]

Applying Interaction with those input columns, then interactedCol as the output column contains:

id1vec1vec2interactedCol
1[1.0,2.0,3.0][8.0,4.0,5.0][8.0,4.0,5.0,16.0,8.0,10.0,24.0,12.0,15.0]
2[4.0,3.0,8.0][7.0,9.0,8.0][56.0,72.0,64.0,42.0,54.0,48.0,112.0,144.0,128.0]
3[6.0,1.0,9.0][2.0,3.0,6.0][36.0,54.0,108.0,6.0,9.0,18.0,54.0,81.0,162.0]
4[10.0,8.0,6.0][9.0,4.0,5.0][360.0,160.0,200.0,288.0,128.0,160.0,216.0,96.0,120.0]
5[9.0,2.0,7.0][10.0,7.0,3.0][450.0,315.0,135.0,100.0,70.0,30.0,350.0,245.0,105.0]
6[1.0,1.0,4.0][2.0,8.0,4.0][12.0,48.0,24.0,12.0,48.0,24.0,48.0,192.0,96.0]

? Scala
? Java
? Python
Refer to the Interaction Scala docs for more details on the API.
import org.apache.spark.ml.feature.Interaction
import org.apache.spark.ml.feature.VectorAssembler

val df = spark.createDataFrame(Seq(
(1, 1, 2, 3, 8, 4, 5),
(2, 4, 3, 8, 7, 9, 8),
(3, 6, 1, 9, 2, 3, 6),
(4, 10, 8, 6, 9, 4, 5),
(5, 9, 2, 7, 10, 7, 3),
(6, 1, 1, 4, 2, 8, 4)
)).toDF(“id1”, “id2”, “id3”, “id4”, “id5”, “id6”, “id7”)

val assembler1 = new VectorAssembler().
setInputCols(Array(“id2”, “id3”, “id4”)).
setOutputCol(“vec1”)

val assembled1 = assembler1.transform(df)

val assembler2 = new VectorAssembler().
setInputCols(Array(“id5”, “id6”, “id7”)).
setOutputCol(“vec2”)

val assembled2 = assembler2.transform(assembled1).select(“id1”, “vec1”, “vec2”)

val interaction = new Interaction()
.setInputCols(Array(“id1”, “vec1”, “vec2”))
.setOutputCol(“interactedCol”)

val interacted = interaction.transform(assembled2)

interacted.show(truncate = false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/InteractionExample.scala” in the Spark repo.
Normalizer
Normalizer is a Transformer which transforms a dataset of Vector rows, normalizing each Vector to have unit norm. It takes parameter p, which specifies the p-norm used for normalization. (p=2p=2 by default.) This normalization can help standardize your input data and improve the behavior of learning algorithms.
Examples
The following example demonstrates how to load a dataset in libsvm format and then normalize each row to have unit L1L1 norm and unit L∞L∞ norm.
? Scala
? Java
? Python
Refer to the Normalizer Scala docs for more details on the API.
import org.apache.spark.ml.feature.Normalizer
import org.apache.spark.ml.linalg.Vectors

val dataFrame = spark.createDataFrame(Seq(
(0, Vectors.dense(1.0, 0.5, -1.0)),
(1, Vectors.dense(2.0, 1.0, 1.0)),
(2, Vectors.dense(4.0, 10.0, 2.0))
)).toDF(“id”, “features”)

// Normalize each Vector using L1L^1L1 norm.
val normalizer = new Normalizer()
.setInputCol(“features”)
.setOutputCol(“normFeatures”)
.setP(1.0)

val l1NormData = normalizer.transform(dataFrame)
println(“Normalized using L^1 norm”)
l1NormData.show()

// Normalize each Vector using L∞L^\inftyL norm.
val lInfNormData = normalizer.transform(dataFrame, normalizer.p -> Double.PositiveInfinity)
println(“Normalized using L^inf norm”)
lInfNormData.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/NormalizerExample.scala” in the Spark repo.
StandardScaler
StandardScaler transforms a dataset of Vector rows, normalizing each feature to have unit standard deviation and/or zero mean. It takes parameters:
? withStd: True by default. Scales the data to unit standard deviation.
? withMean: False by default. Centers the data with mean before scaling. It will build a dense output, so take care when applying to sparse input.
StandardScaler is an Estimator which can be fit on a dataset to produce a StandardScalerModel; this amounts to computing summary statistics. The model can then transform a Vector column in a dataset to have unit standard deviation and/or zero mean features.
Note that if the standard deviation of a feature is zero, it will return default 0.0 value in the Vector for that feature.
Examples
The following example demonstrates how to load a dataset in libsvm format and then normalize each feature to have unit standard deviation.
? Scala
? Java
? Python
Refer to the StandardScaler Scala docs for more details on the API.
import org.apache.spark.ml.feature.StandardScaler

val dataFrame = spark.read.format(“l(fā)ibsvm”).load(“data/mllib/sample_libsvm_data.txt”)

val scaler = new StandardScaler()
.setInputCol(“features”)
.setOutputCol(“scaledFeatures”)
.setWithStd(true)
.setWithMean(false)

// Compute summary statistics by fitting the StandardScaler.
val scalerModel = scaler.fit(dataFrame)

// Normalize each feature to have unit standard deviation.
val scaledData = scalerModel.transform(dataFrame)
scaledData.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/StandardScalerExample.scala” in the Spark repo.
RobustScaler
RobustScaler transforms a dataset of Vector rows, removing the median and scaling the data according to a specific quantile range (by default the IQR: Interquartile Range, quantile range between the 1st quartile and the 3rd quartile). Its behavior is quite similar to StandardScaler, however the median and the quantile range are used instead of mean and standard deviation, which make it robust to outliers. It takes parameters:
? lower: 0.25 by default. Lower quantile to calculate quantile range, shared by all features.
? upper: 0.75 by default. Upper quantile to calculate quantile range, shared by all features.
? withScaling: True by default. Scales the data to quantile range.
? withCentering: False by default. Centers the data with median before scaling. It will build a dense output, so take care when applying to sparse input.
RobustScaler is an Estimator which can be fit on a dataset to produce a RobustScalerModel; this amounts to computing quantile statistics. The model can then transform a Vector column in a dataset to have unit quantile range and/or zero median features.
Note that if the quantile range of a feature is zero, it will return default 0.0 value in the Vector for that feature.
Examples
The following example demonstrates how to load a dataset in libsvm format and then normalize each feature to have unit quantile range.
? Scala
? Java
? Python
Refer to the RobustScaler Scala docs for more details on the API.
import org.apache.spark.ml.feature.RobustScaler

val dataFrame = spark.read.format(“l(fā)ibsvm”).load(“data/mllib/sample_libsvm_data.txt”)

val scaler = new RobustScaler()
.setInputCol(“features”)
.setOutputCol(“scaledFeatures”)
.setWithScaling(true)
.setWithCentering(false)
.setLower(0.25)
.setUpper(0.75)

// Compute summary statistics by fitting the RobustScaler.
val scalerModel = scaler.fit(dataFrame)

// Transform each feature to have unit quantile range.
val scaledData = scalerModel.transform(dataFrame)
scaledData.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/RobustScalerExample.scala” in the Spark repo.
MinMaxScaler
MinMaxScaler transforms a dataset of Vector rows, rescaling each feature to a specific range (often [0, 1]). It takes parameters:
? min: 0.0 by default. Lower bound after transformation, shared by all features.
? max: 1.0 by default. Upper bound after transformation, shared by all features.
MinMaxScaler computes summary statistics on a data set and produces a MinMaxScalerModel. The model can then transform each feature individually such that it is in the given range.
The rescaled value for a feature E is calculated as,
Rescaled(ei)=ei?EminEmax?Emin?(max?min)+min(1)(1)Rescaled(ei)=ei?EminEmax?Emin?(max?min)+min
For the case EmaxEminEmaxEmin, Rescaled(ei)=0.5?(max+min)Rescaled(ei)=0.5?(max+min)
Note that since zero values will probably be transformed to non-zero values, output of the transformer will be DenseVector even for sparse input.
Examples
The following example demonstrates how to load a dataset in libsvm format and then rescale each feature to [0, 1].
? Scala
? Java
? Python
Refer to the MinMaxScaler Scala docs and the MinMaxScalerModel Scala docs for more details on the API.
import org.apache.spark.ml.feature.MinMaxScaler
import org.apache.spark.ml.linalg.Vectors

val dataFrame = spark.createDataFrame(Seq(
(0, Vectors.dense(1.0, 0.1, -1.0)),
(1, Vectors.dense(2.0, 1.1, 1.0)),
(2, Vectors.dense(3.0, 10.1, 3.0))
)).toDF(“id”, “features”)

val scaler = new MinMaxScaler()
.setInputCol(“features”)
.setOutputCol(“scaledFeatures”)

// Compute summary statistics and generate MinMaxScalerModel
val scalerModel = scaler.fit(dataFrame)

// rescale each feature to range [min, max].
val scaledData = scalerModel.transform(dataFrame)
println(s"Features scaled to range: [${scaler.getMin}, ${scaler.getMax}]")
scaledData.select(“features”, “scaledFeatures”).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/MinMaxScalerExample.scala” in the Spark repo.
MaxAbsScaler
MaxAbsScaler transforms a dataset of Vector rows, rescaling each feature to range [-1, 1] by dividing through the maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity.
MaxAbsScaler computes summary statistics on a data set and produces a MaxAbsScalerModel. The model can then transform each feature individually to range [-1, 1].
Examples
The following example demonstrates how to load a dataset in libsvm format and then rescale each feature to [-1, 1].
? Scala
? Java
? Python
Refer to the MaxAbsScaler Scala docs and the MaxAbsScalerModel Scala docs for more details on the API.
import org.apache.spark.ml.feature.MaxAbsScaler
import org.apache.spark.ml.linalg.Vectors

val dataFrame = spark.createDataFrame(Seq(
(0, Vectors.dense(1.0, 0.1, -8.0)),
(1, Vectors.dense(2.0, 1.0, -4.0)),
(2, Vectors.dense(4.0, 10.0, 8.0))
)).toDF(“id”, “features”)

val scaler = new MaxAbsScaler()
.setInputCol(“features”)
.setOutputCol(“scaledFeatures”)

// Compute summary statistics and generate MaxAbsScalerModel
val scalerModel = scaler.fit(dataFrame)

// rescale each feature to range [-1, 1]
val scaledData = scalerModel.transform(dataFrame)
scaledData.select(“features”, “scaledFeatures”).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/MaxAbsScalerExample.scala” in the Spark repo.
Bucketizer
Bucketizer transforms a column of continuous features to a column of feature buckets, where the buckets are specified by users. It takes a parameter:
? splits: Parameter for mapping continuous features into buckets. With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. Splits should be strictly increasing. Values at -inf, inf must be explicitly provided to cover all Double values; Otherwise, values outside the splits specified will be treated as errors. Two examples of splits are Array(Double.NegativeInfinity, 0.0, 1.0, Double.PositiveInfinity) and Array(0.0, 1.0, 2.0).
Note that if you have no idea of the upper and lower bounds of the targeted column, you should add Double.NegativeInfinity and Double.PositiveInfinity as the bounds of your splits to prevent a potential out of Bucketizer bounds exception.
Note also that the splits that you provided have to be in strictly increasing order, i.e. s0 < s1 < s2 < … < sn.
More details can be found in the API docs for Bucketizer.
Examples
The following example demonstrates how to bucketize a column of Doubles into another index-wised column.
? Scala
? Java
? Python
Refer to the Bucketizer Scala docs for more details on the API.
import org.apache.spark.ml.feature.Bucketizer

val splits = Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity)

val data = Array(-999.9, -0.5, -0.3, 0.0, 0.2, 999.9)
val dataFrame = spark.createDataFrame(data.map(Tuple1.apply)).toDF(“features”)

val bucketizer = new Bucketizer()
.setInputCol(“features”)
.setOutputCol(“bucketedFeatures”)
.setSplits(splits)

// Transform original data into its bucket index.
val bucketedData = bucketizer.transform(dataFrame)

println(s"Bucketizer output with ${bucketizer.getSplits.length-1} buckets")
bucketedData.show()

val splitsArray = Array(
Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity),
Array(Double.NegativeInfinity, -0.3, 0.0, 0.3, Double.PositiveInfinity))

val data2 = Array(
(-999.9, -999.9),
(-0.5, -0.2),
(-0.3, -0.1),
(0.0, 0.0),
(0.2, 0.4),
(999.9, 999.9))
val dataFrame2 = spark.createDataFrame(data2).toDF(“features1”, “features2”)

val bucketizer2 = new Bucketizer()
.setInputCols(Array(“features1”, “features2”))
.setOutputCols(Array(“bucketedFeatures1”, “bucketedFeatures2”))
.setSplitsArray(splitsArray)

// Transform original data into its bucket index.
val bucketedData2 = bucketizer2.transform(dataFrame2)

println(s"Bucketizer output with [" +
s"bucketizer2.getSplitsArray(0).length?1,"+s"{bucketizer2.getSplitsArray(0).length-1}, " + s"bucketizer2.getSplitsArray(0).length?1,"+s"{bucketizer2.getSplitsArray(1).length-1}] buckets for each input column")
bucketedData2.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/BucketizerExample.scala” in the Spark repo.
ElementwiseProduct
ElementwiseProduct multiplies each input vector by a provided “weight” vector, using element-wise multiplication. In other words, it scales each column of the dataset by a scalar multiplier. This represents the Hadamard product between the input vector, v and transforming vector, w, to yield a result vector.
????v1?vN????°????w1?wN????=????v1w1?vNwN????(v1?vN)°(w1?wN)=(v1w1?vNwN)
Examples
This example below demonstrates how to transform vectors using a transforming vector value.
? Scala
? Java
? Python
Refer to the ElementwiseProduct Scala docs for more details on the API.
import org.apache.spark.ml.feature.ElementwiseProduct
import org.apache.spark.ml.linalg.Vectors

// Create some vector data; also works for sparse vectors
val dataFrame = spark.createDataFrame(Seq(
(“a”, Vectors.dense(1.0, 2.0, 3.0)),
(“b”, Vectors.dense(4.0, 5.0, 6.0)))).toDF(“id”, “vector”)

val transformingVector = Vectors.dense(0.0, 1.0, 2.0)
val transformer = new ElementwiseProduct()
.setScalingVec(transformingVector)
.setInputCol(“vector”)
.setOutputCol(“transformedVector”)

// Batch transform the vectors to create new column:
transformer.transform(dataFrame).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/ElementwiseProductExample.scala” in the Spark repo.
SQLTransformer
SQLTransformer implements the transformations which are defined by SQL statement. Currently, we only support SQL syntax like “SELECT … FROM THIS …” where “THIS” represents the underlying table of the input dataset. The select clause specifies the fields, constants, and expressions to display in the output, and can be any select clause that Spark SQL supports. Users can also use Spark SQL built-in function and UDFs to operate on these selected columns. For example, SQLTransformer supports statements like:
? SELECT a, a + b AS a_b FROM THIS
? SELECT a, SQRT(b) AS b_sqrt FROM THIS where a > 5
? SELECT a, b, SUM? AS c_sum FROM THIS GROUP BY a, b
Examples
Assume that we have the following DataFrame with columns id, v1 and v2:

idv1v2
01.03.0
22.05.0

This is the output of the SQLTransformer with statement “SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM THIS”:

idv1v2v3v4
01.03.04.03.0
22.05.07.010.0

? Scala
? Java
? Python
Refer to the SQLTransformer Scala docs for more details on the API.
import org.apache.spark.ml.feature.SQLTransformer

val df = spark.createDataFrame(
Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(“id”, “v1”, “v2”)

val sqlTrans = new SQLTransformer().setStatement(
“SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM THIS”)

sqlTrans.transform(df).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/SQLTransformerExample.scala” in the Spark repo.
VectorAssembler
VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees. VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type. In each row, the values of the input columns will be concatenated into a vector in the specified order.
Examples
Assume that we have a DataFrame with the columns id, hour, mobile, userFeatures, and clicked:

idhourmobileuserFeaturesclicked
0181.0[0.0, 10.0, 0.5]1.0

userFeatures is a vector column that contains three user features. We want to combine hour, mobile, and userFeatures into a single feature vector called features and use it to predict clicked or not. If we set VectorAssembler’s input columns to hour, mobile, and userFeatures and output column to features, after transformation we should get the following DataFrame:

idhourmobileuserFeaturesclickedfeatures
0181.0[0.0, 10.0, 0.5]1.0[18.0, 1.0, 0.0, 10.0, 0.5]

? Scala
? Java
? Python
Refer to the VectorAssembler Scala docs for more details on the API.
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

val dataset = spark.createDataFrame(
Seq((0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0))
).toDF(“id”, “hour”, “mobile”, “userFeatures”, “clicked”)

val assembler = new VectorAssembler()
.setInputCols(Array(“hour”, “mobile”, “userFeatures”))
.setOutputCol(“features”)

val output = assembler.transform(dataset)
println(“Assembled columns ‘hour’, ‘mobile’, ‘userFeatures’ to vector column ‘features’”)
output.select(“features”, “clicked”).show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/VectorAssemblerExample.scala” in the Spark repo.
VectorSizeHint
It can sometimes be useful to explicitly specify the size of the vectors for a column of VectorType. For example, VectorAssembler uses size information from its input columns to produce size information and metadata for its output column. While in some cases this information can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are not available until the stream is started. VectorSizeHint allows a user to explicitly specify the vector size for a column so that VectorAssembler, or other transformers that might need to know vector size, can use that column as an input.
To use VectorSizeHint a user must set the inputCol and size parameters. Applying this transformer to a dataframe produces a new dataframe with updated metadata for inputCol specifying the vector size. Downstream operations on the resulting dataframe can get this size using the metadata.
VectorSizeHint can also take an optional handleInvalid parameter which controls its behaviour when the vector column contains nulls or vectors of the wrong size. By default handleInvalid is set to “error”, indicating an exception should be thrown. This parameter can also be set to “skip”, indicating that rows containing invalid values should be filtered out from the resulting dataframe, or “optimistic”, indicating that the column should not be checked for invalid values and all rows should be kept. Note that the use of “optimistic” can cause the resulting dataframe to be in an inconsistent state, meaning the metadata for the column VectorSizeHint was applied to does not match the contents of that column. Users should take care to avoid this kind of inconsistent state.
? Scala
? Java
? Python
Refer to the VectorSizeHint Scala docs for more details on the API.
import org.apache.spark.ml.feature.{VectorAssembler, VectorSizeHint}
import org.apache.spark.ml.linalg.Vectors

val dataset = spark.createDataFrame(
Seq(
(0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0),
(0, 18, 1.0, Vectors.dense(0.0, 10.0), 0.0))
).toDF(“id”, “hour”, “mobile”, “userFeatures”, “clicked”)

val sizeHint = new VectorSizeHint()
.setInputCol(“userFeatures”)
.setHandleInvalid(“skip”)
.setSize(3)

val datasetWithSize = sizeHint.transform(dataset)
println(“Rows where ‘userFeatures’ is not the right size are filtered out”)
datasetWithSize.show(false)

val assembler = new VectorAssembler()
.setInputCols(Array(“hour”, “mobile”, “userFeatures”))
.setOutputCol(“features”)

// This dataframe can be used by downstream transformers as before
val output = assembler.transform(datasetWithSize)
println(“Assembled columns ‘hour’, ‘mobile’, ‘userFeatures’ to vector column ‘features’”)
output.select(“features”, “clicked”).show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/VectorSizeHintExample.scala” in the Spark repo.
QuantileDiscretizer
QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins is set by the numBuckets parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles.
NaN values: NaN values will be removed from the column during QuantileDiscretizer fitting. This will produce a Bucketizer model for making predictions. During the transformation, Bucketizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for approxQuantile for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. When set to zero, exact quantiles are calculated (Note: Computing exact quantiles is an expensive operation). The lower and upper bin bounds will be -Infinity and +Infinity covering all real values.
Examples
Assume that we have a DataFrame with the columns id, hour:

idhour
018.0
----------
119.0
----------
28.0
----------
35.0
----------
42.2

hour is a continuous feature with Double type. We want to turn the continuous feature into a categorical one. Given numBuckets = 3, we should get the following DataFrame:

idhourresult
018.02.0
----------------
119.02.0
----------------
28.01.0
----------------
35.01.0
----------------
42.20.0

? Scala
? Java
? Python
Refer to the QuantileDiscretizer Scala docs for more details on the API.
import org.apache.spark.ml.feature.QuantileDiscretizer

val data = Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2))
val df = spark.createDataFrame(data).toDF(“id”, “hour”)

val discretizer = new QuantileDiscretizer()
.setInputCol(“hour”)
.setOutputCol(“result”)
.setNumBuckets(3)

val result = discretizer.fit(df).transform(df)
result.show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala” in the Spark repo.
Imputer
The Imputer estimator completes missing values in a dataset, either using the mean or the median of the columns in which the missing values are located. The input columns should be of numeric type. Currently Imputer does not support categorical features and possibly creates incorrect values for columns containing categorical features. Imputer can impute custom values other than ‘NaN’ by .setMissingValue(custom_value). For example, .setMissingValue(0) will impute all occurrences of (0).
Note all null values in the input columns are treated as missing, and so are also imputed.
Examples
Suppose that we have a DataFrame with the columns a and b:
a | b
------------|-----------
1.0 | Double.NaN
2.0 | Double.NaN
Double.NaN | 3.0
4.0 | 4.0
5.0 | 5.0
In this example, Imputer will replace all occurrences of Double.NaN (the default for the missing value) with the mean (the default imputation strategy) computed from the other values in the corresponding columns. In this example, the surrogate values for columns a and b are 3.0 and 4.0 respectively. After transformation, the missing values in the output columns will be replaced by the surrogate value for the relevant column.
a | b | out_a | out_b
------------|------------|-------|-------
1.0 | Double.NaN | 1.0 | 4.0
2.0 | Double.NaN | 2.0 | 4.0
Double.NaN | 3.0 | 3.0 | 3.0
4.0 | 4.0 | 4.0 | 4.0
5.0 | 5.0 | 5.0 | 5.0
? Scala
? Java
? Python
Refer to the Imputer Scala docs for more details on the API.
import org.apache.spark.ml.feature.Imputer

val df = spark.createDataFrame(Seq(
(1.0, Double.NaN),
(2.0, Double.NaN),
(Double.NaN, 3.0),
(4.0, 4.0),
(5.0, 5.0)
)).toDF(“a”, “b”)

val imputer = new Imputer()
.setInputCols(Array(“a”, “b”))
.setOutputCols(Array(“out_a”, “out_b”))

val model = imputer.fit(df)
model.transform(df).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/ImputerExample.scala” in the Spark repo.
Feature Selectors
VectorSlicer
VectorSlicer is a transformer that takes a feature vector and outputs a new feature vector with a sub-array of the original features. It is useful for extracting features from a vector column.
VectorSlicer accepts a vector column with specified indices, then outputs a new vector column whose values are selected via those indices. There are two types of indices,

  1. Integer indices that represent the indices into the vector, setIndices().
  2. String indices that represent the names of features into the vector, setNames(). This requires the vector column to have an AttributeGroup since the implementation matches on the name field of an Attribute.
    Specification by integer and string are both acceptable. Moreover, you can use integer index and string name simultaneously. At least one feature must be selected. Duplicate features are not allowed, so there can be no overlap between selected indices and names. Note that if names of features are selected, an exception will be thrown if empty input attributes are encountered.
    The output vector will order features with the selected indices first (in the order given), followed by the selected names (in the order given).
    Examples
    Suppose that we have a DataFrame with the column userFeatures:
    userFeatures

[0.0, 10.0, 0.5]
userFeatures is a vector column that contains three user features. Assume that the first column of userFeatures are all zeros, so we want to remove it and select only the last two columns. The VectorSlicer selects the last two elements with setIndices(1, 2) then produces a new vector column named features:

userFeaturesfeatures
[0.0, 10.0, 0.5][10.0, 0.5]

Suppose also that we have potential input attributes for the userFeatures, i.e. [“f1”, “f2”, “f3”], then we can use setNames(“f2”, “f3”) to select them.

userFeaturesfeatures
[0.0, 10.0, 0.5][10.0, 0.5]
[“f1”, “f2”, “f3”][“f2”, “f3”]

? Scala
? Java
? Python
Refer to the VectorSlicer Scala docs for more details on the API.
import java.util.Arrays

import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, NumericAttribute}
import org.apache.spark.ml.feature.VectorSlicer
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.StructType

val data = Arrays.asList(
Row(Vectors.sparse(3, Seq((0, -2.0), (1, 2.3)))),
Row(Vectors.dense(-2.0, 2.3, 0.0))
)

val defaultAttr = NumericAttribute.defaultAttr
val attrs = Array(“f1”, “f2”, “f3”).map(defaultAttr.withName)
val attrGroup = new AttributeGroup(“userFeatures”, attrs.asInstanceOf[Array[Attribute]])

val dataset = spark.createDataFrame(data, StructType(Array(attrGroup.toStructField())))

val slicer = new VectorSlicer().setInputCol(“userFeatures”).setOutputCol(“features”)

slicer.setIndices(Array(1)).setNames(Array(“f3”))
// or slicer.setIndices(Array(1, 2)), or slicer.setNames(Array(“f2”, “f3”))

val output = slicer.transform(dataset)
output.show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/VectorSlicerExample.scala” in the Spark repo.
RFormula
RFormula selects columns specified by an R model formula. Currently we support a limited subset of the R operators, including ‘~’, ‘.’, ‘:’, ‘+’, and ‘-‘. The basic operators are:
? ~ separate target and terms
? + concat terms, “+ 0” means removing intercept
? - remove a term, “- 1” means removing intercept
? : interaction (multiplication for numeric values, or binarized categorical values)
? . all columns except target
Suppose a and b are double columns, we use the following simple examples to illustrate the effect of RFormula:
? y ~ a + b means model y ~ w0 + w1 * a + w2 * b where w0 is the intercept and w1, w2 are coefficients.
? y ~ a + b + a:b - 1 means model y ~ w1 * a + w2 * b + w3 * a * b where w1, w2, w3 are coefficients.
RFormula produces a vector column of features and a double or string column of label. Like when formulas are used in R for linear regression, numeric columns will be cast to doubles. As to string input columns, they will first be transformed with StringIndexer using ordering determined by stringOrderType, and the last category after ordering is dropped, then the doubles will be one-hot encoded.
Suppose a string feature column containing values {‘b’, ‘a(chǎn)’, ‘b’, ‘a(chǎn)’, ‘c’, ‘b’}, we set stringOrderType to control the encoding:

stringOrderTypeCategory mapped to 0 by StringIndexerCategory dropped by RFormula
‘frequencyDesc’most frequent category (‘b’)least frequent category (‘c’)
‘frequencyAsc’least frequent category (‘c’)most frequent category (‘b’)
‘a(chǎn)lphabetDesc’last alphabetical category (‘c’)first alphabetical category (‘a(chǎn)’)
‘a(chǎn)lphabetAsc’first alphabetical category (‘a(chǎn)’)last alphabetical category (‘c’)

If the label column is of type string, it will be first transformed to double with StringIndexer using frequencyDesc ordering. If the label column does not exist in the DataFrame, the output label column will be created from the specified response variable in the formula.
Note: The ordering option stringOrderType is NOT used for the label column. When the label column is indexed, it uses the default descending frequency ordering in StringIndexer.
Examples
Assume that we have a DataFrame with the columns id, country, hour, and clicked:

idcountryhourclicked
7“US”181.0
8“CA”120.0
9“NZ”150.0

If we use RFormula with a formula string of clicked ~ country + hour, which indicates that we want to predict clicked based on country and hour, after transformation we should get the following DataFrame:

idcountryhourclickedfeatureslabel
7“US”181.0[0.0, 0.0, 18.0]1.0
8“CA”120.0[0.0, 1.0, 12.0]0.0
9“NZ”150.0[1.0, 0.0, 15.0]0.0

? Scala
? Java
? Python
Refer to the RFormula Scala docs for more details on the API.
import org.apache.spark.ml.feature.RFormula

val dataset = spark.createDataFrame(Seq(
(7, “US”, 18, 1.0),
(8, “CA”, 12, 0.0),
(9, “NZ”, 15, 0.0)
)).toDF(“id”, “country”, “hour”, “clicked”)

val formula = new RFormula()
.setFormula(“clicked ~ country + hour”)
.setFeaturesCol(“features”)
.setLabelCol(“l(fā)abel”)

val output = formula.fit(dataset).transform(dataset)
output.select(“features”, “l(fā)abel”).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/RFormulaExample.scala” in the Spark repo.
ChiSqSelector
ChiSqSelector stands for Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the Chi-Squared test of independence to decide which features to choose. It supports five selection methods: numTopFeatures, percentile, fpr, fdr, fwe:
? numTopFeatures chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
? percentile is similar to numTopFeatures but chooses a fraction of all features instead of a fixed number.
? fpr chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.
? fdr uses the Benjamini-Hochberg procedure to choose all features whose false discovery rate is below a threshold.
? fwe chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection. By default, the selection method is numTopFeatures, with the default number of top features set to 50. The user can choose a selection method using setSelectorType.
Examples
Assume that we have a DataFrame with the columns id, features, and clicked, which is used as our target to be predicted:

idfeaturesclicked
7[0.0, 0.0, 18.0, 1.0]1.0
8[0.0, 1.0, 12.0, 0.0]0.0
9[1.0, 0.0, 15.0, 0.1]0.0

If we use ChiSqSelector with numTopFeatures = 1, then according to our label clicked the last column in our features is chosen as the most useful feature:

idfeaturesclickedselectedFeatures
7[0.0, 0.0, 18.0, 1.0]1.0[1.0]
8[0.0, 1.0, 12.0, 0.0]0.0[0.0]
9[1.0, 0.0, 15.0, 0.1]0.0[0.1]

? Scala
? Java
? Python
Refer to the ChiSqSelector Scala docs for more details on the API.
import org.apache.spark.ml.feature.ChiSqSelector
import org.apache.spark.ml.linalg.Vectors

val data = Seq(
(7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
(8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
(9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)
)

val df = spark.createDataset(data).toDF(“id”, “features”, “clicked”)

val selector = new ChiSqSelector()
.setNumTopFeatures(1)
.setFeaturesCol(“features”)
.setLabelCol(“clicked”)
.setOutputCol(“selectedFeatures”)

val result = selector.fit(df).transform(df)

println(s"ChiSqSelector output with top ${selector.getNumTopFeatures} features selected")
result.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/ChiSqSelectorExample.scala” in the Spark repo.
UnivariateFeatureSelector
UnivariateFeatureSelector operates on categorical/continuous labels with categorical/continuous features. User can set featureType and labelType, and Spark will pick the score function to use based on the specified featureType and labelType.

featureTypelabelTypescore function
categoricalcategoricalchi-squared (chi2)
continuouscategoricalANOVATest (f_classif)
continuouscontinuousF-value (f_regression)

It supports five selection modes: numTopFeatures, percentile, fpr, fdr, fwe:
? numTopFeatures chooses a fixed number of top features.
? percentile is similar to numTopFeatures but chooses a fraction of all features instead of a fixed number.
? fpr chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.
? fdr uses the Benjamini-Hochberg procedure to choose all features whose false discovery rate is below a threshold.
? fwe chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection.
By default, the selection mode is numTopFeatures, with the default selectionThreshold sets to 50.
Examples
Assume that we have a DataFrame with the columns id, features, and label, which is used as our target to be predicted:

idfeatureslabel
1[1.7, 4.4, 7.6, 5.8, 9.6, 2.3]3.0
2[8.8, 7.3, 5.7, 7.3, 2.2, 4.1]2.0
3[1.2, 9.5, 2.5, 3.1, 8.7, 2.5]3.0
4[3.7, 9.2, 6.1, 4.1, 7.5, 3.8]2.0
5[8.9, 5.2, 7.8, 8.3, 5.2, 3.0]4.0
6[7.9, 8.5, 9.2, 4.0, 9.4, 2.1]4.0

If we set featureType to continuous and labelType to categorical with numTopFeatures = 1, the last column in our features is chosen as the most useful feature:

idfeatureslabelselectedFeatures
1[1.7, 4.4, 7.6, 5.8, 9.6, 2.3]3.0[2.3]
2[8.8, 7.3, 5.7, 7.3, 2.2, 4.1]2.0[4.1]
3[1.2, 9.5, 2.5, 3.1, 8.7, 2.5]3.0[2.5]
4[3.7, 9.2, 6.1, 4.1, 7.5, 3.8]2.0[3.8]
5[8.9, 5.2, 7.8, 8.3, 5.2, 3.0]4.0[3.0]
6[7.9, 8.5, 9.2, 4.0, 9.4, 2.1]4.0[2.1]

? Scala
? Java
? Python
Refer to the UnivariateFeatureSelector Scala docs for more details on the API.
import org.apache.spark.ml.feature.UnivariateFeatureSelector
import org.apache.spark.ml.linalg.Vectors

val data = Seq(
(1, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3), 3.0),
(2, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1), 2.0),
(3, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5), 3.0),
(4, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8), 2.0),
(5, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0), 4.0),
(6, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1), 4.0)
)

val df = spark.createDataset(data).toDF(“id”, “features”, “l(fā)abel”)

val selector = new UnivariateFeatureSelector()
.setFeatureType(“continuous”)
.setLabelType(“categorical”)
.setSelectionMode(“numTopFeatures”)
.setSelectionThreshold(1)
.setFeaturesCol(“features”)
.setLabelCol(“l(fā)abel”)
.setOutputCol(“selectedFeatures”)

val result = selector.fit(df).transform(df)

println(s"UnivariateFeatureSelector output with top ${selector.getSelectionThreshold}" +
s" features selected using f_classif")
result.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/UnivariateFeatureSelectorExample.scala” in the Spark repo.
VarianceThresholdSelector
VarianceThresholdSelector is a selector that removes low-variance features. Features with a variance not greater than the varianceThreshold will be removed. If not set, varianceThreshold defaults to 0, which means only features with variance 0 (i.e. features that have the same value in all samples) will be removed.
Examples
Assume that we have a DataFrame with the columns id and features, which is used as our target to be predicted:

idfeatures
1[6.0, 7.0, 0.0, 7.0, 6.0, 0.0]
2[0.0, 9.0, 6.0, 0.0, 5.0, 9.0]
3[0.0, 9.0, 3.0, 0.0, 5.0, 5.0]
4[0.0, 9.0, 8.0, 5.0, 6.0, 4.0]
5[8.0, 9.0, 6.0, 5.0, 4.0, 4.0]
6[8.0, 9.0, 6.0, 0.0, 0.0, 0.0]

The variance for the 6 features are 16.67, 0.67, 8.17, 10.17, 5.07, and 11.47 respectively. If we use VarianceThresholdSelector with varianceThreshold = 8.0, then the features with variance <= 8.0 are removed:

idfeaturesselectedFeatures
1[6.0, 7.0, 0.0, 7.0, 6.0, 0.0][6.0,0.0,7.0,0.0]
2[0.0, 9.0, 6.0, 0.0, 5.0, 9.0][0.0,6.0,0.0,9.0]
3[0.0, 9.0, 3.0, 0.0, 5.0, 5.0][0.0,3.0,0.0,5.0]
4[0.0, 9.0, 8.0, 5.0, 6.0, 4.0][0.0,8.0,5.0,4.0]
5[8.0, 9.0, 6.0, 5.0, 4.0, 4.0][8.0,6.0,5.0,4.0]
6[8.0, 9.0, 6.0, 0.0, 0.0, 0.0][8.0,6.0,0.0,0.0]

? Scala
? Java
? Python
Refer to the VarianceThresholdSelector Scala docs for more details on the API.
import org.apache.spark.ml.feature.VarianceThresholdSelector
import org.apache.spark.ml.linalg.Vectors

val data = Seq(
(1, Vectors.dense(6.0, 7.0, 0.0, 7.0, 6.0, 0.0)),
(2, Vectors.dense(0.0, 9.0, 6.0, 0.0, 5.0, 9.0)),
(3, Vectors.dense(0.0, 9.0, 3.0, 0.0, 5.0, 5.0)),
(4, Vectors.dense(0.0, 9.0, 8.0, 5.0, 6.0, 4.0)),
(5, Vectors.dense(8.0, 9.0, 6.0, 5.0, 4.0, 4.0)),
(6, Vectors.dense(8.0, 9.0, 6.0, 0.0, 0.0, 0.0))
)

val df = spark.createDataset(data).toDF(“id”, “features”)

val selector = new VarianceThresholdSelector()
.setVarianceThreshold(8.0)
.setFeaturesCol(“features”)
.setOutputCol(“selectedFeatures”)

val result = selector.fit(df).transform(df)

println(s"Output: Features with variance lower than" +
s" ${selector.getVarianceThreshold} are removed.")
result.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/VarianceThresholdSelectorExample.scala” in the Spark repo.
Locality Sensitive Hashing
Locality Sensitive Hashing (LSH) is an important class of hashing techniques, which is commonly used in clustering, approximate nearest neighbor search and outlier detection with large datasets.
The general idea of LSH is to use a family of functions (“LSH families”) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. An LSH family is formally defined as follows.
In a metric space (M, d), where M is a set and d is a distance function on M, an LSH family is a family of functions h that satisfy the following properties:
?p,q∈M,d(p,q)≤r1?Pr(h§=h(q))≥p1d(p,q)≥r2?Pr(h§=h(q))≤p2?p,q∈M,d(p,q)≤r1?Pr(h§=h(q))≥p1d(p,q)≥r2?Pr(h§=h(q))≤p2
This LSH family is called (r1, r2, p1, p2)-sensitive.
In Spark, different LSH families are implemented in separate classes (e.g., MinHash), and APIs for feature transformation, approximate similarity join and approximate nearest neighbor are provided in each class.
In LSH, we define a false positive as a pair of distant input features (with d(p,q)≥r2d(p,q)≥r2) which are hashed into the same bucket, and we define a false negative as a pair of nearby features (with d(p,q)≤r1d(p,q)≤r1) which are hashed into different buckets.
LSH Operations
We describe the major types of operations which LSH can be used for. A fitted LSH model has methods for each of these operations.
Feature Transformation
Feature transformation is the basic functionality to add hashed values as a new column. This can be useful for dimensionality reduction. Users can specify input and output column names by setting inputCol and outputCol.
LSH also supports multiple LSH hash tables. Users can specify the number of hash tables by setting numHashTables. This is also used for OR-amplification in approximate similarity join and approximate nearest neighbor. Increasing the number of hash tables will increase the accuracy but will also increase communication cost and running time.
The type of outputCol is Seq[Vector] where the dimension of the array equals numHashTables, and the dimensions of the vectors are currently set to 1. In future releases, we will implement AND-amplification so that users can specify the dimensions of these vectors.
Approximate Similarity Join
Approximate similarity join takes two datasets and approximately returns pairs of rows in the datasets whose distance is smaller than a user-defined threshold. Approximate similarity join supports both joining two different datasets and self-joining. Self-joining will produce some duplicate pairs.
Approximate similarity join accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as outputCol.
In the joined dataset, the origin datasets can be queried in datasetA and datasetB. A distance column will be added to the output dataset to show the true distance between each pair of rows returned.
Approximate Nearest Neighbor Search
Approximate nearest neighbor search takes a dataset (of feature vectors) and a key (a single feature vector), and it approximately returns a specified number of rows in the dataset that are closest to the vector.
Approximate nearest neighbor search accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as outputCol.
A distance column will be added to the output dataset to show the true distance between each output row and the searched key.
Note: Approximate nearest neighbor search will return fewer than k rows when there are not enough candidates in the hash bucket.
LSH Algorithms
Bucketed Random Projection for Euclidean Distance
Bucketed Random Projection is an LSH family for Euclidean distance. The Euclidean distance is defined as follows:
d(x,y)=∑i(xi?yi)2??????????√d(x,y)=∑i(xi?yi)2
Its LSH family projects feature vectors xx onto a random unit vector vv and portions the projected results into hash buckets:
h(x)=?x?vr?h(x)=?x?vr?
where r is a user-defined bucket length. The bucket length can be used to control the average size of hash buckets (and thus the number of buckets). A larger bucket length (i.e., fewer buckets) increases the probability of features being hashed to the same bucket (increasing the numbers of true and false positives).
Bucketed Random Projection accepts arbitrary vectors as input features, and supports both sparse and dense vectors.
? Scala
? Java
? Python
Refer to the BucketedRandomProjectionLSH Scala docs for more details on the API.
import org.apache.spark.ml.feature.BucketedRandomProjectionLSH
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col

val dfA = spark.createDataFrame(Seq(
(0, Vectors.dense(1.0, 1.0)),
(1, Vectors.dense(1.0, -1.0)),
(2, Vectors.dense(-1.0, -1.0)),
(3, Vectors.dense(-1.0, 1.0))
)).toDF(“id”, “features”)

val dfB = spark.createDataFrame(Seq(
(4, Vectors.dense(1.0, 0.0)),
(5, Vectors.dense(-1.0, 0.0)),
(6, Vectors.dense(0.0, 1.0)),
(7, Vectors.dense(0.0, -1.0))
)).toDF(“id”, “features”)

val key = Vectors.dense(1.0, 0.0)

val brp = new BucketedRandomProjectionLSH()
.setBucketLength(2.0)
.setNumHashTables(3)
.setInputCol(“features”)
.setOutputCol(“hashes”)

val model = brp.fit(dfA)

// Feature Transformation
println(“The hashed dataset where hashed values are stored in the column ‘hashes’:”)
model.transform(dfA).show()

// Compute the locality sensitive hashes for the input rows, then perform approximate
// similarity join.
// We could avoid computing hashes by passing in the already-transformed dataset, e.g.
// model.approxSimilarityJoin(transformedA, transformedB, 1.5)
println(“Approximately joining dfA and dfB on Euclidean distance smaller than 1.5:”)
model.approxSimilarityJoin(dfA, dfB, 1.5, “EuclideanDistance”)
.select(col(“datasetA.id”).alias(“idA”),
col(“datasetB.id”).alias(“idB”),
col(“EuclideanDistance”)).show()

// Compute the locality sensitive hashes for the input rows, then perform approximate nearest
// neighbor search.
// We could avoid computing hashes by passing in the already-transformed dataset, e.g.
// model.approxNearestNeighbors(transformedA, key, 2)
println(“Approximately searching dfA for 2 nearest neighbors of the key:”)
model.approxNearestNeighbors(dfA, key, 2).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala” in the Spark repo.
MinHash for Jaccard Distance
MinHash is an LSH family for Jaccard distance where input features are sets of natural numbers. Jaccard distance of two sets is defined by the cardinality of their intersection and union:
d(A,B)=1?|A∩B||A∪B|d(A,B)=1?|A∩B||A∪B|
MinHash applies a random hash function g to each element in the set and take the minimum of all hashed values:
h(A)=mina∈A(g(a))h(A)=mina∈A(g(a))
The input sets for MinHash are represented as binary vectors, where the vector indices represent the elements themselves and the non-zero values in the vector represent the presence of that element in the set. While both dense and sparse vectors are supported, typically sparse vectors are recommended for efficiency. For example, Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)]) means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. All non-zero values are treated as binary “1” values.
Note: Empty sets cannot be transformed by MinHash, which means any input vector must have at least 1 non-zero entry.
? Scala
? Java
? Python
Refer to the MinHashLSH Scala docs for more details on the API.
import org.apache.spark.ml.feature.MinHashLSH
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col

val dfA = spark.createDataFrame(Seq(
(0, Vectors.sparse(6, Seq((0, 1.0), (1, 1.0), (2, 1.0)))),
(1, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (4, 1.0)))),
(2, Vectors.sparse(6, Seq((0, 1.0), (2, 1.0), (4, 1.0))))
)).toDF(“id”, “features”)

val dfB = spark.createDataFrame(Seq(
(3, Vectors.sparse(6, Seq((1, 1.0), (3, 1.0), (5, 1.0)))),
(4, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (5, 1.0)))),
(5, Vectors.sparse(6, Seq((1, 1.0), (2, 1.0), (4, 1.0))))
)).toDF(“id”, “features”)

val key = Vectors.sparse(6, Seq((1, 1.0), (3, 1.0)))

val mh = new MinHashLSH()
.setNumHashTables(5)
.setInputCol(“features”)
.setOutputCol(“hashes”)

val model = mh.fit(dfA)

// Feature Transformation
println(“The hashed dataset where hashed values are stored in the column ‘hashes’:”)
model.transform(dfA).show()

// Compute the locality sensitive hashes for the input rows, then perform approximate
// similarity join.
// We could avoid computing hashes by passing in the already-transformed dataset, e.g.
// model.approxSimilarityJoin(transformedA, transformedB, 0.6)
println(“Approximately joining dfA and dfB on Jaccard distance smaller than 0.6:”)
model.approxSimilarityJoin(dfA, dfB, 0.6, “JaccardDistance”)
.select(col(“datasetA.id”).alias(“idA”),
col(“datasetB.id”).alias(“idB”),
col(“JaccardDistance”)).show()

// Compute the locality sensitive hashes for the input rows, then perform approximate nearest
// neighbor search.
// We could avoid computing hashes by passing in the already-transformed dataset, e.g.
// model.approxNearestNeighbors(transformedA, key, 2)
// It may return less than 2 rows when not enough approximate near-neighbor candidates are
// found.
println(“Approximately searching dfA for 2 nearest neighbors of the key:”)
model.approxNearestNeighbors(dfA, key, 2).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala” in the Spark repo.

總結(jié)

以上是生活随笔為你收集整理的特征提取,转换和选择的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。

99视频免费看 | 久久毛片高清国产 | 伊人超碰在线 | 人人干,人人爽 | 久久久久一区二区三区四区 | 九色视频网站 | 亚洲精品乱码久久 | 国内精品久久久久国产 | 色综合久久久久综合 | 欧美精品久久久 | 欧美网站黄色 | 麻豆视频在线免费观看 | 色婷婷在线观看视频 | 日韩欧美在线免费观看 | 日本中文字幕在线一区 | 久草免费在线 | 久草视频2 | 在线观看激情av | 国产专区精品 | 99热都是精品 | 午夜久久影视 | 久久久国产精品一区二区中文 | 五月婷婷狠狠 | 伊人久久精品久久亚洲一区 | 97香蕉超级碰碰久久免费软件 | 久久99久久99精品免观看粉嫩 | 免费在线精品视频 | 狠狠网亚洲精品 | 一区二区精品久久 | 国产高清在线观看 | 黄色一级在线免费观看 | 国产精品美女久久久久久 | 久久国产精品久久精品国产演员表 | av在线专区 | 色综合久久久网 | 久久人人添人人爽添人人88v | 久久最新视频 | 欧美午夜a| 国产不卡在线视频 | 免费看三级黄色片 | 亚洲欧美国产精品18p | 久久视频精品在线观看 | 天天操福利视频 | 久久精品中文字幕免费mv | 国产一级在线 | 久久免费99 | 国产在线 一区二区三区 | 日韩精品五月天 | 丁香婷婷综合网 | 久久久综合精品 | 天天干天天拍天天操 | 婷婷午夜天 | 成人在线观看日韩 | japanesexxxhd奶水| 夜夜夜夜夜夜操 | 波多野结衣在线观看视频 | 国内少妇自拍视频一区 | 97av在线| 国产一区视频在线 | 成片免费观看视频大全 | 天天久久夜夜 | 在线国产一区二区三区 | 天天操天天操天天操天天 | 不卡的一区二区三区 | 国产亚洲精品xxoo | 中文国产字幕在线观看 | 人人舔人人插 | 手机在线免费av | 97视频在线免费观看 | 久久国产精品一国产精品 | 久久精品xxx | 久久国产网站 | 午夜精品一区二区三区视频免费看 | 日韩欧美国产精品 | 中文字幕国产一区 | 日本深夜福利视频 | 最新一区二区三区 | 日韩国产精品久久久久久亚洲 | 亚洲 中文 在线 精品 | 特级黄色片免费看 | 字幕网资源站中文字幕 | 香蕉视频在线看 | av不卡中文 | 天天综合网久久 | 日韩在线观看网站 | 一区 在线 影院 | 一区二区视频在线看 | 午夜国产在线 | 精品在线你懂的 | 久久久久亚洲精品成人网小说 | av在线影视 | 新av在线| 久久综合婷婷国产二区高清 | 精品免费久久久久 | 精品专区一区二区 | 国产在线色 | 国产91精品一区二区麻豆网站 | 国内久久精品 | 欧美中文字幕第一页 | 欧美日本在线视频 | 中文字幕在线久一本久 | 国产精品久久9 | a黄色影院| 91干干干 | 超碰人人av| 国产五月婷婷 | 亚洲第一av在线 | 午夜男人影院 | 午夜免费在线观看 | 久久理论电影 | 成片人卡1卡2卡3手机免费看 | 蜜桃视频成人在线观看 | 亚洲播放一区 | 黄色免费高清视频 | 玖玖玖影院 | 国产福利一区在线观看 | 国产91精品一区二区麻豆网站 | 在线观看爱爱视频 | 天天干天天搞天天射 | 国产一区二区在线视频观看 | 国产亚洲人成网站在线观看 | 婷婷精品国产欧美精品亚洲人人爽 | 久久精品中文 | 国产一区二区不卡在线 | 日韩精品免费一区二区三区 | 国产一区二三区好的 | 久久精品久久99 | 最近字幕在线观看第一季 | 国产精品12 | 亚洲美女视频网 | 极品久久久久 | 日韩精品亚洲专区在线观看 | 免费av片在线 | 国内小视频在线观看 | 天天操天天射天天插 | 国产码电影 | 色五月成人 | 在线 国产一区 | 亚洲影视九九影院在线观看 | 亚洲夜夜网 | 婷婷激情网站 | 91成人免费观看视频 | 国产麻豆精品一区 | 亚洲传媒在线 | 丁香六月天婷婷 | 99精品在线视频观看 | 久久五月婷婷丁香社区 | 亚洲精品在线观看av | 美女av在线免费 | 中文字幕二区三区 | 天天插天天操天天干 | 一级黄色毛片 | 黄污在线看 | 天天爱天天操天天干 | 久久综合九色综合久99 | 91精品亚洲影视在线观看 | 日韩欧美视频免费在线观看 | 成人午夜久久 | 日韩中午字幕 | 9i看片成人免费看片 | 亚洲欧美日韩精品久久久 | 人人玩人人弄 | 91久久人澡人人添人人爽欧美 | 亚洲男女精品 | 久久久久国产成人免费精品免费 | 国产一级视屏 | 亚洲精品成人网 | 久久久久久久福利 | 中文字幕日韩av | 欧美视频xxx | 日韩欧美一区二区三区在线观看 | 亚洲精品高清一区二区三区四区 | 欧美夫妻性生活电影 | 在线视频婷婷 | 国产精品视屏 | 久久99久久99精品中文字幕 | 精品国产一二三四区 | 久久国产亚洲精品 | 国产精品理论片在线观看 | 成人毛片久久 | 综合色狠狠 | 丁香六月在线观看 | 天天干天天做 | 日本精品久久 | 欧美精品免费在线 | 91精品国产一区二区在线观看 | 欧美性免费 | 久久久久久国产精品 | 国产亚洲精品久久久久久 | 国产精品乱码久久 | 国产激情电影综合在线看 | 天天干人人 | 精品国产一区二区三区四区vr | 免费午夜视频在线观看 | 中文字幕久久精品 | 国产视频一区精品 | 午夜免费电影院 | 日韩在线看片 | 夜添久久精品亚洲国产精品 | 一区二区精品视频 | 亚洲精品91天天久久人人 | 久久综合久久综合这里只有精品 | 亚洲人成人99网站 | 中文字幕av专区 | 91精品国产成人www | 亚洲五月婷婷 | 一级理论片在线观看 | 狠狠色丁香婷婷综合欧美 | 天天操天天射天天添 | 最近中文字幕久久 | 国产精品久久久久久久久久久久午夜 | 在线国产精品一区 | 国模精品一区二区三区 | 国产黄色成人av | 中文字幕亚洲欧美 | 国产亚洲午夜高清国产拍精品 | 日韩高清免费无专码区 | 国产在线一区二区 | 中文字幕在线有码 | sesese图片 | www在线观看视频 | 日韩av在线看 | 国产日韩在线一区 | 最新色站 | 亚洲精品看片 | 亚洲久草在线 | 欧美小视频在线观看 | 国内外激情视频 | 久久精品国产亚洲a | 99色视频在线 | 日本黄色大片免费 | 成人免费大片黄在线播放 | 中文字幕在线观看一区 | 黄色免费大片 | 国产不卡网站 | 五月天丁香综合 | 91在线免费公开视频 | av日韩中文 | 在线播放第一页 | 波多野结衣视频在线 | 麻豆 91 在线 | 碰超在线观看 | 亚洲精品视频在线 | 久久女教师 | 日韩av不卡在线播放 | 日日躁天天躁 | 91porny九色在线播放 | 五月婷婷色播 | 国产一级片免费观看 | 日韩黄色免费看 | 中文字幕免费高清在线观看 | 色的网站在线观看 | 欧美一级免费在线 | 天天撸夜夜操 | 亚洲不卡123 | 中文字幕av在线播放 | 亚洲激情在线观看 | 国产成人99av超碰超爽 | 欧美影片 | 亚洲精品456在线播放 | 99综合电影在线视频 | 欧洲精品视频一区二区 | 九九视频精品免费 | 久久一久久 | 久久伊99综合婷婷久久伊 | 午夜av网站| 中文字幕中文字幕中文字幕 | 中文字幕视频免费观看 | 六月丁香激情网 | 91麻豆国产福利在线观看 | 亚洲片在线资源 | 在线免费av网站 | 8x成人免费视频 | 六月色丁 | 四虎影视av| 蜜臀一区二区三区精品免费视频 | 国产高清区 | 不卡视频在线看 | 日韩一级黄色av | 色网址99| 最近2019年日本中文免费字幕 | 国产无遮挡猛进猛出免费软件 | a天堂最新版中文在线地址 久久99久久精品国产 | 97超碰在线久草超碰在线观看 | 麻豆视频大全 | 欧美 日韩 国产 中文字幕 | 成人永久在线 | 在线成人免费 | 美女又爽又黄 | 亚洲激情网站免费观看 | 国产精品亚洲成人 | 久久久国产毛片 | 日日夜夜精品视频 | 五月婷婷在线视频观看 | av电影亚洲 | 96亚洲精品久久久蜜桃 | 色国产精品 | 99视频精品 | 色视频在线免费观看 | 久草视频视频在线播放 | 97福利视频 | 精品国产一区二区三区久久久 | 国产色在线 | 国产黄a三级三级 | 黄色av电影在线 | 国产成人1区| 去看片 | 99免费在线播放99久久免费 | 亚洲精品乱码久久久久久写真 | 中文字幕 第二区 | 国产一区91 | 手机成人在线电影 | 成人免费看黄 | 成人av影院在线观看 | 中文字幕欧美日韩va免费视频 | 欧美日韩视频在线 | 国产精品乱码久久久 | 婷婷色在线播放 | 黄p网站在线观看 | 一区二区精 | 午夜精品一区二区三区可下载 | 欧美肥妇free| 五月天综合网站 | 午夜精品久久久久久久99水蜜桃 | 91mv.cool在线观看 | 麻豆视频在线免费观看 | 国产精品婷婷午夜在线观看 | 日韩精品一区二区在线 | 色综合网在线 | av大片免费看 | 成人精品一区二区三区中文字幕 | 久久久影视 | 午夜影院先 | 色婷婷国产精品 | 天天色天 | 午夜精品久久久久久久99水蜜桃 | h文在线观看免费 | 亚州日韩中文字幕 | 欧美日韩一区二区三区在线免费观看 | 欧美俄罗斯性视频 | 欧美日韩不卡在线观看 | 日韩激情久久 | 色视频网站在线 | 少妇高潮流白浆在线观看 | 五月婷婷在线视频观看 | 日韩成人精品一区二区三区 | 国色综合 | 日韩v在线 | 有码视频在线观看 | 在线观看完整版免费 | 国产手机在线视频 | 91av精品 | 婷婷亚洲五月色综合 | 视频在线播放国产 | 精品久久久久一区二区国产 | 最新真实国产在线视频 | 在线涩涩| 五月婷综合网 | 久久毛片高清国产 | av黄色一级片 | 在线观看的av网站 | 91免费看黄| 色在线免费 | 欧美日韩国产在线 | 天天曰夜夜爽 | 亚洲第一香蕉视频 | 久久免费视频在线观看 | 久久久午夜剧场 | 国产99久久久国产 | 国产无遮挡猛进猛出免费软件 | 黄色软件大全网站 | 国产片网站 | 狠狠狠色狠狠色综合 | 美女网站视频免费黄 | 亚洲综合小说电影qvod | 全久久久久久久久久久电影 | 福利久久久| 久久久久日本精品一区二区三区 | 久久人人爽人人爽人人片av免费 | 狠狠色香婷婷久久亚洲精品 | 911精品美国片911久久久 | www视频在线播放 | 亚洲国产中文字幕在线 | 狠狠躁夜夜躁人人爽超碰97香蕉 | 久久国产精品一二三区 | 国产专区精品 | 99视频一区 | 天天干.com| 亚洲黄色成人网 | 日韩欧美视频免费看 | 一区二区三区四区五区在线视频 | 色永久免费视频 | 91精品视频免费在线观看 | 久久免费视频这里只有精品 | 亚洲日本va午夜在线影院 | 久久久久久久久久久精 | 夜夜视频资源 | 日韩免费成人av | 一区三区在线欧 | 97国产在线视频 | 五月天婷亚洲天综合网鲁鲁鲁 | 亚洲伦理中文字幕 | 国产精品永久免费视频 | 国产麻豆视频网站 | 一级做a爱片性色毛片www | www.亚洲黄 | av福利在线播放 | 激情视频一区 | 1区2区3区在线观看 三级动图 | 国产精品成人一区二区 | 人人超碰97 | 一区二区电影网 | 超碰国产在线观看 | 狠狠色丁香婷综合久久 | 成人黄色片在线播放 | 亚洲精品777 | 18女毛片| 99久久99久久免费精品蜜臀 | 久久国产一区 | av在线一 | 99热在线这里只有精品 | 91成人网页版 | 91 中文字幕 | 久久精品4| 久久久男人的天堂 | 久久公开免费视频 | 久久久久国 | 98涩涩国产露脸精品国产网 | 免费看成人 | 欧美日韩国产mv | 岛国大片免费视频 | 国产精品午夜av | 久久婷婷精品视频 | 一区二区亚洲精品 | 国色天香在线观看 | 精品日韩中文字幕 | 亚洲成人黄色 | 欧美日韩xxxxx| 国产精品99页 | 视频一区二区免费 | 亚洲专区中文字幕 | 成人毛片网 | 一级免费黄视频 | 国产在线精品国自产拍影院 | 免费在线观看国产黄 | 免费av试看 | 中文字幕人成人 | 中文字幕乱码电影 | 特级西西444www大精品视频免费看 | 国产精品久99 | 中文伊人 | 探花视频在线观看免费 | 亚洲第一中文字幕 | 日韩成人免费在线电影 | 97在线免费观看 | 四虎国产精品成人免费影视 | 亚洲男男gⅴgay双龙 | 天天弄天天干 | 日韩欧美视频免费在线观看 | 91成人在线视频观看 | 国产免费又粗又猛又爽 | 国产男女爽爽爽免费视频 | 欧美日韩免费在线视频 | 久久久久婷 | 五月婷婷av | 国产精品九九九 | 久久国产精品一区二区三区 | 久久免费黄色网址 | 中文字幕一区二区三区久久蜜桃 | 色播五月激情综合网 | 成人欧美一区二区三区在线观看 | 国产免费黄视频在线观看 | 国产精品久久久久久妇 | 国产精品3 | 国产成人在线播放 | 国产午夜精品一区二区三区在线观看 | 国产欧美日韩精品一区二区免费 | 久要激情网 | 久久久96| 日韩免费av在线 | 成人免费看片98欧美 | 九色91在线视频 | 国产精品综合在线 | 成人av日韩| 亚洲视频在线看 | 免费在线视频一区二区 | 亚洲免费在线播放视频 | 91视频a | 六月丁香在线观看 | 啪啪免费试看 | 偷拍精品一区二区三区 | 日韩中文字幕视频在线观看 | 韩国精品一区二区三区六区色诱 | 免费久久99精品国产 | 日韩精品一区电影 | 久久免费的精品国产v∧ | 婷婷狠狠操 | 亚洲欧美综合精品久久成人 | 精品久久久影院 | 国产一区二区不卡视频 | 丰满少妇在线观看 | 久草视频国产 | 福利精品在线 | 国产午夜精品免费一区二区三区视频 | 成人免费91| 欧美精品久久久久久久久久丰满 | 亚洲区二区| 91av视频免费在线观看 | www.av免费观看 | 视频在线观看入口黄最新永久免费国产 | 亚洲精品1区2区3区 超碰成人网 | 久久99亚洲网美利坚合众国 | 伊人在线视频 | 高潮久久久久久 | 成年人视频在线免费观看 | 色久综合 | 亚洲精品乱码久久 | av在线在线 | 日本精品va在线观看 | 成人免费观看视频网站 | 亚洲国产日韩欧美在线 | 99国产高清 | 亚洲精区二区三区四区麻豆 | 99久久99| 日韩欧美区| 伊人成人精品 | 久久久免费观看 | 91伊人久久大香线蕉蜜芽人口 | 日韩在线看片 | 色婷婷丁香 | 中文高清av | 天天色婷婷 | 日韩理论在线播放 | 日韩精品无码一区二区三区 | 欧美va在线观看 | 麻花豆传媒mv在线观看网站 | 99久久www免费 | a天堂一码二码专区 | 丝袜美腿在线视频 | 欧美日韩精品在线一区二区 | 国产伦精品一区二区三区无广告 | 麻豆视频免费入口 | 成人av在线直播 | 精品在线观看一区二区三区 | 91精品久久久久久久久久入口 | 久久综合色综合88 | av在线不卡观看 | 毛片视频电影 | 国产精品你懂的在线观看 | 黄色看片 | 四虎免费av | 色狠狠综合天天综合综合 | 日本韩国精品在线 | 99久久精品免费 | 999久久久国产精品 高清av免费观看 | 国产精品毛片久久久久久久久久99999999 | 在线电影 一区 | 天堂av在线免费观看 | 在线观看色网站 | aaa黄色毛片 | 久久99久久99精品免视看婷婷 | 五月情婷婷| 久久免费毛片 | 国产一区二区影院 | 成年人免费在线观看网站 | 亚州精品在线视频 | 国产一区在线免费观看视频 | 美腿丝袜av | 午夜精品久久久久久久久久久久 | 久久综合一本 | 久久情侣偷拍 | 人人超碰97| 国产精品18久久久久vr手机版特色 | 日本女人逼 | 日本特黄特色aaa大片免费 | 99这里只有精品99 | 久久精品久久久久 | av免费在线观 | 国产色中涩 | 久久国产亚洲精品 | 国产精品18久久久久vr手机版特色 | 香蕉视频在线视频 | 麻豆一二三精选视频 | 久久国产精品免费观看 | 日日夜夜添 | 五月花婷婷| 亚洲天天摸日日摸天天欢 | 在线观看日韩精品 | 欧美日韩国产一区二区在线观看 | 亚洲国产精品va在线看黑人 | 久草在线免费看视频 | 一本一本久久a久久精品综合妖精 | 免费成人av在线看 | 免费网址你懂的 | 黄色av成人在线观看 | 免费黄色看片 | 日韩成人不卡 | 久久超碰免费 | 日本免费久久高清视频 | 久久国产高清视频 | 国产亚洲日本 | 国内精品二区 | 超碰在线最新地址 | 在线国产小视频 | 亚洲国产欧美一区二区三区丁香婷 | 精品a视频 | 天天操网址 | 在线www色 | 国产精品久久久久久久免费大片 | 亚洲综合情 | 在线婷婷| 91麻豆视频 | 日本黄色免费播放 | 欧美激情第八页 | 激情欧美日韩一区二区 | 日韩精品大片 | 国产1区2| 狠狠色丁香九九婷婷综合五月 | 丝袜美腿亚洲综合 | 精品国产亚洲一区二区麻豆 | 六月丁香激情网 | 久久色视频 | 日韩在线色视频 | 国产亚洲日本 | 精品国产伦一区二区三区免费 | 中文字幕一区二区三区久久 | 99国产精品久久久久久久久久 | 超碰在线个人 | 婷婷久操 | 欧美肥妇free | 最新91在线视频 | 超碰在线99| 免费看搞黄视频网站 | 亚洲人成免费 | 黄色片软件网站 | 成人性生交视频 | av不卡在线看 | 国产精品久久久久久久久久久久久 | 国产久视频| 亚洲午夜av电影 | 五月婷婷操| 免费日韩电影 | 国产99在线播放 | 免费中文字幕 | 成人国产精品电影 | 亚洲黄色av一区 | 天天综合人人 | 99久久国产免费,99久久国产免费大片 | 天天干天天做 | 麻豆视频免费看 | 国产精品久久久久免费a∨ 欧美一级性生活片 | 日韩伦理一区二区三区av在线 | 2023国产精品自产拍在线观看 | www.天天干.com | 欧美性精品 | 日韩一区二区三区不卡 | av片子在线观看 | 色婷婷福利 | 国产精品成人一区二区三区吃奶 | 久久精品国产亚洲a | 成人毛片久久 | 免费观看高清 | 免费中文字幕在线观看 | 久久96国产精品久久99漫画 | 欧美a级在线播放 | 天堂av在线网 | 亚洲伊人网在线观看 | 午夜在线资源 | 成人免费在线看片 | 欧美另类激情 | 精品1区2区3区 | 欧美日韩一区二区视频在线观看 | 色婷婷啪啪免费在线电影观看 | 免费观看成年人视频 | 日韩精品无码一区二区三区 | av资源免费看 | 欧美激情第一区 | 热久久视久久精品18亚洲精品 | 久久精品欧美 | 国内久久视频 | 视频在线观看入口黄最新永久免费国产 | 天天操天天干天天操天天干 | 美女免费黄视频网站 | 五月婷婷天堂 | av7777777| 一区二区伦理电影 | 成年人免费观看国产 | 国产精品国产三级国产不产一地 | 九九热精品视频在线观看 | 国产成人a亚洲精品 | 欧美日一级片 | 国产粉嫩在线 | 午夜精品久久久久久久99水蜜桃 | 在线国产日韩 | 亚洲,国产成人av | 丁香六月婷婷开心 | 狠狠狠狠狠狠 | 九九九热视频 | 欧美日韩视频 | 伊人久在线 | 天天操伊人 | 在线视频 国产 日韩 | 又黄又刺激的网站 | 中文字幕视频观看 | 亚洲欧美日韩一二三区 | 天天天天天天干 | 日韩精品视频免费专区在线播放 | 久久久久久久久久电影 | 国产精品久久久久久久久久白浆 | 欧美日韩一区二区三区在线免费观看 | 久久伊99综合婷婷久久伊 | 久久精品国产精品 | 人人爽人人舔 | av色影院 | 91免费国产在线观看 | 欧美精品乱码99久久影院 | 成人一区电影 | 久久精品在线视频 | av理论电影 | 免费下载高清毛片 | 可以免费观看的av片 | 999久久久久久久久6666 | 婷婷丁香激情五月 | 欧美一级小视频 | 在线观看视频一区二区三区 | 久久久久久久久久久久久久免费看 | 中文字幕中文字幕在线中文字幕三区 | 国产精品综合在线 | 国产精品私人影院 | 国产一级片网站 | 久久精品一区二区三区视频 | 一区二区国产精品 | 亚洲六月丁香色婷婷综合久久 | 久久综合影视 | 日本性视频 | 欧洲色吧 | 伊人久久五月天 | 91精品欧美 | 久久伦理视频 | 日韩免费一二三区 | 91九色老| 中文字幕一区二区三区四区 | 国产精品18久久久久久vr | 欧美日韩中文字幕在线视频 | 国产成人精品日本亚洲999 | 久久久久久久久毛片 | 成人一区二区三区中文字幕 | 日日草av| 久久久久伦理电影 | 久热av | 国产精品久久久久一区二区国产 | 亚洲精品色视频 | 久草91视频 | 国产玖玖在线 | 久久久久国 | 99久久www| 69精品在线 | 在线视频app | 久久久精品免费观看 | 最近字幕在线观看第一季 | 欧美日韩视频免费 | 精品综合久久 | 99热这里只有精品8 久久综合毛片 | 中文字幕在线看人 | 中文字幕色在线视频 | 亚洲免费一级 | 夜色在线资源 | freejavvideo日本免费 | 国产精品男女视频 | 日韩精品一区二区三区免费视频观看 | 天天干夜夜爽 | 亚洲成人免费在线 | 91麻豆精品国产91久久久更新时间 | 99精品国产成人一区二区 | 中文字幕久久久精品 | 国产在线更新 | 999国产| 亚洲国产精品资源 | av中文字幕电影 | 欧美 亚洲 另类 激情 另类 | 国产视频一区二区在线观看 | 国产在线观看xxx | 精品久久综合 | 成人欧美一区二区三区黑人麻豆 | www.夜夜爱 | 日日草av | 中文字幕免费高清 | 四虎国产精品成人免费影视 | 国产精品成人免费一区久久羞羞 | 欧美精品午夜 | 免费视频资源 | 看黄色.com | 91av看片 | 在线视频国产区 | 综合色综合色 | 久久中文字幕在线视频 | 色婷婷综合久久久久中文字幕1 | 久久久久久久久久久久久影院 | 国产99色 | www激情久久| 日韩深夜在线观看 | 国产精品久久久久久69 | 国产91精品一区二区绿帽 | 亚洲精品网页 | 91在线看视频| 中国一级片视频 | 欧美日韩视频在线观看免费 | 蜜桃av久久久亚洲精品 | 亚洲一区精品人人爽人人躁 | 国产成人综合精品 | 国产一区高清在线 | 日本精品中文字幕在线观看 | 日韩精品一区二区在线观看视频 | 五月开心六月婷婷 | 激情欧美丁香 | 激情综合狠狠 | 久久免费观看少妇a级毛片 久久久久成人免费 | 精品久久一区二区 | 超级碰99 | 狠狠的干 | 中文字幕乱视频 | 精品国产一区二区三区四 | 四虎国产精品永久在线国在线 | 国产精品黄色在线观看 | 亚洲国产小视频在线观看 | 国产无套精品久久久久久 | 免费a网站 | 久久久久欧美精品999 | 久章操 | 久久精品国产成人 | 天堂在线成人 | 欧美一级性生活 | 精品人人爽 | 国产在线播放不卡 | 伊人色综合网 | 丁香六月在线 | av成人动漫 | 亚洲一区在线看 | 免费亚洲婷婷 | 国产精品理论片在线观看 | 日韩在线观看第一页 | 日韩av一区二区在线播放 | 亚洲小视频在线 | 91网在线看| 99久久激情 | 狠狠网亚洲精品 | 亚洲一区 影院 | 97超碰超碰 | 亚洲另类视频在线观看 | 久久国产亚洲视频 | 久久黄色免费观看 | 日韩欧美区| 国产小视频在线 | 一区二区三区免费在线 | 亚洲成人av在线电影 | 一区二区三区免费在线 | 久久99久久99 | 五月婷婷一级片 | 国产免费亚洲高清 | 日韩av在线资源 | 国产精品视频久久 | 日韩激情av在线 | www日| 四川bbb搡bbb爽爽视频 | 日本精品中文字幕在线观看 | 成年人在线观看视频免费 | 国产中年夫妇高潮精品视频 | 69国产盗摄一区二区三区五区 | 色999精品| 探花视频免费在线观看 | 国产欧美日韩一区 | 天天干,天天操,天天射 | 人人狠狠综合久久亚洲 | 国产精品激情在线观看 | 欧美夫妻生活视频 | 五月婷婷播播 | 91精品区| 亚洲精品日韩在线观看 | 中文字幕一区二区三区乱码不卡 | 亚洲第一香蕉视频 | 一二三区视频在线 | 91视频午夜 | 久久视频中文字幕 | 在线免费av网 | 国产精品成人免费精品自在线观看 | 午夜视频在线观看一区二区三区 | 香蕉精品视频在线观看 | 久久人人精 | 88av网站 | 亚洲国产免费 | 国产精品免费观看网站 | 日韩特黄av | 成人在线观看网址 | 精品主播网红福利资源观看 | 成人h电影在线观看 | 五月天婷婷在线视频 | 一区二区三区免费在线观看视频 | 久久久国产一区二区 | 成人免费毛片aaaaaa片 | 国产区网址 | 一个色综合网站 | 久久久久久久久久久免费 | 国内丰满少妇猛烈精品播 | 久久草草热国产精品直播 | 精品久久久久久亚洲 | 欧洲一区精品 | 久久天天躁 | 最新国产中文字幕 | 色综合天天爱 | 黄av资源| 99热精品久久 | 亚洲资源在线观看 | av免费成人| 久久一区二区三区国产精品 | 狠狠干在线 | 国产福利91精品一区二区三区 | 在线观看国产 | 久久不射电影网 | 成人免费大片黄在线播放 | 色老板在线 | 一区精品久久 | 国产精品18久久久 | 亚洲a在线观看 | 欧美精品首页 | 免费看的黄色的网站 | 久久国产视频网 | 国内精品久久影院 | 中文字幕一区二 | 久久久久久久国产精品视频 | 欧洲亚洲激情 | 欧美a视频在线观看 | 最近中文字幕高清字幕免费mv | 超碰在线cao| 国产精品久久久久影院日本 | 黄色性av | 久色小说 | 色小说av| 啪啪小视频网站 | 精品日韩在线一区 | 日韩在线电影观看 | 亚洲天堂社区 | 成人福利在线播放 | 久久99视频精品 | 亚洲视频久久久久 | 国产视频精品免费播放 | 欧美日韩国产在线一区 | 国产又粗又猛又爽又黄的视频免费 | 亚洲精品国产第一综合99久久 | 亚洲精品国偷拍自产在线观看蜜桃 | 国产剧情一区 | 超碰在线97国产 | 久久久久久久久久久久久久免费看 | 久久色中文字幕 | 91精品天码美女少妇 | 日韩免费在线观看视频 | 97香蕉久久超级碰碰高清版 | 日韩视 | 免费视频成人 | 激情视频网页 | 麻豆久久久久 | 亚洲成人黄色网址 | 中文字幕二区三区 | 激情av资源 | 久久久久久亚洲精品 | 国产精品网站 | 成年人在线视频观看 | 久久精品99视频 | 久久在线观看 | 日日躁你夜夜躁你av蜜 | 中文字幕av免费在线观看 | 久久美女电影 | 中文免费观看 | 爱爱av在线| 91传媒在线看 | 久久久久伊人 | 综合av在线 | 深爱激情综合网 | 久久久久免费电影 | 久久久亚洲麻豆日韩精品一区三区 | 欧美日韩视频一区二区三区 | 91精品影视 | 日韩av手机在线观看 | 一区二区三区中文字幕在线 | 91污污视频在线观看 | 一级黄色a视频 | 射久久 | 国产美女精彩久久 | 久草在线视频在线观看 | 国产视频在线观看一区 | 中文国产在线观看 | 91在线看视频 | 亚洲h色精品 | 日韩在线播放av | 日日夜精品 | 国内精品久久天天躁人人爽 | 国内精品福利视频 | 中文字幕在线观看不卡 | 伊人色综合久久天天网 |