日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 人文社科 > 生活经验 >内容正文

生活经验

特征提取,转换和选择

發(fā)布時(shí)間:2023/11/28 生活经验 38 豆豆
生活随笔 收集整理的這篇文章主要介紹了 特征提取,转换和选择 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

特征提取,轉(zhuǎn)換和選擇
Extracting, transforming and selecting features
This section covers algorithms for working with features, roughly divided into these groups:
? Extraction: Extracting features from “raw” data
? Transformation: Scaling, converting, or modifying features
? Selection: Selecting a subset from a larger set of features
? Locality Sensitive Hashing (LSH): This class of algorithms combines aspects of feature transformation with other algorithms.
本節(jié)涵蓋使用功能的算法,大致分為以下幾類:
? 提取:從“原始”數(shù)據(jù)中提取特征
? 轉(zhuǎn)換:縮放,轉(zhuǎn)換或修改特征
? 選擇:從更大的特征集中選擇一個(gè)子集
? 局部敏感哈希(LSH):此類算法將特征轉(zhuǎn)換的各個(gè)方面與其它算法結(jié)合在一起。
Table of Contents
? Feature Extractors
o TF-IDF
o Word2Vec
o CountVectorizer
o FeatureHasher
? Feature Transformers
o Tokenizer
o StopWordsRemover
o nn-gram
o Binarizer
o PCA
o PolynomialExpansion
o Discrete Cosine Transform (DCT)
o StringIndexer
o IndexToString
o OneHotEncoder
o VectorIndexer
o Interaction
o Normalizer
o StandardScaler
o RobustScaler
o MinMaxScaler
o MaxAbsScaler
o Bucketizer
o ElementwiseProduct
o SQLTransformer
o VectorAssembler
o VectorSizeHint
o QuantileDiscretizer
o Imputer
? Feature Selectors
o VectorSlicer
o RFormula
o ChiSqSelector
o UnivariateFeatureSelector
o VarianceThresholdSelector
? Locality Sensitive Hashing
o LSH Operations
? Feature Transformation
? Approximate Similarity Join
? Approximate Nearest Neighbor Search
o LSH Algorithms
? Bucketed Random Projection for Euclidean Distance
? MinHash for Jaccard Distance
Feature Extractors
TF-IDF
Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. Denote a term by t, a document by d, and the corpus by D. Term frequency TF(t,d) is the number of times that term t appears in document d, while document frequency DF(t,D) is the number of documents that contains term t. If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that appear very often but carry little information about the document, e.g. “a”, “the”, and “of”. If a term appears very often across the corpus, it means it doesn’t carry special information about a particular document. Inverse document frequency is a numerical measure of how much information a term provides:
變量逆頻率文檔頻率(TF-IDF) 是一種特征向量化方法,廣泛用于文本挖掘中,反映變量對(duì)語(yǔ)料庫(kù)中文檔的重要性。用t表示變量,用d表示文檔,用D表示語(yǔ)料庫(kù)。變量頻率TF(t,d)是變量t在文檔d中出現(xiàn)的次數(shù),而文檔頻率DF(t,D)是包含變量t的文檔數(shù)。如果僅使用變量頻率來(lái)衡量重要性,則過(guò)分強(qiáng)調(diào)那些經(jīng)常出現(xiàn),但幾乎不包含有關(guān)文檔信息的變量,例如“一個(gè)a”,“該the”和“屬于of”。如果變量經(jīng)常出現(xiàn)在整個(gè)語(yǔ)料庫(kù)中,則表示該變量不包含有關(guān)特定文檔的特殊信息。逆文檔頻率是一個(gè)變量大小信息,提供了一個(gè)數(shù)值量度:

where |D| is the total number of documents in the corpus. Since logarithm is used, if a term appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product of TF and IDF:
其中|D|是所述語(yǔ)料庫(kù)中的文件的總數(shù)。由于使用對(duì)數(shù),因此如果一個(gè)變量出現(xiàn)在所有文檔中,則其IDF值將變?yōu)?。注意,應(yīng)用了平滑變量以避免對(duì)主體外的變量除以零。TF-IDF度量只是TF和IDF的乘積:

There are several variants on the definition of term frequency and document frequency. In MLlib, we separate TF and IDF to make them flexible.
TF: Both HashingTF and CountVectorizer can be used to generate the term frequency vectors.
HashingTF is a Transformer which takes sets of terms and converts those sets into fixed-length feature vectors. In text processing, a “set of terms” might be a bag of words. HashingTF utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. The hash function used here is MurmurHash 3. Then term frequencies are calculated based on the mapped indices. This approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the chance of collision, we can increase the target feature dimension, i.e. the number of buckets of the hash table. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the feature dimension, otherwise the features will not be mapped evenly to the vector indices. The default feature dimension is 218=262,144218=262,144. An optional binary toggle parameter controls term frequency counts. When set to true all nonzero frequency counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.
CountVectorizer converts text documents to vectors of term counts. Refer to CountVectorizer for more details.
IDF: IDF is an Estimator which is fit on a dataset and produces an IDFModel. The IDFModel takes feature vectors (generally created from HashingTF or CountVectorizer) and scales each feature. Intuitively, it down-weights features which appear frequently in a corpus.
Note: spark.ml doesn’t provide tools for text segmentation. We refer users to the Stanford NLP Group and scalanlp/chalk.
Examples
In the following code segment, we start with a set of sentences. We split each sentence into words using Tokenizer. For each sentence (bag of words), we use HashingTF to hash the sentence into a feature vector. We use IDF to rescale the feature vectors; this generally improves performance when using text as features. Our feature vectors could then be passed to a learning algorithm.
變量頻率和文檔頻率的定義有多種變體。在MLlib中,將TF和IDF分開(kāi)以使其具有靈活性。
TF:HashingTF和CountVectorizer均可用于生成項(xiàng)頻率向量。
HashingTF是,Transformer接受一組變量并將其轉(zhuǎn)換為固定長(zhǎng)度的特征向量。在文本處理中,“一組變量”可能是一袋單詞。 HashingTF利用哈希理論。通過(guò)應(yīng)用哈希函數(shù)將原始特征映射到索引(項(xiàng))。這里使用的哈希函數(shù)是MurmurHash 3。然后根據(jù)映射的索引計(jì)算詞頻。這種方法避免了需要計(jì)算全局項(xiàng)到索引圖的情況,對(duì)于大型語(yǔ)料庫(kù)可能是昂貴的,但是會(huì)遭受潛在的哈希沖突,即哈希后不同的原始特征可能變成相同的變量。為了減少?zèng)_突的概率,可以增加目標(biāo)要素的維數(shù),即哈希表的存儲(chǔ)數(shù)。使用散列值的簡(jiǎn)單模來(lái)確定向量索引,建議使用2的冪作為特征維,否則特征將不會(huì)均勻地映射到向量索引。默認(rèn)特征尺寸為
。可選的二進(jìn)制切換參數(shù)控制項(xiàng)頻率計(jì)數(shù)。當(dāng)設(shè)置為true時(shí),所有非零頻率計(jì)數(shù)都設(shè)置為1。對(duì)于模擬二進(jìn)制而不是整數(shù)計(jì)數(shù)的離散概率模型特別有用。
CountVectorizer將文本文檔轉(zhuǎn)換為變量計(jì)數(shù)向量。有關(guān)更多詳細(xì)信息,請(qǐng)參考CountVectorizer 。
IDF:IDF是Estimator適合數(shù)據(jù)集,產(chǎn)生的IDFIDFModel。所述 IDFModel需要的特征向量(通常從創(chuàng)建HashingTF或CountVectorizer)和縮放每個(gè)特征。直觀地,會(huì)減少在語(yǔ)料庫(kù)中經(jīng)常出現(xiàn)的特征的權(quán)重。
注意: spark.ml不提供用于文本分割的工具。將用戶推薦給Stanford NLP Group和 scalanlp / chalk。
例子
在下面的代碼段中,從一組句子開(kāi)始。使用將每個(gè)句子分成單詞Tokenizer。對(duì)于每個(gè)句子(單詞袋),用HashingTF將句子散列為特征向量。IDF用來(lái)重新縮放特征向量;使用文本作為特征時(shí),通常可以提高性能。然后,特征向量可以傳遞給學(xué)習(xí)算法。

? Scala
? Java
? Python
Refer to the HashingTF Scala docs and the IDF Scala docs for more details on the API.
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}

val sentenceData = spark.createDataFrame(Seq(
(0.0, “Hi I heard about Spark”),
(0.0, “I wish Java could use case classes”),
(1.0, “Logistic regression models are neat”)
)).toDF(“l(fā)abel”, “sentence”)

val tokenizer = new Tokenizer().setInputCol(“sentence”).setOutputCol(“words”)
val wordsData = tokenizer.transform(sentenceData)

val hashingTF = new HashingTF()
.setInputCol(“words”).setOutputCol(“rawFeatures”).setNumFeatures(20)

val featurizedData = hashingTF.transform(wordsData)
// alternatively, CountVectorizer can also be used to get term frequency vectors

val idf = new IDF().setInputCol(“rawFeatures”).setOutputCol(“features”)
val idfModel = idf.fit(featurizedData)

val rescaledData = idfModel.transform(featurizedData)
rescaledData.select(“l(fā)abel”, “features”).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/TfIdfExample.scala” in the Spark repo.
Word2Vec
Word2Vec is an Estimator which takes sequences of words representing documents and trains a Word2VecModel. The model maps each word to a unique fixed-size vector. The Word2VecModel transforms each document into a vector using the average of all words in the document; this vector can then be used as features for prediction, document similarity calculations, etc. Please refer to the MLlib user guide on Word2Vec for more details. Word2Vec是一個(gè)Estimator,表示文檔的單詞序列并訓(xùn)練一個(gè) Word2VecModel。該模型將每個(gè)單詞映射到唯一的固定大小的向量。使用Word2VecModel 文檔中所有單詞的平均值,將每個(gè)文檔轉(zhuǎn)換為向量;然后,可以將此向量用作預(yù)測(cè),文檔相似度計(jì)算等的功能。有關(guān)更多詳細(xì)信息,可參考Word2Vec上的MLlib用戶指南。
Examples
In the following code segment, we start with a set of documents, each of which is represented as a sequence of words. For each document, we transform it into a feature vector. This feature vector could then be passed to a learning algorithm. 在下面的代碼段中,從一組文檔開(kāi)始,每個(gè)文檔都由一個(gè)單詞序列表示。對(duì)于每個(gè)文檔,將其轉(zhuǎn)換為特征向量。然后可以將該特征向量傳遞給學(xué)習(xí)算法。
? Scala
? Java
? Python
Refer to the Word2Vec Scala docs for more details on the API.
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

// Input data: Each row is a bag of words from a sentence or document.
val documentDF = spark.createDataFrame(Seq(
“Hi I heard about Spark”.split(" “),
“I wish Java could use case classes”.split(” “),
“Logistic regression models are neat”.split(” ")
).map(Tuple1.apply)).toDF(“text”)

// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
.setInputCol(“text”)
.setOutputCol(“result”)
.setVectorSize(3)
.setMinCount(0)
val model = word2Vec.fit(documentDF)

val result = model.transform(documentDF)
result.collect().foreach { case Row(text: Seq[_], features: Vector) =>
println(s"Text: [${text.mkString(", “)}] => \nVector: $features\n”) }
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/Word2VecExample.scala” in the Spark repo.
CountVectorizer
CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents to vectors of token counts. When an a-priori dictionary is not available, CountVectorizer can be used as an Estimator to extract the vocabulary, and generates a CountVectorizerModel. The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA.
During the fitting process, CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus. An optional parameter minDF also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. Another optional binary toggle parameter controls the output vector. If set to true all nonzero counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.
CountVectorizer和CountVectorizerModel,幫助轉(zhuǎn)換文本文檔的集合令牌計(jì)數(shù)的載體。當(dāng)先驗(yàn)詞典不可用時(shí),CountVectorizer可以用作Estimator,提取詞匯表并生成CountVectorizerModel。該模型為詞匯表上的文檔生成稀疏表示,然后可以將其傳遞給其它算法,例如LDA。
在擬合過(guò)程中,CountVectorizer將選擇vocabSize整個(gè)語(yǔ)料庫(kù)中,按詞頻排列的前幾個(gè)詞。可選參數(shù)minDF,通過(guò)指定一個(gè)單詞必須出現(xiàn)在詞匯表中的最小數(shù)量(如果小于1.0,則為小數(shù))來(lái)影響擬合過(guò)程。另一個(gè)可選的二進(jìn)制,切換參數(shù)控制輸出向量。如果將其設(shè)置為true,則所有非零計(jì)數(shù)都將設(shè)置為1。這對(duì)于模擬二進(jìn)制,而不是整數(shù)計(jì)數(shù)的離散概率模型特別有用。
Examples
Assume that we have the following DataFrame with columns id and texts:
假設(shè)有以下帶有列id和 texts的DataFrame:

idtexts
0Array(“a”, “b”, “c”)
1Array(“a”, “b”, “b”, “c”, “a”)

each row in texts is a document of type Array[String]. Invoking fit of CountVectorizer produces a CountVectorizerModel with vocabulary (a, b, c). Then the output column “vector” after transformation contains: 每行texts是一個(gè)Array [String]類型的文檔。調(diào)用的契合度CountVectorizer會(huì)產(chǎn)生CountVectorizerModel帶有詞匯量(a,b,c)的a。然后,轉(zhuǎn)換后的輸出列“ vector”包含:

idtextsvector
0Array(“a”, “b”, “c”)(3,[0,1,2],[1.0,1.0,1.0])
1Array(“a”, “b”, “b”, “c”, “a”)(3,[0,1,2],[2.0,2.0,1.0])

Each vector represents the token counts of the document over the vocabulary.
? Scala
? Java
? Python
Refer to the CountVectorizer Scala docs and the CountVectorizerModel Scala docs for more details on the API. 有關(guān)API的更多詳細(xì)信息,參考CountVectorizer Scala文檔 和CountVectorizerModel Scala文檔。
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}

val df = spark.createDataFrame(Seq(
(0, Array(“a”, “b”, “c”)),
(1, Array(“a”, “b”, “b”, “c”, “a”))
)).toDF(“id”, “words”)

// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol(“words”)
.setOutputCol(“features”)
.setVocabSize(3)
.setMinDF(2)
.fit(df)

// alternatively, define CountVectorizerModel with a-priori vocabulary
val cvm = new CountVectorizerModel(Array(“a”, “b”, “c”))
.setInputCol(“words”)
.setOutputCol(“features”)

cvModel.transform(df).show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/CountVectorizerExample.scala” in the Spark repo.
FeatureHasher
Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick to map features to indices in the feature vector.
The FeatureHasher transformer operates on multiple columns. Each column may contain either numeric or categorical features. Behavior and handling of column data types is as follows:
? Numeric columns: For numeric features, the hash value of the column name is used to map the feature value to its index in the feature vector. By default, numeric features are not treated as categorical (even when they are integers). To treat them as categorical, specify the relevant columns using the categoricalCols parameter.
? String columns: For categorical features, the hash value of the string “column_name=value” is used to map to the vector index, with an indicator value of 1.0. Thus, categorical features are “one-hot” encoded (similarly to using OneHotEncoder with dropLast=false).
? Boolean columns: Boolean values are treated in the same way as string columns. That is, boolean features are represented as “column_name=true” or “column_name=false”, with an indicator value of 1.0.
Null (missing) values are ignored (implicitly zero in the resulting feature vector).
The hash function used here is also the MurmurHash 3 used in HashingTF. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector indices.
特征哈希將一組分類或數(shù)字特征投影到指定維數(shù)的特征向量中(通常大大小于原始特征空間的特征向量)。這是通過(guò)使用哈希技巧 將特征映射到特征向量中的索引來(lái)完成的。
該FeatureHasher變壓器上多列運(yùn)行。每列都可以包含數(shù)字或分類特征。列數(shù)據(jù)類型的行為和處理如下:
? 數(shù)字列:對(duì)于數(shù)字特征,列名稱的哈希值用于將特征值映射到特征向量中的索引。默認(rèn)情況下,數(shù)字功能不被視為分類(即使是整數(shù))。要將其視為分類,使用categoricalCols參數(shù)指定相關(guān)列。
? 字符串列:對(duì)于分類特征,字符串“ column_name = value”的哈希值,用于映射到向量索引,指示符值為1.0。因此,分類特征被“一次熱”編碼(類似于將OneHotEncoder與一起使用 dropLast=false)。
? 布爾列:布爾值與字符串列的處理方式相同。即,布爾特征表示為“ column_name = true”或“ column_name = false”,指示符值為1.0。
空(缺失)值將被忽略(在所得特征向量中隱式為零)。
這里使用的哈希函數(shù)也是HashingTF中 使用的MurmurHash 3。由于使用散列值的簡(jiǎn)單模來(lái)確定向量索引,因此建議使用2的冪作為numFeatures參數(shù);否則,建議使用2的冪。不然,這些特征將不會(huì)均勻地映射到矢量索引。

Examples
Assume that we have a DataFrame with 4 input columns real, bool, stringNum, and string. These different data types as input will illustrate the behavior of the transform to produce a column of feature vectors. 假設(shè)有4個(gè)輸入列的數(shù)據(jù)幀real,bool,stringNum,和string。這些不同的數(shù)據(jù)類型作為輸入,將生成一列特征向量的變換。

realboolstringNumstring
2.2true1foo
3.3false2bar
4.4false3baz
5.5false4foo

Then the output of FeatureHasher.transform on this DataFrame is:

realboolstringNumstringfeatures
2.2true1foo(262144,[51871, 63643,174475,253195],[1.0,1.0,2.2,1.0])
3.3false2bar(262144,[6031, 80619,140467,174475],[1.0,1.0,1.0,3.3])
4.4false3baz(262144,[24279,140467,174475,196810],[1.0,1.0,4.4,1.0])
5.5false4foo(262144,[63643,140467,168512,174475],[1.0,1.0,1.0,5.5])

The resulting feature vectors could then be passed to a learning algorithm.
? Scala
? Java
? Python
Refer to the FeatureHasher Scala docs for more details on the API.
import org.apache.spark.ml.feature.FeatureHasher

val dataset = spark.createDataFrame(Seq(
(2.2, true, “1”, “foo”),
(3.3, false, “2”, “bar”),
(4.4, false, “3”, “baz”),
(5.5, false, “4”, “foo”)
)).toDF(“real”, “bool”, “stringNum”, “string”)

val hasher = new FeatureHasher()
.setInputCols(“real”, “bool”, “stringNum”, “string”)
.setOutputCol(“features”)

val featurized = hasher.transform(dataset)
featurized.show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/FeatureHasherExample.scala” in the Spark repo.
Feature Transformers
Tokenizer
Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality. The example below shows how to split sentences into sequences of words.
RegexTokenizer allows more advanced tokenization based on regular expression (regex) matching. By default, the parameter “pattern” (regex, default: “\s+”) is used as delimiters to split the input text. Alternatively, users can set parameter “gaps” to false indicating the regex “pattern” denotes “tokens” rather than splitting gaps, and find all matching occurrences as the tokenization result.
標(biāo)記化是獲取文本(例如句子),并將其分解為單個(gè)術(shù)語(yǔ)(通常是單詞)的過(guò)程。一個(gè)簡(jiǎn)單的Tokenizer類提供了此功能。下面的示例顯示了如何將句子分成單詞序列。
RegexTokenizer允許基于正則表達(dá)式(regex)匹配,進(jìn)行更高級(jí)的標(biāo)記化。默認(rèn)情況下,參數(shù)“ pattern”(正則表達(dá)式,默認(rèn)值:),"\s+"用作分隔輸入文本的定界符。或者,用戶可以將參數(shù)“ gap”設(shè)置為false,以表示正則表達(dá)式“ pattern”表示“令牌”,而不是拆分間隙,并找到所有匹配的出現(xiàn)作為標(biāo)記化結(jié)果。
Examples
? Scala
? Java
? Python
Refer to the Tokenizer Scala docs and the RegexTokenizer Scala docs for more details on the API. 有關(guān)API的更多詳細(xì)信息,可參考Tokenizer Scala文檔 和RegexTokenizer Scala文檔。
import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val sentenceDataFrame = spark.createDataFrame(Seq(
(0, “Hi I heard about Spark”),
(1, “I wish Java could use case classes”),
(2, “Logistic,regression,models,are,neat”)
)).toDF(“id”, “sentence”)

val tokenizer = new Tokenizer().setInputCol(“sentence”).setOutputCol(“words”)
val regexTokenizer = new RegexTokenizer()
.setInputCol(“sentence”)
.setOutputCol(“words”)
.setPattern("\W") // alternatively .setPattern("\w+").setGaps(false)

val countTokens = udf { (words: Seq[String]) => words.length }

val tokenized = tokenizer.transform(sentenceDataFrame)
tokenized.select(“sentence”, “words”)
.withColumn(“tokens”, countTokens(col(“words”))).show(false)

val regexTokenized = regexTokenizer.transform(sentenceDataFrame)
regexTokenized.select(“sentence”, “words”)
.withColumn(“tokens”, countTokens(col(“words”))).show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/TokenizerExample.scala” in the Spark repo.
StopWordsRemover
Stop words are words which should be excluded from the input, typically because the words appear frequently and don’t carry as much meaning.
StopWordsRemover takes as input a sequence of strings (e.g. the output of a Tokenizer) and drops all the stop words from the input sequences. The list of stopwords is specified by the stopWords parameter. Default stop words for some languages are accessible by calling StopWordsRemover.loadDefaultStopWords(language), for which available options are “danish”, “dutch”, “english”, “finnish”, “french”, “german”, “hungarian”, “italian”, “norwegian”, “portuguese”, “russian”, “spanish”, “swedish” and “turkish”. A boolean parameter caseSensitive indicates if the matches should be case sensitive (false by default).
停用詞是應(yīng)從輸入中排除的詞,通常是因?yàn)檫@些詞頻繁出現(xiàn)且含義不大。
StopWordsRemover將一個(gè)字符串序列(例如Tokenizer的輸出)作為輸入,并從輸入序列中刪除所有停用詞。停用詞列表由stopWords參數(shù)指定。可以通過(guò)調(diào)用來(lái)訪問(wèn)某些語(yǔ)言的默認(rèn)停用詞StopWordsRemover.loadDefaultStopWords(language),其可用選項(xiàng)為“丹麥語(yǔ)”,“荷蘭語(yǔ)”,“英語(yǔ)”,“芬蘭語(yǔ)”,“法語(yǔ)”,“德語(yǔ)”,“匈牙利語(yǔ)”,“意大利語(yǔ)”,“挪威語(yǔ)” ”,“葡萄牙語(yǔ)”,“俄語(yǔ)”,“西班牙語(yǔ)”,“瑞典語(yǔ)”和“土耳其語(yǔ)”。布爾參數(shù)caseSensitive表示匹配項(xiàng)是否區(qū)分大小寫(默認(rèn)情況下為false)。
Examples
Assume that we have the following DataFrame with columns id and raw:

idraw
0[I, saw, the, red, balloon]
1[Mary, had, a, little, lamb]

Applying StopWordsRemover with raw as the input column and filtered as the output column, we should get the following:

idrawfiltered
0[I, saw, the, red, balloon][saw, red, balloon]
1[Mary, had, a, little, lamb][Mary, little, lamb]

In filtered, the stop words “I”, “the”, “had”, and “a” have been filtered out.
? Scala
? Java
? Python
Refer to the StopWordsRemover Scala docs for more details on the API.
import org.apache.spark.ml.feature.StopWordsRemover

val remover = new StopWordsRemover()
.setInputCol(“raw”)
.setOutputCol(“filtered”)

val dataSet = spark.createDataFrame(Seq(
(0, Seq(“I”, “saw”, “the”, “red”, “balloon”)),
(1, Seq(“Mary”, “had”, “a”, “l(fā)ittle”, “l(fā)amb”))
)).toDF(“id”, “raw”)

remover.transform(dataSet).show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/StopWordsRemoverExample.scala” in the Spark repo.
nn-gram
An n-gram is a sequence of nn tokens (typically words) for some integer nn. The NGram class can be used to transform input features into nn-grams.
NGram takes as input a sequence of strings (e.g. the output of a Tokenizer). The parameter n is used to determine the number of terms in each nn-gram. The output will consist of a sequence of nn-grams where each nn-gram is represented by a space-delimited string of nn consecutive words. If the input sequence contains fewer than n strings, no output is produced.
Examples
? Scala
? Java
? Python
Refer to the NGram Scala docs for more details on the API.
import org.apache.spark.ml.feature.NGram

val wordDataFrame = spark.createDataFrame(Seq(
(0, Array(“Hi”, “I”, “heard”, “about”, “Spark”)),
(1, Array(“I”, “wish”, “Java”, “could”, “use”, “case”, “classes”)),
(2, Array(“Logistic”, “regression”, “models”, “are”, “neat”))
)).toDF(“id”, “words”)

val ngram = new NGram().setN(2).setInputCol(“words”).setOutputCol(“ngrams”)

val ngramDataFrame = ngram.transform(wordDataFrame)
ngramDataFrame.select(“ngrams”).show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/NGramExample.scala” in the Spark repo.
Binarizer
Binarization is the process of thresholding numerical features to binary (0/1) features.
Binarizer takes the common parameters inputCol and outputCol, as well as the threshold for binarization. Feature values greater than the threshold are binarized to 1.0; values equal to or less than the threshold are binarized to 0.0. Both Vector and Double types are supported for inputCol.
Examples
? Scala
? Java
? Python
Refer to the Binarizer Scala docs for more details on the API.
import org.apache.spark.ml.feature.Binarizer

val data = Array((0, 0.1), (1, 0.8), (2, 0.2))
val dataFrame = spark.createDataFrame(data).toDF(“id”, “feature”)

val binarizer: Binarizer = new Binarizer()
.setInputCol(“feature”)
.setOutputCol(“binarized_feature”)
.setThreshold(0.5)

val binarizedDataFrame = binarizer.transform(dataFrame)

println(s"Binarizer output with Threshold = ${binarizer.getThreshold}")
binarizedDataFrame.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/BinarizerExample.scala” in the Spark repo.
PCA
PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A PCA class trains a model to project vectors to a low-dimensional space using PCA. The example below shows how to project 5-dimensional feature vectors into 3-dimensional principal components.
Examples
? Scala
? Java
? Python
Refer to the PCA Scala docs for more details on the API.
import org.apache.spark.ml.feature.PCA
import org.apache.spark.ml.linalg.Vectors

val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF(“features”)

val pca = new PCA()
.setInputCol(“features”)
.setOutputCol(“pcaFeatures”)
.setK(3)
.fit(df)

val result = pca.transform(df).select(“pcaFeatures”)
result.show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/PCAExample.scala” in the Spark repo.
PolynomialExpansion
Polynomial expansion is the process of expanding your features into a polynomial space, which is formulated by an n-degree combination of original dimensions. A PolynomialExpansion class provides this functionality. The example below shows how to expand your features into a 3-degree polynomial space.
Examples
? Scala
? Java
? Python
Refer to the PolynomialExpansion Scala docs for more details on the API.
import org.apache.spark.ml.feature.PolynomialExpansion
import org.apache.spark.ml.linalg.Vectors

val data = Array(
Vectors.dense(2.0, 1.0),
Vectors.dense(0.0, 0.0),
Vectors.dense(3.0, -1.0)
)
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF(“features”)

val polyExpansion = new PolynomialExpansion()
.setInputCol(“features”)
.setOutputCol(“polyFeatures”)
.setDegree(3)

val polyDF = polyExpansion.transform(df)
polyDF.show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/PolynomialExpansionExample.scala” in the Spark repo.
Discrete Cosine Transform (DCT)
The Discrete Cosine Transform transforms a length NN real-valued sequence in the time domain into another length NN real-valued sequence in the frequency domain. A DCT class provides this functionality, implementing the DCT-II and scaling the result by 1/2–√1/2 such that the representing matrix for the transform is unitary. No shift is applied to the transformed sequence (e.g. the 00th element of the transformed sequence is the 00th DCT coefficient and not the N/2N/2th).
Examples
? Scala
? Java
? Python
Refer to the DCT Scala docs for more details on the API.
import org.apache.spark.ml.feature.DCT
import org.apache.spark.ml.linalg.Vectors

val data = Seq(
Vectors.dense(0.0, 1.0, -2.0, 3.0),
Vectors.dense(-1.0, 2.0, 4.0, -7.0),
Vectors.dense(14.0, -2.0, -5.0, 1.0))

val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF(“features”)

val dct = new DCT()
.setInputCol(“features”)
.setOutputCol(“featuresDCT”)
.setInverse(false)

val dctDf = dct.transform(df)
dctDf.select(“featuresDCT”).show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/DCTExample.scala” in the Spark repo.
StringIndexer
StringIndexer encodes a string column of labels to a column of label indices. StringIndexer can encode multiple columns. The indices are in [0, numLabels), and four ordering options are supported: “frequencyDesc”: descending order by label frequency (most frequent label assigned 0), “frequencyAsc”: ascending order by label frequency (least frequent label assigned 0), “alphabetDesc”: descending alphabetical order, and “alphabetAsc”: ascending alphabetical order (default = “frequencyDesc”). Note that in case of equal frequency when under “frequencyDesc”/”frequencyAsc”, the strings are further sorted by alphabet.
The unseen labels will be put at index numLabels if user chooses to keep them. If the input column is numeric, we cast it to string and index the string values. When downstream pipeline components such as Estimator or Transformer make use of this string-indexed label, you must set the input column of the component to this string-indexed column name. In many cases, you can set the input column with setInputCol.
Examples
Assume that we have the following DataFrame with columns id and category:

idcategory
0a
1b
2c
3a
4a
5c

category is a string column with three labels: “a”, “b”, and “c”. Applying StringIndexer with category as the input column and categoryIndex as the output column, we should get the following:

idcategorycategoryIndex
0a0.0
1b2.0
2c1.0
3a0.0
4a0.0
5c1.0

“a” gets index 0 because it is the most frequent, followed by “c” with index 1 and “b” with index 2.
Additionally, there are three strategies regarding how StringIndexer will handle unseen labels when you have fit a StringIndexer on one dataset and then use it to transform another:
? throw an exception (which is the default)
? skip the row containing the unseen label entirely
? put unseen labels in a special additional bucket, at index numLabels
Examples
Let’s go back to our previous example but this time reuse our previously defined StringIndexer on the following dataset:

idcategory
0a
1b
2c
3d
4e

If you’ve not set how StringIndexer handles unseen labels or set it to “error”, an exception will be thrown. However, if you had called setHandleInvalid(“skip”), the following dataset will be generated:

idcategorycategoryIndex
0a0.0
1b2.0
2c1.0

Notice that the rows containing “d” or “e” do not appear.
If you call setHandleInvalid(“keep”), the following dataset will be generated:

idcategorycategoryIndex
0a0.0
1b2.0
2c1.0
3d3.0
4e3.0

Notice that the rows containing “d” or “e” are mapped to index “3.0”
? Scala
? Java
? Python
Refer to the StringIndexer Scala docs for more details on the API.
import org.apache.spark.ml.feature.StringIndexer

val df = spark.createDataFrame(
Seq((0, “a”), (1, “b”), (2, “c”), (3, “a”), (4, “a”), (5, “c”))
).toDF(“id”, “category”)

val indexer = new StringIndexer()
.setInputCol(“category”)
.setOutputCol(“categoryIndex”)

val indexed = indexer.fit(df).transform(df)
indexed.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/StringIndexerExample.scala” in the Spark repo.
IndexToString
Symmetrically to StringIndexer, IndexToString maps a column of label indices back to a column containing the original labels as strings. A common use case is to produce indices from labels with StringIndexer, train a model with those indices and retrieve the original labels from the column of predicted indices with IndexToString. However, you are free to supply your own labels.
Examples
Building on the StringIndexer example, let’s assume we have the following DataFrame with columns id and categoryIndex:

idcategoryIndex
00.0
12.0
21.0
30.0
40.0
51.0

Applying IndexToString with categoryIndex as the input column, originalCategory as the output column, we are able to retrieve our original labels (they will be inferred from the columns’ metadata):

idcategoryIndexoriginalCategory
00.0a
12.0b
21.0c
30.0a
40.0a
51.0c

? Scala
? Java
? Python
Refer to the IndexToString Scala docs for more details on the API.
import org.apache.spark.ml.attribute.Attribute
import org.apache.spark.ml.feature.{IndexToString, StringIndexer}

val df = spark.createDataFrame(Seq(
(0, “a”),
(1, “b”),
(2, “c”),
(3, “a”),
(4, “a”),
(5, “c”)
)).toDF(“id”, “category”)

val indexer = new StringIndexer()
.setInputCol(“category”)
.setOutputCol(“categoryIndex”)
.fit(df)
val indexed = indexer.transform(df)

println(s"Transformed string column ‘indexer.getInputCol′"+s"toindexedcolumn′{indexer.getInputCol}' " + s"to indexed column 'indexer.getInputCol"+s"toindexedcolumn{indexer.getOutputCol}’")
indexed.show()

val inputColSchema = indexed.schema(indexer.getOutputCol)
println(s"StringIndexer will store labels in output column metadata: " +
s"${Attribute.fromStructField(inputColSchema).toString}\n")

val converter = new IndexToString()
.setInputCol(“categoryIndex”)
.setOutputCol(“originalCategory”)

val converted = converter.transform(indexed)

println(s"Transformed indexed column ‘converter.getInputCol′backtooriginalstring"+s"column′{converter.getInputCol}' back to original string " + s"column 'converter.getInputColbacktooriginalstring"+s"column{converter.getOutputCol}’ using labels in metadata")
converted.select(“id”, “categoryIndex”, “originalCategory”).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/IndexToStringExample.scala” in the Spark repo.
OneHotEncoder
One-hot encoding maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using StringIndexer first.
OneHotEncoder can transform multiple columns, returning an one-hot-encoded output vector column for each input column. It is common to merge these vectors into a single feature vector using VectorAssembler.
OneHotEncoder supports the handleInvalid parameter to choose how to handle invalid input during transforming data. Available options include ‘keep’ (any invalid inputs are assigned to an extra categorical index) and ‘error’ (throw an error).
Examples
? Scala
? Java
? Python
Refer to the OneHotEncoder Scala docs for more details on the API.
import org.apache.spark.ml.feature.OneHotEncoder

val df = spark.createDataFrame(Seq(
(0.0, 1.0),
(1.0, 0.0),
(2.0, 1.0),
(0.0, 2.0),
(0.0, 1.0),
(2.0, 0.0)
)).toDF(“categoryIndex1”, “categoryIndex2”)

val encoder = new OneHotEncoder()
.setInputCols(Array(“categoryIndex1”, “categoryIndex2”))
.setOutputCols(Array(“categoryVec1”, “categoryVec2”))
val model = encoder.fit(df)

val encoded = model.transform(df)
encoded.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/OneHotEncoderExample.scala” in the Spark repo.
VectorIndexer
VectorIndexer helps index categorical features in datasets of Vectors. It can both automatically decide which features are categorical and convert original values to category indices. Specifically, it does the following:

  1. Take an input column of type Vector and a parameter maxCategories.
  2. Decide which features should be categorical based on the number of distinct values, where features with at most maxCategories are declared categorical.
  3. Compute 0-based category indices for each categorical feature.
  4. Index categorical features and transform original feature values to indices.
    Indexing categorical features allows algorithms such as Decision Trees and Tree Ensembles to treat categorical features appropriately, improving performance.
    Examples
    In the example below, we read in a dataset of labeled points and then use VectorIndexer to decide which features should be treated as categorical. We transform the categorical feature values to their indices. This transformed data could then be passed to algorithms such as DecisionTreeRegressor that handle categorical features.
    ? Scala
    ? Java
    ? Python
    Refer to the VectorIndexer Scala docs for more details on the API.
    import org.apache.spark.ml.feature.VectorIndexer

val data = spark.read.format(“l(fā)ibsvm”).load(“data/mllib/sample_libsvm_data.txt”)

val indexer = new VectorIndexer()
.setInputCol(“features”)
.setOutputCol(“indexed”)
.setMaxCategories(10)

val indexerModel = indexer.fit(data)

val categoricalFeatures: Set[Int] = indexerModel.categoryMaps.keys.toSet
println(s"Chose ${categoricalFeatures.size} " +
s"categorical features: ${categoricalFeatures.mkString(", “)}”)

// Create new column “indexed” with categorical values transformed to indices
val indexedData = indexerModel.transform(data)
indexedData.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/VectorIndexerExample.scala” in the Spark repo.
Interaction
Interaction is a Transformer which takes vector or double-valued columns, and generates a single vector column that contains the product of all combinations of one value from each input column.
For example, if you have 2 vector type columns each of which has 3 dimensions as input columns, then you’ll get a 9-dimensional vector as the output column.
Examples
Assume that we have the following DataFrame with the columns “id1”, “vec1”, and “vec2”:

id1vec1vec2
1[1.0,2.0,3.0][8.0,4.0,5.0]
2[4.0,3.0,8.0][7.0,9.0,8.0]
3[6.0,1.0,9.0][2.0,3.0,6.0]
4[10.0,8.0,6.0][9.0,4.0,5.0]
5[9.0,2.0,7.0][10.0,7.0,3.0]
6[1.0,1.0,4.0][2.0,8.0,4.0]

Applying Interaction with those input columns, then interactedCol as the output column contains:

id1vec1vec2interactedCol
1[1.0,2.0,3.0][8.0,4.0,5.0][8.0,4.0,5.0,16.0,8.0,10.0,24.0,12.0,15.0]
2[4.0,3.0,8.0][7.0,9.0,8.0][56.0,72.0,64.0,42.0,54.0,48.0,112.0,144.0,128.0]
3[6.0,1.0,9.0][2.0,3.0,6.0][36.0,54.0,108.0,6.0,9.0,18.0,54.0,81.0,162.0]
4[10.0,8.0,6.0][9.0,4.0,5.0][360.0,160.0,200.0,288.0,128.0,160.0,216.0,96.0,120.0]
5[9.0,2.0,7.0][10.0,7.0,3.0][450.0,315.0,135.0,100.0,70.0,30.0,350.0,245.0,105.0]
6[1.0,1.0,4.0][2.0,8.0,4.0][12.0,48.0,24.0,12.0,48.0,24.0,48.0,192.0,96.0]

? Scala
? Java
? Python
Refer to the Interaction Scala docs for more details on the API.
import org.apache.spark.ml.feature.Interaction
import org.apache.spark.ml.feature.VectorAssembler

val df = spark.createDataFrame(Seq(
(1, 1, 2, 3, 8, 4, 5),
(2, 4, 3, 8, 7, 9, 8),
(3, 6, 1, 9, 2, 3, 6),
(4, 10, 8, 6, 9, 4, 5),
(5, 9, 2, 7, 10, 7, 3),
(6, 1, 1, 4, 2, 8, 4)
)).toDF(“id1”, “id2”, “id3”, “id4”, “id5”, “id6”, “id7”)

val assembler1 = new VectorAssembler().
setInputCols(Array(“id2”, “id3”, “id4”)).
setOutputCol(“vec1”)

val assembled1 = assembler1.transform(df)

val assembler2 = new VectorAssembler().
setInputCols(Array(“id5”, “id6”, “id7”)).
setOutputCol(“vec2”)

val assembled2 = assembler2.transform(assembled1).select(“id1”, “vec1”, “vec2”)

val interaction = new Interaction()
.setInputCols(Array(“id1”, “vec1”, “vec2”))
.setOutputCol(“interactedCol”)

val interacted = interaction.transform(assembled2)

interacted.show(truncate = false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/InteractionExample.scala” in the Spark repo.
Normalizer
Normalizer is a Transformer which transforms a dataset of Vector rows, normalizing each Vector to have unit norm. It takes parameter p, which specifies the p-norm used for normalization. (p=2p=2 by default.) This normalization can help standardize your input data and improve the behavior of learning algorithms.
Examples
The following example demonstrates how to load a dataset in libsvm format and then normalize each row to have unit L1L1 norm and unit L∞L∞ norm.
? Scala
? Java
? Python
Refer to the Normalizer Scala docs for more details on the API.
import org.apache.spark.ml.feature.Normalizer
import org.apache.spark.ml.linalg.Vectors

val dataFrame = spark.createDataFrame(Seq(
(0, Vectors.dense(1.0, 0.5, -1.0)),
(1, Vectors.dense(2.0, 1.0, 1.0)),
(2, Vectors.dense(4.0, 10.0, 2.0))
)).toDF(“id”, “features”)

// Normalize each Vector using L1L^1L1 norm.
val normalizer = new Normalizer()
.setInputCol(“features”)
.setOutputCol(“normFeatures”)
.setP(1.0)

val l1NormData = normalizer.transform(dataFrame)
println(“Normalized using L^1 norm”)
l1NormData.show()

// Normalize each Vector using L∞L^\inftyL norm.
val lInfNormData = normalizer.transform(dataFrame, normalizer.p -> Double.PositiveInfinity)
println(“Normalized using L^inf norm”)
lInfNormData.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/NormalizerExample.scala” in the Spark repo.
StandardScaler
StandardScaler transforms a dataset of Vector rows, normalizing each feature to have unit standard deviation and/or zero mean. It takes parameters:
? withStd: True by default. Scales the data to unit standard deviation.
? withMean: False by default. Centers the data with mean before scaling. It will build a dense output, so take care when applying to sparse input.
StandardScaler is an Estimator which can be fit on a dataset to produce a StandardScalerModel; this amounts to computing summary statistics. The model can then transform a Vector column in a dataset to have unit standard deviation and/or zero mean features.
Note that if the standard deviation of a feature is zero, it will return default 0.0 value in the Vector for that feature.
Examples
The following example demonstrates how to load a dataset in libsvm format and then normalize each feature to have unit standard deviation.
? Scala
? Java
? Python
Refer to the StandardScaler Scala docs for more details on the API.
import org.apache.spark.ml.feature.StandardScaler

val dataFrame = spark.read.format(“l(fā)ibsvm”).load(“data/mllib/sample_libsvm_data.txt”)

val scaler = new StandardScaler()
.setInputCol(“features”)
.setOutputCol(“scaledFeatures”)
.setWithStd(true)
.setWithMean(false)

// Compute summary statistics by fitting the StandardScaler.
val scalerModel = scaler.fit(dataFrame)

// Normalize each feature to have unit standard deviation.
val scaledData = scalerModel.transform(dataFrame)
scaledData.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/StandardScalerExample.scala” in the Spark repo.
RobustScaler
RobustScaler transforms a dataset of Vector rows, removing the median and scaling the data according to a specific quantile range (by default the IQR: Interquartile Range, quantile range between the 1st quartile and the 3rd quartile). Its behavior is quite similar to StandardScaler, however the median and the quantile range are used instead of mean and standard deviation, which make it robust to outliers. It takes parameters:
? lower: 0.25 by default. Lower quantile to calculate quantile range, shared by all features.
? upper: 0.75 by default. Upper quantile to calculate quantile range, shared by all features.
? withScaling: True by default. Scales the data to quantile range.
? withCentering: False by default. Centers the data with median before scaling. It will build a dense output, so take care when applying to sparse input.
RobustScaler is an Estimator which can be fit on a dataset to produce a RobustScalerModel; this amounts to computing quantile statistics. The model can then transform a Vector column in a dataset to have unit quantile range and/or zero median features.
Note that if the quantile range of a feature is zero, it will return default 0.0 value in the Vector for that feature.
Examples
The following example demonstrates how to load a dataset in libsvm format and then normalize each feature to have unit quantile range.
? Scala
? Java
? Python
Refer to the RobustScaler Scala docs for more details on the API.
import org.apache.spark.ml.feature.RobustScaler

val dataFrame = spark.read.format(“l(fā)ibsvm”).load(“data/mllib/sample_libsvm_data.txt”)

val scaler = new RobustScaler()
.setInputCol(“features”)
.setOutputCol(“scaledFeatures”)
.setWithScaling(true)
.setWithCentering(false)
.setLower(0.25)
.setUpper(0.75)

// Compute summary statistics by fitting the RobustScaler.
val scalerModel = scaler.fit(dataFrame)

// Transform each feature to have unit quantile range.
val scaledData = scalerModel.transform(dataFrame)
scaledData.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/RobustScalerExample.scala” in the Spark repo.
MinMaxScaler
MinMaxScaler transforms a dataset of Vector rows, rescaling each feature to a specific range (often [0, 1]). It takes parameters:
? min: 0.0 by default. Lower bound after transformation, shared by all features.
? max: 1.0 by default. Upper bound after transformation, shared by all features.
MinMaxScaler computes summary statistics on a data set and produces a MinMaxScalerModel. The model can then transform each feature individually such that it is in the given range.
The rescaled value for a feature E is calculated as,
Rescaled(ei)=ei?EminEmax?Emin?(max?min)+min(1)(1)Rescaled(ei)=ei?EminEmax?Emin?(max?min)+min
For the case EmaxEminEmaxEmin, Rescaled(ei)=0.5?(max+min)Rescaled(ei)=0.5?(max+min)
Note that since zero values will probably be transformed to non-zero values, output of the transformer will be DenseVector even for sparse input.
Examples
The following example demonstrates how to load a dataset in libsvm format and then rescale each feature to [0, 1].
? Scala
? Java
? Python
Refer to the MinMaxScaler Scala docs and the MinMaxScalerModel Scala docs for more details on the API.
import org.apache.spark.ml.feature.MinMaxScaler
import org.apache.spark.ml.linalg.Vectors

val dataFrame = spark.createDataFrame(Seq(
(0, Vectors.dense(1.0, 0.1, -1.0)),
(1, Vectors.dense(2.0, 1.1, 1.0)),
(2, Vectors.dense(3.0, 10.1, 3.0))
)).toDF(“id”, “features”)

val scaler = new MinMaxScaler()
.setInputCol(“features”)
.setOutputCol(“scaledFeatures”)

// Compute summary statistics and generate MinMaxScalerModel
val scalerModel = scaler.fit(dataFrame)

// rescale each feature to range [min, max].
val scaledData = scalerModel.transform(dataFrame)
println(s"Features scaled to range: [${scaler.getMin}, ${scaler.getMax}]")
scaledData.select(“features”, “scaledFeatures”).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/MinMaxScalerExample.scala” in the Spark repo.
MaxAbsScaler
MaxAbsScaler transforms a dataset of Vector rows, rescaling each feature to range [-1, 1] by dividing through the maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity.
MaxAbsScaler computes summary statistics on a data set and produces a MaxAbsScalerModel. The model can then transform each feature individually to range [-1, 1].
Examples
The following example demonstrates how to load a dataset in libsvm format and then rescale each feature to [-1, 1].
? Scala
? Java
? Python
Refer to the MaxAbsScaler Scala docs and the MaxAbsScalerModel Scala docs for more details on the API.
import org.apache.spark.ml.feature.MaxAbsScaler
import org.apache.spark.ml.linalg.Vectors

val dataFrame = spark.createDataFrame(Seq(
(0, Vectors.dense(1.0, 0.1, -8.0)),
(1, Vectors.dense(2.0, 1.0, -4.0)),
(2, Vectors.dense(4.0, 10.0, 8.0))
)).toDF(“id”, “features”)

val scaler = new MaxAbsScaler()
.setInputCol(“features”)
.setOutputCol(“scaledFeatures”)

// Compute summary statistics and generate MaxAbsScalerModel
val scalerModel = scaler.fit(dataFrame)

// rescale each feature to range [-1, 1]
val scaledData = scalerModel.transform(dataFrame)
scaledData.select(“features”, “scaledFeatures”).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/MaxAbsScalerExample.scala” in the Spark repo.
Bucketizer
Bucketizer transforms a column of continuous features to a column of feature buckets, where the buckets are specified by users. It takes a parameter:
? splits: Parameter for mapping continuous features into buckets. With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. Splits should be strictly increasing. Values at -inf, inf must be explicitly provided to cover all Double values; Otherwise, values outside the splits specified will be treated as errors. Two examples of splits are Array(Double.NegativeInfinity, 0.0, 1.0, Double.PositiveInfinity) and Array(0.0, 1.0, 2.0).
Note that if you have no idea of the upper and lower bounds of the targeted column, you should add Double.NegativeInfinity and Double.PositiveInfinity as the bounds of your splits to prevent a potential out of Bucketizer bounds exception.
Note also that the splits that you provided have to be in strictly increasing order, i.e. s0 < s1 < s2 < … < sn.
More details can be found in the API docs for Bucketizer.
Examples
The following example demonstrates how to bucketize a column of Doubles into another index-wised column.
? Scala
? Java
? Python
Refer to the Bucketizer Scala docs for more details on the API.
import org.apache.spark.ml.feature.Bucketizer

val splits = Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity)

val data = Array(-999.9, -0.5, -0.3, 0.0, 0.2, 999.9)
val dataFrame = spark.createDataFrame(data.map(Tuple1.apply)).toDF(“features”)

val bucketizer = new Bucketizer()
.setInputCol(“features”)
.setOutputCol(“bucketedFeatures”)
.setSplits(splits)

// Transform original data into its bucket index.
val bucketedData = bucketizer.transform(dataFrame)

println(s"Bucketizer output with ${bucketizer.getSplits.length-1} buckets")
bucketedData.show()

val splitsArray = Array(
Array(Double.NegativeInfinity, -0.5, 0.0, 0.5, Double.PositiveInfinity),
Array(Double.NegativeInfinity, -0.3, 0.0, 0.3, Double.PositiveInfinity))

val data2 = Array(
(-999.9, -999.9),
(-0.5, -0.2),
(-0.3, -0.1),
(0.0, 0.0),
(0.2, 0.4),
(999.9, 999.9))
val dataFrame2 = spark.createDataFrame(data2).toDF(“features1”, “features2”)

val bucketizer2 = new Bucketizer()
.setInputCols(Array(“features1”, “features2”))
.setOutputCols(Array(“bucketedFeatures1”, “bucketedFeatures2”))
.setSplitsArray(splitsArray)

// Transform original data into its bucket index.
val bucketedData2 = bucketizer2.transform(dataFrame2)

println(s"Bucketizer output with [" +
s"bucketizer2.getSplitsArray(0).length?1,"+s"{bucketizer2.getSplitsArray(0).length-1}, " + s"bucketizer2.getSplitsArray(0).length?1,"+s"{bucketizer2.getSplitsArray(1).length-1}] buckets for each input column")
bucketedData2.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/BucketizerExample.scala” in the Spark repo.
ElementwiseProduct
ElementwiseProduct multiplies each input vector by a provided “weight” vector, using element-wise multiplication. In other words, it scales each column of the dataset by a scalar multiplier. This represents the Hadamard product between the input vector, v and transforming vector, w, to yield a result vector.
????v1?vN????°????w1?wN????=????v1w1?vNwN????(v1?vN)°(w1?wN)=(v1w1?vNwN)
Examples
This example below demonstrates how to transform vectors using a transforming vector value.
? Scala
? Java
? Python
Refer to the ElementwiseProduct Scala docs for more details on the API.
import org.apache.spark.ml.feature.ElementwiseProduct
import org.apache.spark.ml.linalg.Vectors

// Create some vector data; also works for sparse vectors
val dataFrame = spark.createDataFrame(Seq(
(“a”, Vectors.dense(1.0, 2.0, 3.0)),
(“b”, Vectors.dense(4.0, 5.0, 6.0)))).toDF(“id”, “vector”)

val transformingVector = Vectors.dense(0.0, 1.0, 2.0)
val transformer = new ElementwiseProduct()
.setScalingVec(transformingVector)
.setInputCol(“vector”)
.setOutputCol(“transformedVector”)

// Batch transform the vectors to create new column:
transformer.transform(dataFrame).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/ElementwiseProductExample.scala” in the Spark repo.
SQLTransformer
SQLTransformer implements the transformations which are defined by SQL statement. Currently, we only support SQL syntax like “SELECT … FROM THIS …” where “THIS” represents the underlying table of the input dataset. The select clause specifies the fields, constants, and expressions to display in the output, and can be any select clause that Spark SQL supports. Users can also use Spark SQL built-in function and UDFs to operate on these selected columns. For example, SQLTransformer supports statements like:
? SELECT a, a + b AS a_b FROM THIS
? SELECT a, SQRT(b) AS b_sqrt FROM THIS where a > 5
? SELECT a, b, SUM? AS c_sum FROM THIS GROUP BY a, b
Examples
Assume that we have the following DataFrame with columns id, v1 and v2:

idv1v2
01.03.0
22.05.0

This is the output of the SQLTransformer with statement “SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM THIS”:

idv1v2v3v4
01.03.04.03.0
22.05.07.010.0

? Scala
? Java
? Python
Refer to the SQLTransformer Scala docs for more details on the API.
import org.apache.spark.ml.feature.SQLTransformer

val df = spark.createDataFrame(
Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF(“id”, “v1”, “v2”)

val sqlTrans = new SQLTransformer().setStatement(
“SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM THIS”)

sqlTrans.transform(df).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/SQLTransformerExample.scala” in the Spark repo.
VectorAssembler
VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees. VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type. In each row, the values of the input columns will be concatenated into a vector in the specified order.
Examples
Assume that we have a DataFrame with the columns id, hour, mobile, userFeatures, and clicked:

idhourmobileuserFeaturesclicked
0181.0[0.0, 10.0, 0.5]1.0

userFeatures is a vector column that contains three user features. We want to combine hour, mobile, and userFeatures into a single feature vector called features and use it to predict clicked or not. If we set VectorAssembler’s input columns to hour, mobile, and userFeatures and output column to features, after transformation we should get the following DataFrame:

idhourmobileuserFeaturesclickedfeatures
0181.0[0.0, 10.0, 0.5]1.0[18.0, 1.0, 0.0, 10.0, 0.5]

? Scala
? Java
? Python
Refer to the VectorAssembler Scala docs for more details on the API.
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

val dataset = spark.createDataFrame(
Seq((0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0))
).toDF(“id”, “hour”, “mobile”, “userFeatures”, “clicked”)

val assembler = new VectorAssembler()
.setInputCols(Array(“hour”, “mobile”, “userFeatures”))
.setOutputCol(“features”)

val output = assembler.transform(dataset)
println(“Assembled columns ‘hour’, ‘mobile’, ‘userFeatures’ to vector column ‘features’”)
output.select(“features”, “clicked”).show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/VectorAssemblerExample.scala” in the Spark repo.
VectorSizeHint
It can sometimes be useful to explicitly specify the size of the vectors for a column of VectorType. For example, VectorAssembler uses size information from its input columns to produce size information and metadata for its output column. While in some cases this information can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are not available until the stream is started. VectorSizeHint allows a user to explicitly specify the vector size for a column so that VectorAssembler, or other transformers that might need to know vector size, can use that column as an input.
To use VectorSizeHint a user must set the inputCol and size parameters. Applying this transformer to a dataframe produces a new dataframe with updated metadata for inputCol specifying the vector size. Downstream operations on the resulting dataframe can get this size using the metadata.
VectorSizeHint can also take an optional handleInvalid parameter which controls its behaviour when the vector column contains nulls or vectors of the wrong size. By default handleInvalid is set to “error”, indicating an exception should be thrown. This parameter can also be set to “skip”, indicating that rows containing invalid values should be filtered out from the resulting dataframe, or “optimistic”, indicating that the column should not be checked for invalid values and all rows should be kept. Note that the use of “optimistic” can cause the resulting dataframe to be in an inconsistent state, meaning the metadata for the column VectorSizeHint was applied to does not match the contents of that column. Users should take care to avoid this kind of inconsistent state.
? Scala
? Java
? Python
Refer to the VectorSizeHint Scala docs for more details on the API.
import org.apache.spark.ml.feature.{VectorAssembler, VectorSizeHint}
import org.apache.spark.ml.linalg.Vectors

val dataset = spark.createDataFrame(
Seq(
(0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0),
(0, 18, 1.0, Vectors.dense(0.0, 10.0), 0.0))
).toDF(“id”, “hour”, “mobile”, “userFeatures”, “clicked”)

val sizeHint = new VectorSizeHint()
.setInputCol(“userFeatures”)
.setHandleInvalid(“skip”)
.setSize(3)

val datasetWithSize = sizeHint.transform(dataset)
println(“Rows where ‘userFeatures’ is not the right size are filtered out”)
datasetWithSize.show(false)

val assembler = new VectorAssembler()
.setInputCols(Array(“hour”, “mobile”, “userFeatures”))
.setOutputCol(“features”)

// This dataframe can be used by downstream transformers as before
val output = assembler.transform(datasetWithSize)
println(“Assembled columns ‘hour’, ‘mobile’, ‘userFeatures’ to vector column ‘features’”)
output.select(“features”, “clicked”).show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/VectorSizeHintExample.scala” in the Spark repo.
QuantileDiscretizer
QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins is set by the numBuckets parameter. It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles.
NaN values: NaN values will be removed from the column during QuantileDiscretizer fitting. This will produce a Bucketizer model for making predictions. During the transformation, Bucketizer will raise an error when it finds NaN values in the dataset, but the user can also choose to either keep or remove NaN values within the dataset by setting handleInvalid. If the user chooses to keep NaN values, they will be handled specially and placed into their own bucket, for example, if 4 buckets are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for approxQuantile for a detailed description). The precision of the approximation can be controlled with the relativeError parameter. When set to zero, exact quantiles are calculated (Note: Computing exact quantiles is an expensive operation). The lower and upper bin bounds will be -Infinity and +Infinity covering all real values.
Examples
Assume that we have a DataFrame with the columns id, hour:

idhour
018.0
----------
119.0
----------
28.0
----------
35.0
----------
42.2

hour is a continuous feature with Double type. We want to turn the continuous feature into a categorical one. Given numBuckets = 3, we should get the following DataFrame:

idhourresult
018.02.0
----------------
119.02.0
----------------
28.01.0
----------------
35.01.0
----------------
42.20.0

? Scala
? Java
? Python
Refer to the QuantileDiscretizer Scala docs for more details on the API.
import org.apache.spark.ml.feature.QuantileDiscretizer

val data = Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2))
val df = spark.createDataFrame(data).toDF(“id”, “hour”)

val discretizer = new QuantileDiscretizer()
.setInputCol(“hour”)
.setOutputCol(“result”)
.setNumBuckets(3)

val result = discretizer.fit(df).transform(df)
result.show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala” in the Spark repo.
Imputer
The Imputer estimator completes missing values in a dataset, either using the mean or the median of the columns in which the missing values are located. The input columns should be of numeric type. Currently Imputer does not support categorical features and possibly creates incorrect values for columns containing categorical features. Imputer can impute custom values other than ‘NaN’ by .setMissingValue(custom_value). For example, .setMissingValue(0) will impute all occurrences of (0).
Note all null values in the input columns are treated as missing, and so are also imputed.
Examples
Suppose that we have a DataFrame with the columns a and b:
a | b
------------|-----------
1.0 | Double.NaN
2.0 | Double.NaN
Double.NaN | 3.0
4.0 | 4.0
5.0 | 5.0
In this example, Imputer will replace all occurrences of Double.NaN (the default for the missing value) with the mean (the default imputation strategy) computed from the other values in the corresponding columns. In this example, the surrogate values for columns a and b are 3.0 and 4.0 respectively. After transformation, the missing values in the output columns will be replaced by the surrogate value for the relevant column.
a | b | out_a | out_b
------------|------------|-------|-------
1.0 | Double.NaN | 1.0 | 4.0
2.0 | Double.NaN | 2.0 | 4.0
Double.NaN | 3.0 | 3.0 | 3.0
4.0 | 4.0 | 4.0 | 4.0
5.0 | 5.0 | 5.0 | 5.0
? Scala
? Java
? Python
Refer to the Imputer Scala docs for more details on the API.
import org.apache.spark.ml.feature.Imputer

val df = spark.createDataFrame(Seq(
(1.0, Double.NaN),
(2.0, Double.NaN),
(Double.NaN, 3.0),
(4.0, 4.0),
(5.0, 5.0)
)).toDF(“a”, “b”)

val imputer = new Imputer()
.setInputCols(Array(“a”, “b”))
.setOutputCols(Array(“out_a”, “out_b”))

val model = imputer.fit(df)
model.transform(df).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/ImputerExample.scala” in the Spark repo.
Feature Selectors
VectorSlicer
VectorSlicer is a transformer that takes a feature vector and outputs a new feature vector with a sub-array of the original features. It is useful for extracting features from a vector column.
VectorSlicer accepts a vector column with specified indices, then outputs a new vector column whose values are selected via those indices. There are two types of indices,

  1. Integer indices that represent the indices into the vector, setIndices().
  2. String indices that represent the names of features into the vector, setNames(). This requires the vector column to have an AttributeGroup since the implementation matches on the name field of an Attribute.
    Specification by integer and string are both acceptable. Moreover, you can use integer index and string name simultaneously. At least one feature must be selected. Duplicate features are not allowed, so there can be no overlap between selected indices and names. Note that if names of features are selected, an exception will be thrown if empty input attributes are encountered.
    The output vector will order features with the selected indices first (in the order given), followed by the selected names (in the order given).
    Examples
    Suppose that we have a DataFrame with the column userFeatures:
    userFeatures

[0.0, 10.0, 0.5]
userFeatures is a vector column that contains three user features. Assume that the first column of userFeatures are all zeros, so we want to remove it and select only the last two columns. The VectorSlicer selects the last two elements with setIndices(1, 2) then produces a new vector column named features:

userFeaturesfeatures
[0.0, 10.0, 0.5][10.0, 0.5]

Suppose also that we have potential input attributes for the userFeatures, i.e. [“f1”, “f2”, “f3”], then we can use setNames(“f2”, “f3”) to select them.

userFeaturesfeatures
[0.0, 10.0, 0.5][10.0, 0.5]
[“f1”, “f2”, “f3”][“f2”, “f3”]

? Scala
? Java
? Python
Refer to the VectorSlicer Scala docs for more details on the API.
import java.util.Arrays

import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, NumericAttribute}
import org.apache.spark.ml.feature.VectorSlicer
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.StructType

val data = Arrays.asList(
Row(Vectors.sparse(3, Seq((0, -2.0), (1, 2.3)))),
Row(Vectors.dense(-2.0, 2.3, 0.0))
)

val defaultAttr = NumericAttribute.defaultAttr
val attrs = Array(“f1”, “f2”, “f3”).map(defaultAttr.withName)
val attrGroup = new AttributeGroup(“userFeatures”, attrs.asInstanceOf[Array[Attribute]])

val dataset = spark.createDataFrame(data, StructType(Array(attrGroup.toStructField())))

val slicer = new VectorSlicer().setInputCol(“userFeatures”).setOutputCol(“features”)

slicer.setIndices(Array(1)).setNames(Array(“f3”))
// or slicer.setIndices(Array(1, 2)), or slicer.setNames(Array(“f2”, “f3”))

val output = slicer.transform(dataset)
output.show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/VectorSlicerExample.scala” in the Spark repo.
RFormula
RFormula selects columns specified by an R model formula. Currently we support a limited subset of the R operators, including ‘~’, ‘.’, ‘:’, ‘+’, and ‘-‘. The basic operators are:
? ~ separate target and terms
? + concat terms, “+ 0” means removing intercept
? - remove a term, “- 1” means removing intercept
? : interaction (multiplication for numeric values, or binarized categorical values)
? . all columns except target
Suppose a and b are double columns, we use the following simple examples to illustrate the effect of RFormula:
? y ~ a + b means model y ~ w0 + w1 * a + w2 * b where w0 is the intercept and w1, w2 are coefficients.
? y ~ a + b + a:b - 1 means model y ~ w1 * a + w2 * b + w3 * a * b where w1, w2, w3 are coefficients.
RFormula produces a vector column of features and a double or string column of label. Like when formulas are used in R for linear regression, numeric columns will be cast to doubles. As to string input columns, they will first be transformed with StringIndexer using ordering determined by stringOrderType, and the last category after ordering is dropped, then the doubles will be one-hot encoded.
Suppose a string feature column containing values {‘b’, ‘a(chǎn)’, ‘b’, ‘a(chǎn)’, ‘c’, ‘b’}, we set stringOrderType to control the encoding:

stringOrderTypeCategory mapped to 0 by StringIndexerCategory dropped by RFormula
‘frequencyDesc’most frequent category (‘b’)least frequent category (‘c’)
‘frequencyAsc’least frequent category (‘c’)most frequent category (‘b’)
‘a(chǎn)lphabetDesc’last alphabetical category (‘c’)first alphabetical category (‘a(chǎn)’)
‘a(chǎn)lphabetAsc’first alphabetical category (‘a(chǎn)’)last alphabetical category (‘c’)

If the label column is of type string, it will be first transformed to double with StringIndexer using frequencyDesc ordering. If the label column does not exist in the DataFrame, the output label column will be created from the specified response variable in the formula.
Note: The ordering option stringOrderType is NOT used for the label column. When the label column is indexed, it uses the default descending frequency ordering in StringIndexer.
Examples
Assume that we have a DataFrame with the columns id, country, hour, and clicked:

idcountryhourclicked
7“US”181.0
8“CA”120.0
9“NZ”150.0

If we use RFormula with a formula string of clicked ~ country + hour, which indicates that we want to predict clicked based on country and hour, after transformation we should get the following DataFrame:

idcountryhourclickedfeatureslabel
7“US”181.0[0.0, 0.0, 18.0]1.0
8“CA”120.0[0.0, 1.0, 12.0]0.0
9“NZ”150.0[1.0, 0.0, 15.0]0.0

? Scala
? Java
? Python
Refer to the RFormula Scala docs for more details on the API.
import org.apache.spark.ml.feature.RFormula

val dataset = spark.createDataFrame(Seq(
(7, “US”, 18, 1.0),
(8, “CA”, 12, 0.0),
(9, “NZ”, 15, 0.0)
)).toDF(“id”, “country”, “hour”, “clicked”)

val formula = new RFormula()
.setFormula(“clicked ~ country + hour”)
.setFeaturesCol(“features”)
.setLabelCol(“l(fā)abel”)

val output = formula.fit(dataset).transform(dataset)
output.select(“features”, “l(fā)abel”).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/RFormulaExample.scala” in the Spark repo.
ChiSqSelector
ChiSqSelector stands for Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the Chi-Squared test of independence to decide which features to choose. It supports five selection methods: numTopFeatures, percentile, fpr, fdr, fwe:
? numTopFeatures chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
? percentile is similar to numTopFeatures but chooses a fraction of all features instead of a fixed number.
? fpr chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.
? fdr uses the Benjamini-Hochberg procedure to choose all features whose false discovery rate is below a threshold.
? fwe chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection. By default, the selection method is numTopFeatures, with the default number of top features set to 50. The user can choose a selection method using setSelectorType.
Examples
Assume that we have a DataFrame with the columns id, features, and clicked, which is used as our target to be predicted:

idfeaturesclicked
7[0.0, 0.0, 18.0, 1.0]1.0
8[0.0, 1.0, 12.0, 0.0]0.0
9[1.0, 0.0, 15.0, 0.1]0.0

If we use ChiSqSelector with numTopFeatures = 1, then according to our label clicked the last column in our features is chosen as the most useful feature:

idfeaturesclickedselectedFeatures
7[0.0, 0.0, 18.0, 1.0]1.0[1.0]
8[0.0, 1.0, 12.0, 0.0]0.0[0.0]
9[1.0, 0.0, 15.0, 0.1]0.0[0.1]

? Scala
? Java
? Python
Refer to the ChiSqSelector Scala docs for more details on the API.
import org.apache.spark.ml.feature.ChiSqSelector
import org.apache.spark.ml.linalg.Vectors

val data = Seq(
(7, Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
(8, Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
(9, Vectors.dense(1.0, 0.0, 15.0, 0.1), 0.0)
)

val df = spark.createDataset(data).toDF(“id”, “features”, “clicked”)

val selector = new ChiSqSelector()
.setNumTopFeatures(1)
.setFeaturesCol(“features”)
.setLabelCol(“clicked”)
.setOutputCol(“selectedFeatures”)

val result = selector.fit(df).transform(df)

println(s"ChiSqSelector output with top ${selector.getNumTopFeatures} features selected")
result.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/ChiSqSelectorExample.scala” in the Spark repo.
UnivariateFeatureSelector
UnivariateFeatureSelector operates on categorical/continuous labels with categorical/continuous features. User can set featureType and labelType, and Spark will pick the score function to use based on the specified featureType and labelType.

featureTypelabelTypescore function
categoricalcategoricalchi-squared (chi2)
continuouscategoricalANOVATest (f_classif)
continuouscontinuousF-value (f_regression)

It supports five selection modes: numTopFeatures, percentile, fpr, fdr, fwe:
? numTopFeatures chooses a fixed number of top features.
? percentile is similar to numTopFeatures but chooses a fraction of all features instead of a fixed number.
? fpr chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.
? fdr uses the Benjamini-Hochberg procedure to choose all features whose false discovery rate is below a threshold.
? fwe chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection.
By default, the selection mode is numTopFeatures, with the default selectionThreshold sets to 50.
Examples
Assume that we have a DataFrame with the columns id, features, and label, which is used as our target to be predicted:

idfeatureslabel
1[1.7, 4.4, 7.6, 5.8, 9.6, 2.3]3.0
2[8.8, 7.3, 5.7, 7.3, 2.2, 4.1]2.0
3[1.2, 9.5, 2.5, 3.1, 8.7, 2.5]3.0
4[3.7, 9.2, 6.1, 4.1, 7.5, 3.8]2.0
5[8.9, 5.2, 7.8, 8.3, 5.2, 3.0]4.0
6[7.9, 8.5, 9.2, 4.0, 9.4, 2.1]4.0

If we set featureType to continuous and labelType to categorical with numTopFeatures = 1, the last column in our features is chosen as the most useful feature:

idfeatureslabelselectedFeatures
1[1.7, 4.4, 7.6, 5.8, 9.6, 2.3]3.0[2.3]
2[8.8, 7.3, 5.7, 7.3, 2.2, 4.1]2.0[4.1]
3[1.2, 9.5, 2.5, 3.1, 8.7, 2.5]3.0[2.5]
4[3.7, 9.2, 6.1, 4.1, 7.5, 3.8]2.0[3.8]
5[8.9, 5.2, 7.8, 8.3, 5.2, 3.0]4.0[3.0]
6[7.9, 8.5, 9.2, 4.0, 9.4, 2.1]4.0[2.1]

? Scala
? Java
? Python
Refer to the UnivariateFeatureSelector Scala docs for more details on the API.
import org.apache.spark.ml.feature.UnivariateFeatureSelector
import org.apache.spark.ml.linalg.Vectors

val data = Seq(
(1, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3), 3.0),
(2, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1), 2.0),
(3, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5), 3.0),
(4, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8), 2.0),
(5, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0), 4.0),
(6, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1), 4.0)
)

val df = spark.createDataset(data).toDF(“id”, “features”, “l(fā)abel”)

val selector = new UnivariateFeatureSelector()
.setFeatureType(“continuous”)
.setLabelType(“categorical”)
.setSelectionMode(“numTopFeatures”)
.setSelectionThreshold(1)
.setFeaturesCol(“features”)
.setLabelCol(“l(fā)abel”)
.setOutputCol(“selectedFeatures”)

val result = selector.fit(df).transform(df)

println(s"UnivariateFeatureSelector output with top ${selector.getSelectionThreshold}" +
s" features selected using f_classif")
result.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/UnivariateFeatureSelectorExample.scala” in the Spark repo.
VarianceThresholdSelector
VarianceThresholdSelector is a selector that removes low-variance features. Features with a variance not greater than the varianceThreshold will be removed. If not set, varianceThreshold defaults to 0, which means only features with variance 0 (i.e. features that have the same value in all samples) will be removed.
Examples
Assume that we have a DataFrame with the columns id and features, which is used as our target to be predicted:

idfeatures
1[6.0, 7.0, 0.0, 7.0, 6.0, 0.0]
2[0.0, 9.0, 6.0, 0.0, 5.0, 9.0]
3[0.0, 9.0, 3.0, 0.0, 5.0, 5.0]
4[0.0, 9.0, 8.0, 5.0, 6.0, 4.0]
5[8.0, 9.0, 6.0, 5.0, 4.0, 4.0]
6[8.0, 9.0, 6.0, 0.0, 0.0, 0.0]

The variance for the 6 features are 16.67, 0.67, 8.17, 10.17, 5.07, and 11.47 respectively. If we use VarianceThresholdSelector with varianceThreshold = 8.0, then the features with variance <= 8.0 are removed:

idfeaturesselectedFeatures
1[6.0, 7.0, 0.0, 7.0, 6.0, 0.0][6.0,0.0,7.0,0.0]
2[0.0, 9.0, 6.0, 0.0, 5.0, 9.0][0.0,6.0,0.0,9.0]
3[0.0, 9.0, 3.0, 0.0, 5.0, 5.0][0.0,3.0,0.0,5.0]
4[0.0, 9.0, 8.0, 5.0, 6.0, 4.0][0.0,8.0,5.0,4.0]
5[8.0, 9.0, 6.0, 5.0, 4.0, 4.0][8.0,6.0,5.0,4.0]
6[8.0, 9.0, 6.0, 0.0, 0.0, 0.0][8.0,6.0,0.0,0.0]

? Scala
? Java
? Python
Refer to the VarianceThresholdSelector Scala docs for more details on the API.
import org.apache.spark.ml.feature.VarianceThresholdSelector
import org.apache.spark.ml.linalg.Vectors

val data = Seq(
(1, Vectors.dense(6.0, 7.0, 0.0, 7.0, 6.0, 0.0)),
(2, Vectors.dense(0.0, 9.0, 6.0, 0.0, 5.0, 9.0)),
(3, Vectors.dense(0.0, 9.0, 3.0, 0.0, 5.0, 5.0)),
(4, Vectors.dense(0.0, 9.0, 8.0, 5.0, 6.0, 4.0)),
(5, Vectors.dense(8.0, 9.0, 6.0, 5.0, 4.0, 4.0)),
(6, Vectors.dense(8.0, 9.0, 6.0, 0.0, 0.0, 0.0))
)

val df = spark.createDataset(data).toDF(“id”, “features”)

val selector = new VarianceThresholdSelector()
.setVarianceThreshold(8.0)
.setFeaturesCol(“features”)
.setOutputCol(“selectedFeatures”)

val result = selector.fit(df).transform(df)

println(s"Output: Features with variance lower than" +
s" ${selector.getVarianceThreshold} are removed.")
result.show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/VarianceThresholdSelectorExample.scala” in the Spark repo.
Locality Sensitive Hashing
Locality Sensitive Hashing (LSH) is an important class of hashing techniques, which is commonly used in clustering, approximate nearest neighbor search and outlier detection with large datasets.
The general idea of LSH is to use a family of functions (“LSH families”) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. An LSH family is formally defined as follows.
In a metric space (M, d), where M is a set and d is a distance function on M, an LSH family is a family of functions h that satisfy the following properties:
?p,q∈M,d(p,q)≤r1?Pr(h§=h(q))≥p1d(p,q)≥r2?Pr(h§=h(q))≤p2?p,q∈M,d(p,q)≤r1?Pr(h§=h(q))≥p1d(p,q)≥r2?Pr(h§=h(q))≤p2
This LSH family is called (r1, r2, p1, p2)-sensitive.
In Spark, different LSH families are implemented in separate classes (e.g., MinHash), and APIs for feature transformation, approximate similarity join and approximate nearest neighbor are provided in each class.
In LSH, we define a false positive as a pair of distant input features (with d(p,q)≥r2d(p,q)≥r2) which are hashed into the same bucket, and we define a false negative as a pair of nearby features (with d(p,q)≤r1d(p,q)≤r1) which are hashed into different buckets.
LSH Operations
We describe the major types of operations which LSH can be used for. A fitted LSH model has methods for each of these operations.
Feature Transformation
Feature transformation is the basic functionality to add hashed values as a new column. This can be useful for dimensionality reduction. Users can specify input and output column names by setting inputCol and outputCol.
LSH also supports multiple LSH hash tables. Users can specify the number of hash tables by setting numHashTables. This is also used for OR-amplification in approximate similarity join and approximate nearest neighbor. Increasing the number of hash tables will increase the accuracy but will also increase communication cost and running time.
The type of outputCol is Seq[Vector] where the dimension of the array equals numHashTables, and the dimensions of the vectors are currently set to 1. In future releases, we will implement AND-amplification so that users can specify the dimensions of these vectors.
Approximate Similarity Join
Approximate similarity join takes two datasets and approximately returns pairs of rows in the datasets whose distance is smaller than a user-defined threshold. Approximate similarity join supports both joining two different datasets and self-joining. Self-joining will produce some duplicate pairs.
Approximate similarity join accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as outputCol.
In the joined dataset, the origin datasets can be queried in datasetA and datasetB. A distance column will be added to the output dataset to show the true distance between each pair of rows returned.
Approximate Nearest Neighbor Search
Approximate nearest neighbor search takes a dataset (of feature vectors) and a key (a single feature vector), and it approximately returns a specified number of rows in the dataset that are closest to the vector.
Approximate nearest neighbor search accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as outputCol.
A distance column will be added to the output dataset to show the true distance between each output row and the searched key.
Note: Approximate nearest neighbor search will return fewer than k rows when there are not enough candidates in the hash bucket.
LSH Algorithms
Bucketed Random Projection for Euclidean Distance
Bucketed Random Projection is an LSH family for Euclidean distance. The Euclidean distance is defined as follows:
d(x,y)=∑i(xi?yi)2??????????√d(x,y)=∑i(xi?yi)2
Its LSH family projects feature vectors xx onto a random unit vector vv and portions the projected results into hash buckets:
h(x)=?x?vr?h(x)=?x?vr?
where r is a user-defined bucket length. The bucket length can be used to control the average size of hash buckets (and thus the number of buckets). A larger bucket length (i.e., fewer buckets) increases the probability of features being hashed to the same bucket (increasing the numbers of true and false positives).
Bucketed Random Projection accepts arbitrary vectors as input features, and supports both sparse and dense vectors.
? Scala
? Java
? Python
Refer to the BucketedRandomProjectionLSH Scala docs for more details on the API.
import org.apache.spark.ml.feature.BucketedRandomProjectionLSH
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col

val dfA = spark.createDataFrame(Seq(
(0, Vectors.dense(1.0, 1.0)),
(1, Vectors.dense(1.0, -1.0)),
(2, Vectors.dense(-1.0, -1.0)),
(3, Vectors.dense(-1.0, 1.0))
)).toDF(“id”, “features”)

val dfB = spark.createDataFrame(Seq(
(4, Vectors.dense(1.0, 0.0)),
(5, Vectors.dense(-1.0, 0.0)),
(6, Vectors.dense(0.0, 1.0)),
(7, Vectors.dense(0.0, -1.0))
)).toDF(“id”, “features”)

val key = Vectors.dense(1.0, 0.0)

val brp = new BucketedRandomProjectionLSH()
.setBucketLength(2.0)
.setNumHashTables(3)
.setInputCol(“features”)
.setOutputCol(“hashes”)

val model = brp.fit(dfA)

// Feature Transformation
println(“The hashed dataset where hashed values are stored in the column ‘hashes’:”)
model.transform(dfA).show()

// Compute the locality sensitive hashes for the input rows, then perform approximate
// similarity join.
// We could avoid computing hashes by passing in the already-transformed dataset, e.g.
// model.approxSimilarityJoin(transformedA, transformedB, 1.5)
println(“Approximately joining dfA and dfB on Euclidean distance smaller than 1.5:”)
model.approxSimilarityJoin(dfA, dfB, 1.5, “EuclideanDistance”)
.select(col(“datasetA.id”).alias(“idA”),
col(“datasetB.id”).alias(“idB”),
col(“EuclideanDistance”)).show()

// Compute the locality sensitive hashes for the input rows, then perform approximate nearest
// neighbor search.
// We could avoid computing hashes by passing in the already-transformed dataset, e.g.
// model.approxNearestNeighbors(transformedA, key, 2)
println(“Approximately searching dfA for 2 nearest neighbors of the key:”)
model.approxNearestNeighbors(dfA, key, 2).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala” in the Spark repo.
MinHash for Jaccard Distance
MinHash is an LSH family for Jaccard distance where input features are sets of natural numbers. Jaccard distance of two sets is defined by the cardinality of their intersection and union:
d(A,B)=1?|A∩B||A∪B|d(A,B)=1?|A∩B||A∪B|
MinHash applies a random hash function g to each element in the set and take the minimum of all hashed values:
h(A)=mina∈A(g(a))h(A)=mina∈A(g(a))
The input sets for MinHash are represented as binary vectors, where the vector indices represent the elements themselves and the non-zero values in the vector represent the presence of that element in the set. While both dense and sparse vectors are supported, typically sparse vectors are recommended for efficiency. For example, Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)]) means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. All non-zero values are treated as binary “1” values.
Note: Empty sets cannot be transformed by MinHash, which means any input vector must have at least 1 non-zero entry.
? Scala
? Java
? Python
Refer to the MinHashLSH Scala docs for more details on the API.
import org.apache.spark.ml.feature.MinHashLSH
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.col

val dfA = spark.createDataFrame(Seq(
(0, Vectors.sparse(6, Seq((0, 1.0), (1, 1.0), (2, 1.0)))),
(1, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (4, 1.0)))),
(2, Vectors.sparse(6, Seq((0, 1.0), (2, 1.0), (4, 1.0))))
)).toDF(“id”, “features”)

val dfB = spark.createDataFrame(Seq(
(3, Vectors.sparse(6, Seq((1, 1.0), (3, 1.0), (5, 1.0)))),
(4, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (5, 1.0)))),
(5, Vectors.sparse(6, Seq((1, 1.0), (2, 1.0), (4, 1.0))))
)).toDF(“id”, “features”)

val key = Vectors.sparse(6, Seq((1, 1.0), (3, 1.0)))

val mh = new MinHashLSH()
.setNumHashTables(5)
.setInputCol(“features”)
.setOutputCol(“hashes”)

val model = mh.fit(dfA)

// Feature Transformation
println(“The hashed dataset where hashed values are stored in the column ‘hashes’:”)
model.transform(dfA).show()

// Compute the locality sensitive hashes for the input rows, then perform approximate
// similarity join.
// We could avoid computing hashes by passing in the already-transformed dataset, e.g.
// model.approxSimilarityJoin(transformedA, transformedB, 0.6)
println(“Approximately joining dfA and dfB on Jaccard distance smaller than 0.6:”)
model.approxSimilarityJoin(dfA, dfB, 0.6, “JaccardDistance”)
.select(col(“datasetA.id”).alias(“idA”),
col(“datasetB.id”).alias(“idB”),
col(“JaccardDistance”)).show()

// Compute the locality sensitive hashes for the input rows, then perform approximate nearest
// neighbor search.
// We could avoid computing hashes by passing in the already-transformed dataset, e.g.
// model.approxNearestNeighbors(transformedA, key, 2)
// It may return less than 2 rows when not enough approximate near-neighbor candidates are
// found.
println(“Approximately searching dfA for 2 nearest neighbors of the key:”)
model.approxNearestNeighbors(dfA, key, 2).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala” in the Spark repo.

總結(jié)

以上是生活随笔為你收集整理的特征提取,转换和选择的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。

免费黄色激情视频 | 中文字幕乱在线伦视频中文字幕乱码在线 | 日韩免费视频观看 | 色婷婷免费视频 | 欧美久久久久久久久 | 免费在线视频一区二区 | 97精品国产91久久久久久久 | 国产精品免费大片视频 | 国产精品入口传媒 | 国产精品女教师 | 天天拍天天爽 | www.天天操 | 在线国产能看的 | 九九九九免费视频 | 91超碰免费在线 | 五月天久久久久久 | 999久久久久久久久6666 | 黄色www在线观看 | 国产视频一二区 | 欧美另类高清 | 日b视频在线观看网址 | 国产一级在线视频 | 日日麻批40分钟视频免费观看 | 精品国产亚洲一区二区麻豆 | 91视频久久久久 | 粉嫩av一区二区三区四区 | .国产精品成人自产拍在线观看6 | 天天爽人人爽夜夜爽 | 成人a级网站 | 播五月综合 | 顶级bbw搡bbbb搡bbbb | 黄色片网站大全 | 91网在线看 | 一级α片 | 色综合天天综合网国产成人网 | 成年人精品 | 亚洲国产色一区 | 深爱激情五月综合 | 亚洲精品久久久久www | 亚洲免费国产视频 | av动态图片 | 黄色大全免费观看 | 福利视频 | 99爱爱| 国产免费美女 | 午夜在线免费视频 | 少妇高潮冒白浆 | 99久久精品无码一区二区毛片 | 欧美日韩国产三级 | 黄色小说免费在线观看 | 亚洲91av| 99在线观看精品 | 天天爽网站 | 精品国产乱码久久久久久浪潮 | 操操操影院 | 久久久www成人免费精品 | 亚洲国产精品久久 | 美女视频黄是免费的 | 亚洲国产日韩欧美在线 | 中文字幕观看在线 | 日本一区二区高清不卡 | 极品久久久| 成人a视频片观看免费 | 亚洲每日更新 | 久久伊99综合婷婷久久伊 | 日韩大片在线观看 | 久久99久久99久久 | 免费观看性生交 | 久久久www成人免费精品张筱雨 | 久草在线视频中文 | 天堂在线一区二区 | 国产精品久一 | 97香蕉视频| 91精品一区二区三区蜜桃 | 久久人人插 | 九九免费在线看完整版 | 久久久久久国产精品免费 | 亚洲精品理论片 | 美女免费网视频 | 午夜精品一区二区三区视频免费看 | 日韩电影一区二区在线观看 | www.狠狠操.com | 色噜噜在线观看视频 | 美女网站视频免费都是黄 | 欧美另类高潮 | 国产黄色特级片 | 久热电影 | 日韩美视频 | 亚洲欧洲一区二区在线观看 | 亚洲片在线观看 | 日韩r级电影在线观看 | 亚洲欧洲精品一区二区 | 69国产盗摄一区二区三区五区 | 91精品国产一区二区在线观看 | 国产黄色av影视 | 亚洲v欧美v国产v在线观看 | 91大神精品视频在线观看 | 国产手机在线播放 | 久草色在线观看 | 午夜精品久久久久久中宇69 | 日韩天堂网 | 国产糖心vlog在线观看 | 精品久久五月天 | 久久一线 | 狠狠的操你 | 久久国产日韩 | 久久亚洲福利 | 国产精品九九久久久久久久 | 黄网站a | 国产精品亚州 | 成人影视免费看 | 91大神电影 | 日韩专区在线观看 | 久在线观看视频 | 成人午夜在线观看 | 在线播放精品一区二区三区 | 欧美另类重口 | 成人黄在线 | 亚洲婷婷网 | 激情欧美国产 | 99re6热在线精品视频 | 亚洲精品五月 | 尤物97国产精品久久精品国产 | 中文在线√天堂 | 久久久麻豆 | 国产在线观看av | 欧美午夜激情网 | 国产精品露脸在线 | 五月婷婷激情 | 国产无套精品久久久久久 | 九色精品在线 | 色妞久久福利网 | 久久国产精品一区二区 | 成人久久精品视频 | 欧美二区在线播放 | 成人资源站| 欧美精品国产精品 | 色伊人网 | 婷婷日日| 国产美女精品视频免费观看 | 国产中出在线观看 | 国产v在线观看 | 国产 欧美 在线 | 日韩欧美区 | 欧美午夜a | 免费观看成人 | 婷婷综合激情 | 日本韩国精品一区二区在线观看 | 久久久久久久久久久免费av | 青青河边草免费直播 | 在线观看av的网站 | 国产欧美在线一区 | 超碰人人在线观看 | 一区二区三区动漫 | 日本一区二区三区免费观看 | a精品视频| 亚洲精品女人久久久 | 一区二区三区四区精品 | 最新国产视频 | 国内精品久久久久久 | 99亚洲国产精品 | 在线精品国产 | 果冻av在线 | 国产精品一区二区吃奶在线观看 | 在线观看国产区 | 久久精品国产精品亚洲精品 | 中文字幕在线网址 | 亚洲国产精品日韩 | 日本久久高清视频 | 91精品黄色 | 国产精品网红直播 | 中文字幕在线播放第一页 | h文在线观看免费 | 色婷婷丁香 | 中文在线亚洲 | 国外成人在线视频网站 | 在线免费高清视频 | 亚洲精品在线资源 | 777视频在线观看 | 中文字幕第一页在线播放 | 精品国产久 | 日韩二区三区在线 | 成人黄色大片 | 一区二区三区久久 | 婷婷视频在线 | 99国产成+人+综合+亚洲 欧美 | 黄色com | 免费在线观看av网址 | 欧美一级日韩三级 | 日韩美精品视频 | 欧美午夜寂寞影院 | 99久久精品久久久久久清纯 | 久久久久国产一区二区 | 免费污片 | 在线观看视频中文字幕 | 久久99免费观看 | 日韩一区二区三区高清在线观看 | 天天玩夜夜操 | 一区二区视频在线观看免费 | 婷婷色综合 | 中文字幕免费高清av | 日韩电影精品 | 五月色丁香 | 视频91 | 免费日韩av片 | 国产高清亚洲 | 久久久精品综合 | 国产高清久久久久 | 天天操天天干天天摸 | 在线观看国产一区 | 99久久久国产免费 | 久久综合九色 | 国产成人三级在线观看 | 久久久久 | 日韩久久精品一区二区 | 黄色成人在线网站 | 久久久午夜精品福利内容 | 国产一级二级视频 | 日日爱999| 中文字幕一区在线观看视频 | 久久视频二区 | 808电影免费观看三年 | 九九激情视频 | 国产精品久久久久永久免费看 | 亚洲精品在线观看的 | 国产视频2 | 91伊人久久大香线蕉蜜芽人口 | 91桃色视频| 在线日本看片免费人成视久网 | av大片免费在线观看 | 欧美亚洲一级片 | 天天操夜夜干 | 国产精品免费久久久久久 | 欧美精品国产综合久久 | 色婷在线 | 69久久夜色精品国产69 | av电影一区二区三区 | 伊人天堂av | 国产精品18久久久久久首页狼 | 视频直播国产精品 | 91av在线精品| 久久综合狠狠综合 | 国产麻豆成人传媒免费观看 | 日韩av在线免费播放 | 国产一区二区在线影院 | 国产精品精品久久久久久 | 日韩精品一区二区三区三炮视频 | 2022久久国产露脸精品国产 | 国产精品一区二区在线免费观看 | 久久色在线观看 | 在线黄色免费 | 成在线播放 | 国产香蕉97碰碰久久人人 | 日日婷婷夜日日天干 | 免费高清在线观看成人 | 久久精品系列 | 操操操影院 | 欧美一级专区免费大片 | 碰超在线97人人 | 亚洲欧洲一区二区在线观看 | 深爱激情五月综合 | 狠狠色丁香九九婷婷综合五月 | 久久亚洲国产精品 | 中文字幕亚洲欧美 | 天天摸天天操天天爽 | 四虎影视www | 91大神在线观看视频 | 成年人在线免费看片 | 91桃色免费观看 | 精品欧美一区二区三区久久久 | 国产特级毛片aaaaaaa高清 | 首页国产精品 | 五月天婷婷狠狠 | 国产 av 日韩 | 天天艹天天 | 国产精品美女久久久久久久久久久 | 色偷偷av男人天堂 | 亚洲天堂网站视频 | 九色精品在线 | 99久久综合狠狠综合久久 | 国产福利在线 | 狠狠色噜噜狠狠狠 | 日本中文字幕在线电影 | 三级av黄色 | 国内精品久久久久久久影视麻豆 | 中文字幕在线播放视频 | 激情综合国产 | 五月婷网站| 欧美激情视频免费看 | 久久综合九色综合97婷婷女人 | 天天碰天天操视频 | 欧美精品少妇xxxxx喷水 | 国产一级黄色电影 | 午夜视频播放 | 在线午夜电影神马影院 | 天天曰天天 | 欧美日韩久 | 久久天堂亚洲 | 欧美一二三区在线播放 | 国产精品av在线免费观看 | 狠狠干狠狠插 | 日本精a在线观看 | 久久国产欧美日韩精品 | 婷婷丁香五| 国产高清视频色在线www | av成人免费网站 | 国产精品亚 | 天天在线视频色 | 国产伦理久久精品久久久久_ | 欧美在线视频日韩 | 国产精品中文字幕在线播放 | 69人人| 成人免费av电影 | 久草新在线 | 激情开心 | 国产一级免费在线观看 | 中文字幕资源在线观看 | www.天天色| 九色激情网| 五月天高清欧美mv | 九九电影在线 | 国产亚洲欧美一区 | 91国内产香蕉 | 国产精品对白一区二区三区 | 国产午夜精品久久久久久久久久 | 人人超碰免费 | 精品国产一区二区三区噜噜噜 | 91精品视频在线 | 欧美激情一区不卡 | 国产成人一区二区三区在线观看 | 国产91精品看黄网站在线观看动漫 | 久久免费激情视频 | 色综合久久88色综合天天6 | 久久精品亚洲综合专区 | 久久精品国产精品亚洲 | 精品视频专区 | 九九九九九九精品 | 久久在草 | 国产精品久久久久久久久蜜臀 | 日韩欧美在线观看一区二区 | 超级碰碰免费视频 | 亚洲最快最全在线视频 | 午夜美女福利直播 | 免费视频在线观看网站 | 国产伦精品一区二区三区照片91 | 在线观看一| 亚洲天堂网在线视频 | 99久久久久免费精品国产 | 中日韩男男gay无套 日韩精品一区二区三区高清免费 | 亚洲精品va | 美女视频一区二区 | 亚洲一级黄色 | 国产视频 亚洲视频 | 91麻豆精品国产91久久久使用方法 | 菠萝菠萝在线精品视频 | 精品一区二区6 | 国产原创av在线 | 亚州精品天堂中文字幕 | 国产精品毛片久久久久久久 | 欧美性高跟鞋xxxxhd | 国产 日韩 欧美 自拍 | 国产精品久久久一区二区 | 日韩在线欧美在线 | 天天色棕合合合合合合 | 日狠狠 | 亚洲女在线 | 国产精品中文字幕在线播放 | 日韩精品一区二区免费 | 久热爱 | 91av资源网 | 国产成人一区二区三区久久精品 | 国产精品99久久久久久有的能看 | a级国产乱理论片在线观看 伊人宗合网 | 在线观看亚洲专区 | 国产精品97 | 精品在线免费视频 | 久久免费成人精品视频 | 成人久久18免费网站图片 | 国产不卡在线播放 | 日韩视频一区二区在线观看 | 日本在线视频网址 | 亚洲有 在线 | 日韩91在线| 久久综合五月 | 成人在线视频一区 | 日韩精品一区二区三区免费观看视频 | 欧美成人h版 | 久久亚洲福利视频 | 精品成人免费 | 最近中文字幕在线 | 亚洲欧洲日韩在线观看 | 在线黄网站 | 日韩1级片 | 国产一级大片在线观看 | 欧美污在线观看 | 97成人精品视频在线观看 | 久久99久久99精品免视看婷婷 | av成人动漫在线观看 | 黄色一级大片在线观看 | 亚洲电影第一页av | 久草视频免费在线观看 | 日韩高清不卡在线 | 久草免费在线视频观看 | 日韩中文字幕视频在线观看 | 久久久久久黄 | 深爱激情站 | 色综合天天天天做夜夜夜夜做 | 久草免费福利在线观看 | 香蕉视频亚洲 | 成人免费在线看片 | 91在线永久 | 99在线免费观看视频 | www久草| 国产特级毛片 | 五月婷婷综合色拍 | 成人在线一区二区 | 亚洲黄色激情小说 | 91cn国产在线 | 久久99久久99精品免观看粉嫩 | 97涩涩视频 | 国内精品视频久久 | 国产精品3区 | 婷婷在线综合 | 二区三区av | 日韩精品中文字幕在线播放 | 一本一本久久a久久 | 色综合国产 | 黄色大片国产 | 91精品1区 | 2021av在线 | 99久久精品久久久久久清纯 | 免费看搞黄视频网站 | a级国产乱理论片在线观看 伊人宗合网 | 在线三级av | 中文字幕免费高 | 911亚洲精品第一 | 日韩欧美视频二区 | 久久私人影院 | av在线一二三区 | 精品婷婷| 人人干人人超 | 骄小bbw搡bbbb揉bbbb | 国产中文伊人 | 精品久久久久久久久久久久久久久久 | 国产欧美精品在线观看 | 91精品久久久久久久久久久久久 | 久久a久久 | 国产高清日韩欧美 | 奇米导航 | 免费在线观看av网址 | 国产小视频在线免费观看视频 | 精品视频免费在线 | 亚洲一区视频在线播放 | 2019中文在线观看 | 久久中文字幕导航 | 免费看的黄网站软件 | 91大神精品视频在线观看 | 亚洲精品免费看 | 亚洲精品影视 | 国产黄色在线网站 | 91黄色免费网站 | 91传媒在线看 | 色多多视频在线观看 | 操操操操网 | 国产精品久久久99 | 国产资源在线观看 | 久久不卡视频 | 成人免费视频播放 | 嫩草91影院| 久久综合色婷婷 | 日韩高清成人在线 | 狠狠的日| 美女视频a美女大全免费下载蜜臀 | 四虎在线视频 | 国产老太婆免费交性大片 | 亚洲涩涩一区 | 欧美另类亚洲 | 国产精品白浆视频 | 一二三四精品 | 99久久精品国产一区二区成人 | 国产免费二区 | 999久久久精品视频 日韩高清www | 国产成人精品综合 | 91在线观看视频网站 | www.五月婷婷.com | 在线观看亚洲专区 | 99视频黄| 亚洲视频,欧洲视频 | 中文字幕最新精品 | 亚洲精品在线视频观看 | 国产一区二区日本 | 精品国产一区在线观看 | 91精品国产福利在线观看 | 综合久久2023 | 日本高清久久久 | 免费电影一区二区三区 | 最新成人在线 | 中文字幕免费观看 | 日本最新高清不卡中文字幕 | 国产黄大片在线观看 | 久久草在线视频国产 | 97免费在线观看 | 国产一区在线视频播放 | 国产色妞影院wwwxxx | 中文资源在线观看 | 免费色视频在线 | 国产在线观看污片 | 97在线观看免费 | 国产97色在线 | 天天草天天操 | 久久国产精品久久国产精品 | 亚洲综合涩 | 日本在线精品视频 | 黄色小说免费在线观看 | 亚洲成av人片在线观看www | 国产又粗又长又硬免费视频 | av字幕在线| 夜夜夜草 | 亚洲精品国 | 国产999精品久久久影片官网 | 国产精品一区二区三区四 | 黄色大片av| 国产区久久 | www.色爱 | 亚洲精品乱码久久久久久蜜桃不爽 | 久久亚洲综合国产精品99麻豆的功能介绍 | 91c网站色版视频 | 成人av播放 | 国产日韩精品一区二区在线观看播放 | 欧美一级视频免费看 | 亚洲一区精品人人爽人人躁 | 韩日电影在线免费看 | 国产免费又黄又爽 | 日韩精品一区二区在线观看视频 | 最近日本韩国中文字幕 | 久草网免费 | 一区二区三区四区精品视频 | 天天操天天吃 | www一起操| 中文字幕高清免费日韩视频在线 | 成人黄性视频 | 亚洲激情在线观看 | 手机看片| 手机色在线| 亚洲午夜久久久综合37日本 | 九9热这里真品2 | 精品国产一区二区三区av性色 | 午夜国产在线 | 美女黄濒 | 午夜久久美女 | 亚洲一级电影在线观看 | 国产美女视频 | 日韩电影在线一区二区 | 中文字幕a∨在线乱码免费看 | 亚洲精品午夜视频 | 免费国产黄线在线观看视频 | 99久久日韩精品视频免费在线观看 | 成人在线观看av | 国产美女视频免费观看的网站 | 91视频啪 | 91精品国产网站 | 国产精品9999久久久久仙踪林 | 日韩高清黄色 | 91在线视频观看免费 | 在线观看视频你懂 | 综合激情 | 少妇性aaaaaaaaa视频 | 国产精品午夜在线 | 狠狠干成人综合网 | 91综合久久一区二区 | 手机看片中文字幕 | 欧美成人一二区 | 欧美亚洲一级片 | 丁香高清视频在线看看 | 精品在线99 | 一级片免费视频 | 日韩午夜精品福利 | 成人av一二三区 | 色香天天 | 久久久久久久久久免费视频 | 日本精品久久久久影院 | 最新一区二区三区 | 亚洲黄色免费在线看 | 欧美一区视频 | 欧美在线视频一区二区三区 | 日韩精品一区二区在线 | 国产99区| 天天天色 | 久久字幕网 | 久久99精品国产99久久 | 九九热免费在线观看 | 国产麻豆果冻传媒在线观看 | 伊人天天| 在线黄色国产电影 | 日韩字幕在线 | 日韩理论片 | 视频成人免费 | 亚洲最大av网 | 久久9999久久免费精品国产 | 九色琪琪久久综合网天天 | 99视频国产在线 | 久久久五月婷婷 | 久久久久久久久影视 | 国产.精品.日韩.另类.中文.在线.播放 | 日韩黄色一区 | 精品日韩在线 | 亚洲精品999 | 又紧又大又爽精品一区二区 | 国产精品久久久久久99 | 国产精品精 | 国产一区二区在线看 | 日日干网| 黄色一级性片 | 中文字幕视频免费观看 | 私人av| 99草在线视频 | 国产精品久久久久毛片大屁完整版 | 久久免费视频网站 | 精品久久国产精品 | 久久艹综合 | 99热精品视 | 国产精品久久99精品毛片三a | 天天操天天色天天 | 日韩大片在线观看 | 97中文字幕 | 99色国产| 日韩av图片 | mm1313亚洲精品国产 | 日本视频精品 | 91成人网在线播放 | 中文一区二区三区在线观看 | 中文免费观看 | 成人免费中文字幕 | 激情视频91 | 久久激情精品 | 国产亚洲精品久久久久久久久久 | 在线免费观看黄色小说 | 成人cosplay福利网站 | 狠狠躁夜夜躁人人爽超碰97香蕉 | 国产精品一区二区av | 久久久一本精品99久久精品 | 五月天婷婷在线观看视频 | 免费黄色网址大全 | 欧美久草网 | 特级西西人体444是什么意思 | 国产精品久久电影观看 | 日韩视频三区 | 成人免费xyz网站 | 久久久18 | 欧美三级在线播放 | 久久精品国产免费看久久精品 | 天天操天天怕 | 99视频网站| 日韩在线观看第一页 | 这里只有精品视频在线观看 | 97品白浆高清久久久久久 | 97超碰人人澡人人爱学生 | 精品视频 | 色婷婷激情 | 国产精品九九久久久久久久 | 天天·日日日干 | 最新av网站在线观看 | 国产黄在线免费观看 | 日韩一区二区免费在线观看 | 夜夜骑首页 | 精品自拍网 | 亚洲精品欧美精品 | 亚洲黄色免费观看 | 91人人爱 | 久久精品中文字幕少妇 | 91精品小视频| 欧美不卡视频在线 | 国产免费观看久久黄 | 在线综合色 | 欧美午夜a | 二区三区视频 | 日本久久中文字幕 | 中文字幕在线看视频国产中文版 | 亚洲精品国产精品国自产观看 | 九九免费精品视频在线观看 | 超碰人人在线观看 | 狠狠操天天干 | 欧美日韩国产精品一区二区亚洲 | 成 人 黄 色视频免费播放 | 中文电影网| 免费在线黄色av | 国产精品久久久久影院 | 日韩中文字幕免费在线观看 | 日本黄色a级大片 | 亚洲欧美999| 91在线九色 | 久久视频在线视频 | 国产在线p | 天天综合婷婷 | av成人动漫 | 涩涩网站在线 | 美女国内精品自产拍在线播放 | 99国产精品久久久久久久久久 | 久久电影国产免费久久电影 | 色噜噜在线观看视频 | 国产黄色高清 | 久久精品五月 | 欧美精品亚洲二区 | 美女免费视频一区二区 | www.久久成人 | 狠狠网站 | 亚洲精品国产精品国自产 | 四虎在线免费 | 狠狠干,狠狠操 | 久久久国产精品成人免费 | www色网站 | 一级欧美日韩 | japanese黑人亚洲人4k | 黄色国产大片 | 亚洲精品在线观看不卡 | www国产亚洲精品久久网站 | 黄色国产高清 | 久久国产片 | 九色视频网址 | 天天草天天色 | 久久五月情影视 | 91爱爱电影| 欧美成人猛片 | 在线观看色视频 | 欧美一级高清片 | 日韩欧美一区二区三区在线观看 | 国产精品日韩欧美 | 国产一级视屏 | 免费午夜视频在线观看 | 韩国精品福利一区二区三区 | 成人久久影院 | 久久久网站 | 蜜桃视频在线视频 | av.com在线| 亚洲电影久久 | 99热官网| 热精品| 中文字幕国产精品 | 91豆麻精品91久久久久久 | 日韩黄色影院 | 精品国产伦一区二区三区免费 | 在线免费av播放 | 欧美日韩久久不卡 | av福利超碰网站 | 婷婷丁香狠狠爱 | 麻豆免费视频网站 | 午夜久久久久久久久 | 亚洲专区一二三 | 黄色毛片大全 | 国产精品久久电影观看 | 91av手机在线观看 | 亚州精品在线视频 | 天天干天天怕 | 九九九热精品免费视频观看网站 | 天天综合日日夜夜 | 欧美亚洲精品在线观看 | 久久久国产精品一区二区三区 | 国产精品你懂的在线观看 | 在线 国产 亚洲 欧美 | 中国一级片免费看 | 国产黄视频在线观看 | 永久免费的啪啪网站免费观看浪潮 | 亚洲欧美日韩国产 | 亚洲一区日韩在线 | 免费日韩 精品中文字幕视频在线 | 四虎国产 | 免费精品在线观看 | 午夜精品一二三区 | 成人在线一区二区三区 | 亚洲一区二区高潮无套美女 | 色多多视频在线观看 | 在线视频日韩一区 | 久久精品电影 | 黄色高清视频在线观看 | 久久久久久中文字幕 | 日韩精品一区二区三区不卡 | 亚洲好视频 | 99精品视频在线免费观看 | 欧美韩日精品 | 久久国产色 | 日韩激情av在线 | 亚洲精品美女在线 | 欧美天堂影院 | 99tvdz@gmail.com | 亚洲第一久久久 | 国产精品一区二区免费在线观看 | a电影在线观看 | 亚洲综合五月 | 国产69久久久欧美一级 | 日韩欧美视频在线观看免费 | 国产麻豆视频 | 精品国产乱码久久久久久久 | 亚洲欧美在线观看视频 | 岛国大片免费视频 | 免费网站看av片 | 国产黄色片免费观看 | 国产精品igao视频网入口 | a级国产乱理伦片在线观看 亚洲3级 | 欧美精品亚洲精品 | 最新日本中文字幕 | 91人人澡人人爽人人精品 | 成人久久久久久久久久 | 国产涩涩网站 | 一区二区三区在线免费 | 精品久久久久久国产91 | 三级黄色网络 | 精品在线一区二区三区 | 狠狠成人 | 91免费在线看片 | 久久精品国产第一区二区三区 | av片在线观看 | 久久久久国产精品厨房 | 特级片免费看 | 日韩在线观看第一页 | 国产精品一区二区av麻豆 | 免费看三级网站 | 亚洲精品综合一二三区在线观看 | 韩国三级在线一区 | 亚洲精品自拍视频在线观看 | 五月天亚洲综合小说网 | 亚洲人毛片| 一级α片免费看 | 成片视频在线观看 | 国产色女| 在线播放你懂 | 国产乱老熟视频网88av | 综合色爱| 国产午夜精品免费一区二区三区视频 | 欧美一级黄色视屏 | 亚洲电影第一页av | 在线观看免费一级片 | 国产黄a三级三级三级三级三级 | 欧美有色 | 国产综合视频在线观看 | 99精品视频一区 | 99c视频高清免费观看 | 国产欧美在线一区二区三区 | 最近日本中文字幕 | av中文在线 | 免费看三片| 最新中文字幕在线观看视频 | 天天操夜夜操天天射 | 国产精品久久久久一区二区三区共 | 深夜成人av | www免费黄色 | 亚洲视频高清 | 国产97免费| 中文字幕字幕中文 | 综合激情久久 | 久久久久久综合 | 伊人中文网 | 国产精品久久影院 | 精品国产乱码一区二区三区在线 | 久久免费视频一区 | 天天射综合网站 | 久久高清视频免费 | 91免费在线视频 | 在线不卡中文字幕播放 | 又黄又刺激 | 亚洲永久精品视频 | 天天色成人网 | 国产视频在线一区二区 | 天天干天天拍天天操天天拍 | 久久综合九色综合欧美狠狠 | 国产精品亚洲视频 | 99热999 | 伊人天天色 | 欧美午夜久久久 | 精品女同一区二区三区在线观看 | 国产在线观看你懂得 | 精品久久网 | 亚洲精品国产精品99久久 | 激情视频免费在线观看 | 成人综合免费 | 香蕉视频网站在线观看 | 亚洲国产精品电影在线观看 | 一级免费黄视频 | 一级黄色片在线 | www.91成人 | 国产精品v欧美精品v日韩 | 国产福利电影网址 | 精品久久久国产 | 精品麻豆入口免费 | 一本大道久久精品懂色aⅴ 五月婷社区 | 西西444www大胆高清图片 | av电影在线播放 | 日韩在线网址 | 成人在线视频网 | 狠狠色噜噜狠狠狠合久 | 99热这里精品 | 国产黄色大片 | 久草精品电影 | 在线a亚洲视频播放在线观看 | 国产高清在线一区 | 99久久精品一区二区成人 | 成人在线免费观看视视频 | 欧美久久久一区二区三区 | 日韩精品中文字幕一区二区 | 国产精品免费在线播放 | 91黄色免费网站 | 四虎5151久久欧美毛片 | 久久精品—区二区三区 | 国产字幕在线播放 | 午夜成人免费电影 | 婷婷久久综合九色综合 | 婷婷久久一区二区三区 | 伊人婷婷| 欧美亚洲国产一卡 | 在线视频日韩一区 | 中文字幕在线一二 | 国产精品一区在线观看 | 国产视频精品久久 | 国产98色在线 | 日韩 | 中文字幕观看av | www.com.黄 | 99九九热只有国产精品 | 亚洲精品高清在线观看 | 久久国语露脸国产精品电影 | 碰超在线97人人 | 色婷婷免费视频 | 国产一级片观看 | 麻豆视频一区二区 | 久久香蕉电影网 | 免费看黄色小说的网站 | 天堂av网在线 | 在线观看黄网站 | 九九热精品视频在线播放 | 亚洲h在线播放在线观看h | 91视频观看免费 | 亚洲天堂激情 | 亚洲精选在线 | 日韩精品一区二区三区免费观看视频 | 国产精品麻豆视频 | 日韩久久精品一区 | 激情综合一区 | av高清免费在线 | 欧美激情第十页 | 免费看av在线 | 亚洲 中文 在线 精品 | 国产精品xxxx18a99 | 欧美一级片免费观看 | 在线蜜桃视频 | 911免费视频 | 免费亚洲婷婷 | 国产永久免费高清在线观看视频 | 欧美久久久久久久久久 | 国产亚洲精品久久网站 | 国产精品日韩精品 | 在线免费观看的av | 久草在线精品观看 | 99热国内精品 | 国产一区二区三区高清播放 | 亚洲精品乱码久久久久久写真 | 中文字幕国语官网在线视频 | a极黄色片| 99视频在线免费播放 | 91精品国产高清自在线观看 | 婷婷色中文网 | 日韩在线电影观看 | 日韩精品 在线视频 | 国产成人久久av免费高清密臂 | 人人干,人人爽 | 一本一本久久a久久精品综合 | 久久综合久久综合久久综合 | 免费成人黄色 | 国产手机视频在线观看 | 中文字幕在线久一本久 | 亚洲91精品在线观看 | 天天操天天拍 | 精品久久久久久综合日本 | 日本高清dvd | 日日操操操 | 美女黄濒 | 国产一级电影在线 | 久久66热这里只有精品 | 在线观看色网 | 91亚洲精品久久久 | 四虎8848免费高清在线观看 | 亚洲黄色成人av | 五月激情五月激情 | 永久免费av在线播放 | 99久久精品免费一区 | 亚洲 中文字幕av | 日韩免费看视频 | 国产精品一区二区视频 | 精品国产一区二区三区av性色 | 亚洲手机av | 日韩免费福利 | 日韩色中色 | 91视频一8mav | 国产免费亚洲高清 | 欧美日韩视频在线播放 | 国产精品9区| 国产精品入口久久 | 国产手机在线观看 | av片中文| 少妇性xxx | 亚洲欧美日本A∨在线观看 青青河边草观看完整版高清 |