Spark SQL 函数全集
生活随笔
收集整理的這篇文章主要介紹了
Spark SQL 函数全集
小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
org.apache.spark.sql.functions是一個(gè)Object,提供了約兩百多個(gè)函數(shù)。
大部分函數(shù)與Hive的差不多。
除UDF函數(shù),均可在spark-sql中直接使用。
經(jīng)過(guò)import org.apache.spark.sql.functions._ ,也可以用于Dataframe,Dataset。
version
2.3.0
大部分支持Column的函數(shù)也支持String類型的列名。這些函數(shù)的返回類型基本都是Column。
函數(shù)很多,都在下面了。
聚合函數(shù)
approx_count_distinct count_distinct近似值avg 平均值collect_list 聚合指定字段的值到listcollect_set 聚合指定字段的值到setcorr 計(jì)算兩列的Pearson相關(guān)系數(shù)count 計(jì)數(shù)countDistinct 去重計(jì)數(shù) SQL中用法 select count(distinct class)covar_pop 總體協(xié)方差(population covariance)covar_samp 樣本協(xié)方差(sample covariance)first 分組第一個(gè)元素last 分組最后一個(gè)元素groupinggrouping_idkurtosis計(jì)算峰態(tài)(kurtosis)值skewness計(jì)算偏度(skewness)max 最大值min 最小值mean 平均值stddev即stddev_sampstddev_samp樣本標(biāo)準(zhǔn)偏差(sample standard deviation)stddev_pop 總體標(biāo)準(zhǔn)偏差(population standard deviation)sum 求和sumDistinct 非重復(fù)值求和 SQL中用法 select sum(distinct class)var_pop 總體方差(population variance)var_samp 樣本無(wú)偏方差(unbiased variance)variance 即var_samp集合函數(shù)
array_contains(column,value) 檢查array類型字段是否包含指定元素explode展開(kāi)array或map為多行explode_outer 同explode,但當(dāng)array或map為空或null時(shí),會(huì)展開(kāi)為null。posexplode 同explode,帶位置索引。posexplode_outer 同explode_outer,帶位置索引。from_json 解析JSON字符串為StructType or ArrayType,有多種參數(shù)形式,詳見(jiàn)文檔。to_json 轉(zhuǎn)為json字符串,支持StructType, ArrayType of StructTypes, a MapType or ArrayType of MapTypes。get_json_object(column,path) 獲取指定json路徑的json對(duì)象字符串。 select get_json_object('{"a"1,"b":2}','$.a'); [JSON Path介紹](http://blog.csdn.net/koflance/article/details/63262484) json_tuple(column,fields) 獲取json中指定字段值。select json_tuple('{"a":1,"b":2}','a','b');map_keys 返回map的鍵組成的arraymap_values 返回map的值組成的arraysize array or map的長(zhǎng)度sort_array(e: Column, asc: Boolean) 將array中元素排序(自然排序),默認(rèn)asc。時(shí)間函數(shù)
add_months(startDate: Column, numMonths: Int) 指定日期添加n月date_add(start: Column, days: Int) 指定日期之后n天 e.g. select date_add('2018-01-01',3)date_sub(start: Column, days: Int) 指定日期之前n天datediff(end: Column, start: Column) 兩日期間隔天數(shù)current_date() 當(dāng)前日期current_timestamp() 當(dāng)前時(shí)間戳,TimestampType類型date_format(dateExpr: Column, format: String) 日期格式化dayofmonth(e: Column) 日期在一月中的天數(shù),支持 date/timestamp/stringdayofyear(e: Column) 日期在一年中的天數(shù), 支持 date/timestamp/stringweekofyear(e: Column) 日期在一年中的周數(shù), 支持 date/timestamp/stringfrom_unixtime(ut: Column, f: String) 時(shí)間戳轉(zhuǎn)字符串格式from_utc_timestamp(ts: Column, tz: String) 時(shí)間戳轉(zhuǎn)指定時(shí)區(qū)時(shí)間戳to_utc_timestamp(ts: Column, tz: String) 指定時(shí)區(qū)時(shí)間戳轉(zhuǎn)UTF時(shí)間戳hour(e: Column) 提取小時(shí)值minute(e: Column) 提取分鐘值month(e: Column) 提取月份值quarter(e: Column) 提取季度second(e: Column) 提取秒year(e: Column):提取年last_day(e: Column) 指定日期的月末日期months_between(date1: Column, date2: Column) 計(jì)算兩日期差幾個(gè)月next_day(date: Column, dayOfWeek: String) 計(jì)算指定日期之后的下一個(gè)周一、二...,dayOfWeek區(qū)分大小寫(xiě),只接受 "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"。to_date(e: Column) 字段類型轉(zhuǎn)為DateTypetrunc(date: Column, format: String) 日期截?cái)鄒nix_timestamp(s: Column, p: String) 指定格式的時(shí)間字符串轉(zhuǎn)時(shí)間戳unix_timestamp(s: Column) 同上,默認(rèn)格式為 yyyy-MM-dd HH:mm:ssunix_timestamp():當(dāng)前時(shí)間戳(秒),底層實(shí)現(xiàn)為unix_timestamp(current_timestamp(), yyyy-MM-dd HH:mm:ss)window(timeColumn: Column, windowDuration: String, slideDuration: String, startTime: String) 時(shí)間窗口函數(shù),將指定時(shí)間(TimestampType)劃分到窗口數(shù)學(xué)函數(shù)
cos,sin,tan 計(jì)算角度的余弦,正弦。。。sinh,tanh,cosh 計(jì)算雙曲正弦,正切,。。acos,asin,atan,atan2 計(jì)算余弦/正弦值對(duì)應(yīng)的角度bin 將long類型轉(zhuǎn)為對(duì)應(yīng)二進(jìn)制數(shù)值的字符串For example, bin("12") returns "1100".bround 舍入,使用Decimal的HALF_EVEN模式,v>0.5向上舍入,v< 0.5向下舍入,v0.5向最近的偶數(shù)舍入。round(e: Column, scale: Int) HALF_UP模式舍入到scale為小數(shù)點(diǎn)。v>=0.5向上舍入,v< 0.5向下舍入,即四舍五入。ceil 向上舍入floor 向下舍入cbrt Computes the cube-root of the given value.conv(num:Column, fromBase: Int, toBase: Int)轉(zhuǎn)換數(shù)值(字符串)的進(jìn)制log(base: Double, a: Column):$log_{base}(a)$log(a: Column):$log_e(a)$log10(a: Column):$log_{10}(a)$log2(a: Column):$log_{2}(a)$log1p(a: Column):$log_{e}(a+1)$pmod(dividend: Column, divisor: Column):Returns the positive value of dividend mod divisor.pow(l: Double, r: Column):$r^l$ 注意r是列pow(l: Column, r: Double):$r^l$ 注意l是列pow(l: Column, r: Column):$r^l$ 注意r,l都是列radians(e: Column):角度轉(zhuǎn)弧度rint(e: Column):Returns the double value that is closest in value to the argument and is equal to a mathematical integer.shiftLeft(e: Column, numBits: Int):向左位移shiftRight(e: Column, numBits: Int):向右位移shiftRightUnsigned(e: Column, numBits: Int):向右位移(無(wú)符號(hào)位)signum(e: Column):返回?cái)?shù)值正負(fù)符號(hào)sqrt(e: Column):平方根hex(column: Column):轉(zhuǎn)十六進(jìn)制unhex(column: Column):逆轉(zhuǎn)十六進(jìn)制混雜(misc)函數(shù)
crc32(e: Column):計(jì)算CRC32,返回biginthash(cols: Column*):計(jì)算 hash code,返回intmd5(e: Column):計(jì)算MD5摘要,返回32位,16進(jìn)制字符串sha1(e: Column):計(jì)算SHA-1摘要,返回40位,16進(jìn)制字符串sha2(e: Column, numBits: Int):計(jì)算SHA-1摘要,返回numBits位,16進(jìn)制字符串。numBits支持224, 256, 384, or 512.其他非聚合函數(shù)
abs(e: Column) 絕對(duì)值array(cols: Column*) 多列合并為array,cols必須為同類型map(cols: Column*): 將多列組織為map,輸入列必須為(key,value)形式,各列的key/value分別為同一類型。bitwiseNOT(e: Column): Computes bitwise NOT.broadcast[T](df: Dataset[T]): Dataset[T]: 將df變量廣播,用于實(shí)現(xiàn)broadcast join。如left.join(broadcast(right), "joinKey")coalesce(e: Column*): 返回第一個(gè)非空值col(colName: String): 返回colName對(duì)應(yīng)的Columncolumn(colName: String): col函數(shù)的別名expr(expr: String): 解析expr表達(dá)式,將返回值存于Column,并返回這個(gè)Column。greatest(exprs: Column*): 返回多列中的最大值,跳過(guò)Nullleast(exprs: Column*): 返回多列中的最小值,跳過(guò)Nullinput_file_name():返 回當(dāng)前任務(wù)的文件名 ??isnan(e: Column): 檢查是否NaN(非數(shù)值)isnull(e: Column): 檢查是否為Nulllit(literal: Any): 將字面量(literal)創(chuàng)建一個(gè)ColumntypedLit[T](literal: T)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[T]): 將字面量(literal)創(chuàng)建一個(gè)Column,literal支持 scala types e.g.: List, Seq and Map.monotonically_increasing_id(): 返回單調(diào)遞增唯一ID,但不同分區(qū)的ID不連續(xù)。ID為64位整型。nanvl(col1: Column, col2: Column): col1為NaN則返回col2negate(e: Column): 負(fù)數(shù),同df.select( -df("amount") )not(e: Column): 取反,同df.filter( !df("isActive") )rand(): 隨機(jī)數(shù)[0.0, 1.0]rand(seed: Long): 隨機(jī)數(shù)[0.0, 1.0],使用seed種子randn(): 隨機(jī)數(shù),從正態(tài)分布取randn(seed: Long): 同上spark_partition_id(): 返回partition IDstruct(cols: Column*): 多列組合成新的struct column ??when(condition: Column, value: Any): 當(dāng)condition為true返回value,如 people.select(when(people("gender") === "male", 0).when(people("gender") === "female", 1).otherwise(2)) 如果沒(méi)有otherwise且condition全部沒(méi)命中,則返回null.排序函數(shù)
asc(columnName: String):正序asc_nulls_first(columnName: String):正序,null排最前asc_nulls_last(columnName: String):正序,null排最后e.g. df.sort(asc("dept"), desc("age"))對(duì)應(yīng)有desc函數(shù)desc,desc_nulls_first,desc_nulls_last字符串函數(shù)
ascii(e: Column): 計(jì)算第一個(gè)字符的ascii碼base64(e: Column): base64轉(zhuǎn)碼unbase64(e: Column): base64解碼concat(exprs: Column*):連接多列字符串concat_ws(sep: String, exprs: Column*):使用sep作為分隔符連接多列字符串decode(value: Column, charset: String): 解碼encode(value: Column, charset: String): 轉(zhuǎn)碼,charset支持 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'。format_number(x: Column, d: Int):格式化'#,###,###.##'形式的字符串format_string(format: String, arguments: Column*): 將arguments按format格式化,格式為printf-style。initcap(e: Column): 單詞首字母大寫(xiě)lower(e: Column): 轉(zhuǎn)小寫(xiě)upper(e: Column): 轉(zhuǎn)大寫(xiě)instr(str: Column, substring: String): substring在str中第一次出現(xiàn)的位置length(e: Column): 字符串長(zhǎng)度levenshtein(l: Column, r: Column): 計(jì)算兩個(gè)字符串之間的編輯距離(Levenshtein distance)locate(substr: String, str: Column): substring在str中第一次出現(xiàn)的位置,位置編號(hào)從1開(kāi)始,0表示未找到。locate(substr: String, str: Column, pos: Int): 同上,但從pos位置后查找。lpad(str: Column, len: Int, pad: String):字符串左填充。用pad字符填充str的字符串至len長(zhǎng)度。有對(duì)應(yīng)的rpad,右填充。ltrim(e: Column):剪掉左邊的空格、空白字符,對(duì)應(yīng)有rtrim.ltrim(e: Column, trimString: String):剪掉左邊的指定字符,對(duì)應(yīng)有rtrim.trim(e: Column, trimString: String):剪掉左右兩邊的指定字符trim(e: Column):剪掉左右兩邊的空格、空白字符regexp_extract(e: Column, exp: String, groupIdx: Int): 正則提取匹配的組regexp_replace(e: Column, pattern: Column, replacement: Column): 正則替換匹配的部分,這里參數(shù)為列。regexp_replace(e: Column, pattern: String, replacement: String): 正則替換匹配的部分repeat(str: Column, n: Int):將str重復(fù)n次返回reverse(str: Column): 將str反轉(zhuǎn)soundex(e: Column): 計(jì)算桑迪克斯代碼(soundex code)PS:用于按英語(yǔ)發(fā)音來(lái)索引姓名,發(fā)音相同但拼寫(xiě)不同的單詞,會(huì)映射成同一個(gè)碼。split(str: Column, pattern: String): 用pattern分割strsubstring(str: Column, pos: Int, len: Int): 在str上截取從pos位置開(kāi)始長(zhǎng)度為len的子字符串。substring_index(str: Column, delim: String, count: Int):Returns the substring from string str before count occurrences of the delimiter delim. If count is positive, everything the left of the final delimiter (counting from left) is returned. If count is negative, every to the right of the final delimiter (counting from the right) is returned. substring_index performs a case-sensitive match when searching for delim.translate(src: Column, matchingString: String, replaceString: String):把src中的matchingString全換成replaceString。UDF函數(shù)
user-defined function.callUDF(udfName: String, cols: Column*): 調(diào)用UDF import org.apache.spark.sql._val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value") val spark = df.sparkSession spark.udf.register("simpleUDF", (v: Int) => v * v) df.select($"id", callUDF("simpleUDF", $"value"))udf: 定義UDF窗口函數(shù)
cume_dist(): cumulative distribution of values within a window partitioncurrentRow(): returns the special frame boundary that represents the current row in the window partition.rank():排名,返回?cái)?shù)據(jù)項(xiàng)在分組中的排名,排名相等會(huì)在名次中留下空位 1,2,2,4。dense_rank(): 排名,返回?cái)?shù)據(jù)項(xiàng)在分組中的排名,排名相等會(huì)在名次中不會(huì)留下空位 1,2,2,3。row_number():行號(hào),為每條記錄返回一個(gè)數(shù)字 1,2,3,4percent_rank():returns the relative rank (i.e. percentile) of rows within a window partition.lag(e: Column, offset: Int, defaultValue: Any): offset rows before the current rowlead(e: Column, offset: Int, defaultValue: Any): returns the value that is offset rows after the current rowntile(n: Int): returns the ntile group id (from 1 to n inclusive) in an ordered window partition.unboundedFollowing():returns the special frame boundary that represents the last row in the window partition.?
轉(zhuǎn)載于:https://www.cnblogs.com/itboys/p/9818836.html
總結(jié)
以上是生活随笔為你收集整理的Spark SQL 函数全集的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: Spring MVC能响应HTTP请求的
- 下一篇: Oracle数据库分组排序