日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 运维知识 > 数据库 >内容正文

数据库

Task01c:随机抽样与卡方检验的SQL实现

發布時間:2023/12/29 数据库 30 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Task01c:随机抽样与卡方检验的SQL实现 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Task01c:隨機抽樣與卡方檢驗的SQL實現

需要的工具及基礎學習內容

- 工具: MySQL【先過一遍書,代碼準確性具體還得二期編輯,希望用MySQL】

  • 書籍:《數據分析技術 使用SQL和EXCEL工具 第二版》
  • 數據集: 數據配套資源
  • Task01a:復習SQL的基本知識
  • Task01b:統計的基本概念及SQL實現

主要內容

  • 抽樣
  • 假設檢驗

基于均值的比較

Z分數:測量樣本值到期望值之間的距離,以標準差的數量測量。

隨機抽樣及分層抽樣

隨機樣本是隨機的,因此,該樣本的統計是分布于平均值周圍的。

-- 隨機抽樣 SELECT t.* FROM <tab> t WHERE RAND() < 0.1;-- 可重復的隨機抽樣 SELECT t.* FROM <tab> t WHERE RAND(1) < 0.1; -- 偽隨機數生成器 WITH t AS (SELECT t.*,ROW_NUMBER() OVER (ORDER BY col) AS seqnumFROM <tab> t ) SELECT t.* FROM t WHERE (seqnum*17+57) % 101 <= 10;-- 分層平衡抽樣 訂單小于$200的200個訂單的散點圖。 SELECT OrderDate,(CASE WHEN PaymentType="AE" THEN TotalPrice END) AS AE,(CASE WHEN PaymentType="AE" THEN NULL ELSE TotalPrice END) AS NotAE FROM Orders WHERE TotalPrice <= 200 ORDER BY RAND() LIMIT 200;

虛擬假設及可信度

-- 有多少客戶仍然是活躍客戶以及停止率 SELECT COUNT(*) AS numstarts,SUM(CASE WHEN Stoptype IS NOT NULL THEN 1 ELSE 0 END) AS numstops,AVG(CASE WHEN Stoptype IS NOT NULL THEN 1.0 ELSE 0 END) AS stoprate, FROM Subscribers WHERE Startdate = "2005-12-28"-- 1. 給定數量,停止概率是多少? -- 2. 給定概率,停止者的數量是多少?

概率和標準差、置信區間

-- 標準差以及置信區間為95%的下限和上限 SELECT stoprate-1.96*stderr AS conflower,stoprate+1.96*stderr AS confupper,stoprate,stderr,numstarts,numstops FROM (SELECT SQRT(stoprate*(1-stoprate)/numstarts) AS stderr,stoprate,numstarts,numstopsFROM (SELECT COUNT(*) AS numstarts,SUM(CASE WHEN Stoptype IS NOT NULL THEN 1 ELSE 0 END) AS numstops,AVG(CASE WHEN Stoptype IS NOT NULL THEN 1.0 ELSE 0 END) AS stoprateFROM SubscribersWHERE startdate = "2005-12-28") s) s

卡方檢驗 比較兩者之間的多個維度的區別。嚴格地講,偏差是由于偶然導致的可能性是多少?

如果可能性很低,我們就能很自信地認為市場之間是有區別的。

卡方檢驗最基本的思想就是通過觀察實際值與理論值的偏差來確定理論的正確與否。
實際工作中我們的理論原假設是:假設某特征(如性別、年齡)分布與目標值(是否流失)的分布相互獨立,用通俗的話來解釋就是“用戶是不是流失與他的性別無關”。
當我們手頭有一堆樣本數據后,可以通過觀察數據通過卡方值判斷“原假設”是否成立。如果不成立,就要推翻原假設,證明實際情況是“用戶流失跟性別可能有關系”。卡方值查P值,若P<0.05則拒絕原假設,支持性別和是否流失可能相互不獨立;否則支持原假設,性別和是否流失相對獨立,不存在關聯。

  • 卡方:使用偏移值的平方除以期望值,整個表的卡方值是所有卡方值的和。卡方分布:表的自由度:(行數-1 )*(列數-1)
-- SQL中的卡方檢驗 不同市場之間是否有區別 SELECT market,isstopped,val,x,SQUARE(val-x)/x AS chisquare FROM(SELECT cells.market,cells.isstopped,(1.0*r.cnt*c.cnt/(SELECT COUNT(*) FROM subscribers WHERE startdate IN ("2005-12-26"))) AS x,cells.cnt AS valFROM(SELECT Market,(CASE WHEN Stoptype IS NOT NULL THEN 1 ELSE 0 END) AS isstopped,COUNT(*) AS cntFROM SubscribersWHERE Startdate IN ("2005-12-26")GROUP BY Market,(CASE WHEN Stoptype IS NOT NULL THEN 1 ELSE 0 END)) cells LEFT JOIN(SELECT Market,COUNT(*) AS cntFROM SubscribersWHERE startdate IN ("2005-12-26")GROUP BY Market) rON cells.market = r.market LEFT JOIN (SELECT (CASE WHEN stoptype IS NOT NULL THEN 1 ELSE 0 END) AS isstopped,COUNT(*) AS cntFROM SubscribersWHERE startdate IN ("2005-12-26")GROUP BY (CASE WHEN stoptype IS NOT NULL THEN 1 ELSE 0 END)) cON cells.isstopped = c.isstopped) a ORDER BY market,isstopped-- 對產品的偏好與地域相關嗎? 產品組和州的組合關系 SELECT state,GroupName,val,exp,SQUARE(val-expx) / expx AS chisquare FROM (SELECT cells.state,cells.GroupName,1.0*r.cnt*c.cnt / (SELECT COUNT(DISTINCT CustomerId) FROM Orders) AS expx,cells.cnt AS valFROM(SELECT o.State,p.GroupName,COUNT(DISTINCT o.CustomerId) AS cntFROM Orders o LEFT JOIN OrderLines olON o.OrderId = ol.OrderId LEFT JOIN Products pON ol.ProductId = p.ProductIdGROUP BY o.state,p.GroupName) cells LEFT JOIN(SELECT o.state,COUNT(DISTINCT o.CustomerID) AS cntFROM Orders oGROUP BY o.state) rON cells.State = r.State LEFT JOIN(SELECT p.GroupName,COUNT(DISTINCT o.CustomerId) AS cntFROM Orders o LEFT JOIN OrderLines olON o.OrderId = ol.OrderId LEFT JOIN Products pON ol.ProductId = p.ProductIdGROUP BY p.GroupName) cON cells.GroupName = c.GroupName) a ORDER BY chisquare DESC;

多維卡方 月份和支付類型與不同產品類型的特殊關系

WITH pmg AS (-- 從支付類型、月份、組名聚合計算單元格值SELECT o.PaymentType,MONTH(o.OrderDate) AS mon,p.GroupName,COUNT(*) AS cntFROM Orders o JOIN OrderLines olON o.OrderId = ol.OrderId JOIN Products pON ol.ProductId = p.ProductIdGROUP BY o.PaymentType,Month(o.OrderDate),p.GroupName),pmgmarg AS (-- 計算每一個維度的總和SELECT pmg.*,SUM(cnt) OVER (PARTITION BY paymentType) AS cnt_pt,SUM(cnt) OVER (PARTITION BY mon) AS cnt_mon,SUM(cnt) OVER (PARTITION BY GroupName) SA cnt_gn,SUM(cnt) OVER () AS cnt_allFROM pmg),pmgexp AS (-- 計算期望值SELECT pmgmarg.*,(cnt_pt*cnt_mon*cnt_gn)/POWER(cnt_all,2) AS ExpectedValueFROM pmgmarg) -- 計算卡方值 SELECT pmgexp.*,SQUARE(cnt-ExpectedValue)/ExpectedValue AS chi2 FROM pmgexp ORDER BY chi2 DESC;

總結

以上是生活随笔為你收集整理的Task01c:随机抽样与卡方检验的SQL实现的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。