针对数据科学家和数据工程师的4条SQL技巧
SQL has become a common skill requirement across industries and job profiles over the last decade.
在過去的十年中,SQL已成為跨行業和職位描述的通用技能要求。
Companies like Amazon and Google will often demand that their data analysts, data scientists and product managers are at least be familiar with SQL. This is because SQL remains the language of data. So, in order to be data-driven, people need to know how to access and analyze data.
像Amazon和Google這樣的公司通常會要求他們的數據分析師,數據科學家和產品經理至少熟悉SQL。 這是因為SQL仍然是數據的語言。 因此,為了受到數據驅動,人們需要知道如何訪問和分析數據。
With so many people looking at, slicing, manipulating, and analyzing data we wanted to provide some tips to help improve your SQL.
由于有如此多的人查看,切片,操作和分析數據,我們希望提供一些技巧來幫助改進SQL 。
These tips and tricks we have picked up along the way while writing SQL. Some of them are do’s and don’ts others are just best practices. Overall we hope that they will help bring your SQL to the next level.
我們在編寫SQL的過程中已經掌握了這些技巧。 其中一些是可行的,其他則只是最佳實踐。 總體而言,我們希望它們將幫助您將SQL提升到一個新的水平。
Some of the tips will be things you shouldn’t do, even when you might be tempted and others are best practices that will help ensure that you can trust your data. Overall, they are both meant to be informative as well as reduce possible future headaches.
一些技巧是您不應該做的事情,即使您可能會被誘惑,而其他一些最佳實踐則可以幫助確保您可以信任自己的數據。 總體而言,它們既可以提供信息,又可以減少將來可能出現的麻煩。
不要平均使用Avg()-不一樣 (Don’t Use Avg() on an Average — It’s Not the Same)
A common mistake we see in people’s queries is averaging averages. Some people may think it’s obvious to not average averages. However, across the web, there are discussions and whole articles explaining why it is bad to average averages.
我們在人們的查詢中看到的一個常見錯誤是平均數。 某些人可能認為不平均并不容易。 但是,在網絡上,有討論和整篇文章解釋了為什么平均平均值不好。
Why is it bad to average averages, both in SQL and in general? Because it can be skewed by averages that were based on low numbers of whatever you are averaging.
在SQL和一般情況下,為什么平均平均值不好? 因為它可能會由于基于您所求平均值的少量數字的平均值而產生偏差。
For example, look at the table below:
例如,查看下表:
Here we’ve already averaged the cost per claim at the county level. What we also can see is that one county’s average is based on 100 claims and the other is based on 2 claims. In a real-life situation, this table would not include the total number of claims — we’re using it to illustrate how easily you can skew an average.
在這里,我們已經在縣一級平均了每項索賠的成本。 我們還可以看到,一個縣的平均值基于100個索賠,另一個縣的平均值基于2個索賠。 在現實生活中,此表將不包含索賠總數–我們正在使用它來說明您可以很容易地使平均值偏斜。
What if we wanted to find the average of all the counties. If you were to try averaging the average, then you would get $525. That doesn’t seem right.
如果我們想找到所有縣的平均值怎么辦。 如果您嘗試對平均值進行平均,那么您將獲得525美元。 那似乎不對。
If 100 claims were on average $50 and only 2 averaged 1000, then the average of all those values should be closer to $50 not $500. So, in fact, the average of these claims is about $68. But if you average the average you get a number almost ten times greater.
如果有100個索賠平均為50美元,而只有2個平均為1000美元,那么所有這些值的平均值應該接近50美元,而不是500美元。 因此,實際上,這些索賠的平均值約為68美元。 但是,如果將平均數取平均值,您得到的數字將增加近十倍。
So why do people even ask if it’s OK to average the average? Well, sometimes averaging the average can feel close to the expected output.
那么,為什么人們甚至問平均數是否可以? 好吧,有時取平均值可以感到接近預期的輸出。
Let’s look at a SQL example:
讓我們看一個SQL示例:
In this case, we’ll be using a table that has the average cost per patient and average visits per patient by county and age. However, we would like to find the average cost per patient and visits per patient at the county level.
在這種情況下,我們將使用一個表,該表具有按縣和年齡劃分的每位患者的平均費用和每位患者的平均就診次數。 但是,我們希望找到縣級每位患者的平均費用和每位患者的就診次數。
If we average the averages from the table using the query above it will give us the following output:
如果我們使用上面的查詢對表中的平均值求平均值,則會得到以下輸出:
But we could also correctly write a query that recalculates the average at the county granularity, as shown below:
但是我們也可以正確編寫一個查詢,以縣級粒度重新計算平均值,如下所示:
Let’s compare this query’s output to the previous output:
讓我們將此查詢的輸出與上一個輸出進行比較:
You will notice a few differences in the King County output. If we compare the average visits they seem quite similar — 2.4 versus 2.6. This is probably why some people fall for the average of averages — they can sometimes be close to the actual output, so it can be tempting to use this method.
您會注意到King County輸出中的一些差異。 如果我們比較平均訪問量,則它們看起來非常相似-2.4與2.6。 這可能就是為什么有些人無法獲得平均值的原因-他們有時可能接近實際產出,因此使用此方法可能很誘人。
However, when we look at the average cost per claim we’ll notice that there’s a nearly $58 difference between about $560 and $620 — almost 10%. When you’re talking about cost-savings, that’s a huge difference.
但是,當我們查看每項索賠的平均費用時,我們會注意到,大約560美元和620美元之間存在近58美元的差額,幾乎是10%。 當您談論節省成本時,那是巨大的差異。
So although the difference between 2.4 and 2.6 may seem small, it can lead to some massive differences.
因此,盡管2.4和2.6之間的差異似乎很小,但可能導致一些巨大差異。
您可以在總和內使用個案陳述 (You Can Use A Case Statement Inside Sum)
Another great tip when writing SQL is learning how to use case statements in your sum clause. This can be very useful when you are trying to write metrics with a ratio or a numerator.
編寫SQL時的另一個很棒的技巧是學習如何在sum子句中使用case語句。 當您嘗試使用比率或分子編寫指標時,這可能非常有用。
For example, take a look at the query below. You will see that we need to hit the table claims twice to get the count of values we are trying to filter as well the total number of rows. However, we could reduce this.
例如,看看下面的查詢。 您將看到我們需要按兩次表聲明才能獲得我們要過濾的值的計數以及總行數。 但是,我們可以減少這一點。
We can write a case statement to count the total values where the condition is true and then divide by the total count, as in the query below.
我們可以編寫一個case語句來計算條件為true時的總數,然后除以總數,如下面的查詢所示。
You’ll notice that we don’t need to hit the table twice to get both numbers. This is also simpler to read. From my experience, this trick is usually picked up by most SQL developers somewhere in their first year or two of using SQL.
您會注意到,我們不需要兩次打賭就能獲得兩個數字。 這也更容易閱讀。 根據我的經驗,大多數SQL開發人員通常在使用SQL的第一兩年中就意識到了這一技巧。
It’s extremely helpful for writing code that counts the percentage of nulls in a row, or to calculate metrics for dashboards. In turn, this is why many analysts and data engineers will become familiar with this trick, as long as they have to write a decent amount of SQL and don’t just use drag and drop solutions.
這對于編寫可計算連續空值百分比的代碼或計算儀表板的度量標準非常有用。 反過來,這就是為什么許多分析人員和數據工程師將變得熟悉此技巧的原因,因為他們必須編寫大量SQL,而不僅僅是使用拖放解決方案。
了解數組以及如何操作它們 (Understanding Arrays and How to Manipulate Them)
Arrays and maps inside of your database tables aren’t too common. However, I’ve noticed more and more teams relying on unstructured data which can often utilize data structures like arrays and array functions.
數據庫表內部的數組和映射不太常見。 但是,我注意到越來越多的團隊依賴非結構化數據,這些數據通常可以利用數組和數組函數之類的數據結構。
This is because databases like Postgres and SQL engines like Presto allow for you to handle arrays in your query.
這是因為Postgres之類的數據庫和Presto之類SQL引擎允許您處理查詢中的數組。
Although Arrays and maps are not a new concept they’re a somewhat new concept for some analysts and data scientists who perhaps aren’t as familiar with programming. This means you may need to occasionally learn a few array and map functions to extract data.
盡管數組和映射不是一個新概念,但對于一些對編程不太熟悉的分析師和數據科學家來說,這是一個新概念。 這意味著您可能偶爾需要學習一些數組和映射函數以提取數據。
Let’s start by learning how to unnest a map in presto. A map is a data structure that provides a key:value relationship. This means you can provide a unique key like a specific description about the value like “first_name”:”George”. A map can also contain multiple key-value pairs like the image below.
讓我們開始學習如何預先隱藏地圖。 映射是一種提供key:value關系的數據結構。 這意味著您可以提供一個唯一鍵,例如有關值的特定說明,例如“first_name”:”George” 。 映射也可以包含多個鍵/值對,如下圖所示。
In this case, we have two keys, dob and friend_ids that we would like to access:
在這種情況下,我們有兩個鍵,dob和friend_id ,我們想訪問它們:
So how do we access that data? Let’s check out the query below.
那么我們如何訪問這些數據? 讓我們看看下面的查詢。
As you can see, you can define a row for both the key and value. So when we pull out the data you can get the specific data types.
如您所見,您可以為鍵和值定義一行。 因此,當我們提取數據時,您可以獲得特定的數據類型。
The output will look like this:
輸出將如下所示:
You can also check the length of arrays, find specific keys, and so much more (read more about presto arrays here). I recommend you don’t just use maps and arrays as replacements for good data modeling, however, they can come in handy when you’re working with data that you might not want a specific schema for.
您還可以檢查數組的長度,找到特定的鍵,等等( 在這里了解更多關于presto數組的信息 )。 我建議您不要僅僅使用映射和數組來代替良好的數據建模,但是,當您使用不需要特定模式的數據時,它們可以派上用場。
領先和落后以避免自我加入 (Lead and Lag to Avoid Self Joins)
Finally, let’s talk about using Lead and Lag window functions to avoid self joins.
最后,讓我們談談使用Lead和Lag窗口函數來避免自我連接。
When you’re doing analytics you will often need to compare two events output or calculate the amount of time between two events.
在進行分析時,您通常需要比較兩個事件的輸出或計算兩個事件之間的時間量。
One way you can do this is to self-join a table to itself and connect the two rows. However, other nifty SQL functions are the Lag and Lead functions.
一種執行此操作的方法是將表自連接到自身并連接兩行。 但是,其他漂亮SQL函數是Lag和Lead函數。
These allow a user to reference a specified lagging or leading value. You can also specify the desired level of granularity of the lagging and leading values.
這些允許用戶參考指定的滯后值或前導值。 您還可以指定所需的滯后值和前導值的粒度級別。
For example, in the query below we are partitioning the lagging and leading value by patient_id. This means we are only looking at lagging and lead claim_dates and claim_costs at the patient level:
例如,在下面的查詢中,我們按patient_id對滯后值和前導值進行了patient_id 。 這意味著我們僅在患者級別查看滯后和領先的claim_dates和claim_costs :
The output of this query will look like this:
該查詢的輸出將如下所示:
You will notice that for the first date of every patient the lagging claim_date and cost is null. This is because there’s no prior cost or claim date.
您會注意到,對于每個患者的第一個日期, claim_date和cost都為空。 這是因為沒有事先的費用或索賠日期。
Overall, the lag and lead functions can make an SQL developer's life much simpler.
總體而言, lag和lead功能可使SQL開發人員的工作變得更加簡單。
SQL的細節問題 (The Details Matter With SQL)
SQL remains the language of data. Learning these tips and tricks can help ensure that your next dashboard or analysis is that much better. Whether you avoid averaging averages, or write data quality checks, these small improvements make a huge difference. Some of these issues have caused large issues and discussions in companies, so we hope this helps bring many of you up to speed.
SQL仍然是數據的語言。 學習這些提示和技巧可以幫助確保您的下一個儀表板或分析效果更好。 無論您是避免求平均值的平均值,還是編寫數據質量檢查,這些小的改進都將帶來巨大的不同。 其中一些問題已引起公司中的大問題和討論,因此我們希望這有助于使您中的許多人快速入門。
In addition, if you follow these SQL tips, your data analysis will be more accurate and you can be more confident in the numbers you provide.
另外,如果您遵循這些SQL提示,則數據分析將更加準確,并且您對所提供的數字將更有信心。
Thanks for reading.
謝謝閱讀。
翻譯自: https://medium.com/better-programming/4-sql-tips-for-data-scientist-and-data-engineers-56c41487752f
總結
以上是生活随笔為你收集整理的针对数据科学家和数据工程师的4条SQL技巧的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: python内置函数多少个_每个数据科学
- 下一篇: sql 左联接 全联接_通过了解自我联接