當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

数据分析和大数据哪个更吃香_处理数据，大数据甚至更大数据的17种策略

發(fā)布時(shí)間：2023/11/29 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了数据分析和大数据哪个更吃香_处理数据，大数据甚至更大数据的17种策略小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

數(shù)據(jù)分析和大數(shù)據(jù)哪個(gè)更吃香

Dealing with big data can be tricky. No one likes out of memory errors. ?? No one likes waiting for code to run. ? No one likes leaving Python. 🐍

處理大數(shù)據(jù)可能很棘手。沒(méi)有人喜歡內(nèi)存不足錯(cuò)誤。 No?沒(méi)有人喜歡等待代碼運(yùn)行。 ?沒(méi)有人喜歡離開Python。 🐍

Don’t despair! In this article I’ll provide tips and introduce up and coming libraries to help you efficiently deal with big data. I’ll also point you toward solutions for code that won’t fit into memory. And all while staying in Python. 👍

別失望！在本文中，我將提供一些技巧，并介紹和建立新的庫(kù)來(lái)幫助您有效地處理大數(shù)據(jù)。我還將向您指出不適合內(nèi)存的代碼的解決方案。而所有這些都停留在Python中。 👍

Python is the most popular language for scientific and numerical computing. Pandas is the most popular for cleaning code and exploratory data analysis.

Python是用于科學(xué)和數(shù)值計(jì)算的最流行的語(yǔ)言。熊貓是最受歡迎的清潔代碼和探索性數(shù)據(jù)分析工具。

Using pandas with Python allows you to handle much more data than you could with Microsoft Excel or Google Sheets.

與Microsoft Excel或Google Sheets相比，在Python中使用pandas可以處理更多的數(shù)據(jù)。

SQL databases are very popular for storing data, but the Python ecosystem has many advantages over SQL when it comes to expressiveness, testing, reproducibility, and the ability to quickly perform data analysis, statistics, and machine learning.

SQL數(shù)據(jù)庫(kù)在存儲(chǔ)數(shù)據(jù)方面非常流行，但是在表達(dá)性，測(cè)試，可再現(xiàn)性以及快速執(zhí)行數(shù)據(jù)分析，統(tǒng)計(jì)信息和機(jī)器學(xué)習(xí)的能力方面，Python生態(tài)系統(tǒng)具有許多優(yōu)于SQL的優(yōu)勢(shì)。

Unfortunately, if you are working locally, the amount of data that pandas can handle is limited by the amount of memory on your machine. And if you’re working in the cloud, more memory costs more money.

不幸的是，如果您在本地工作，熊貓可以處理的數(shù)據(jù)量受到計(jì)算機(jī)內(nèi)存量的限制。而且，如果您在云中工作，那么更多的內(nèi)存將花費(fèi)更多的資金。

Regardless of where you code is running you want operations to happen quickly so you can GSD (Get Stuff Done)! 😀

無(wú)論您在哪里運(yùn)行代碼，都希望操作能夠快速進(jìn)行，以便您可以進(jìn)行GSD(完成工作)！ 😀

永遠(yuǎn)要做的事情 (Things to always do)

If you’ve ever heard or seen advice on speeding up code you’ve seen the warning. ?? Don’t prematurely optimize! ??

如果您曾經(jīng)聽(tīng)說(shuō)過(guò)或看到有關(guān)加速代碼的建議，那么您會(huì)看到警告。 ?? 不要過(guò)早優(yōu)化！ ??

This is good advice. But it’s also smart to know techniques so you can write clean fast code the first time. 🚀

這是個(gè)好建議。但是了解技術(shù)也很聰明，因此您可以在第一時(shí)間編寫干凈的快速代碼。 🚀

Getting after it! Source: pixabay.com得到它！資料來(lái)源：foto.com

The following are three good coding practices for any size dataset.

以下是適用于任何大小數(shù)據(jù)集的三種良好編碼實(shí)踐。

Avoid nested loops whenever possible. Here’s a brief primer on Big-O notation and algorithm analysis. One for loop nested inside another for loop generally leads to polynomial time calculations. If you have more than a few items to search through, you’ll be waiting for a while. See a nice chart and explanation here.

盡可能避免嵌套循環(huán)。這是有關(guān)Big-O表示法和算法分析的簡(jiǎn)要介紹。一個(gè)for循環(huán)嵌套在另一個(gè)for循環(huán)中通常會(huì)導(dǎo)致多項(xiàng)式時(shí)間計(jì)算。如果您要搜索的項(xiàng)目不止幾個(gè)，則需要等待一段時(shí)間。在這里看到一個(gè)不錯(cuò)的圖表和說(shuō)明。

Use list comprehensions (and dict comprehensions) whenever possible in Python. Creating a list on demand is faster than load the append attribute of the list and repeatedly calling it as a function — hat tip to the Stack Overflow answer here. However, in general, don’t sacrifice clarity for speed, so be careful with nesting list comprehensions. ??

盡可能在Python中使用列表推導(dǎo)(和字典推導(dǎo))。按需創(chuàng)建列表的速度比加載列表的append屬性并作為函數(shù)重復(fù)調(diào)用的速度要快- 這里的Stack Overflow答案提示。但是，總的來(lái)說(shuō)，不要為了速度而犧牲清晰度，因此請(qǐng)小心嵌套列表的理解。 ??

In pandas, use built-in vectorized functions. The principle is really the same as the reason for dict comprehensions. Apply a function to a whole data structure at once is much faster than repeatedly calling a function.

在熊貓中，使用內(nèi)置的矢量化功能。原理實(shí)際上與字典理解的原因相同。一次將函數(shù)應(yīng)用于整個(gè)數(shù)據(jù)結(jié)構(gòu)比重復(fù)調(diào)用函數(shù)要快得多。

If you find yourself reaching for apply, think about whether you really need to. It's looping over rows or columns. Vectorized methods are usually faster and less code, so they are a win win. 🚀

如果您發(fā)現(xiàn)自己想要apply ，請(qǐng)考慮是否確實(shí)需要。它遍歷行或列。向量化方法通常更快，代碼更少，因此是雙贏的。 🚀

Avoid the other pandas Series and DataFrame methods that loop over your data — applymap, itterrows, ittertuples. Use the replace method on a DataFrame instead of any of those other options to save lots of time.

避免其他遍歷數(shù)據(jù)的pandas Series和DataFrame方法applymap ， itterrows ， ittertuples 。在DataFrame上使用replace方法，而不是其他任何選項(xiàng)，可以節(jié)省大量時(shí)間。

Notice that these suggestions might not hold for very small amounts of data, but in that cases, the stakes are low, so who cares. 😉

請(qǐng)注意，這些建議可能只適用于非常少量的數(shù)據(jù)，但在那種情況下，風(fēng)險(xiǎn)很低，所以誰(shuí)在乎。 😉

這將我們帶到最重要的規(guī)則 (This brings us to our most important rule)

如果可以的話，留在熊貓里。 🐼 (If you can, stay in pandas. 🐼)

It’s a happy place. 😀

這是一個(gè)快樂(lè)的地方。 😀

Don’t worry about these issues if you aren’t having problems and you don’t expect your data to balloon. But at some point, you’ll encounter a big dataset and then you’ll want to know what to do. Let’s see some tips.

如果您沒(méi)有遇到問(wèn)題并且不希望數(shù)據(jù)激增，請(qǐng)不要擔(dān)心這些問(wèn)題。但是到某個(gè)時(shí)候，您將遇到一個(gè)龐大的數(shù)據(jù)集，然后您想知道該怎么做。讓我們看看一些技巧。

與相當(dāng)大的數(shù)據(jù)(大約數(shù)百萬(wàn)行)有關(guān)的事情 (Things to do with pretty big data (roughly millions of rows))

Like millions of grains of sand. Source: pixabay.com就像數(shù)百萬(wàn)的沙粒一樣。資料來(lái)源：foto.com

Use a subset of your data to explore, clean, make a baseline model if you’re doing machine learning. Solve 90% of your problems fast and save time and resources. This technique can save you so much time!

如果您要進(jìn)行機(jī)器學(xué)習(xí)，請(qǐng)使用數(shù)據(jù)的子集來(lái)探索，清理和建立基線模型。快速解決90％的問(wèn)題，并節(jié)省時(shí)間和資源。這種技術(shù)可以節(jié)省您很多時(shí)間！

Load only the columns that you need with the usecols argument when reading in your DataFrame. Less data in = win!

在讀取usecols時(shí)，僅使用usecols參數(shù)加載所需的列。更少的數(shù)據(jù)=贏！

Use dtypes efficiently. Downcast numeric columns to the smallest dtypes that makes sense with pandas.to_numeric(). Convert columns with low cardinality (just a few values) to a categorical dtype. Here’s a pandas guide on efficient dtypes.

有效地使用dtype。將pandas.to_numeric()有意義的數(shù)字列轉(zhuǎn)換為最小的dtypes。將具有低基數(shù)(僅幾個(gè)值)的列轉(zhuǎn)換為分類dtype。這是有關(guān)有效dtypes的熊貓指南。

Parallelize model training in scikit-learn to use more processing cores whenever possible. By default, scikit-learn uses just one of your machine’s cores. Many computers have 4 or more cores. You can use them all for parallelizable tasks by passing the argument n_jobs=-1 when doing cross validation with GridSearchCV and many other classes.

在scikit-learn中并行進(jìn)行模型訓(xùn)練，以盡可能使用更多處理核心。默認(rèn)情況下，scikit-learn僅使用計(jì)算機(jī)的核心之一。許多計(jì)算機(jī)具有4個(gè)或更多核心。在使用GridSearchCV和許多其他類進(jìn)行交叉驗(yàn)證時(shí)，可以通過(guò)傳遞參數(shù)n_jobs=-1來(lái)將它們?nèi)坑糜诳刹⑿谢娜蝿?wù)。

Save pandas DataFrames in feather or pickle formats for faster reading and writing. Hat tip to Martin Skarzynski, who links to evidence and code here.

將熊貓DataFrame保存為羽毛或泡菜格式，以實(shí)現(xiàn)更快的讀寫速度。向Martin Skarzynski致謝，他在此處鏈接了證據(jù)和代碼。

Use pd.eval to speed up pandas operations. Pass the function your usual code in a string. It does the operation much faster. Here’s a chart from tests with a 100 column DataFrame.

使用pd.eval可以加快熊貓操作。將函數(shù)的常規(guī)代碼傳遞給字符串。它可以更快地完成操作。這是帶有100列DataFrame的測(cè)試的圖表。

Image from this good article on the topic by Tirthajyoti Sarkar Tirthajyoti Sarkar 撰寫的有關(guān)該主題的出色文章的圖片

df.query is basically same as pd.eval, but as a DataFrame method instead of a top-level pandas function.

df.query是基本上相同pd.eval ，但作為一個(gè)數(shù)據(jù)幀的方法，而不是頂級(jí)大熊貓功能。

See the docs because there are some gotchas. ??

請(qǐng)參閱文檔，因?yàn)橛幸恍┫葳濉???

Pandas is using numexpr under the hood. Numexpr also works with NumPy. Hat tip to Chris Conlan in his book Fast Python for pointing me to@Numexpr. Chris’s book is an excellent read for learning how to speed up your Python code. 👍

熊貓?jiān)诤笈_(tái)使用numexpr 。 Numexpr也可以與NumPy一起使用。克里斯·康蘭(Chris Conlan)在他的書《快速Python》中給我的提示是@Numexpr。克里斯的書是學(xué)習(xí)如何加快Python代碼速度的絕佳閱讀。 👍

事情涉及真正的大數(shù)據(jù)(大約數(shù)千萬(wàn)行以上) (Things do with really big data (roughly tens of millions of rows and up))

Even more data! Source: pixabay.com甚至更多的數(shù)據(jù)！資料來(lái)源：foto.com

Use numba. Numba gives you a big speed boost if you’re doing mathematical calcs. Install numba and import it. Then use the @numba.jit decorator function when you need to loop over NumPy arrays and can't use vectorized methods. It only works with only NumPy arrays. Use .to_numpy() on a pandas DataFrame to convert it to a NumPy array.

使用numba 。如果您要進(jìn)行數(shù)學(xué)計(jì)算，Numba可以大大提高速度。安裝numba并將其導(dǎo)入。然后，當(dāng)您需要循環(huán)遍歷NumPy數(shù)組并且不能使用矢量化方法時(shí)，請(qǐng)使用@numba.jit裝飾器函數(shù)。它僅適用于NumPy數(shù)組。在熊貓DataFrame上使用.to_numpy()將其轉(zhuǎn)換為NumPy數(shù)組。

Use SciPy sparse matrices when it makes sense. Scikit-learn outputs sparse arrays automatically with some transformers, such as CountVectorizer. When your data is mostly 0s or missing values, you can convert columns to sparse dtypes in pandas. Read more here.

在合理的情況下使用SciPy稀疏矩陣。 Scikit-learn使用某些轉(zhuǎn)換器(例如CountVectorizer)自動(dòng)輸出稀疏數(shù)組。當(dāng)數(shù)據(jù)大部分為0或缺少值時(shí)，可以將列轉(zhuǎn)換為熊貓中的稀疏dtype。在這里。

Use Dask to parallelize the reading of datasets into pandas in chunks. Dask can also parallelize data operations across multiple machines. It mimics a subset of the pandas and NumPy APIs. Dask-ML is a sister package to parallelize machine learning algorithms across multiple machines. It mimics the scikit-learn API. Dask plays nicely with other popular machine learning libraries such as XGBoost, LightGBM, PyTorch, and TensorFlow.

使用Dask將數(shù)據(jù)集的讀取并行化為大塊的熊貓。 Dask還可以跨多臺(tái)機(jī)器并行化數(shù)據(jù)操作。它模仿了熊貓和NumPy API的子集。 Dask-ML是一個(gè)姊妹軟件包，用于在多臺(tái)機(jī)器之間并行化機(jī)器學(xué)習(xí)算法。它模仿了scikit-learn API。 Dask與其他流行的機(jī)器學(xué)習(xí)庫(kù)(例如XGBoost，LightGBM，PyTorch和TensorFlow)配合得很好。

Use PyTorch with or without a GPU. You can get really big speedups by using PyTorch on a GPU, as I found in this article on sorting.

在有或沒(méi)有GPU的情況下使用PyTorch。正如我在有關(guān)sorting的本文中所發(fā)現(xiàn)的那樣，通過(guò)在GPU上使用PyTorch可以大大提高速度。

未來(lái)處理大數(shù)據(jù)時(shí)需要密切注意/進(jìn)行實(shí)驗(yàn)的事情 (Things to keep an eye on/experiment with for dealing with big data in the future)

Keep an eye on them! Source: pixabay.com注意他們！資料來(lái)源：foto.com

The following three packages are bleeding edge as of mid-2020. Expect configuration issues and early stage APIs. If you are working locally on a CPU, these are unlikely to fit your needs. But they all look very promising and are worth keeping an eye on. 🔭

截至2020年中，以下三個(gè)方案處于前沿。預(yù)期配置問(wèn)題和早期API。如果您在本地CPU上工作，那么這些將不太可能滿足您的需求。但是它們看起來(lái)都很有前途，值得關(guān)注。 🔭

Do you have access to lots of cpu cores? Does your data have more than 32 columns (necessary as of mid-2020)? Then consider Modin. It mimics a subset of the pandas library to speed up operations on large datasets. It uses Apache Arrow (via Ray) or Dask under the hood. The Dask backend is experimental. Some things weren’t fast in my tests — for example reading in data from NumPy arrays was slow and memory management was an issue.

您可以使用許多cpu核心嗎？您的數(shù)據(jù)是否有超過(guò)32列(從2020年中期開始是必需的)？然后考慮莫丁。它模仿了熊貓庫(kù)的一個(gè)子集，以加快對(duì)大型數(shù)據(jù)集的操作。它在后臺(tái)使用Apache Arrow(通過(guò)Ray)或Dask。 Dask后端是實(shí)驗(yàn)性的。在我的測(cè)試中，有些事情并不快-例如，從NumPy陣列讀取數(shù)據(jù)的速度很慢，并且內(nèi)存管理是一個(gè)問(wèn)題。

You can use jax in place of NumPy. Jax is an open source google product that’s bleeding edge. It speeds up operations by using five things under the hood: autograd, XLA, JIT, vectorizer, and parallelizer. It works on a CPU, GPU, or TPU and might be simpler than using PyTorch or TensorFlow to get speed boosts. Jax is good for deep learning, too. It has a NumPy version but no pandas version yet. However, you could convert a DataFrame to TensorFlow or NumPy and then use jax. Read more here.

您可以使用jax代替NumPy。 Jax是一種最新的Google開源產(chǎn)品，具有領(lǐng)先優(yōu)勢(shì)。它通過(guò)使用5種功能來(lái)加快操作速度：autograd，XLA，JIT，矢量化器和并行化器。它可以在CPU，GPU或TPU上工作，并且可能比使用PyTorch或TensorFlow來(lái)提高速度更為簡(jiǎn)單。 Jax也適用于深度學(xué)習(xí)。它具有NumPy版本，但尚未提供熊貓版本。但是，您可以將DataFrame轉(zhuǎn)換為TensorFlow或NumPy，然后使用jax。在這里。

Rapids cuDF uses Apache Arrow on GPUs with a pandas-like API. It’s an open source Python package from NVIDIA. Rapids plays nicely with Dask so you could get multiple GPUs processing data in parallel. For the biggest workloads, it should provide a nice boost.

Rapids cuDF在具有類似熊貓API的GPU上使用Apache Arrow。這是NVIDIA的開源Python軟件包。 Rapids與Dask配合得很好，因此您可以獲得多個(gè)GPU并行處理數(shù)據(jù)。對(duì)于最大的工作負(fù)載，它應(yīng)該提供很好的提升。

其他有關(guān)代碼速度和大數(shù)據(jù)的知識(shí) (Other stuff to know about code speed and big data)

計(jì)時(shí)作業(yè) (Timing operations)

If you want to time an operation in a Jupyter notebook, you can use %time or %%timeit magic commands. They both work on a single line or an entire code cell.

如果要在Jupyter筆記本中計(jì)時(shí)操作的時(shí)間，可以使用%time或%%timeit magic命令。它們都在單行或整個(gè)代碼單元上工作。

%time runs once and %%timeit runs the code multiple times (the default is seven). Do check out the docs to see some subtleties.

%time運(yùn)行一次， %%timeit運(yùn)行代碼多次(默認(rèn)值為7)。請(qǐng)檢查文檔以查看一些細(xì)節(jié)。

If you are in a script or notebook you can import the time module, check the time before and after running some code, and find the difference.

如果您使用的是腳本或筆記本，則可以導(dǎo)入時(shí)間模塊，檢查運(yùn)行某些代碼之前和之后的時(shí)間，然后找出不同之處。

When testing for time, note that different machines and software versions can cause variation. Caching will sometimes mislead if you are doing repeated tests. As with all experimentation, hold everything you can constant. 👍

測(cè)試時(shí)間時(shí)，請(qǐng)注意不同的機(jī)器和軟件版本可能會(huì)導(dǎo)致變化。如果進(jìn)行重復(fù)測(cè)試，緩存有時(shí)會(huì)誤導(dǎo)。與所有實(shí)驗(yàn)一樣，保持一切不變。 👍

存儲(chǔ)大數(shù)據(jù) (Storing big data)

GitHub’s maximum file size is 100MB. You can use Git Large File Storage extension if you want to version large files with GitHub.

GitHub的最大文件大小為100MB 。如果要使用GitHub對(duì)大型文件進(jìn)行版本控制，則可以使用Git Large File Storage擴(kuò)展。

Make sure you aren’t auto-uploading files to Dropbox, iCloud, or some other auto-backup service, unless you want to be.

除非您愿意，否則請(qǐng)確保沒(méi)有將文件自動(dòng)上傳到Dropbox，iCloud或其他自動(dòng)備份服務(wù)。

想了解更多？ (Want to learn more?)

The pandas docs have sections on enhancing performance and scaling to large datasets. Some of these ideas are adapted from those sections.

熊貓文檔中有關(guān)于增強(qiáng)性能和擴(kuò)展到大型數(shù)據(jù)集的部分。這些想法中的一些是從那些部分改編而成的。

Have other tips? I’d love to hear them over on Twitter. 🎉

還有其他提示嗎？我很想在Twitter上聽(tīng)到他們的聲音。 🎉

包 (Wrap)

You’ve seen how to write faster code. You’ve also seen how to deal with big data and really big data. Finally, you saw some new libraries that will likely continue to become more popular for processing big data.

您已經(jīng)了解了如何編寫更快的代碼。您還已經(jīng)了解了如何處理大數(shù)據(jù)和真正的大數(shù)據(jù)。最后，您看到了一些新的庫(kù)，這些庫(kù)在處理大數(shù)據(jù)方面可能會(huì)繼續(xù)變得越來(lái)越流行。

I hope you’ve found this guide to be helpful. If you did, please share it on your favorite social media so other folks can find it, too. 😀

希望本指南對(duì)您有所幫助。如果您這樣做了，請(qǐng)?jiān)谀矚g的社交媒體上分享它，以便其他人也可以找到它。 😀

I write about Python, SQL, Docker, and other tech topics. If any of that’s of interest to you, sign up for my mailing list of awesome data science resources and read more to help you grow your skills here. 👍

我撰寫有關(guān)Python ， SQL ， Docker和其他技術(shù)主題的文章。如果您有任何興趣，請(qǐng)注冊(cè)我的超棒數(shù)據(jù)科學(xué)資源郵件列表，并在此處內(nèi)容以幫助您提高技能。 👍

Source: pixabay.com資料來(lái)源：foto.com

Happy big data-ing! 😀

大數(shù)據(jù)快樂(lè)！ 😀

翻譯自: https://towardsdatascience.com/17-strategies-for-dealing-with-data-big-data-and-even-bigger-data-283426c7d260

數(shù)據(jù)分析和大數(shù)據(jù)哪個(gè)更吃香

總結(jié)

以上是生活随笔為你收集整理的数据分析和大数据哪个更吃香_处理数据，大数据甚至更大数据的17种策略的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：梦到刑事案件有什么兆头
下一篇：批梯度下降随机梯度下降_梯度下降及其变