使用Java解决您的数据科学问题
Java offers versatility, interoperability, and the chance to zip around Europe in a red vespa. Data scientists are expected to have good coding chops in Python (or R). I argue that Java should be added as a useful tool in the data science skillset.
Java提供了多功能性,互操作性,并有機會在紅色的vespa中環游歐洲。 數據科學家有望在Python(或R)中擁有良好的編碼能力。 我認為應該將Java添加為數據科學技能集中的有用工具。
Since 1995, Java has been sipping espresso and writing killer apps, occasionally sending a wistful glance out the cafe window. It’s probable there’s already a Java app that could help you with your work as a data scientist.
自1995年以來,Java一直在喝咖啡并編寫殺手級應用程序,偶爾會在咖啡館的窗戶外望去。 可能已經有一個Java應用程序可以幫助您作為數據科學家工作。
This article will help you get a sense of the essentials, understand how Java could be used to support machine learning, and check out some resources for further learning. This article does not intend to teach you the actual syntax of this versatile language. This article will not offer to buy your espresso, but Java might.
本文將幫助您了解基本知識,了解如何使用Java支持機器學習,并查看一些資源以進一步學習。 本文無意教您這種通用語言的實際語法。 本文不會提供購買您的意式濃縮咖啡的信息,但Java可能會提供。
In this guide:
在本指南中:
What is Java?
什么是Java?
Why learn Java?
為什么要學習Java?
What are the core differences between Python and Java?
Python和Java之間的核心區別是什么?
Summary
摘要
Resources
資源資源
什么是Java? (What is Java?)
Java is a general-purpose programming language that is class-based, object-oriented, and designed for interoperability. You could say Java has a devil-may care attitude when it comes to the deployment environment — an app written in Java can run on any operating system that can run the Java Virtual Machine (JVM). This flexibility is why Java is a favorite of app developers seeking to create a tool that is operating system agnostic and will transfer well to mobile.
Java是一種通用的編程語言,它是基于類,面向對象的并且旨在實現互操作性 。 您可能會說Java在部署環境方面表現出惡魔般的關心態度-用Java編寫的應用程序可以在可以運行Java虛擬機(JVM)的任何操作系統上運行。 這種靈活性就是為什么Java成為尋求創建與操作系統無關的工具并將能很好地遷移到移動設備的應用程序開發人員的最愛的原因。
A java program’s source code (the .java file) is written in curly-brace heavy syntax. The program is compiled down to bytecode (a .class file) that can run on the JVM.
Java程序的源代碼(.java文件)以大括號語法編寫。 程序被編譯為可以在JVM上運行的字節碼(.class文件)。
A program is compiled from source code (e.g., pdfParsingApp.java) to bytecode (e.g., pdfParsingApp.class) by typing javac pdfParsingApp.java
通過鍵入javac pdfParsingApp.java ,將程序從源代碼(例如pdfParsingApp.java)編譯為字節碼(例如pdfParsingApp.class)。
If you’re familiar with the concept of machine code from computer science fundamentals, java bytecode is analogous — but instead of being written for the specific architecture of the host computer, the bytecode is written for the virtual machine.
如果您從計算機科學的基礎上熟悉機器代碼的概念,那么Java字節碼是類似的-但不是為主機的特定體系結構編寫字節碼,而是為虛擬機編寫字節碼。
My Tutorial Blog我的教程博客The JVM translates the Java bytecode into the platform’s machine language — this is what gives Java its interoperability. The JVM is capable of translating from languages such as Clojure, Scala, and others that compile down to Java bytecode.
JVM將Java字節碼轉換為平臺的機器語言-這就是Java的互操作性的原因。 JVM能夠從Clojure,Scala等語言編譯成Java字節碼的語言進行翻譯。
為什么要學習Java? (Why learn Java?)
While Python may be the lingua franca of the data science world (with R playing a role to a lesser extent), in terms of the broader developer community, Java is used by an estimated 40% programmers. Because of its widespread adoption, having at least a cursory understanding of how Java packages are structured, written, and implemented can help a data scientist communicate more effectively with their software engineering colleagues.
盡管Python可能是數據科學領域的通用語言 (R在其中起的作用較小),但就更廣泛的開發人員社區而言, 估計有40%的程序員使用Java 。 由于Java包的廣泛采用,至少對Java包的結構,編寫和實現方式有一個粗略的了解可以幫助數據科學家與他們的軟件工程同事更有效地溝通。
Moreover, understanding Java is helpful for the data acquisition and deployment phases of the machine learning pipeline.
此外,了解Java有助于機器學習管道的數據獲取和部署階段。
At the data acquisition stage, Java is helpful because production code bases are often written in Java. As a data scientist, the ability to go upstream to fix bad data before it enters the machine learning pipeline is invaluable. Behind SQL (used by 55% of programmers), Java is probably the most helpful language for addressing backend data issues and for cleaning up data.
在數據采集階段,Java很有用,因為生產代碼庫通常是用Java編寫的。 作為數據科學家,能夠在數據進入機器學習管道之前上游修復不良數據的能力非常寶貴。 在SQL( 55%的程序員使用過的 SQL)之后,Java可能是解決后端數據問題和清理數據最有用的語言。
Backsliding from model development to data cleaning — this is what we’re trying to avoid through improved data quality via data engineering. Photo by Wilbur Wong on Unsplash.從模型開發到數據清理的后退—這是我們試圖通過數據工程提高數據質量來避免的事情。 Wilbur Wong在Unsplash上的照片 。For example, in natural language processing (NLP), we deal with a lot of unstructured text data that’s designed to be readable to humans but not to computers. Apache PDFBox is a Java library designed to address the challenge of wrangling text from a PDF into a usable format. Not only does this tool support content extraction, it can also be used to modify existing documents and create new ones. That would be a really cool way to report the output of your machine learning model built on text analytics.
例如,在自然語言處理(NLP)中,我們處理了許多非結構化的文本數據,這些數據被設計為人類可以讀取的,但計算機卻不可讀。 Apache PDFBox是一個Java庫,旨在解決將文本從PDF轉換為可用格式的挑戰。 該工具不僅支持內容提取,還可以用于修改現有文檔并創建新文檔。 這將是報告基于文本分析的機器學習模型的輸出的一種非常酷的方法。
Speaking of deployment, Java affords a simple introduction into Scala. This super-fast language runs on the JVM and is often used by DevOps engineers to put machine learning models into production.
說到部署, Java對Scala進行了簡單介紹 。 這種超快速的語言在JVM上運行,DevOps工程師經常使用該語言將機器學習模型投入生產。
The most powerful argument for learning Java is that it provides a strong foundation for improving your object-oriented programming. Computer science 101 classes are often taught with Java as the base language to provide instruction on data structures & algorithms. Because Java doesn’t offer as much abstraction as Python, it will improve your overall knowledge of computer science fundamentals.
學習Java的最有力依據是,它為改進面向對象的編程提供了堅實的基礎。 通常以Java為基礎語言教授計算機科學101課,以提供有關數據結構和算法的說明 。 因為Java提供的抽象不如Python,所以它將提高您對計算機科學基礎知識的整體了解。
Bottom line: there is a net benefit for everyone — data scientists, software engineers, and the organization in which they operate — when data science is positioned squarely within the practice of computer science.
最重要的是:只要數據科學在計算機科學領域內處于立足之地,對每個人(數據科學家,軟件工程師以及他們經營所在的組織)都有凈收益。
Python和Java之間的核心區別是什么? (What are the core differences between Python and Java?)
Python is a dynamically typed, interpreted language. When writing Python code, the programmer doesn’t have to declare the type of each variable — the type is dynamic and changeable. At runtime, the interpret executes a Python program by translating each statement into a sequence of subroutines, then into machine code.
Python是一種動態類型的解釋語言。 在編寫Python代碼時,程序員不必聲明每個變量的類型-類型是動態的且可變的。 在運行時,解釋程序通過將每個語句轉換為一系列子例程,然后轉換為機器代碼來執行Python程序。
Java is a statically typed, compiled language. When creating a new variable in Java, the programmer is responsible for declaring the type. That type is then static — it cannot be changed. Static typing also means errors are caught in compilation, not at runtime. We’ve already talked about the role of the JVM in translating from intermediate Java bytecode (i.e., the .class file created by compilation) into machine code that is specific to the underlying computer hardware.
Java是一種靜態類型的編譯語言 。 在Java中創建新變量時,程序員負責聲明類型。 然后,該類型是靜態的-無法更改。 靜態類型化還意味著錯誤會在編譯中捕獲,而不是在運行時捕獲。 我們已經討論了JVM在將中間Java字節碼(即,通過編譯創建的.class文件)轉換為特定于基礎計算機硬件的機器代碼中的作用。
Etsy.Etsy 。There are many more rules to working with Java than working with Python — both in terms of syntax (e.g. static typing) as well as package structure. For example, a Java source file can only contain one public class. The source file must be named for the public class it contains.
使用Java的規則要比使用Python的規則多得多-在語法(例如靜態類型)以及包結構方面。 例如,一個Java源文件只能包含一個公共類。 必須為源文件包含的公共類命名。
For example, the program pdfParsingApp.java will look something like:
例如,程序pdfParsingApp.java將類似于:
The public static void main line must appear in every Java program (with the exception of programs that are part of a package library). This is the start of the main method that runs at execution and essentially tells the Java program what to do. Other classes within the program may provide information that is used by the main method.
public static void main 必須出現在每個Java程序中 (作為程序包庫一部分的程序除外)。 這是在執行時運行的main方法的開始,實際上告訴了Java程序該怎么做。 程序中的其他類可能會提供主要方法使用的信息。
摘要 (Summary)
Learning Java is a good step toward getting comfortable with the idea that an enterprise machine learning pipeline is often composed of multiple languages. You might use technology related to Java and the JVM in order to gather data, clean data, and productionize your model. Plus, adding Java to your stack will improve your ability to write class-based, object-oriented programs.
學習Java是使企業機器學習管道通常由多種語言組成的想法的良好一步。 您可能會使用與Java和JVM相關的技術,以便收集數據,清理數據并生產模型。 另外,將Java添加到堆棧中將提高您編寫基于類,面向對象的程序的能力。
Hopefully, this article gave you a bit of an intuition for how Java could be useful to your DS work. Using your familiarity with Python (or R), you can quickly build toward better understanding of Java — and better apps and model deployments.
希望本文能使您對Java如何對DS工作有用的直覺有所了解。 利用對Python(或R)的熟悉,您可以快速構建對Java的更好理解,以及更好的應用程序和模型部署。
資源資源 (Resources)
6 Hours of Java Programming Tutorials by Caleb Curry
Caleb Curry撰寫的 6個小時的Java編程教程
Codecademy Java for building muscle memory
Codecademy Java用于建立肌肉記憶
JetBrains Academy project-based approach to learning Java
JetBrains Academy基于項目的 Java學習方法
If you enjoyed reading this article, follow me on Medium, LinkedIn, and Twitter for more ideas to advance your data science skills.
如果您喜歡閱讀本文 ,請在Medium , LinkedIn和Twitter上關注我,以獲取更多提高您的數據科學技能的想法。
翻譯自: https://towardsdatascience.com/java-for-data-science-f64631fdda12
總結
以上是生活随笔為你收集整理的使用Java解决您的数据科学问题的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 3D看片神器?华为公开全新立体投影专利:
- 下一篇: Java NIO原理和使用