官术网_书友最值得收藏!

Defining a methodology

Let's start by clarifying the role of the data scientist, software engineer, and domain expert.

A domain or subject-matter expert is a person with authoritative or credited expertise in a particular area or topic. A chemist is an expert in the domain of chemistry and possibly related fields.

A data scientist solves problems related to data in a variety of fields such as biological sciences, health care, marketing, or finances. Data and text mining, signal processing, statistical analysis, and modeling using machine learning algorithms are some of the activities performed by a data scientist.

A software developer performs all the tasks related to creating software applications, including analysis, design, coding, testing, and deployment.

A data scientist has many options in selecting and implementing a classification or clustering algorithm.

Firstly, a mathematical or statistical model is to be selected to extract knowledge from the raw input data or the output of a data upstream transformation. The selection of the model is constrained by the following parameters:

  • Business requirements, such as accuracy of results or computation time
  • Availability of training data, algorithms, and libraries
  • Access to a domain or subject-matter expert, if needed

Secondly, the engineer has to select a computational and deployment framework suitable for the amount of data to be processed. The computational context is to be defined by the following parameters:

  • Available resources, such as machines, CPU, memory, or I/O bandwidth
  • Implementation strategy, such as iterative versus recursive computation or caching
  • Requirement for responsiveness of the overall process, such as duration of computation or display of intermediate results

Thirdly, a domain expert has to tag or label the observations in order to generate an accurate classifier.

Finally, the model has to be validated against a reliable test dataset.

The following diagram illustrates the selection process to create a workflow:

Statistical and computation modelling for machine learning applications

The parameters of a data transformation may need to be reconfigured according to the output of the upstream data transformation. Scala's higher-order functions are particularly suitable for implementing configurable data transformations.

主站蜘蛛池模板: 伊金霍洛旗| 本溪| 松江区| 阳信县| 抚顺市| 邢台市| 金溪县| 汽车| 中牟县| 丰县| 金阳县| 邵武市| 克山县| 京山县| 宜兴市| 洞口县| 太保市| 长治市| 大石桥市| 琼海市| 霍城县| 本溪市| 睢宁县| 陈巴尔虎旗| 鄂托克前旗| 长阳| 城固县| 西贡区| 临猗县| 东海县| 靖江市| 荣成市| 汝南县| 徐水县| 乳源| 剑川县| 天柱县| 安阳县| 慈溪市| 呼图壁县| 山东|