官术网_书友最值得收藏!

Modeling

Data is the lifeline of any scientist, and the selection of data providers is critical in developing or evaluating any statistical inference or machine learning algorithm.

What is a model?

We briefly introduced the concept of a model in the Model categorization section in Chapter 1, Getting Started .

What constitutes a model? Wikipedia provides a reasonably good definition of a model as understood by scientists [2:1]:

A scientific model seeks to represent empirical objects, phenomena, and physical processes in a logical and objective way.

Models that are rendered in software allow scientists to leverage computational power to simulate, visualize, manipulate and gain intuition about the entity, phenomenon or process being represented.

In statistics and probabilistic theory, a model describes data that one might observe from a system to express any form of uncertainty and noise. A model allows us to infer rules, make predictions, and learn from data.

A model is composed of features, also known as attributes or variables, and a set of relations between those features. For instance, the model represented by the function f(x,y) = x.sin(2y) has two features x and y and a relation f. Those two features are assumed to be independent. If the model is subject to a constraint, such as f(x, y) < 20, for example, then the conditional independence is no longer valid.

An astute Scala programmer would associate a model to a monoid for which the set is the group of observations and the operator is the function implementing the model.

Models come in a variety of shapes and forms:

  • Parametric: This consists of functions and equations (for example, y = sin(2t+w))
  • Differential: This consists of ordinary and partial differential equations (for example, dy = 2x.dx)
  • Probabilistic: This consists of probability distributions (for example, p(x|c) = exp (k.logx – x)/x!)
  • Graphical: This consists of graphs that abstract out the conditional independence between variables (for example, p(x,y|c) = p(x|c).p(y|c))
  • Directed graphs: This consists of a temporal, spatial relationships (for example, scheduler)
  • Numerical method: This consists of computational method such as finite difference, finite elements or Newton-Raphson
  • Chemistry: This consists of formulas and components (for example, H2O, Fe + C12 = FeC13)
  • Taxonomy: This consists of a semantic definition and a relationship of concepts (for example, APG/Eudicots/Rosids/Huaceae/Malvales)
  • Grammar and lexicon: This consists of a syntactic representation of documents (for example, Scala programming language)
  • Inference logic: This consists of rules (for example, IF (stock vol> 1.5 * average) AND rsi> 80 THEN …)

Model versus design

The confusion between model and design is quite common in computer science, the reason being that these terms have different meanings for different people depending on the subject. The following metaphors should help with your understanding of these two concepts:

  • Modeling: This is describing something you know. A model assumes, which becomes an assertion if proven correct (for example, the US population p increases by 1.2% a year, dp/dt= 1.012).
  • Designing: This is manipulating representation for things you don't know. Designing can be regarded as the exploration phase of modeling (for example, what are the features that contribute to the growth of US population? Birth rate? Immigration? Economic conditions? Social policies?).

Selecting features

The selection of a model's features is the process of discovering and documenting the minimum set of variables required to build the model. Scientists assume that data contains many redundant or irrelevant features. Redundant features do not provide information already given by the selected features, and irrelevant features provide no useful information.

A features selection consists of two consecutive steps:

  1. Search for new feature subsets.
  2. Evaluate these feature subsets using a scoring mechanism.

The process of evaluating each possible subset of features to find the one that maximizes the objective function or minimizes the error rate is computationally intractable for large datasets. A model with n features requires 2n-1 evaluations!

Extracting features

An observation is a set of indirect measurements of hidden, also known as latent, variables, which may be noisy or contain a high degree of correlation and redundancies. Using raw observations in a classification task would very likely produce inaccurate results. Using all features in each observation also incurs a high computation cost.

The purpose of feature extraction is to reduce the number of variables or dimensions of the model by eliminating redundant or irrelevant features. The features are extracted by transforming the original set of observations into a smaller set at the risk of losing some vital information embedded in the original set.

主站蜘蛛池模板: 高唐县| 东乌| 湖州市| 隆回县| 广昌县| 界首市| 龙岩市| 淮北市| 乐平市| 苍溪县| 江都市| 静宁县| 和林格尔县| 宜州市| 铁岭县| 湘潭县| 涿州市| 威海市| 澜沧| 临清市| 堆龙德庆县| 龙口市| 保山市| 巢湖市| 三明市| 荃湾区| 建昌县| 蒲江县| 开远市| 静乐县| 香河县| 屏东市| 密山市| 红安县| 临夏市| 葵青区| 安新县| 西宁市| 无为县| 连山| 荥阳市|