官术网_书友最值得收藏!

Dealing with big data

Big data existed long before the phrase was invented. For instance, banks and stock exchanges have been processing billions of transactions daily for years and airline companies have worldwide real-time infrastructures for operational management of passenger booking, and so on. So, what is big data really? Doug Laney (2001) suggested that big data is defined by three Vs: volume, velocity, and variety. Therefore, to answer the question of whether your data is big, we can translate this into the following three sub-questions:

  • Volume: Can you store your data in memory?
  • Velocity: Can you process new incoming data with a single machine?
  • Variety: Is your data from a single source?

If you answered all of these questions with yes, then your data is probably not big, and you have just simplified your application architecture.

If your answer to all of these questions was no, then your data is big! However, if you have mixed answers, then it's complicated. Some may argue that one V is important; others may say that the other Vs are more important. From a machine learning point of view, there is a fundamental difference in algorithm implementation in order process the data in memory or from distributed storage. Therefore, a rule of thumb is: if you cannot store your data in memory, then you should look into a big data machine learning library.

The exact answer depends on the problem that you are trying to solve. If you're starting a new project, I suggest that you start off with a single-machine library and prototype your algorithm, possibly with a subset of your data if the entire data does not fit into the memory. Once you've established good initial results, consider moving to something more heavy duty such as Mahout or Spark.

主站蜘蛛池模板: 新昌县| 阿城市| 平舆县| 松阳县| 伊金霍洛旗| 萍乡市| 长沙县| 无棣县| 清徐县| 三都| 安康市| 鄢陵县| 博野县| 日喀则市| 白水县| 罗山县| 寻甸| 聂拉木县| 铁力市| 精河县| 林州市| 香格里拉县| 廉江市| 涟水县| 平江县| 阿拉善右旗| 红河县| 沭阳县| 赤城县| 鄂托克前旗| 县级市| 镶黄旗| 武夷山市| 开阳县| 沙雅县| SHOW| 鞍山市| 海盐县| 于都县| 望江县| 铜山县|