Discretization of continuous features

In statistics and machine learning, discretization refers to the process of converting or partitioning continuous attributes, features or variables to discretized or nominal attributes/features/variables/intervals. This can be useful when creating probability mass functions – formally, in density estimation. It is a form of discretization in general and also of binning, as in making a histogram. Whenever continuous data is discretized, there is always some amount of discretization error. The goal is to reduce the amount to a level considered negligible for the modeling purposes at hand.

Typically data is discretized into partitions of K equal lengths/width (equal intervals) or K% of the total data (equal frequencies).[1]

Mechanisms for discretizing continuous data include Fayyad & Irani's MDL method,[2] which uses mutual information to recursively define the best bins, CAIM, CACC, Ameva, and many others[3]

Many machine learning algorithms are known to produce better models by discretizing continuous attributes.[4]

Software

This is a partial list of software that implement MDL algorithm.

discretize4crf tool designed to work with popular CRF implementations (C++)
mdlp in the R package discretization
Discretize in the R package RWeka

gollark: I have other pictures of the automelon machines and stuff.

gollark: Er, I think a month or two at this point.

gollark: It's me, <@160279332454006795>, and some other people.

gollark: The autocrafting system of the Unicode Consortium on there can actually do roughly the same things, because it contains similar AE2 hardware and has a lot of machines hooked up, but this has the potatOS installation machine and is faster through having dedicated machines for things.

gollark: It makes computers *with no external inputs*, and also makes OC computers, you see.

References

Clarke, E. J.; Barton, B. A. (2000). "Entropy and MDL discretization of continuous variables for Bayesian belief networks" (PDF). International Journal of Intelligent Systems. 15: 61–92. doi:10.1002/(SICI)1098-111X(200001)15:1<61::AID-INT4>3.0.CO;2-O. Retrieved 2008-07-10.
Fayyad, Usama M.; Irani, Keki B. (1993) "Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning" (PDF). hdl:2014/35171., Proc. 13th Int. Joint Conf. on Artificial Intelligence (Q334 .I571 1993), pp. 1022-1027
Dougherty, J.; Kohavi, R. ; Sahami, M. (1995). "Supervised and Unsupervised Discretization of Continuous Features". In A. Prieditis & S. J. Russell, eds. Work. Morgan Kaufmann, pp. 194-202
Kotsiantis, S.; Kanellopoulos, D (2006). "Discretization Techniques: A recent survey". GESTS International Transactions on Computer Science and Engineering. 32 (1): 47–58. CiteSeerX 10.1.1.109.3084.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[clarke-1] Clarke, E. J.; Barton, B. A. (2000). "Entropy and MDL discretization of continuous variables for Bayesian belief networks" (PDF). International Journal of Intelligent Systems. 15: 61–92. doi:10.1002/(SICI)1098-111X(200001)15:1<61::AID-INT4>3.0.CO;2-O. Retrieved 2008-07-10.

[2] Fayyad, Usama M.; Irani, Keki B. (1993) "Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning" (PDF). hdl:2014/35171., Proc. 13th Int. Joint Conf. on Artificial Intelligence (Q334 .I571 1993), pp. 1022-1027

[3] Dougherty, J.; Kohavi, R. ; Sahami, M. (1995). "Supervised and Unsupervised Discretization of Continuous Features". In A. Prieditis & S. J. Russell, eds. Work. Morgan Kaufmann, pp. 194-202

[4] Kotsiantis, S.; Kanellopoulos, D (2006). "Discretization Techniques: A recent survey". GESTS International Transactions on Computer Science and Engineering. 32 (1): 47–58. CiteSeerX 10.1.1.109.3084.

Discretization of continuous features

Software

See also

References