In general terms, Data Mining comprises techniques and algorithms for determining interesting patterns from large datasets. There are currently hundreds of algorithms that perform tasks such as frequent pattern mining, clustering, and classification, among others. Understanding how these algorithms work and how to use them effectively is a continuous challenge faced by data mining analysts, researchers, and practitioners, in particular because the algorithm behavior and patterns it provides may change significantly as a function of its parameters. In practice, most of the data mining literature is too abstract regarding the actual use of the algorithms and parameter tuning is usually a frustrating task. On the other hand, there is a large number of implementations available, such as those in the R project, but their documentation focus mainly on implementation details without providing a good discussion about parameter-related trade-offs associated with each of them.
This Wikibook aims to fill this gap by integrating three pieces of information for each technique: description and rationale, implementation details, and use cases. The description and rationale of each technique provide the necessary background for understanding the implementation and applying it to real scenarios. The implementation details not only expose the algorithm design, but also explain its parameters in the light of the rationale provided previously. Finally, the use cases provide an experience of the algorithms use on synthetic and real datasets.
The choice of the R project as the computational platform associated with this Wikibook stems from its popularity (and thus critical mass), ease of programming, good performance, and an increasing use in several fields, such as bioinformatics and finance.
If you want to learn how to program in the R language, read the book R Programming.