4.8: Data Mining

Last updated
Save as PDF

Page ID: 9924

Ly-Huong T. Pham, Tejal Desai-Naik, Laurie Hammond, & Wael Abdeljabbar
ASCCC Open Educational Resources Initiative (OERI)

$ \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } $ $ \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} $$\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$ $\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$$\newcommand{\AA}{\unicode[.8,0]{x212B}}$

Data mining is the process of sorting through big data (measured in terabytes). In the past, there was a lack of data to analyze. The challenge is an overabundance of data that must be reviewed, which is called data overload. This becomes an issue because the user needs to evaluate which information is useful and which is not. Many businesses do mining to get detailed insight on their customers, products and to optimize business decisions. The analysis is executed with sophisticated programs. The programs can combine multiple databases. The end effect is so complex that companies must find a way to store the data. Data warehouses are needed. The data warehouse is where the information is stored and processed from the data mining. The price for a simple warehouse could start at $10 million.

Companies like Google, Netflix, Amazon, and Facebook are big users of data mining. They seek to find out who their consumer is and how best to keep them and sell them more products. They also review their products. The means used are reviewing data and finding trends, patterns, and associations to make decisions. Generally, data mining is accomplished through automated means against extensive data sets, such as a data warehouse.

Examples of data mining include:

An analysis of sales from a large grocery chain might determine that milk is purchased more frequently the day after it rains in cities with a population of less than 50,000.
A bank may find that loan applicants whose bank accounts show particular deposit and withdrawal patterns are not good credit risks.
A baseball team may find those collegiate baseball players with specific statistics in hitting, pitching, and fielding for more successful major league players.

In some cases, a data-mining project is begun with a hypothetical result in mind. For example, a grocery chain may already have some idea that the buying patterns change after it rains and want to get a deeper understanding of exactly what is happening. In other cases, there are no presuppositions, and a data-mining program is run against large data sets to find patterns and associations.