Extracting knowledge from huge amount of data is known as data mining
Data mining is also known as “knowledge Discovery from Data” (KDD).
Data mining is useful in many fields like ecommerce, business, agriculture & health care. There is
huge amount of data is coming from different sources, mining that data gives some insights through those
insight business can be improved, best quality of products will be produced, human life will become very happy also.
If we have sales information of an outlet then by applying the mining we can analyse the sales, we can
predict the sales also. If we have weather data we can analyse and mine important patterns. We can predict
the future (next) weather condition which will be useful for formers.
APPLICATION OF DATA MINING
Market Analysis
Market Basket Analysis
Market Segmentation
Target Market Analysis
Risk Analysis and Risk Prediction
Health Care Industry
DATA MINING ACTIVITIES
Data Mining Activities are divided into 2 categories
Descriptive Data Mining
Predictive Data Mining
DESCRIPTIVE DATA MINING:
In this the data is analysed, classified, clustered, Association mining is carried out, in
descriptive mining data is analysed and the visualisation is projected.
PREDICTIVE DATA MINING
In predictive data mining after analysis of data forecasting, anticipation or prediction will be done.
If we have sales data of a store then we can predict the sales of next coming days.
ACTIVITIES OF DATA MINING
Classification
Clustering
Association Rule mining
Finding Frequent Patterns
Prediction
DATA MINING PROCESS ALSO known AS KDD
Data mining process includes 7 steps. Those are
Data Cleaning:
If the data which we want to mine is having missing data, noisy data and uncertain data.
Then the data must be cleaned. Here cleaning means filling the data, removing the noise etc.
If the data is not cleaned and if we proceed with data mining process then the data mining results may
not be accurate or correct. So in order to get good accuracy or result we have to do data cleaning.
Data cleaning also known as data preprocessing.
Data Integration
Data integration is combining data which is collected from different sources. If means if the data is
collected from different sources then combing the data to make one form is called as data integration.
Data Selection
After combining the data we have to extract some data from the whole collected data. This is called as data selection.
Data Transformation
Converting data form one form to another form is called as data transformation.
In this data transformation stage is the data set consists of true/false or Yes/No
then that will be converted as 1/0 format. This process is called as Data Transformation.
Data Mining
Extraction of knowledge or patterns from the huge Data is called as Data Mining. In this step actual data mining will be done.
Pattern Evaluation
In the process of data mining pattern evaluation is one important step. In this step the patterns are
evaluation to make out data mining is trustable or not.
Visualization
In the process of data mining visualization is last and final step. In this phase the data is visualized with
different graphs. This will give a clear glance of the data.
These are many performance evaluation metric are present to evaluate the data mining process those
Accuracy
Running time
Confidence
Specificity
Sensitivity
Precision
KINDS OF DATA
There are four kinds of data on which we can do data mining
Data base Data
Data ware house Data
Transactional Data
Miscellaneous Data
Issues in data mining
Mining Methodology
Mining various and new kinds of knowledge.
Mining Knowledge in multidimensional space=> Exploratory or multidimensional data mining
Data mining -an interdisciplinary effort: bug mining, NLP.
Boosting the power of discovery in a network of environment
Handling Uncertainty , nose or incompleteness of data
Pattern evaluation pattern or constraint guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Ad hoc data mining and data mining query Language
Presentation and visualization of data mining results
Efficiency & Scalability
Efficient and scalability of data mining algorithms
Parallel, distributed and incremented mining Algorithms.
Diversity of Data base types
Handling complex types of data: temporal, biological sequences, sensor data.
Mining dynamic, networked , and global data repositories
Data mining & Society :
Social impacts if data mining
Privacy preserving data mining
Invisible data mining
DATA CLEANING
Data cleaning is performed on the data before doing data mining. These are different methods
to be followed in order to clean the data. Those in data cleaning we handle missing values and noisy data.
HANDLING MISSING VALUES:
Ignore the tuple
If missing values are less then we can ignore entire tuple. This is a easy and lazy process.
But it will not treated as good practice.
FILL MISSING VALUES MANUALLUY:
The missing values are filled manually. This is a times consuming process as if the data is
huge in size then it tales more time.
FILL MISSING VALUE WITH A GLOBAL CONSTANT:
The missing values are filled with an appropriate global value.
Fill the missing values with a measure of the attribute. This is a good method.
In this the missing values autiled with means or median of the attribute vales. It means
if we want to fill maeles attribute of student relation then the maeiles mean or median
is calculated and that is used to fill the missing values.
FILL THE MISSING VALUES WITH A MEATURE OF SAMPLE OF THE ATTRIBUTE.
In this instead of taking entire attribute values, we can consider sample of attribute
values to calculate mean or mediam , that value is used to fill the missing values.
FILL WITH MOST OR LIGHEST PROBABLE VALUE.
This can be done by finding most probable value of the attribute and that can be used to fill the missing values.
HANDLING NOISY DATA
Abrupt change in data is treated as noise. The noise may occure due to external signals or unexpected error.
Noisy data must be handled properly in order to get efficiat results. The noise can be handled by using the
following methods.
Binning
Regression
Outlier Analysis
BINNING
In this process the data is divided into buckets or “bins”. The data is kept in the bins.
Then the values are smoothened by any one of the 3 following methods.
Smoothing by bin Mean
Smoothing by bin median
Smoothing by bin boundaries
In binning process first the data is sorted and that data is devided into bins.
Ex: data: 4,7,6,2,3,9,6,5,10.
Sorted data : 2,3,4,5,6,7,8,9,10
Then the values kept in the bins as follows
bin 1 : 2,3,6
bin 2: 6,6,7
bin 3 : 8,12,12
SMOOTING BY BIN MEAN:
Bin 1 : 2,3,6 average = (2+3+6)/3=3
Bin 2: 6,6,7 average =(6+6+7)/3=6
Bin 3: 8,12,12 average = (8+12+12)=11.66
Then bin values are replaced with the bin mean .
Then the new values in the bins are as follows.
Bin 1 : 4,4,4
Bin 2 : 6,6,6
Bin 3 : 11.6,11.6,11.6
SMOOTHING BY BIN MEDIAN
In this the bin media is calculated then the values are used to smoothen the bin values.
Ex:
Bin 1 : 2,3,6
Bin 2 : 6,6,7
Bin 3 : 8,12,12
Then the median are used after applying bin median then the bin values are as follows
Bin 1 : 3,3,3
Bin 2 : 6,6,6
Bin 3 : 12,12,12
SMOOTHING BY BIN BOUNDAERIES
In this method the values are replaced with the nearest boundary values.
REGRESSION
In Regression the data is made in to a formula so that is can be best plotted as a line graph.