Data Mining (Unit 1)

WHAT IS DATA MINING

Extracting knowledge from huge amount of data is known as data mining
Data mining is also known as “knowledge Discovery from Data” (KDD). Data mining is useful in many fields like ecommerce, business, agriculture & health care. There is huge amount of data is coming from different sources, mining that data gives some insights through those insight business can be improved, best quality of products will be produced, human life will become very happy also. If we have sales information of an outlet then by applying the mining we can analyse the sales, we can predict the sales also. If we have weather data we can analyse and mine important patterns. We can predict the future (next) weather condition which will be useful for formers.

APPLICATION OF DATA MINING

  • Market Analysis
  • Market Basket Analysis
  • Market Segmentation
  • Target Market Analysis
  • Risk Analysis and Risk Prediction
  • Health Care Industry

DATA MINING ACTIVITIES

Data Mining Activities are divided into 2 categories
  • Descriptive Data Mining
  • Predictive Data Mining
DESCRIPTIVE DATA MINING:
In this the data is analysed, classified, clustered, Association mining is carried out, in descriptive mining data is analysed and the visualisation is projected.
PREDICTIVE DATA MINING
In predictive data mining after analysis of data forecasting, anticipation or prediction will be done. If we have sales data of a store then we can predict the sales of next coming days.

ACTIVITIES OF DATA MINING

  • Classification
  • Clustering
  • Association Rule mining
  • Finding Frequent Patterns
  • Prediction

DATA MINING PROCESS ALSO known AS KDD

Data mining process includes 7 steps. Those are
  1. Data Cleaning:
    If the data which we want to mine is having missing data, noisy data and uncertain data. Then the data must be cleaned. Here cleaning means filling the data, removing the noise etc. If the data is not cleaned and if we proceed with data mining process then the data mining results may not be accurate or correct. So in order to get good accuracy or result we have to do data cleaning. Data cleaning also known as data preprocessing.
  2. Data Integration
    Data integration is combining data which is collected from different sources. If means if the data is collected from different sources then combing the data to make one form is called as data integration.
  3. Data Selection
    After combining the data we have to extract some data from the whole collected data. This is called as data selection.
  4. Data Transformation
    Converting data form one form to another form is called as data transformation. In this data transformation stage is the data set consists of true/false or Yes/No then that will be converted as 1/0 format. This process is called as Data Transformation.
  5. Data Mining
    Extraction of knowledge or patterns from the huge Data is called as Data Mining. In this step actual data mining will be done.
  6. Pattern Evaluation
    In the process of data mining pattern evaluation is one important step. In this step the patterns are evaluation to make out data mining is trustable or not.
  7. Visualization
    In the process of data mining visualization is last and final step. In this phase the data is visualized with different graphs. This will give a clear glance of the data.


These are many performance evaluation metric are present to evaluate the data mining process those
  • Accuracy
  • Running time
  • Confidence
  • Specificity
  • Sensitivity
  • Precision

KINDS OF DATA

There are four kinds of data on which we can do data mining
  1. Data base Data
  2. Data ware house Data
  3. Transactional Data
  4. Miscellaneous Data

Issues in data mining

Mining Methodology
  • Mining various and new kinds of knowledge.
  • Mining Knowledge in multidimensional space=> Exploratory or multidimensional data mining
  • Data mining -an interdisciplinary effort: bug mining, NLP.
  • Boosting the power of discovery in a network of environment
  • Handling Uncertainty , nose or incompleteness of data
  • Pattern evaluation pattern or constraint guided mining

User Interaction

  • Interactive mining
  • Incorporation of background knowledge
  • Ad hoc data mining and data mining query Language
  • Presentation and visualization of data mining results

Efficiency & Scalability

  • Efficient and scalability of data mining algorithms
  • Parallel, distributed and incremented mining Algorithms.

Diversity of Data base types

  • Handling complex types of data: temporal, biological sequences, sensor data.
  • Mining dynamic, networked , and global data repositories

Data mining & Society :

  • Social impacts if data mining
  • Privacy preserving data mining
  • Invisible data mining

DATA CLEANING

Data cleaning is performed on the data before doing data mining. These are different methods to be followed in order to clean the data. Those in data cleaning we handle missing values and noisy data.
HANDLING MISSING VALUES:
  1. Ignore the tuple
    If missing values are less then we can ignore entire tuple. This is a easy and lazy process. But it will not treated as good practice.
  2. FILL MISSING VALUES MANUALLUY:
    The missing values are filled manually. This is a times consuming process as if the data is huge in size then it tales more time.

  3. FILL MISSING VALUE WITH A GLOBAL CONSTANT: The missing values are filled with an appropriate global value.
  4. Fill the missing values with a measure of the attribute.
    This is a good method. In this the missing values autiled with means or median of the attribute vales. It means if we want to fill maeles attribute of student relation then the maeiles mean or median is calculated and that is used to fill the missing values.
  5. FILL THE MISSING VALUES WITH A MEATURE OF SAMPLE OF THE ATTRIBUTE. In this instead of taking entire attribute values, we can consider sample of attribute values to calculate mean or mediam , that value is used to fill the missing values.
  6. FILL WITH MOST OR LIGHEST PROBABLE VALUE. This can be done by finding most probable value of the attribute and that can be used to fill the missing values.
  7. HANDLING NOISY DATA Abrupt change in data is treated as noise. The noise may occure due to external signals or unexpected error. Noisy data must be handled properly in order to get efficiat results. The noise can be handled by using the following methods.
    1. Binning
    2. Regression
    3. Outlier Analysis

    BINNING

    In this process the data is divided into buckets or “bins”. The data is kept in the bins. Then the values are smoothened by any one of the 3 following methods.
    1. Smoothing by bin Mean
    2. Smoothing by bin median
    3. Smoothing by bin boundaries
    In binning process first the data is sorted and that data is devided into bins. Ex: data: 4,7,6,2,3,9,6,5,10. Sorted data : 2,3,4,5,6,7,8,9,10 Then the values kept in the bins as follows
    • bin 1 : 2,3,6
    • bin 2: 6,6,7
    • bin 3 : 8,12,12
    SMOOTING BY BIN MEAN:
    • Bin 1 : 2,3,6 average = (2+3+6)/3=3
    • Bin 2: 6,6,7 average =(6+6+7)/3=6
    • Bin 3: 8,12,12 average = (8+12+12)=11.66
    Then bin values are replaced with the bin mean . Then the new values in the bins are as follows.
    • Bin 1 : 4,4,4
    • Bin 2 : 6,6,6
    • Bin 3 : 11.6,11.6,11.6
    SMOOTHING BY BIN MEDIAN
    In this the bin media is calculated then the values are used to smoothen the bin values. Ex:
    • Bin 1 : 2,3,6
    • Bin 2 : 6,6,7
    • Bin 3 : 8,12,12
    Then the median are used after applying bin median then the bin values are as follows
    • Bin 1 : 3,3,3
    • Bin 2 : 6,6,6
    • Bin 3 : 12,12,12
    SMOOTHING BY BIN BOUNDAERIES
    In this method the values are replaced with the nearest boundary values.
    REGRESSION
    In Regression the data is made in to a formula so that is can be best plotted as a line graph.