Data cleaning can be done with help of tools like data scrubbing tools, data analysis tools
Extraction Transformation hoading (ETL) tools. In data cleaning the first sep is finding data descrepancies. The data descripancies are caused by human errors, optional values or decay data.
The data descrepancies are identified with the help of “meta data”.
DATA INTEGRATION
Combining data is lnown as Data Integrationor Data merging. The data merging is done is the daata is generated from differnet sources.
FREQUENT PATTERN
Frequent patterns are patterns that appear frequently in a data set. Here latternmeans itemset, subsequence or substructure.
For example in a super market transactional data set some items are purchased frequently. Those items which are purchased frequently are treated as frequent pattern.
We can find out the association of the products after finding frequent pattern.
MINING FREQUENT PATTERNS, ASSOCIATIONS AND CORRELATION
Basic Concepts:
The Relationship between the given data can be searched used Frequent Pattern Minining the associations and correlations can be found using freqent pattern mining.
MARKET BASKET ANALYSIS
Finding the relation or correlations among busines transactions can help in many business decisiokn making processes like catalog desingn,customer purchasing behaviour analysis, discounts,promotions etc. Market Basket Analysis is mainly focus on the cutomer pruchasing habits. With this analysis the retails will get insight okf the customers using market basket Analyis we can identify the items which are purchased together by the customers.
TERMINALLOGY:
Itemset:- Set of items is called as Itemset .
Ex: Set of Transactions is called as Transactional Itemset. A itemset that contains k item is called as k itemset.
The occurence fresquency of an Item in a w item set is called as frequency, count or support count.
Support count also called as absolute support.
RELATIVE SUPPORT
Fraction of transactions that Contains as Item is called as Relative Support.
PATTERN MINING APPLICATIONS
1) Super Market Data:
Using pattern mining on the supermarket Data we can get the purchasing be haviour of customer. We can
find out the item which are fresquently purchased together.
2) Text Mining:
We can apply pattern ming on huge amout of text and we can get the frequent appeare words or bag of words
which are appeared together.
3) Web Log Analysys : We can analyses the web logs of an user
Apriori Algorithms to Find Frequent Item Sets
T is the data base contains set of n transactions denoted by T1, T2........Tn .
Itemset is a set of items. A k itemset is an itemset that contains exactly k item.
TRANSACTION DATA SET
TID
Items
T1
{milk, bread, eggs}
T2
{bread, sugar}
T4
{milk, bread, butter}
T5
{milk, bread, butter}
T6
{milk, bread, butter}
T7
{milk, sugar}
T8
{milk, sugar}
T9
{sugar, butter, bread, eggs}
T10
{milk, sugar, butter}
T11
{milk, bread, butter, eggs}
Now we will calculate the support of each item in the data set. This table is called as candidate table
and represented as C1 as follows.
Item
Support
milk
8
bread
7
suger
5
butter
7
eggs
4
Consider the minimim support is 3. Then if the items count is less than 3 those should be removed from the
above candidate table .
But in the above table all the items are satisfing the minimum support count hence the level 1
candidate set as follows.
Item
Support
milk
8
bread
7
suger
5
butter
7
eggs
4
Now we have to apply the joining step of apriori algorithem. In this step each item
joined with all other items of Level 1 item set and support count will be noted and it is written along with the item set.
The Table looks as follows.
Item
Support
Milk,bread
5
Milk,sugar
4
Milk,butter
4
Milk,eggs
2
Bread,sugar
2
Bread,butter
5
Bread,eggs
4
Suggar,butter
2
Suggar,eggs
1
Butter,eggs
3
Item Support
This table is denoted as C2. We have to deduce 2 itemset table L2 by removing the
itemsets which are not satisfying the minimum support count that is “3” (i.e in this example).
After removing the itemsets which are not satisfuing the minimum support count the L2 table looks as follows.
Item
Support
Milk,bread
5
Milk,sugar
4
Milk,butter
4
Bread,butter
5
Bread,eggs
4
Butter,eggs
3
Now we have join the itemset with all other itemsets of the above table. Then the resultant table
contains the itemset, support count. The resultant table looks as follows.
Item
Support
Milk,bread,sugar
1
Milk,bread,butter
3
Milk,bread,eggs
2
Milk,butter,sugar
0
Milk,butter,eggs
1
Bread,butter,eggs
3
From the above table we have to remove the items sets which are not satisfying minimum support count.
Then the resultant table is known as 3 table. The table looks as follows.
Item
Support
Milk,bread,butter
3
Bread,butter,eggs
3
Futther we can combine the itemsets of L3 and we can findout the support count. Then the table is C4 and
it contains minims 4 items in each set. The table as follows.
Item
Support
Milk,berad,butter,eggs
1
This table itemsets are not satisfying minimum support count hence the previous condidate table L3 is treated
as final and the itemsets present in L3 are considered as frequent pattern.
Then the fresquent patterns are
milk, bread, butter
bread, butter, eggs.
CALCULATING CONFIDENCE
Confidence:
Confidence is gives the percentage of a product A that will be purchaged with B. Then the confidence A->B
is
confidence => CONFIDENCE
Confidence (A-> B ) = (Support Count of AUB )/ (Support Count of A)
Frequent Pattern Growth Algorithm
The two primary drawbacks of the Apriori Algorithm are:
1. At each step, candidate sets have to be built.
2. To build the candidate sets, the algorithm has to repeatedly scan the database.
These two properties inevitably make the algorithm slower. To overcome these redundant steps, a new
association-rule mining algorithm was developed named Frequent Pattern Growth Algorithm. It overcomes
the disadvantages of the Apriori algorithm by storing all the transactions in a Trie Data Structure.
Consider the following data:-
The above-given data is a hypothetical dataset of transactions with each letter representing an item.
The frequency of each individual item is computed:-
Let the minimum support be 3. A Frequent Pattern set is built which will contain all the elements
whose frequency is greater than or equal to the minimum support. These elements are stored in descending
order of their respective frequencies. After insertion of the relevant items, the set L as follows:-
L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
Now, for each transaction, the respective Ordered-Item set is built. It is done by iterating
the Frequent Pattern set and checking if the current item is contained in the transaction in question.
If the current item is contained, the item is inserted in the Ordered-Item set for the current transaction.
The following table is built for all the transactions:
Now, all the Ordered-Item sets are inserted into a Trie Data Structure.
Inserting the set {K, E, M, O, Y}:
Here, all the items are simply linked one after the other in the order of occurrence in the set and initialize
the support count for each item as 1.
Inserting the set {K, E, O, Y}:
Till the insertion of the elements K and E, simply the support count is increased by 1. On
inserting O we can see that there is no direct link between E and O, therefore a new node for the item O
is initialized with the support count as 1 and item E is linked to this new node. On inserting Y, we first
initialize a new node for the item Y with support count as 1 and link the new node of O with the new node of Y.
Inserting the set {K, E, M}:
Here simply the support count of each element is increased by 1.
Inserting the set {K, M, Y}:
Similar to step b), first the support count of K is increased, then new nodes for M and Y
are initialized and linked accordingly.
Inserting the set {K, E, O}:
Here simply the support counts of the respective elements are increased. Note that the
support count of the new node of item O is increased.
Now, for each item, the Conditional Pattern Base is computed which is path labels of all
the paths which lead to any node of the given item in the frequent-pattern tree. Note that
the items in the below table are arranged in the ascending order of their frequencies.
Now for each item, the Conditional Frequent Pattern Tree is built. It is done by taking the
set of elements that is common in all the paths in the Conditional Pattern Base of that item
and calculating its support count by summing the support counts of all the paths in the Conditional
Pattern Base.
From the Conditional Frequent Pattern tree, the Frequent Pattern rules are generated by pairing the
items of the Conditional Frequent Pattern Tree set to the corresponding to the item as given in the
below table.
For each row, two types of association rules can be inferred for example for the first row which
contains the element, the rules K -> Y and Y -> K can be inferred. To determine the valid rule, the
confidence of both the rules is calculated and the one with confidence greater than or equal to the
minimum confidence value is retained.
improve the efficiency of Apriori-based mining
There are some variations of the Apriori algorithm that have been projected that target developing the efficiency
of the original algorithm which are as follows −
The hash-based technique (hashing itemsets into corresponding buckets) − A hash-based technique can be used to
decrease the size of the candidate k-itemsets, Ck, for k > 1. For instance, when scanning each transaction in the
database to create the frequent 1-itemsets,L1, from the candidate 1-itemsets in C1, it can make some 2-itemsets for
each transaction, hash (i.e., map) them into the several buckets of a hash table structure, and increase the equivalent
bucket counts.
Transaction reduction − A transaction that does not include some frequent k-itemsets cannot include some frequent
(k + 1)-itemsets. Thus, such a transaction can be marked or deleted from further consideration because subsequent scans
of the database for j-itemsets, where j > k, will not need it.
Partitioning −
A partitioning technique can be used that needed two database scans to mine the frequent itemsets. It includes two phases
involving In Phase I, the algorithm subdivides the transactions of D into n non-overlapping partitions. If the minimum
support threshold for transactions in D is min_sup, therefore the minimum support count for a partition
is min_sup × the number of transactions in that partition.
For each partition, all frequent itemsets within the partition are discovered. These are defined as local
frequent itemsets. The process employs a specific data structure that, for each itemset, records the TIDs
of the transactions including the items in the itemset. This enables it to find all of the local frequent
k-itemsets, for k = 1, 2... in only one scan of the database.
A local frequent itemset can or cannot be frequently related to the whole database, D. Any itemset that is
possibly frequent related D must appear as a frequent itemset is partially one of the partitions. Thus, all
local frequent itemsets are candidate itemsets slightly D. The set of frequent itemsets from all partitions
forms the worldwise candidate itemsets for D. In Phase II, the second scan of D is organized in which the
actual support of each candidate is assessed to decide the global frequent itemsets.
Sampling − The fundamental idea of the sampling approach is to select a random sample S of the given data D,
and then search for frequent itemsets in S rather than D. In this method, it can trade off some degree of accuracy
against efficiency. The sample size of S is such that the search for frequent itemsets in S can be completed in main
memory, and therefore only one scan of the transactions in S is needed overall.