(i)
Data mining
Data
mining (sometimes called data or knowledge discovery) is the process of
analyzing data from different perspectives and summarizing it into useful
information - information that can be used to increase revenue, cuts costs, or
both. Data mining software is one of a number of analytical tools for analyzing
data. It allows users to analyze data from many different dimensions or angles,
categorize it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among dozens of
fields in large relational databases.
Data mining
is primarily used today by companies with a strong consumer focus - retail,
financial, communication, and marketing organizations. It enables these
companies to determine relationships among "internal" factors such as
price, product positioning, or staff skills, and "external" factors
such as economic indicators, competition, and customer demographics. And, it
enables them to determine the impact on sales, customer satisfaction, and
corporate profits. Finally, it enables them to "drill down" into summary
information to view detail transactional data.
How data mining works
Generally, any of four types of
relationships are sought:
ü
Classes:
Stored data is used to locate data in predetermined groups. For example, a
restaurant chain could mine customer purchase data to determine when customers
visit and what they typically order. This information could be used to increase
traffic by having daily specials.
ü
Clusters:
Data items are grouped according to logical relationships or consumer
preferences. For example, data can be mined to identify market segments or
consumer affinities.
ü
Associations:
Data can be mined to identify associations. The beer-diaper example is an
example of associative mining.
ü
Sequential patterns:
Data is mined to anticipate behavior patterns and trends. For example, an
outdoor equipment retailer could predict the likelihood of a backpack being
purchased based on a consumer's purchase of sleeping bags and hiking shoes.
Data mining consists of five major
elements:
ü
Extract, transform, and load
transaction data onto the data warehouse system.
ü
Store and manage the data in a
multidimensional database system.
ü
Provide data access to business
analysts and information technology professionals.
ü
Analyze the data by application
software.
ü
Present the data in a useful format,
such as a graph or table.
(ii) Data
warehousing
Data
warehousing is defined as a process of centralized data management and
retrieval. Data warehousing, like data mining, is a relatively new term
although the concept itself has been around for years. Data warehousing
represents an ideal vision of maintaining a central repository of all
organizational data. Centralization of data is needed to maximize user access
and analysis. Dramatic technological advances are making this vision a reality
for many companies. And, equally dramatic advances in data analysis software
are allowing users to access this data freely. The data analysis software is
what supports data mining.
(iii)Data marts
A data
mart is the access layer of the data warehouse environment that is used to get data out to the
users. The data mart is a subset of the data warehouse that is usually oriented
to a specific business line or team. Data marts are small slices of the data
warehouse.
It is a
simple form of a data warehouse that is focused on a single subject (or
functional area), such as Sales, Finance, or Marketing. Data marts are often
built and controlled by a single department within an organization. Given their
single-subject focus, data marts usually draw data from only a few sources. The
sources could be internal operational systems, a central data warehouse, or
external data.
Simply
stated, the major steps in implementing a data mart are to design the schema,
construct the physical storage, populate the data mart with data from source
systems, access it to make informed decisions, and manage it over time.
This
gives you the usual advantages of centralization
b)DATA
MINING TECHNIQUES AND THEIR APPLICATION
The most
commonly used techniques include artificial neural networks, decision trees,
and the nearest-neighbor method. Each of these techniques analyzes data in
different ways:
- Artificial
neural network.s
Are non-linear, predictive models
that learn through training. Although they are powerful predictive modeling
techniques, some of the power comes at the expense of ease of use and
deployment. One area where auditors can easily use them is when reviewing
records to identify fraud and fraud-like actions.
Advantages:
Because of their complexity, they are better
employed in situations where they can be used and reused, such as reviewing
credit card transactions every month to check for anomalies.
Disadvantages:
Neural
networks, which are difficult to implement, require all input and resultant
output to be expressed numerically, thus needing some sort of interpretation
depending on the nature of the data-mining exercise
- Decision
trees.
Are tree-shaped structures that represent
decision sets. These decisions generate rules, which then are used to classify
data. Decision trees are the favored technique for building understandable
models. Auditors can use them to assess, for example, whether the organization
is using an appropriate cost-effective marketing strategy that is based on the
assigned value of the customer, such as profit.
Decision trees have several advantages
ü Are simple to understand and
interpret. People are able to understand decision tree models after a brief
explanation.
ü Have value even with little hard
data. Important insights can be generated based on experts describing a
situation (its alternatives, probabilities, and costs) and their preferences
for outcomes.
ü Possible scenarios can be added
ü Worst, best and expected values
can be determined for different scenarios
Disadvantages of decision trees:
ü For data including categorical
variables with different number of levels, information gain in
decision trees
are biased in favor of those attributes with more levels.
ü Calculations can get very complex
particularly if many values are uncertain and/or if many outcomes are linked.
- The
nearest-neighbor method
Classifies dataset records based on
similar data in a historical dataset. Auditors can use this approach to define
a document that is interesting to them and ask the system to search for similar
items.
Advantages
The
nearest-neighbor method relies more on linking similar items and, therefore,
works better for extrapolation rather than predictive enquiries.
Disadvantages
They
(like the neural networks) do not simplify the distribution of objects in
parameter space to a comprehensible set of parameters. Instead, the training
set is retained in its entirety as a description of the object distribution.
The
method is also rather slow if the training set has many examples.
The most
serious shortcoming of nearest neighbor methods is that they are very sensitive
to the presence of irrelevant parameters.
Most commonly used technique for predicting a specific outcome
such as response / no-response, high / medium / low-value customer, likely to
buy / not buy.
A classification task begins with a data set in which the class assignments
are known. For instance, a classification model that predicts credit risk could
be developed based on observed data for many loan applicants over a period of
time. In addition to the historical credit rating, the data might track
employment history, home ownership or rental, years of residence, number and
type of investments, and so on. Credit rating would be the target, the other
attributes would be the predictors, and the data for each customer would
constitute a case.
Ranks attributes according to strength of relationship with target
attribute. Use cases include finding factors most associated with customers who
respond to an offer, factors most associated with healthy patients.
The goal of anomaly detection is to identify cases that are
unusual within data that is seemingly homogeneous. Anomaly detection is an
important tool for detecting fraud, network intrusion, and other rare events
that may have great significance but are hard to find.
Clustering
analysis finds clusters of data objects that are similar in some sense to one
another. The members of a cluster are more like each other than they are like
members of other clusters. The goal of clustering analysis is to find
high-quality clusters such that the inter-cluster similarity is low and the
intra-cluster similarity is high.
Clustering,
like classification, is used to segment the data. Unlike classification,
clustering models segment data into groups that were not previously defined.
Classification models segment data by assigning it to
previously-defined classes, which are specified in a target. Clustering models
do not use a target.
Strength:
Clustering
is useful for exploring data. If there are many cases and
no obvious groupings, clustering algorithms can be used to find natural
groupings. Clustering can also serve as a useful data-preprocessing
step to identify homogeneous groups on which to build supervised models.
It is
easier to cluster objects based on their similarities.
Weakness:
in the event of many similarities clustering may be tedious
Association
is a data mining function that discovers the probability of the co-occurrence
of items in a collection. The relationships between co-occurring items are
expressed as association rules.
Association
rules are often used to analyze sales transactions.
Unlike
other data mining functions, association is transaction-based. In transaction processing, a case consists of a transaction
such as a market basket or Web session. The collection of items in the transaction
is an attribute of the transaction. Other attributes might be the date, time,
location, or user ID associated with the transaction.
Attribute
importance is a supervised function that ranks attributes according to their significance
in predicting a target.
Finding
the most significant predictors is the goal of some data mining projects. For
example, a model might seek to find the principal characteristics of clients
who pose a high credit risk.
Attribute
importance is also useful as a preprocessing step in classification modeling,
especially for models that use Naive Bayes or Support Vector Machine. The
Decision Tree algorithm includes components that rank attributes as part of the
model build.
·
Regression.
Regression is a data mining function that predicts a
number. Profit, sales, mortgage rates, house values, square footage,
temperature, or distance could all be predicted using regression techniques.
For example, a regression model could be used to predict the value of a house
based on location, number of rooms, lot size, and other factors.
A regression task
begins with a data set in which the target values are known. For example, a
regression model that predicts house values could be developed based on
observed data for many houses over a period of time. In addition to the value,
the data might track the age of the house, square footage, number of rooms,
taxes, school district, proximity to shopping centers, and so on. House value
would be the target, the other attributes would be the predictors, and the data
for each house would constitute a case.
Comments
Post a Comment