Skip to main content

Data Mining


(i)                 Data mining

Data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

Data mining is primarily used today by companies with a strong consumer focus - retail, financial, communication, and marketing organizations. It enables these companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information to view detail transactional data.


How data mining works

Generally, any of four types of relationships are sought:

ü  Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials.

ü  Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities.

ü  Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining.

ü  Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.

Data mining consists of five major elements:

ü  Extract, transform, and load transaction data onto the data warehouse system.

ü  Store and manage the data in a multidimensional database system.

ü  Provide data access to business analysts and information technology professionals.

ü  Analyze the data by application software.

ü  Present the data in a useful format, such as a graph or table.


 (ii)  Data warehousing

Data warehousing is defined as a process of centralized data management and retrieval. Data warehousing, like data mining, is a relatively new term although the concept itself has been around for years. Data warehousing represents an ideal vision of maintaining a central repository of all organizational data. Centralization of data is needed to maximize user access and analysis. Dramatic technological advances are making this vision a reality for many companies. And, equally dramatic advances in data analysis software are allowing users to access this data freely. The data analysis software is what supports data mining.

(iii)Data marts

A data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data mart is a subset of the data warehouse that is usually oriented to a specific business line or team. Data marts are small slices of the data warehouse.

It is a simple form of a data warehouse that is focused on a single subject (or functional area), such as Sales, Finance, or Marketing. Data marts are often built and controlled by a single department within an organization. Given their single-subject focus, data marts usually draw data from only a few sources. The sources could be internal operational systems, a central data warehouse, or external data.

Simply stated, the major steps in implementing a data mart are to design the schema, construct the physical storage, populate the data mart with data from source systems, access it to make informed decisions, and manage it over time.

This gives you the usual advantages of centralization



b)DATA MINING TECHNIQUES AND THEIR APPLICATION

The most commonly used techniques include artificial neural networks, decision trees, and the nearest-neighbor method. Each of these techniques analyzes data in different ways:

  • Artificial neural network.s

Are non-linear, predictive models that learn through training. Although they are powerful predictive modeling techniques, some of the power comes at the expense of ease of use and deployment. One area where auditors can easily use them is when reviewing records to identify fraud and fraud-like actions.

Advantages:

 Because of their complexity, they are better employed in situations where they can be used and reused, such as reviewing credit card transactions every month to check for anomalies.

                Disadvantages:

Neural networks, which are difficult to implement, require all input and resultant output to be expressed numerically, thus needing some sort of interpretation depending on the nature of the data-mining exercise


  • Decision trees.

 Are tree-shaped structures that represent decision sets. These decisions generate rules, which then are used to classify data. Decision trees are the favored technique for building understandable models. Auditors can use them to assess, for example, whether the organization is using an appropriate cost-effective marketing strategy that is based on the assigned value of the customer, such as profit.


Decision trees have several advantages

ü  Are simple to understand and interpret. People are able to understand decision tree models after a brief explanation.

ü  Have value even with little hard data. Important insights can be generated based on experts describing a situation (its alternatives, probabilities, and costs) and their preferences for outcomes.

ü  Possible scenarios can be added

ü  Worst, best and expected values can be determined for different scenarios

ü  Use a white box model. If a given result is provided by a model.

Disadvantages of decision trees:

ü  For data including categorical variables with different number of levels, information gain in decision trees are biased in favor of those attributes with more levels.

ü  Calculations can get very complex particularly if many values are uncertain and/or if many outcomes are linked.



  • The nearest-neighbor method

Classifies dataset records based on similar data in a historical dataset. Auditors can use this approach to define a document that is interesting to them and ask the system to search for similar items.

Advantages


The nearest-neighbor method relies more on linking similar items and, therefore, works better for extrapolation rather than predictive enquiries.


Disadvantages

They (like the neural networks) do not simplify the distribution of objects in parameter space to a comprehensible set of parameters. Instead, the training set is retained in its entirety as a description of the object distribution.

The method is also rather slow if the training set has many examples.

The most serious shortcoming of nearest neighbor methods is that they are very sensitive to the presence of irrelevant parameters.

·         Classification


Most commonly used technique for predicting a specific outcome such as response / no-response, high / medium / low-value customer, likely to buy / not buy.

A classification task begins with a data set in which the class assignments are known. For instance, a classification model that predicts credit risk could be developed based on observed data for many loan applicants over a period of time. In addition to the historical credit rating, the data might track employment history, home ownership or rental, years of residence, number and type of investments, and so on. Credit rating would be the target, the other attributes would be the predictors, and the data for each customer would constitute a case.





·         Attribute Importance


Ranks attributes according to strength of relationship with target attribute. Use cases include finding factors most associated with customers who respond to an offer, factors most associated with healthy patients.

·         Anomaly Detection


The goal of anomaly detection is to identify cases that are unusual within data that is seemingly homogeneous. Anomaly detection is an important tool for detecting fraud, network intrusion, and other rare events that may have great significance but are hard to find.


·         Clustering.

Clustering analysis finds clusters of data objects that are similar in some sense to one another. The members of a cluster are more like each other than they are like members of other clusters. The goal of clustering analysis is to find high-quality clusters such that the inter-cluster similarity is low and the intra-cluster similarity is high.

Clustering, like classification, is used to segment the data. Unlike classification, clustering models segment data into groups that were not previously defined. Classification models segment data by assigning it to previously-defined classes, which are specified in a target. Clustering models do not use a target.

Strength:

Clustering is useful for exploring data. If there are many cases and no obvious groupings, clustering algorithms can be used to find natural groupings. Clustering can also serve as a useful data-preprocessing step to identify homogeneous groups on which to build supervised models.

It is easier to cluster objects based on their similarities.

Weakness: in the event of many similarities clustering may be tedious

·         Association

Association is a data mining function that discovers the probability of the co-occurrence of items in a collection. The relationships between co-occurring items are expressed as association rules.

Association rules are often used to analyze sales transactions.

Unlike other data mining functions, association is transaction-based. In transaction processing, a case consists of a transaction such as a market basket or Web session. The collection of items in the transaction is an attribute of the transaction. Other attributes might be the date, time, location, or user ID associated with the transaction.

·         Feature Selection and Extraction

Attribute importance is a supervised function that ranks attributes according to their significance in predicting a target.

Finding the most significant predictors is the goal of some data mining projects. For example, a model might seek to find the principal characteristics of clients who pose a high credit risk.

Attribute importance is also useful as a preprocessing step in classification modeling, especially for models that use Naive Bayes or Support Vector Machine. The Decision Tree algorithm includes components that rank attributes as part of the model build.

·         Regression.


Regression is a data mining function that predicts a number. Profit, sales, mortgage rates, house values, square footage, temperature, or distance could all be predicted using regression techniques. For example, a regression model could be used to predict the value of a house based on location, number of rooms, lot size, and other factors.


A regression task begins with a data set in which the target values are known. For example, a regression model that predicts house values could be developed based on observed data for many houses over a period of time. In addition to the value, the data might track the age of the house, square footage, number of rooms, taxes, school district, proximity to shopping centers, and so on. House value would be the target, the other attributes would be the predictors, and the data for each house would constitute a case.

Comments

Popular posts from this blog

Ten output devices, advantages and disadvantages, inter-site back-up mechanism, Expert systems for medical diagnosis,information systems security

(i)Printer Printer enables us to produce information output on paper. It is one of the most popular computer output devices we often use to get information on paper - called hard copy. Advantage They produce high quality paper output for presentation at a speedy rate. It is also possible to share the printer among different users in a network. Disadvantage The cost for maintenance of the printer equipment as well printing ink is cumulatively high. (ii) Plotters These devices are used to produce graphical outputs on paper. They have automated pens that make line drawings on paper Advantage They can produce neat drawings on a piece of paper based on user commands. In Computer Aided Design (CAD) they are used to produce paper prototypes that aid in design of the final system. Disadvantage They are more expensive than printers. Further, the command based interface is difficult to use. (iii)Monitor It is...

simple basic object oriented java code for employee salary calculation

import java.io.*; import java.util.Scanner; public class Employees {     Scanner scan=new Scanner(System.in);     String Fname;   int EmpID;  int DOB; double Allowance;   double Salary;     public void getDetails(){   System.out.println("Enter the first name");   Fname=scan.next();   System.out.println("Enter the ID number");   EmpID=scan.nextInt();   System.out.println("Enter the date of birth");   DOB=scan.nextInt();   System.out.println("Enter the salary");   Salary=scan.nextDouble();   Allowance=0.6*Salary;     }    public void printReport(){    System.out.println(Fname+"\t"+EmpID+"\t"+calGross()+"\t"+calPayee()+"\t"+calNetIncome());    }    public double calGross(){        return Salary + Allowance;    }    public double calPayee(){     ...

Start Wamp server on windows automatically permanently

For those that have completely refused to use linux platforms for development, you might find this useful. As with all (aspiring) web developers, it’s always important to test your projects locally before putting it out there for the entire web community to see. One must-have developer tool for this purpose is WAMPServer. We’ve all wished it’s automatically up and running when we need it. These easy steps will help you automate WAMPServer to run on system start-up. For those unfamiliar with WAMPServer, it is a development package that lets you run web development projects locally. WAMP stands for Windows, Apache, MySQL, PHP/Perl/Python. It’s basically four programs packaged to work as one. WAMP basically turns any Windows PC into a localized web server. The Linux counterpart is called LAMP, obviously. Once WAMPServer is installed in your PC, you’ll be able to test your web projects before putting it into the live environment. But I always found it a hassle to manually s...