WEB MINING
In a distributed information environment, documents or objects are usually linked together to facilitate interactive access. Examples for such information-providing environments include the World Wide Web (WWW) and on-line services such as America Online, where users, when seeking information of interest, travel from one object to another via facilities such as hyperlinks and URL addresses. The Web is a hypertext body of more than 800 million pages that continues to grow. It exceeds six terabytes of data on about three million servers. Almost a million pages are added daily; typically, pages change every few months and, therefore, several hundred gigabytes are changed every month. As the information offered in the Web grows daily, obtaining that information becomes more and more tedious. Even the largest search engines such as Alta Vista and HotBot index less than 18% of the accessible Web pages as on February 1999, down from 35% in late 1997. The main difficulty lies in the semistructured or unstructured Web content that is not easy to difficulty lies in the semistructured or unstructured Web content that is not easy to regulate and where enforcing a structure or standards is difficult. A set of Web pages lacks a unifying structure and shows far more authoring style and content variation than that seen in traditional print document collections. This level of complexity makes an "off-the-shelf" database-management and information retrieval solution very complex and almost impossible to use. New methods and tools are necessary. Web mining may be defined as the use of data-mining techniques to automatically discover and extract information from Web documents and services. It refers to the overall process of discovery, not just to the application of standard data-mining tools. Some authors suggest decomposing Web-mining task into four subtasks:
Resource finding- This is the process of retrieving data, which is either online or offline, from the multimedia sources on the Web, such as electronic newsletters, electronic newswire, newsgroups, and the text content of HTML documents obtained by removing the HTML tags.
Information selection and preprocessing - This is the process by which different kinds of original data retrieved in the previous subtask is transformed. These transformations could be either a kind of preprocessing such as removing stop words, stemming, etc. or a preprocessing aimed at obtaining the desired representation, such as finding phrases in the training corpus, representing the text in the first order logic form etc
Generalization -Generalization is the process of automatically discovering general patterns within individual Web sites as well as across multiple sites. Different general-purpose machine-learning techniques, data-mining techniques, and specific Web-oriented methods are used.
Analysis- This is a phase in which validation and/or interpretation of the mined patterns is performed.
There are three factors affecting the way a user perceives and evaluates Web sites through the data-mining process: a) Web-page content, b) Web-page design and c) overall site design including its structure. The first factor concerns the goods, services, or data offered by the site. The other factors concern the way in which the site makes content accessible and understandable to its users. We distinguish between the design of individual pages and the overall site design, because a site is not a simply a collection of pages: it is a network of related pages. The users will not engage in exploring it unless they find its structure simple and intuitive. Clearly, understanding user-access patterns in such an environment will not only help improve the system design (e.g., providing efficient access between highly correlated objects, better authoring design for WWW pages, etc.) but also be able to lead to better marketing decisions. Commercial results will be improved by putting advertisements in proper places, better customer/user classification, and understanding user requirements better through behavioral analysis.
No longer are companies interested in Web sites that simply direct traffic and process orders. Now they want to maximize their profits. They want to understand customer preferences and customize sales pitches to individual users. By evaluating a user's purchasing and browsing patterns, e-vendors want to serve up (in real time) customized menus of attractive offers e-buyers can't resist. Gathering and aggregating customer information into e-business intelligence is an important task for any company with Web-based activities. E businesses expect big profits from improved decision-making, and therefore e-vendors line up for data-mining solutions.
Borrowing from marketing theory, we measure the efficiency of a Web page by its contribution to the success of the site. For an on-line shop, it is the ratio of visitors that purchased a product after visiting this page to the total number of visitors that accessed the page. For a promotional site, the efficiency of the page can be measured as the ratio of visitors that clicked on an advertisement after visiting the page. The pages with low efficiency should be redesigned to better serve the purposes of the site. Navigation-pattern discovery should help in restructuring a site by inserting links and redesigning pages, and ultimately accommodating user needs and expectations. One possible categorization of Web mining is based on which part of the Web to mine, and it consists of three areas:
Web-content mining - describes the discovery of useful information from Web documents. Basically, Web content consists of several types of data such as text, image, audio, video, metadata as well as hyperlinks. Research in mining multiple types of data is now termed multimedia-data mining. We could consider multimedia-data mining as an instance of Web-content mining. The Web content data consist of unstructured data such as free text, semi-structured data such as HTML documents, and a more structured data such as tables and database- generated HTML pages. The goal of Web-content mining is mainly to assist or to improve information-finding or filtering the information. Building a new model of data on the Web, more sophisticated queries other than the keywords-based search could be asked.
Web-structure mining - tries to discover the model underlying the link structure on the Web. The model is based on the topology of the hyperlinks with or without a description of the links. The model can be used to categorize Web pages and is useful for generating information such as the similarity relationship between Web sites.
Web-usage mining - tries to make sense of the data generated by the Web surfer's sessions or behaviors. While Web-content mining and Web-structure mining utilize real or primary data on the Web, Web-usage mining mines the secondary data derived from the behaviour of users while interacting with the Web. This includes data from Web server-access logs, proxy-server logs, browser logs, user profiles, registration data, user sessions or transactions, cookies, bookmark data, and any other data that is derived from a person's interaction with the Web.
To deal with problems of Web-page quality, Web-site structure, and their use, two families of Web tools emerge. The first includes tools that accompany the users in their navigation, learn from their behavior, make suggestions as they brows, and, occasionally, customize the user-profile. These tools are usually connected to or built-in into parts of different search engines. The second family of tools analyzes the activities of users off-line. Their goal is to provide an insight in the semantics of a Web site's structure by discovering how this structure is actually utilized. In other words, knowledge of the navigational behavior of users is used to predict future trends. New data-mining techniques are behind these tools, where web log files are analyzed and information uncovered. In the next two sections, we will illustrate Web mining with three techniques that are representative of a large spectrum of Web-mining methodologies developed recently.
Tuesday, December 16, 2008
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment