Extracting Product from E-Commercial web site using Entropy Estimation
Mr.Vijay R. Sonawane , Prof.P.P.Halkarnikar Department of Computer Science and Technology, Shivaji University, Kolhapur,Maharashtra,India Email:
[email protected]
Department of Computer Science and Engineering , Dr.D.Y.Patil College of Engineering,Kolhapur,Maharashtra,India Email:
[email protected] Abstract-Explosive growth of the web has made information search and extraction harder to the web. Mining product descriptions from e-commercial web sites is useful for product information integration from multiple sources where user is able to easily locate desired object from huge number of web sites. In this paper, we propose simple technique to extract product description from e-commercial web site. This technique takes the benefits of hierarchical structure of HTML language. First it discovers the set of product descriptions based on the measure of entropy at each node in the HTML tag tree. Afterwards, a set of association rules based on heuristic features is employed for more accuracy in the product extraction. Keywords – Entropy, representative value ,association rule, filter,product description.
1. INTRODUCTION A great amount of effort is often required for user to manually locate and extract product description from e-commercial web sites .As it facing several challenges in how to effectively recognize data region in the web page ,how to exactly split data region into separated data descriptions and therefore get ride of noisy object. Following categories of tools are available to extract the data from web site. These group include Wrapper induction tool, NLP based tools, Ontology based tools, and HTML-aware tools. Wrapper induction tools such as WIEN[9],SoftMealy[8],STALKER[11] develops wrapper based on domain specific sample pages to extract to get extraction rules which are used to extract data object in future similar pages. However these tools are time consuming and labor-intensive due to difficulties in writing and maintaining wrapper as well as annotating sample pages. NLP based tools such as RAPIER [3], SRV[7],WHISK[13] uses the traditional NLP approach such as lexical and part-of –speech tagging to learn rules for extracting relevant data existing in highly grammatical documents. These tools are not appropriate for less grammatical pages. In Ontology based tools [6] data extraction is accomplished by directly relying on the data objects. But it encounters the problem in writing and maintaining domain ontologies. HTML aware tools rely on such as W4F[12],XWRAP[10], and RoadRunner[5], rely on inherent structural feature of HTML document for accomplishing data extraction.
sample pages.MDR [1], OMINI [2], and IEPAD [4] are also automatic and HTML-aware tools. OMINI uses a set of extraction algorithms to locate the smallest subtree that contains all objects of interest. Then it employs object extraction algorithm to find correct object separator tags that can separate objects. IEPAD proposes a method to find patterns from the HTML tag strings, and then use the patterns to extract objects. The method uses a PAT tree to find patterns. However, both OMINI and IEPAD return many noisy objects.MDR employs the edit distance string matching algorithm to recognize data regions containing a set of generalized nodes. Afterwards, MDR determines data records in each generalized node by using coarse-grained heuristic observations (such as dollar sign). The first drawback of MDR is the computational complexity of the edit distance algorithm. The second drawback of MDR is that its coarse-grained heuristic observations such as dollar sign are not enough for identifying true data records and, therefore, results in many noisy objects. In this paper, we propose a new and simple but efficient technique to automatically extract product descriptions from web collections with various representation styles. Our method is based on the observation that product descriptions usually have similar display format and they are contained in similar HTML subtrees. To extract these product descriptions, the source HTML page must first be parsed to form DOM-like HTML tag tree. Then, entropy estimation is performed, by a proposed entropy measurement, at each internal node of the tree to measure the similarity among subtrees. Nodes with high entropy value, i.e. high similarity among its subtrees, should contain potential product descriptions in their descendant nodes. Finally, entropy information is combined with a set of association rules based on heuristic features to identify exact product descriptions. The remained part of the paper is organized as follows. Section 2 presents overview of the proposed system. Section 3 demonstrate the experimental results, and section 4 concludes the paper.
Although these tools achieve higher level of automation comparing to those of wrapper induction ,they still need user intervention either at building extraction rules or labeling
1
Tw
tag weight
tl
tag level
dp depth of this subtree rv Representative value of subtree er Entropy ratio of the node sr score of the tree node st a list of pointers to its subtrees sp no. of adjacent subtrees Table 1: Information associated with tag node
Figure 1.Architectural Diagram 2 .SYSTEM OVERVIEW As shown in the figure 1, proposed approach includes following main steps: Crawling all the links of web site, building HTML tag tree, entropy measurement, and product description extraction.
2.3.Entropy Measurement Entropy estimation is necessary to identify product regions containing potential product descriptions. This idea originates from the observation that similar product descriptions are usually contained in similar subtrees of tags. The term similar, in this sense, means that these subtrees have analogical structures in both subtree skeleton and tag position. The essential problem is to measure the similarity among the subtrees .Edit distance algorithm can also be used to compare the similarity between two tag strings. But computational complexity of this approach is high especially when the size of HTML file is large .To measure the similarity of subtrees at each tree node, we perform the two following steps. First, structure of each subtree is mapped into a representative value that should satisfy three properties: (1) two subtrees have close representative values if they are similar in structure, (2) two trees have divergent representative values if they are dissimilar in structure, and (3) the computational complexity for this mapping is as small as possible. Second, entropy on representative values corresponding to subtrees will be estimated. The higher the entropy is, the larger similarity these subtrees have.
2.1. Crawler Module To target whole web site given as input Crawler need to define to crawl each link of the web site .Following is the process by which Web crawlers work: 1. Download the Web page. 2. Parse through the downloaded page and retrieve all the links. 3. For each link retrieved, repeat the process. In the first step, a Web crawler takes a URL and downloads the page from the Internet at the given URL. Oftentimes the downloaded page is saved to a file on disk. In the second step, a Web crawler parses through the downloaded page and retrieves the links to other pages. Each link in the page is defined with an HTML anchor. After the crawler has 2.3.1Representative Value calculation The representative value of a tag tree T (T is also the root retrieved the links from the page, each link is added to a list of links to be crawled. The third step of Web crawling repeats node), denoted as T.rv,is calculated by the following formula, the process. All crawlers work in a recursive or loop fashion, T.rv=T.tw +∑(Ni.tl ×Ni.co ×Ni.tw)………………….……. but there are two different ways to handle it. Links can be (1) crawled in a depth-first or breadth-first manner. Ni ϵN Where N is the set of all descendant nodes of T. Ni.tl is the 2.2. HTML Tag Tree This approach taking the benefits of hierarchical and tag level, i.e. the distance from tag node Ni to the root node nesting tag structure of HTML documents.But not all web T. Ni.tw is the tag weight of the tree node Ni. The main usage pages tightly confirm the W3C HTML specification. Thus, of Ni.tw value is to help distinguish among different HTML input web pages are first preprocessed to fix potential mistake tags. Ni.co is the child order of the node Ni among its and thus confirm W3C specification. In HTML tree node is siblings. Figure 2[14] shows an example of representative corresponding to a tag. Each tag node together with its value calculation. In this figure, each oval contains tag name descendants constitutes a subtree. Additionally, each node in and tag weight of each tree node. The number associated with tree edge is the child order. The tag level of the root node, i.e. our HTML tag tree contains following types of information. T, starts from 1.
tt
tag type, i.e. single or paired tag
2
Figure 2.Representative value Calculation 2.3.2Entropy Calculation Supposing that a tree node T has a set of subtrees S = {T1, T2, . .Tn} with the corresponding set of representative values R = {T1.rv, T2.rv, . . Tn.rv}. Let P = {Ti|Ti ϵ S and Ti. dp ≥ DTh}. The Shannon’s entropy of T with respect to their children’s representative values, denoted as E(T), is calculated as,
Product description usually resides in subtrees with depth greater than or equal to minimum depth threshold DTh. E(T) in formula tends to be high when cardinality of p is growing ,when all representative values are equal Shannon’s entropy get maximum value ln|p| so we normalized value of E(T) between 0 and 1,denote as
Algorithm 1 (Entropy ratio calculation) is for entropy ratio computation in formula (3) for all nodes in the HTML tag tree . This algorithm uses T and minimum depth threshold DTh as inputs. After calculating, entropy ratio at each tree node will be updated. Algorithm 1-ERC:Entropy ratio calculation. Input:Node T, DTh. Output: Entropy ratio of all nodes 1: Entropy = 0; Eri = 0; totalCount = 0; TotalRV = 0; 2: For Ti ϵT do 3. For TjϵT do 3: If Tj.dp < Ti.dp+1 then 4: If Ti.co
14.Entropy+=-portion*log(portion) 15.End If 16.End If 17.End For 18.If totalCount≥1 19.Ti.er=entropy/log(totalCount) 20.End If 20.End for 2.4. Product description Extraction Nodes with large entropy ratios will have high probability of containing regularly formatted objects. However, not all these objects are actual product descriptions. A large portion of these objects are noisy object i.e advertisements, navigation links, etc. and we need to remove them. Obviously, information about entropy ratio is not enough to do this. Thus, a scoring technique based on a set of association rules is combined with entropy ratio to extract and filter the output. Extracted objects having the score greater than or equal to the minimum score threshold (STh) will be considered product descriptions. Other objects are noise and will be ignored. This proposed approach also uses a set of association rules [15] to give score for each tree node.These association rules are based on the 19 features or items are listed in Table 2[14]. The first 18 items chosen based on heuristics are useful for classification. For example, nPrices is the number of occurrences of strings like “List Price”, “Our Price”, “Sale Price”, or “Outlet Price” a node contains. The last item (isProduct) is the target attribute. Feature Description er dp rv nLinks nImages nFonts nBolds nItalic nBLs nLists nDollars nPercents nDigits nPrices nSeeMore s nRatings nLengths
Entropy ratio of tree node The depth of subtree Representative value of tree node No. of
tag No. of tag No. of tag No. of tag No. of tag No. of
tag No. of tag No. of currency signs (e.g. $, £) No. of % signs No. of digital characters No. of “List Price”, “Our Price” No. of “See More”, “Read More
No. of “Rate”, “Rating” The length of non-tagged text isProduct True is product, False otherwise nSaves No. of “Saving”, “You Save” isProduct True is product, False else Table 2: Heuristic feature for filtering output The antecedent of each rule is a combination or subset of the first 18 items in Table 2. The consequent of each rule is the target attribute, i.e. isProduct. Some example of association rules accompanied by their support and confidence are give in table 3. Table 3:Association Rule for Scoring 3
We use this set of association rules to give score for each future tree node. If the consequent of a rule is “isProduct =False” (negative rule) we decrease the score of the node by x points. If the consequent of a rule is “isProduct = True” (pos- itive rule) we increase the score of the node by y. Both x and y depend on the confidence and support of each rule. In our system, we set x = y = support × conf ident/100. The higher the score of a tag node is the larger possibility that tag node is a product description. We use a minimum score threshold, ST h, to determine whether a tag tree node is a product description. Algorithm 2 -Product description Extraction from the Web Require: T , ERTh, STh, DTh. Ensure: Product descriptions in the tree T. 1: if T.dp < DTh then 2: Return; 3:end if 4:if T.er ≥ ERTh then 5:for Ti∈ T.st do 6: if Ti.sr ≥ STh then 7: Extract Ti; 8: else 9: PEWeb(Ti, ERTh, STh, DTh); 10: end if 11: end for 12: else 13: for Ti ∈T.st do 14: PEWeb (Ti, ERTh, STh, DTh); 15: end for 16: end if
The algorithm2[14] uses inputs as T (tag tree node), ERTh (minimum entropy ratio threshold), STh (minimum score threshold), and DTh (minimum depth threshold). The algorithm visits all nodes under T recursively. At each node, it compares the node’s entropy ratio (Ti.er) with ERTh. If the node’s entropy ratio is greater than or equal to the threshold, it will consider all the node’s children. At each child node Ti, if Ti.sr is greater than or equal to the minimum score threshold (STh), Ti is a product description; otherwise it will be called recursively with Ti to search for potential product descriptions below Ti. If T.er is smaller than ERTh threshold, it will also be called recursively to search for potential product descriptions under node T. Whole web site can also be targeted by recursively applying strategy to every page. Afterward In our system we provide feasibility to the user to get information regarding desired product by filtering extracted output based on extraction pattern (i.e. product name, price, colour etc.) 3. EXPERIMENTAL RESULTS In this section we evaluate the experimental results. Java language is used to develop the proposed system. We used product based website www.compusa.com as Data set for the evaluation. Crawler module takes search string and number of
links of website to be crawled as input. And crawl all the links that match the search string.
Figure 3: C All the crawled links are fetched and preprocessed to fix the potential mistakes. By taking balanced tag tree as input Entropy ratio is calculated for each node of the tree. Table 4 shows the sample output of the entropy ratio calculation algorithm. isproduct depth Entropy Line 0 975 0.53800805824690 6 0 974 0.63553739799325 7 1 844 0.63548066716732 229 1 924 0.51398288605239 339 1 834 1.0 445 Table 4: Entropy ratio These are intermediate results. Next expected results by combining the association rules with the entropy ratio will give the exact product description. Afterward by using filters based on extraction pattern user is able to filter extracted output to get only information regarding desired product from website.
4. CONCLUSION In this paper, we proposed a novel and efficient method to automatically extract product descriptions from the Web. Our technique takes the benefits of hierarchical structure of HTML language to extract product description. It uses entropy estimation and association rule to exactly find product region. As it Classifies outputs based on association rules of important heuristic features which helps to effectively eliminates noisy objects. And finally again extraction pattern based filter helps to sort output.
4
REFERENCES [1] L. Bing, G. Robert, and Z. Yanhong. Mining data records in web pages. [2] D. Buttler, L. Liu, and C. Pu. A fully automated extraction system for the World Wide Web. Proceedings of IEEE ICDCS- 21, 2001. [3] M Califf and R. J. Mooney. Relational learning for patternmatch rules for information extraction. Proceedings of the Sixteenth National Conference on Artificial Intelligence, pages 328–334, 1999. [4] C.-H. Chang and S.-L. Lui. Iepad: Information extraction based on pattern discovery. Proceedings of WWW-10, 2001. [5] V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. Proceedings of the 26th VLDB, pages 109–118, 2001. [6] D. W. Embley, D. M. Campbell, Y. S. Jiang, S. waW. Liddle, Y.-K. Ng, D. Quass, and R. D. Smith. Conceptual-model-based data extraction from multiple-record web pages. Data and Knowledge Engineering, 31(3):227–251, November 1999. [7] D. Freitag. Machine learning for information extraction in informal domains. Machine Learning, 39(2-3):169–202, 2000. [8] C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semistructured data extraction from the web. Information Systems, 23(8):521–538, 1998. [9] N. Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1-2):15–68, 2000. [10] L. Liu, C. Pu, and W. Han. Xwrap: An xml-enable wrapper construction system for web information sources. Proceedings of the 16th IEEE International Conference on Data Engineering, pages 611–621, 2000. [11] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical wrapper induction for semi structured information sources. Autonomous Agents and Multi-Agent. [12] A. Sahuguet and F. Azavant. Building intelligent web applications using lightweight wrappers. Data and Knowledge Engineering, 36(3):283–272, 2001. [13] S. Soderlan. Learning information extraction rules for semistructured and free text. Machine Learning, 34(1-3):233–272, 1999. [14] Hieu Xuan Phan , Susumu Horiguchi, Bao Tu Ho,PEWeb: Product Extraction from the Web Based on Entropy Estimation [15] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD, pages 207–216, 1993.
5