2009 IEEE International Conference on Web Services
Collaborative Web Data Record Extraction Gengxin Miao, Firat Kart, L. E. Moser, P. M. Melliar-Smith Department of Electrical and Computer Engineering University of California, Santa Barbara Santa Barbara, CA, 93106 {miao, fkart, moser, pmms}@ece.ucsb.edu Abstract—This paper describes a Web Service that automatically parses and extracts data records from Web pages containing structured data. The Web Service allows multiple users to share and manage a Web data record extraction task to increase its utility. A recommendation system, based on the Probabilistic Latency Semantic Indexing algorithm, enables a user to find potentially interesting content or other users who share the same interests with the user. A distributed computing platform improves the scalability of the Web Service in supporting multiple users by employing multiple server computers. A Web Service interface allows users to access the Web Service, and allows programmers to develop their own applications and, thus, extend the functionality of the Web Service. Index Terms—collaborative information extraction, data mining, Web Service.
I. I NTRODUCTION On the Web, the amount of structured data is 500 times greater than the amount of unstructured data. As one of the major sources of structured data, the deep Web has been estimated to contain more than 450,000 databases [5] in which structured data are stored. The structured data in the deep Web are continually evolving, and might be updated as often as once every second. Deep Web pages can be dynamically generated from the data in the deep Web. A single deep Web page typically contains a large number of Web records [12], [13], i.e., HTML regions, each of which corresponds to an individual data object. When browsing these deep Web pages, a user is usually interested in only a small number of data objects. The diverse data and the evolving characteristics of the deep Web make it difficult for users to locate the data objects of interest in a friendly and timely manner. There have been extensive studies of fully automatic methods to extract data objects from the Web [1], [7]. A typical process to extract data objects from a Web page consists of three steps. The first step is to identify Web records that represent individual data objects (e.g., products). The second step is to extract data object attributes (e.g., product names, prices, and images) from the Web records. Corresponding attributes in different Web records are aligned, resulting in spreadsheet-like data [17], [18]. The final step is the optional task (which is very difficult in general) of interpreting aligned attributes and assigning appropriate labels [16], [19]. In this paper we describe a Web Service for extracting Web data records from deep Web pages. Attribute alignment is not necessary. Approaches such as EXALG [1] and RoadRunner [7] are applicable only when there are multiple deep Web 978-0-7695-3709-2/09 $25.00 © 2009 IEEE DOI 10.1109/ICWS.2009.109
pages that use the same template, which is not guaranteed in our case. SRR [18] is intended for extracting data records from Web pages returned by search engines. Because the userspecified data source can occur within any domain, we employ a domain-independent approach for extracting data records from deep Web pages [13]. The Internet has brought collaboration of individuals to a whole new level. Not only can colleagues or friends collaborate with each other, but also individuals can collaborate without even knowing each other. Wikipedia provides a collaborative authoring platform that aggregates individual intelligence by allowing any authorized user to modify an article on which he/she has knowledge. Facebook allows different users to collaborate in developing applications and, hence, provides a useful social network service. Support for collaborative and social interactions in an information seeking system [6], [9] improves the utility of the system by allowing users to share information and tasks. In this paper we present a Web Service that supports collaboration among multiple users, who are interested in the same information, e.g., price of a certain product. Different users are aware of different data sources that contain relevant information, e.g., different e-commerce Web sites carrying the product. By collaboration, the users can obtain more complete and relevant information. Our Web Service allows authorized users to share the extracted Web data records from the deep Web and to manage the Web data record extraction task. Collaborative filtering [10] produces recommendations by computing the similarities between one user’s preference and the preferences of other users. Algorithms for collaborative filtering fall into two categories: rank-based algorithms, e.g., RankBoost [8], and probabilistic model-based algorithms e.g., Latent Dirichlet Allocation (LDA) [3] and Probabilistic Latency Semantic Indexing (PLSI) [11]. RankBoost combines multiple partial preferences into a unified ranking. PLSI and LDA analyze the co-occurrences of two different types of data, e.g., documents and words. With the help of a latent semantic layer, PLSI and LDA estimate the joint probability of any given pair of documents and words. We employ PLSI to produce personal recommendations for the users. The extracted Web data records are ranked based on their probabilities of co-occurrence for a particular user, and topranked objects are recommended to the user as they align with the user’s preferences.
896
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.
II. M OTIVATION To motivate our Web Service for collaborative Web data record extraction, we present several example scenarios, in which a filtered summary of Web data records that match the user’s interests is produced from a list on a Web page. Example 1. Bob is traveling by train across California. Bob wants to read some interesting stories from his favorite Web site using his cell phone to fill the time (10 hours) spent on the train. However, the Web site is not friendly to small hand-held devices. Both the bandwidth and the screen size are limited. It takes minutes to download a large Web page onto Bob’s cell phone, and he needs to scroll both horizontally and vertically to locate the interesting content on the Web page. Bob completely loses interest after this frustrating browsing experience. If there were a Web Service that parses and extracts the interesting information from the Web page and then sends Bob only the interesting content, his experience would be much more satisfying. Example 2. Mike is a soccer fan and likes the soccer player David Beckham very much. However, he happens to be busy on the day when Beckham plays an important game. Mike cannot watch the entire game, but he does not want to miss the moment when Beckham scores a goal. He knows a Web site that broadcasts the video of the game online and another Web site that provides live news on the game in text. Mike wishes that there were a Web Service that periodically extracts and parses the live news and notifies him when Beckham scores, so that he can catch the replay of the goal. Example 3. Alice decides to go to Long Beach for Spring break with her friends. They start looking for a vacation rental on the local forums on the Web one month before their vacation. Alice feels exhausted after reading through the vacation rental advertisements posted on the various forums, most of which turn out to be irrelevant. It takes Alice and her friends a lot of precious time reading listings and circulating the information among themselves. The process would be greatly facilitated if a Web Service were available that automatically scans the forums of interests and extracts only the relevant rental information for Alice and her friends to share. When a user browses a deep Web page, the user typically has a particular information need in mind. Only a fraction of the Web data records found on a deep Web page match the user’s information need. Browsing the entire Web page is tedious, and can be expensive in time or bandwidth. Our Web Service for collaborative Web data record extraction returns only the Web data records of interest to the user and, thus, increases the efficiency of the deep Web browsing process. In addition, it allows multiple users to share and manage the Web data record extraction task to enhance the benefits further. III. W EB DATA R ECORD E XTRACTION Our Web Service for Web data record extraction focuses on the deep Web for the following reasons: • The amount of data in the deep Web is much greater than that on the surface Web.
Fig. 1.
Fig. 2.
An example Web page containing the live news of a soccer game.
HTML code template used to render the soccer game live news.
The deep Web is a good source of structured data, which are more suitable for automatic processing than the unstructured data on the surface Web. • The dynamic content found in deep Web pages is generally of greater interest than the static content found in static Web pages on the surface Web. Our Web Service employs a Web data record extraction technique, based on HTML tag path clustering [13], that we developed. In an automatically generated deep Web page, the Web records, e.g., live news about a soccer game, are rendered in visually repeating patterns. First, the Web record extraction algorithm identifies the visually repeating part in a Web page. Then, a Web page segmentation algorithm looks for the exact boundaries of each Web data record. •
A. Finding Visually Repeating Patterns In a Web page (HTML document), the visual information is conveyed by HTML tag paths. Visually repeating patterns in a Web page correspond to the repeated occurrence of HTML tag paths in the Web page. A unique HTML tag path might have multiple occurrences in a Web page. A set of HTML tag paths that repeatedly occur in the Web page in a similar way corresponds to a set of Web records. The occurrence positions are indicated using a binary vector, referred to as a visual signal vector. Each unique HTML tag path corresponds to a visual signal vector. By evaluating the similarity between the visual signal vectors, we can discover whether two unique HTML tag paths have similar repeated occurrence patterns. We construct a pairwise similarity matrix, in which each element is the
897
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.
Figure 4 shows the ancestor and descendant relationships within a set of unique tag paths in the Web page for the soccer game live news. An occurrence of an upper level tag path corresponds to a larger piece of HTML text than a lower level tag path. In this example, the dl node corresponds to the entire news record. All occurrences of the dl nodes following the tag path are extracted; each one is a news record. The results for the soccer game live news Web page are shown in Figure 5.
Fig. 3. Unique HTML tag paths extracted from the soccer game live news Web page.
Fig. 4. Ancestor / descendant relationships within a set of unique HTML tag paths for the soccer game live news Web page.
similarity measurement of a pair of visual signal vectors. We then apply a spectral clustering algorithm [15] to the similarity matrix to discover a set of unique HTML tag paths, i.e., visually repeating patterns on the Web page, which correspond to the Web records. For example, Figure 1 is an automatically generated Web page that contains live news on a soccer game between Germany and Spain. The soccer game Web page is updated whenever live news is uploaded. Each live news record is rendered using the HTML code template shown in Figure 2. The template corresponds to the unique HTML tag paths 17 through 20 in Figure 3, which is one of the clusters generated by the spectral clustering algorithm. The grouped unique HTML tag paths are then passed to the Web page segmentation algorithm to find the exact boundaries of each live news record. B. Web Page Segmentation The Web page segmentation algorithm takes as input a set of HTML tag paths and examines their occurrences to determine the exact boundaries of the Web records. Each occurrence of a unique HTML tag path corresponds to a node in the DOM tree, a tree structure obtained by parsing the HTML document. If a unique HTML tag path A is a prefix of a unique HTML tag path B, then the occurrences of A are ancestors of the occurrences of B in the DOM tree and, hence, correspond to larger pieces of HTML text. In this case, A is an ancestor visual signal of B. A set of unique HTML tag paths corresponds to a HTML template. An occurrence of a tag path maps to a part of a Web record or an entire Web record. A larger piece of HTML text is more likely to cover an entire Web record than a smaller piece of HTML text.
Fig. 5. Web data record extraction results for the soccer game live news Web page.
The Web data record extraction algorithm has linear time complexity in the length of the Web page. We use this algorithm in our Web Service to extract information from the Web pages. IV. S YSTEM A RCHITECTURE Our system enables users to submit Web data record extraction tasks using the Web Service. Different kinds of applications on different kinds of devices can access the Web Service, as shown in Figure 6. Having received tasks from 898
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.
multiple client applications, the Web Service executes the Web data record extraction tasks in parallel. Depending on the Web Service call, the results are returned to the client application or are uploaded to an Atom server to be published as a syndication feed for subscribed consumers.
Fig. 6.
Use of the Web data record extraction Web Service.
Fig. 7.
Our Web Service for collaborative Web data record extraction allows multiple users to share results and manage tasks, as shown in Figure 7. A client is identified using a unique clientID and is authorized by password verification. Once a client submits a new task to the Web Service, it becomes the administrative client for that task. The client can authorize other clients to access the results returned by the task or to manage the task. Authorized clients access the Web Service to list tasks that they have permission to manage and URLs to the results that they can access. They can use either the Web Service or the Atom server to access the task or the results, respectively. Our Web Service also provides a recommendation facility to allow users to find Web data record extraction tasks of interest to them. Tasks submitted to the Web Service are executed in parallel for scalability reasons, as shown in Figure 8. When a new Web data record extraction task is submitted, the master computer divides the task up into multiple sub-tasks and puts them into a task queue. Thus, the master computer maintains the list of tasks to be executed and distributes them among a set of worker computers for the purposes of load balancing. The workers retrieve the data resources from the deep Web using user-specified URLs. The extracted Web records are filtered based on the user-defined filtering rules. Final results are gathered at the master computer. and the workers are then ready to take on new tasks. The master computer determines whether to pass back the result set, containing the list of Web records, directly to the calling client or to save it at the Atom server for the client to consume later.
Collaborative data extraction.
V. I MPLEMENTATION Our system provides a Web Service interface that allows clients to access the Web data record extraction service, a distributed computing platform that performs the Web record extraction computations in parallel, and a backend database that stores information for the multiple collaborative users. The data from the deep Web sites are aggregated in a database at the Atom server. A. Distributed Computing Platform The distributed computing platform is similar to the CILK system [4] for multithreaded parallel programming, except that CILK is implemented in C++ whereas our system is implemented in Java. A job is divided up into a number of subjobs. A job is finished when all of its sub-jobs are executed. In CILK, there might be dependence relationships between the sub-jobs. For example, if sub-job A takes the output of sub-job B as input, then sub-job A can be executed only when subjob B has finished. In our system, there are no dependence relationships between sub-jobs, i.e., a Web record extraction task does not depend on any other tasks. Thus, our problem is slightly easier than the general problem addressed by CILK. We employ the concept of work stealing used by CILK to avoid multiple workers requesting work from the master at the same time and, hence, avoid the network communication bottleneck at the master. The master in our system maintains the list of jobs that are ready to be executed. Each worker maintains its own job queue. When its job queue length is less than a threshold, M inJobs, the worker either requests a
899
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.
Fig. 8.
Distributed computing platform for the Web data record extraction Web Service.
new job from the master with probability p, or “steals” a job from a randomly picked worker with probability 1 − p. Using this “work stealing” strategy, the system balances the network bandwidth usage among all of the workers to avoid a burst of requests at the master. The master keeps track of the jobs that have been assigned to the workers until all of them are executed successfully. If a worker fails to respond to the master within a certain amount of time, the master marks the worker as “dead” and puts all of the tasks assigned to that worker back into the job queue for execution. In this manner, the system is protected against failures of the workers. The master can also be protected from failures by means of a backup server. B. Backend Database The backend database stores the user information, the Web data record extraction task information, and the corresponding results. The structure of the database is shown in Figure 9.
Fig. 9.
Backend database.
The U ser Account table stores the user account information. Each record corresponds to a user account created by a client. The attributes include two mandatary fields, U sername and P assword, and several optional fields for the user’s profile, Age, Interest, Occupation, etc. Once a new Web data record extraction task is submitted to the server, the system creates a new entry in the T ask table. T askID is a unique identifier for the task; N ame is a user-defined attribute that helps to identify the task; and Description is an attribute that briefly describes the task. F ilterRule is a logic expression that is used to filter the extracted records. For example, “Keyword1 AND Keyword2” means that the user wants the set of data records that contain both Keyword1 and Keyword2. The F ilterRule attribute is optional. If the F ilterRule field is missing, the system returns all of the extracted Web records. The T askSchedule table stores information related to which task is to be executed and when. A Web data record extraction task might need to be executed repeatedly. For example, in the soccer game live news application, the client wants the results to be updated every few seconds because news can be posted at any time. According to the userspecified starting time, ending time and refresh frequency, the system creates multiple task schedule entries for the same Web data record extraction task, which is executed repeatedly. The U ser-taskAuthorization table describes the user-totask authorization relationships. There are three authorization levels: 0 means the user can access the results of the task; 1 means the user can manage the task; and 2 means the user is the administrator of the task. U sername “Public” is system-reserved. If the U sername attribute of a U sertaskAuthorization record is “Public,” it indicates that the corresponding task is publicly available.
900
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.
The DataResource table stores the user-specified URLs for the Web pages that contain the target information. A Web data extraction task can be associated with multiple URLs and, hence, multiple DataResource records. The Result table stores the location of the Web records that are extracted.
•
•
C. Web Service Interface Our Web Service allows client access in a variety of ways. The Web Service interface is described using WSDL. A client accesses the Web Service by sending a SOAP request message to the Web Service, and the Web Service returns the results to the client in a SOAP response message. The key operations provided by the Web Service are the following. • CreateUserAccount: A new client uses this operation to create a new account by providing a Username, Password, Interests, etc. in a SOAP request message. The Web Service creates a new record in the UserAccount table. The SOAP response message indicates whether or not the user account is successfully created. • ExtractDataRecords: Using this operation, a client submits a Web data record extraction task to the Web Service. In the SOAP request message, the client provides a Username, Password, URL for the target Web page that contains the list of Web records, and a filtering rule. The Web Service creates records in the UsertaskAuthorization table, Task table, TaskResource table and TaskResult table. The Web records extracted from the Web page are filtered using the filtering rule and stored at the Atom server for the user to retrieve at a later time. Meanwhile, the SOAP response message containing the Web record is returned to the client. • PeriodicallyExtractDataRecords: The Web Service supports periodic updating of the Web data record extraction results. In the SOAP request message, the client indicates the data source, starting time, ending time, and requested update frequency. The server creates a new record in the Task table and multiple records in the TaskSchedule table. The scheduled tasks are executed at pre-defined times, and the results are stored at the Atom server as Atom feeds. The client consumes the data by subscribing for the data at the Atom server. The Atom server performs an identification check before authorizing a user to access the Atom feeds. The SOAP response message contains only the TaskID and the URL to the Atom feeds. • AddConsumer / RemoveConsumer: The administrative user of a task can use these operations to authorize / deauthorize other users’ access to the results stored at the Atom server. If the user adds / removes a special username “Public,” the Atom feeds containing the task results will be made publicly available / unavailable. To use this operation, the user must specify Username, Password and TaskID. The SOAP response message indicates whether or not the job is successfully executed.
•
•
AddManager / RemoveManager: The administrative user of a task can use these operations to authorize / deauthorize another user as manager of the task. AddResource / RemoveResource: The administrative user and all authorized users that can manage the task use these operations to add / remove URLs to target Web pages that contain Web records. Similar operations include UpdateName, UpdateDescription, UpdateFrequency, UpdateEndingtime, etc. The server updates the corresponding records in the Task table, TaskSchedule table and TaskResource table. ListTasks: This operation lists all of the tasks for which the user is authorized to view the results. The SOAP request message indicates the Username and Password. The SOAP response message contains a list of tasks with descriptions and the URLs to access the resulting Atom feeds at the Atom server. RecommendTask / RecommendUser: These operations help the user to locate the publicly available tasks that might be of interest to the user, or other users who have the same interests. We employ PLSI [11] to generate recommendations. The co-occurrence probability of each user / task pair is estimated. For each user, we have a ranked list of tasks that have high probability of cooccurrence for that user. The top-ranked publicly accessible tasks are recommended to the user. For recommending friends to a user, we compare the edit distance of any pair of user interests. The users with interests closest to those of a particular user are recommended to the user as potential friends. VI. R ELATED W ORK
Traditional information search and retrieval services for the Web, such as those provided by the search engines of Google and Yahoo!, consider a Web page as an atomic-level object. A user is lead to a Web page even though he / she is interested in only a small part of the content of the Web page. On the other hand, if the information in which the user is interested is located on multiple Web pages, the search engines will not aggregate this information, and users have to access all of the related Web pages manually to obtain a broad view of the available information. The Web should be considered as a repository of information, rather than as a repository of Web pages. The wide use of information in the Web has driven general-purpose search engines to perform vertical Web search. In a particular domain, Web data records in Web pages are extracted and aggregated together to satisfy users’ information needs. Google and Microsoft provide vertical search engines for online shopping, publications, recruiting advertisements, etc. However, the Web contains information about so many different topics; moreover, the information is coupled together for multi-disciplinary fields. It is difficult to divide up the information in the Web into a reasonable number of nonoverlapping domains. It is even harder to build a vertical search engine for each such domain. A domain-independent
901
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.
Web object retrieval service is proposed in [14]. However, to identify whether or not a Web page contains a set of data objects is a non-trivial problem. We address the information search and retrieval problem in the case that users know the data source locations, but it is inconvenient for the users to locate the relevant information by browsing the Web pages themselves. The reasons are the users’ precious time, the limitations in network bandwidth, the display area on mobile devices, etc. Our Web Service provides better results for users’ queries, by extracting, filtering and / or aggregating data records from the Web pages in a user-defined manner. Different from existing vertical search engines, our Web Service for extraction of Web data records supports collaboration among users. Data extraction results can be shared among multiple users, and multiple users can manage the same Web data record extraction task to provide more complete and relevant data. VII. C ONCLUSION AND F UTURE W ORK This paper has described a Web Service that automatically parses and extracts data records from Web pages containing structured data. The Web Service allows multiple users to share and manage a Web data record extraction task. A recommendation system, based on the Probabilistic Latency Semantic Indexing algorithm, enables a user to find potentially interesting content or other users who have similar interests. A distributed computing platform improves the scalability of the Web Service. A Web Service interface allows users to access the Web Service, and allows programmers to develop their own applications and, thus, extend the functionality of the Web Service. In future work, we plan to do extensive performance evaluation of the Web Service to determine the query rate that the system supports, with a single server and with multiple servers using our distributed computing platform, and also with multiple databases at multiple Web sites. We also plan to investigate the use of the Web Service in various example applications, such as those mentioned earlier in the paper. R EFERENCES [1] A. Arasu and H. Garcia-Molina. Extracting structured data from Web pages. In Proceedings of the 2003 ACM International Conference on the Management of Data, June 2003, San Diego, CA, pp. 337-348.
[2] M. K. Bergman. The deep Web: Surfacing hidden value. Technical report, BrightPlanet LLC, December 2000. [3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, vol. 3, 2003, pp. 993-1022. [4] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system, Journal of Parallel and Distributed Computing, pages 207-216, 1995. [5] K. C. Chang, B. He, C. Li, M. Patel, and Z. Zhang. Structured databases on the Web: Observations and implications. ACM SIGMOD Record, vol. 33, no. 3, 2004, pp. 61-70. [6] E. H. Chi. Information seeking can be social. Computer, vol. 42, no. 3, March 2009, pp. 42-46. [7] V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large Web sites. In Proceedings of the 27th International Conference on Very Large Data Bases, September 2001, Rome, Italy, pp. 109-118. [8] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, vol. 4, 2003, pp. 933-969. [9] G. Golovchinsky and P. Qvarfordt. Collaborative information seeking. Computer, vol. 42, no. 3, March 2009, pp. 47-51. [10] J. L. Herlocker, J. A. Konstan, and J. Riedl. Explaining collaborative filtering recommendations. In Proceedings of the 2000 ACM Conference on Computer Supported Cooperative Work, Philadelphia, PA, December 2000, pp. 241-150. [11] T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM Conference on Research and Development in Information Retrieval, Berkeley, CA, August 1999, pp. 50-57. [12] B. Liu. Mining data records in Web pages. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, Washington, D.C., August 2003, pp. 601-606 [13] G. Miao, J. Tatemura, A. Sawires, W. P. Hsiu, and L. E. Moser. Extracting data records from the Web using tag path clustering. In Proceedings of the 18th International World Wide Web Conference, Madrid, Spain, 2009, pp. 981-990 [14] Z. Nie, Y. Ma, S. Shi, J.-R. Wen, and W.-Y. Ma. Web object retrieval. In Proceedings of the 16th International Conference on the World Wide Web, Banff, Alberta, Canada, May 2007, pp. 81-90 [15] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, 2000, pp. 888-905. [16] J. Wang and F. H. Lochovsky. Data extraction and label assignment for Web databases. In Proceedings of the 12th International Conference on the World Wide Web, Budapest, Hungary, May 2003, pp. 187-196. [17] Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In Proceedings of the 14th International Conference on the World Wide Web, Chiba, Japan, May 2005, pp. 76-85. [18] H. Zhao, W. Meng, and C. Yu. Mining templates from search result records of search engines. In Proceedings of the 13th ACM International Conference on Knowledge Discovery and Data Mining, San Jose, CA, August 2007, pp. 884-893. [19] J. Zhu, Z. Nie, J. Wen, B. Zhang, and W. Ma. Simultaneous record detection and attribute labeling in Web data extraction. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, August 2006, pp. 494-503.
902
Authorized licensed use limited to: MISRIMAL NAVAJEE MUNOTH JAIN ENGG COLLEGE. Downloaded on November 28, 2009 at 21:02 from IEEE Xplore. Restrictions apply.