This chapter discusses the main challenges addressed within the fields of Web information extraction and Web page understanding and considers different utilized Web page representations. A configurable Java-based framework for implementing effective methods for Web Page Processing (WPP) called WPPS is presented as the result of this analysis. WPPS leverages a Unified Ontological Model (UOM) of Web pages that describes their different aspects, such as layout, visual features, interface, DOM tree, and the logical structure in the form of one consistent model. The UOM is a formalization of certain layers of a Web page conceptualization defined in the chapter. A WPPS API provided for the development of WPP methods makes it possible to combine the declarative approach, represented by the set of inference rules and SPARQL queries, with the object-oriented approach. The framework is illustrated with one example scenario related to the identification of a Web page navigation menu.
|
|