Sign up to receive free email alerts when patent applications with chosen keywords are publishedSIGN UP

Abstract:

Embodiments of the present invention provide a method for processing a
World Wide Web (WWW) page, which includes: determining at least one
website to be optimized; generating a corresponding page template for
each of WWW pages with different types in each website, and storing the
page template; grabbing WWW pages from each website, matching each
grabbed WWW page with a page template, filtering redundant HTML data from
the WWW page according to a matching result, and storing the filtered WWW
page; after receiving a request sent by a terminal for accessing a WWW
page, determining whether there is a stored filtered WWW page
corresponding to the WWW page requested by the terminal, if yes,
returning the filtered WWW page to the terminal. Embodiments of the
present invention also provide an apparatus for processing a WWW page.
With the scheme of the present invention, redundant information may be
efficiently eliminated.

Claims:

1. A method for processing a World Wide Web, WWW, page, the method
comprises: determining at least one website to be optimized; generating a
corresponding page template for each of WWW pages with different types in
each website, and storing the corresponding page template; constantly
grabbing WWW pages from each website, matching each grabbed WWW page with
a page template corresponding to the grabbed WWW page, filtering
redundant Hyper Text Mark-up Language, HTML, data from the WWW page
according to a matching result, and storing the filtered WWW page without
the redundant HTML data; after receiving a request, sent by a terminal,
for accessing a WWW page, determining whether there is a stored filtered
WWW page without the redundant HTML data corresponding to the WWW page
requested by the terminal; and when there is a stored filtered WWW page
without the redundant HTML data corresponding to the WWW page requested
by the terminal, returning the filtered WWW page without the redundant
HTML data to the terminal.

2. The method according to claim 1, the method further comprises: when
there is no stored filtered WWW page without the redundant HTML data
corresponding to the WWW page requested by the terminal, implementing a
real-time transformation process for the WWW page requested by the
terminal.

3. The method according to claim 1, wherein generating a corresponding
page template for each of WWW pages with different types in each website,
and storing the corresponding page template comprises performing the
following operations for each respective website X: obtaining a WWW page
from each of the various types of WWW pages in the website X according to
a received instruction of a background administrator; respectively
analyzing each obtained WWW page to construct a Document Object Model,
DOM, tree; deleting each DOM node unnecessary to be reserved from each
DOM tree according to a received instruction of a background
administrator; respectively transforming each DOM tree in which each DOM
node unnecessary to be reserved is deleted into a WWW page; and storing
the WWW page as a page template.

4. The method according to claim 1, wherein matching each grabbed WWW
page with a page template corresponding to the grabbed WWW page,
filtering redundant HTML data from the WWW page according to a matching
result comprises: for each grabbed WWW page Y, performing the following
processes: analyzing the WWW page Y to construct a DOM tree, and
obtaining DOM tree 1; analyzing page template Y corresponding to the WWW
page Y to construct a DOM tree, and obtaining DOM tree 2; for each DOM
node in DOM tree 1, determining whether there is a matched DOM node in
DOM tree 2; when there is a matched DOM node in DOM tree 2, performing no
operations on the DOM node in DOM tree 1, otherwise, deleting the DOM
node from DOM tree 1; transforming DOM tree 1 in which each DOM node
unnecessary to be reserved is deleted into a WWW page; and transforming
DOM 2 into page template Y.

5. The method according to claim 1, the method further comprises: storing
a Uniform Resource Location, URL, of each page template; and before
matching each grabbed WWW page with a page template corresponding to the
grabbed WWW page, further comprising: determining the page template
corresponding to the grabbed WWW page according to the URL of the grabbed
WWW page.

6. The method according to claim 1, wherein the terminal is a mobile
terminal.

7. An apparatus for processing a World Wide Web, WWW, page, the apparatus
comprises: a first processing unit configured to determine at least one
website to be optimized, generate a corresponding page template for each
of WWW pages with different types in each website, store the
corresponding page template; constantly grab WWW pages from each website,
match each grabbed WWW page with a page template corresponding to the
grabbed WWW page, filter redundant Hyper Text Mark-up Language (HTML)
data from the WWW page according to a matching result, and store the
filtered WWW page without the redundant HTML data; and a second
processing unit configured to, after receiving a request sent by a
terminal for accessing a WWW page, determine whether there is a filtered
WWW page without the redundant HTML data corresponding to the WWW page
requested by the terminal stored in the first processing unit, when there
is a stored filtered WWW page without the redundant HTML data
corresponding to the WWW page requested by the terminal, obtain the
filtered WWW page without the redundant HTML data from the first
processing unit, and return the filtered WWW page without the redundant
HTML data to the terminal.

8. The apparatus according to claim 7, wherein the second processing unit
is further configured to, when there is no filtered WWW page without the
redundant HTML data corresponding to the WWW page requested by the
terminal stored in the first processing unit 21, implement a real-time
transformation process for the WWW page requested by the terminal.

9. The apparatus according to claim 7, wherein the first processing unit
comprises: a first processing sub-unit, configured to receive at least
one website to be optimized inputted by a background administrator; a
second processing sub-unit, configured to perform the following
operations for each website X: according to a received instruction of the
background administrator, obtain one WWW page from each of the various
types of WWW pages in the website X, respectively analyze each obtained
WWW page to construct a DOM tree, according to a received instruction of
the background administrator, delete each DOM node unnecessary to be
reserved from each DOM tree, transform each DOM tree in which each DOM
node unnecessary to be reserved is deleted into a WWW page respectively,
and store the WWW page as a page template; and a third processing unit,
configured to constantly grab WWW pages from each website, for each
grabbed WWW page Y, and perform the following processes respectively:
analyze the WWW page Y to construct a DOM tree, and obtain DOM tree 1,
analyze the page template Y corresponding to the WWW page Y to construct
a DOM tree, and obtain DOM tree 2, for each DOM node in DOM tree 1,
determine whether there is a matched DOM node in DOM tree 2, when there
is a matched DOM node in DOM tree 2, perform no operations on the DOM
node in DOM tree 1, otherwise, delete the DOM node in DOM tree 1,
transform DOM tree 1 in which each DOM node unnecessary to be reserved is
deleted into a WWW page, and transform DOM tree 2 into page template Y.

10. The apparatus according to claim 7, wherein the terminal is a mobile
terminal.

Description:

FIELD OF THE INVENTION

[0001] The present invention relates to Internet technology, and more
particularly, to a method and apparatus for processing a World Wide Web
(WWW) page.

BACKGROUND OF THE INVENTION

[0002] With the popularity of broadband Internet, contents displayed on
WWW pages of the Internet are increasingly enriching. However, redundant
information, such as advertising information, is also constantly
increasing. When a user browses a WWW page by a terminal with a limited
size, such as a mobile terminal, the redundant information will bring
about a great deal of inconvenience to the user's browse.

SUMMARY OF THE INVENTION

[0003] In view of above, embodiments of the present invention provide a
method for processing a WWW page and an apparatus for processing a WWW
page, so as to effectively eliminate the redundant information.

[0004] The method for processing a WWW page provided by embodiments of the
present invention includes:

[0005] determining at least one website to
be optimized; generating a corresponding page template for each of WWW
pages with different types in each network, and storing the corresponding
page template;

[0006] constantly grabbing WWW pages from each website,
matching each grabbed WWW page with a page template corresponding to the
grabbed WWW page, filtering redundant Hyper Text Mark-up Language (HTML)
data from the WWW page according to a matching result, and storing the
filtered WWW page without the redundant HTML data;

[0007] after receiving
a request, sent by a terminal, for accessing a WWW page, determining
whether there is a stored filtered WWW page without the redundant HTML
data corresponding to the WWW page requested by the terminal, when there
is a stored WWW page without the redundant HTML data corresponding to the
WWW page requested by the terminal, returning the filtered WWW page
without the redundant HTML data to the terminal.

[0008] The apparatus for processing a WWW page provided by embodiments of
the present invention includes:

[0009] a first processing unit
configured to determine at least one website to be optimized; generate a
corresponding page template for each of WWW pages with different types in
each network, and store the corresponding page template; constantly grab
WWW pages from each website, match each grabbed WWW page with a page
template corresponding to the grabbed WWW page, filter redundant Hyper
Text Mark-up Language, HTML, data from the WWW page according to a
matching result, and store the filtered WWW page without the redundant
HTML data;

[0010] a second processing unit configured to, after receiving
a request sent by a terminal for accessing a WWW page, determine whether
there is a filtered WWW page without the redundant HTML data
corresponding to the WWW page requested by the terminal stored in the
first processing unit, when there is a stored filtered WWW page without
the redundant HTML data corresponding to the WWW page requested by the
terminal, obtain the filtered WWW page without the redundant HTML data
from the first processing unit, and return the filtered WWW page without
the redundant HTML data to the terminal.

[0011] As can be seen, by adopting the technical solutions of the present
invention, redundant Hypertext Markup Language (HTML) information, such
as advertising information, may be filtered from a grabbed WWW page
according to a page template. That is, redundant information is
efficiently eliminated, thereby facilitating the user's browse. In
addition, the technical solutions of the present invention can be
implemented simply and conveniently.

BRIEF DESCRIPTION OF DRAWINGS

[0012] The exemplary embodiment of the present invention will be described
in detail hereinafter with reference to accompanying drawings, so as to
make above mentioned or other features and advantages of the present
invention clearer to one skilled in the art. In the accompanying
drawings:

[0013] FIG. 1 is a flowchart illustrating a method for processing a WWW
page according to an embodiment of the present invention;

[0014] FIG. 2 is a schematic diagram illustrating a structure of an
apparatus for processing a WWW page according to an embodiment of the
present invention.

DETAILED DESCRIPTION OF THE INVENTION

[0015] In view of above problem existed in the prior art, the present
invention provides a new scheme for processing a WWW page.

[0016] In order to make objectives, technical solutions and advantages of
the present invention clearer, the present invention will be described in
detail hereinafter with reference to accompanying drawings.

[0017] FIG. 1 is a flowchart illustrating a method for processing a WWW
page according to an embodiment of the present invention. As shown in
FIG. 1, the method includes the following processes.

[0018] In block 11, a website to be optimized is determined.

[0019] In practical application, it is possible to determine websites to
be optimized (of which WWW pages will be filtered according to the
follow-up method) by a background administrator, compose a website link
list with determined websites, and input the website link list to the
background processing system.

[0020] Theoretically, the more websites the website link list includes,
the better. However, taking into account factors, such as maintenance
costs, the website link list may only include some websites commonly
used.

[0021] In block 12, for each of WWW pages with different types in each
network, a corresponding page template is respectively generated and
stored.

[0022] In block 12, for each website X in the website link list, the
following processes are performed respectively. According to a received
instruction of the background administrator, for each of the various
types of WWW pages in website X, a WWW page is obtained. Each obtained
WWW page is respectively analyzed to construct a Document Object Model
(DOM) tree. According to a received instruction of the background
administrator, each DOM node unnecessary to be reserved in each DOM tree
is deleted. Then each DOM tree in which each DOM node unnecessary to be
reserved is deleted is respectively transformed into a WWW page, and the
WWW page is stored as a page template. How to analyze the WWW page to
construct the DOM tree and how to transform a DOM tree into a WWW page
may be implemented with existing technologies.

[0023] In practical application, relevant plug-in may be developed and
installed in a browser, such as a FireFox browser, of the background
processing system. Subsequently, the background administrator may access
different types of WWW pages in different websites via the FireFox
browser with the plug-in. Specifically, for each type, such as news type
or BBS type, of WWW pages in each website, it is possible to randomly
select one WWW page to access, and select contents to be reserved and
contents to be deleted in the accessed WWW page by a mouse. The plug-in
is adapted to, according to operations of the background administrator,
correspondingly implement the functions including analyzing the WWW page
to construct a DOM tree, deleting a DOM node, and transforming a DOM tree
into a WWW page.

[0024] After the process described in block 12, a series of page templates
are obtained. For example, suppose the website link list includes three
websites (here only gives an example for description, in practice, the
number of websites included in the website link list may far exceed
three), in which, the first website includes five different types of WWW
pages, the second website includes six different types of WWW pages, the
third website includes four different types of WWW pages, then it is
possible to obtain total 5+6+4=15 page templates.

[0025] In block 13, WWW pages are constantly grabbed from each website.
Each grabbed WWW page is respectively matched with a page template
corresponding to the grabbed WWW page. According to a matching result,
redundant Hyper Text Mark-up Language (HTML) data is filtered from the
grabbed WWW page, and the grabbed WWW page without the redundant HTML
data is stored.

[0026] The background processing system may constantly grab WWW pages from
each website in the website link list. The grabbing operation may be
performed in real time or may be performed once at each interval. Objects
grabbed by the background processing system include all WWW pages in each
website.

[0027] For each grabbed WWW page Y, the following processes are performed
respectively. WWW page Y is analyzed to construct a DOM tree, namely DOM
tree 1. Page template Y corresponding to WWW page Y is analyzed to
construct a DOM tree, namely DOM tree 2. For each DOM node in DOM tree 1,
whether there is a matching DOM node in DOM tree 2 is determined. If
there is a matching DOM node in DOM tree 2, no operations will be
performed on the DOM node in DOM tree 1; otherwise, the DOM node in DOM
tree 1 will be deleted. DOM tree 1 in which each DOM node unnecessary to
be reserved has been deleted is transformed into a WWW page, and the DOM
tree 2 is transformed into page template Y. How to determine whether a
DOM node has a matched node may be state-of-the-art technology. With this
method, it is possible to filter out redundant HTML data, such as
advertising information, from a WWW page.

[0028] Above-mentioned page template Y corresponding to WWW page Y refers
to a page template of the same type as WWW page Y, and the page template
and WWW page Y belong to the same website. In practical application, when
each page template is stored, it is possible to simultaneously store the
Uniform Resource Location (URL) of each page template. URL may reflect
information, such as the website to which the page template belongs and
the type of the page template. Thus, before each grabbed WWW page is
matched with its corresponding page template, it is possible to determine
the corresponding page template according to the URL of each grabbed WWW
page.

[0029] In block 14, when a request sent by a terminal for accessing a WWW
page is received, it is determined whether there is a stored WWW page
with the redundant HTML data filtered out corresponding to the WWW page
requested by the terminal. If there is a stored and filtered WWW page
without the redundant HTML data corresponding to the WWW page requested
by the terminal, the filtered WWW page without the redundant HTML data is
returned to the terminal.

[0030] In block 14, when receiving a request for accessing a WWW page sent
by a terminal, the background processing system firstly determines
whether there is a WWW page with the redundant HTML data filtered out
corresponding to the WWW page requested by the terminal stored in local,
namely, determines whether the WWW page requested by the terminal has
been grabbed and optimized. If there is a corresponding filtered WWW page
without redundant HTML data, the corresponding filtered WWW page without
the redundant HTML data is returned to the terminal, otherwise, the
real-time transformation process of the WWW page is implemented according
to existing technologies.

[0031] Based on above mentioned description, FIG. 2 is a schematic diagram
illustrating a structure of an apparatus for processing a WWW page
according to an embodiment of the present invention. As shown in FIG. 2,
the apparatus includes a first processing unit 21 and a second processing
unit 22.

[0032] The first processing unit 21 is configured to determine at least
one website to be optimized. For each WWW page of different types in each
website, the first processing unit respectively generates and stores a
corresponding page template, constantly grabs WWW pages from each
website, compares each grabbed WWW page with its corresponding page
template respectively, filters redundant HTML data from the grabbed WWW
page according to a compared result, and stores the filtered WWW page
without redundant HTML data.

[0033] The second processing unit 22 is configured to, when receiving a
request for accessing a WWW page sent by a terminal, determine whether
there is a filtered WWW page corresponding to the WWW page requested by
the terminal stored in the first processing unit 21. When there is a
filtered WWW page corresponding to the WWW page requested by the terminal
stored in the first processing unit 21, the second processing unit 22
obtains the filtered WWW page from the first processing unit 21, and
returns the filtered WWW page to the terminal.

[0034] The second processing unit 22 is further configured to, when there
is no filtered WWW page corresponding to the WWW page requested by the
terminal stored in the first processing unit 21, implement the real-time
transformation process for the WWW page.

[0035] The first processing unit 21 may further include (to simplify the
drawings, the detailed structure of the first processing unit is not
illustrated) a first processing sub-unit, a second processing sub-unit
and a third processing sub-unit.

[0036] The first processing sub-unit is configured to receive at least one
website to be optimized inputted by a background administrator.

[0037] The second processing sub-unit is configured to perform the
following operations for each website X: according to a received
instruction of the background administrator, obtain a WWW page from each
type of the various types of WWW pages in website X; respectively analyze
each obtained WWW page to construct a Document Object Model (DOM) tree;
according to a received instruction of the background administrator,
delete each DOM node unnecessary to be reserved from each DOM tree;
transform each DOM tree that has DOM node deleted into a WWW page
respectively, and store the WWW page as a page template.

[0038] The third processing unit is configured to constantly grab WWW
pages from each website, and for each grabbed WWW page Y, to perform the
following operations respectively: analyze WWW page Y to construct a DOM
tree, and obtain DOM tree 1; analyze page template Y corresponding to WWW
page Y to construct a DOM tree, and obtain DOM tree 2; for each DOM node
in DOM tree 1, determine whether there is a matching DOM node in DOM tree
2; when there is a matching DOM node in DOM tree 2, perform no operations
on the DOM node in DOM tree 1; otherwise, delete the DOM node in DOM tree
1; transform DOM tree 1 in which each DOM node unnecessary to be reserved
has been deleted into a WWW page; and transform DOM tree 2 into page
template Y.

[0039] For specific process of the apparatus embodiment shown in FIG. 2,
corresponding description in the method embodiment shown in FIG. 1 may be
referred, thus no further description will be provided here. In addition,
the terminal mentioned in the embodiments shown in FIG. 1 and FIG. 2 is
generally a mobile terminal.

[0040] The foregoing description is only preferred embodiments of the
present invention and is not used for limiting the protection scope
thereof. Any modification, equivalent substitution, or improvement made
without departing from the principle of the present invention is within
the protection scope of the present invention.