国产+人+亚洲,成人无码www在线看免费

Home

Web Front-end

HTML Tutorial

Using XPATH and HTML Cleaner to parse HTML/XML_html/css_WEB-ITnose

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 24, 2016 am 11:51 AM

Using XPATH and HTML Cleaner to parse HTML/XML

The Beautiful Life of Sun Vulcan ()

This article follows the "Attribution-NonCommercial-Consistency" Creative Commons License

Please keep this sentence for reprinting: The Beautiful Life of the Sun Vulcan - This blog focuses on agile development and research on mobile and IoT devices : iOS, Android, Html5, Arduino, pcDuino, otherwise, the articles from this blog will not be reproduced or reprinted, thank you for your cooperation.

Using XPATH and HTML Cleaner to parse HTML/XML
JANUARY 5, 2010

tags: android, examples, HTML, parse, scraping, XML, XPATH

Hey everyone

Hey everyone,

Sometimes I find the ability to be very useful, especially in web-related applications, and that is to get HTML from a web site and parse data from the HTML, or whatever you want (for mine The case is mostly always data).

So something that I've found to be extremely useful (especially in web related applications) is the ability to retrieve HTML from websites and parse their HTML for data or whatever you may be looking for (in my case it is almost always data).

I actually use this technique to do the real time stock/option imports for my Black-Scholes/Implied Volatility applications, so if you're looking for an example on how to retrieve and parse HTML and run “queries” over it using, say, XPATH, then this post is for you.

Now, before we begin, in order to do this you will have to reference an external JAR in your project's build path. The JAR that I use comes from HtmlCleaner which even gives you an example of how they use it here HtmlCleaner Example, but in addition to that I'll show you an example of how I use it.

public class OptionScraper {

???? // EXAMPLE XPATH QUERIES IN THE FORM OF STRINGS - WILL BE USED LATER

???? private static final String NAME_XPATH = "http://div[@class='yfi_quote']/div[@class='hd']/h2" ;

???? private static final String TIME_XPATH = "http://table[@id='time_table']/tbody/tr/td[@class='yfnc_tabledata1']" ;

???? private static final String PRICE_XPATH = "http://table[@id='price_table']//tr//span" ;

???? // TAGNODE OBJECT, ITS USE WILL COME IN LATER

???? private static TagNode node;

???? // A METHOD THAT HELPS ME RETRIEVE THE STOCK OPTION'S DATA BASED OFF THE NAME (I.E. GOUAA IS ONE OF GOOGLE'S STOCK OPTIONS)

???? public static Option getOptionFromName(String name) throws XPatherException, ParserConfigurationException,SAXException, IOException, XPatherException {

???????? // THE URL WHOSE HTML I WANT TO RETRIEVE AND PARSE

???????? String option_url = " http://finance.yahoo.com/q?s=" name.toUpperCase();

???????? // THIS IS WHERE THE HTMLCLEANER COMES IN, I INITIALIZE IT HERE

???????? HtmlCleaner cleaner = new HtmlCleaner();

???????? CleanerProperties props = cleaner.getProperties();

???????? props.setAllowHtmlInsideAttributes( true );

???????? props.setAllowMultiWordAttributes( true );

???????? props.setRecognizeUnicodeChars( true );

???????? props.setOmitComments( true );

???????? // OPEN A CONNECTION TO THE DESIRED URL

???????? URL url = new URL(option_url);

???????? URLConnection conn = url.openConnection();

???????? //USE THE CLEANER TO "CLEAN" THE HTML AND RETURN IT AS A TAGNODE OBJECT

???????? node = cleaner.clean( new InputStreamReader(conn.getInputStream()));

???????? // ONCE THE HTML IS CLEANED, THEN YOU CAN RUN YOUR XPATH EXPRESSIONS ON THE NODE, WHICH WILL THEN RETURN AN ARRAY OF TAGNODE OBJECTS (THESE ARE RETURNED AS OBJECTS BUT GET CASTED BELOW)

???????? Object[] info_nodes = node.evaluateXPath(NAME_XPATH);

???????? Object[] time_nodes = node.evaluateXPath(TIME_XPATH);

???????? Object[] price_nodes = node.evaluateXPath(PRICE_XPATH);

???????? // HERE I JUST DO A SIMPLE CHECK TO MAKE SURE THAT MY XPATH WAS CORRECT AND THAT AN ACTUAL NODE(S) WAS RETURNED

???????? if (info_nodes.length > 0 ) {

???????????? // CASTED TO A TAGNODE

???????????? TagNode info_node = (TagNode) info_nodes[ 0 ];

???????????? // HOW TO RETRIEVE THE CONTENTS AS A STRING

???????????? String info = info_node.getChildren().iterator().next().toString().trim();

???????????? // SOME METHOD THAT PROCESSES THE STRING OF INFORMATION (IN MY CASE, THIS WAS THE STOCK QUOTE, ETC)

???????????? processInfoNode(o, info);

???????? }

???????? if (time_nodes.length > 0 ) {

???????????? TagNode time_node = (TagNode) time_nodes[ 0 ];

???????????? String date = time_node.getChildren().iterator().next().toString().trim();

???????????? // DATE RETURNED IN 15-JAN-10 FORMAT, SO THIS IS SOME METHOD I WROTE TO JUST PARSE THAT STRING INTO THE FORMAT THAT I USE

???????????? processDateNode(o, date);

???????? }

???????? if (price_nodes.length > 0 ) {

???????????? TagNode price_node = (TagNode) price_nodes[ 0 ];

???????????? double price = Double.parseDouble(price_node.getChildren().iterator().next().toString().trim());

???????????? o.setPremium(price);

???????? }

???????? return o;

???? }

}

So that’s it! Once you include the JAR in your build path, everything else is pretty easy! It’s a great tool to use. However, it does require knowledge of?XPATH?but XPATH isn’t too hard to pick up and is useful to know so if you don’t know it then take a look at the link.

Now, a warning to everyone. It’s documented that the XPATH expressions recognized by HtmlCleaner is not complete in the sense that only “basic” XPATH is recognized. What’s excluded? For instance, you can’t use any of the “axes” operators (i.e. parent, ancestor, following, following-sibling, etc), but in my experience everything else is fair game. Yes, it sucks, and many times it can make your life a little bit harder, but usually it just requires you to be a tad more clever with your XPATH expressions before you can pull the desired information.

And of course, this technique works for XML documents as well!

Hope this was helpful to everyone. Let me know if you’re confused anywhere.

- jwei

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undress AI Tool

Undress images for free

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5060533 fails to install in Windows 10?

3 weeks ago By DDD

Dune: Awakening - Where To Get Insulated Fabric

3 weeks ago By Jack chen

Gmail Login: How to Sign Up, Sign In, or Sign Out of Gmail - MiniTool

1 months ago By Jack chen

How to fix KB5060999 fails to install in Windows 11?

3 weeks ago By DDD

Guild Guide In Tainted Grail: The Fall Of Avalon

4 weeks ago By Jack chen

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

8517

Java Tutorial

1742

CakePHP Tutorial

1596

Laravel Tutorial

1536

PHP Tutorial

1396

Related knowledge

Explain the purpose of the role attribute in ARIA. Jun 14, 2025 am 12:35 AM

ARIA's role attribute is used to define the role of web elements and improve accessibility. 1. Role attribute helps assistive technology to understand the functions of elements, such as buttons, navigation, etc. 2. Use role attributes to assign specific roles to non-semantic HTML elements. 3. The role attribute should be consistent with the element behavior and be verified by the accessibility tool test.

HTML and Design: Creating the Visual Layout of Websites Jun 14, 2025 am 12:39 AM

How to create a website layout? 1. Use HTML tags to define the content structure, such as, ,. 2. Control styles and positions through CSS, using box model, float or Flexbox layout. 3. Optimize performance, reduce HTTP requests, use cache and optimize images, and ensure responsive design.

How can you ensure your HTML code is readable and maintainable? Jun 10, 2025 am 12:06 AM

Improve the readability and maintainability of HTML code can be achieved through the following steps: 1. Use semantic tags, such as, etc. to make the code structure clear and improve SEO effect; 2. Keep the code formatted and use consistent indentation and spaces; 3. Add appropriate comments to explain the code intention; 4. Avoid excessive nesting and simplify the structure; 5. Use external style sheets and scripts to keep the HTML concise.

How do I stay up-to-date with the latest HTML standards and best practices? Jun 20, 2025 am 08:33 AM

The key to keep up with HTML standards and best practices is to do it intentionally rather than follow it blindly. First, follow the summary or update logs of official sources such as WHATWG and W3C, understand new tags (such as) and attributes, and use them as references to solve difficult problems; second, subscribe to trusted web development newsletters and blogs, spend 10-15 minutes a week to browse updates, focus on actual use cases rather than just collecting articles; second, use developer tools and linters such as HTMLHint to optimize the code structure through instant feedback; finally, interact with the developer community, share experiences and learn other people's practical skills, so as to continuously improve HTML skills.

How do I use the element to represent the main content of a document? Jun 19, 2025 pm 11:09 PM

The reason for using tags is to improve the semantic structure and accessibility of web pages, make it easier for screen readers and search engines to understand page content, and allow users to quickly jump to core content. Here are the key points: 1. Each page should contain only one element; 2. It should not include content that is repeated across pages (such as sidebars or footers); 3. It can be used in conjunction with ARIA properties to enhance accessibility. Usually located after and before, it is used to wrap unique page content, such as articles, forms or product details, and should be avoided in, or in; to improve accessibility, aria-labeledby or aria-label can be used to clearly identify parts.

How do I create a basic HTML document? Jun 19, 2025 pm 11:01 PM

To create a basic HTML document, you first need to understand its basic structure and write code in a standard format. 1. Use the declaration document type at the beginning; 2. Use the tag to wrap the entire content; 3. Include and two main parts in it, which are used to store metadata such as titles, style sheet links, etc., and include user-visible content such as titles, paragraphs, pictures and links; 4. Save the file in .html format and open the viewing effect in the browser; 5. Then you can gradually add more elements to enrich the page content. Follow these steps to quickly build a basic web page.

What is an HTML tag? Jun 13, 2025 am 12:36 AM

HTMLtagsareessentialforstructuringwebpages.Theydefinecontentandlayoutusinganglebrackets,ofteninpairslikeand,withsomebeingself-closinglike.HTMLtagsarecrucialforcreatingstructured,accessible,andSEO-friendlywebpages.

How do I create checkboxes in HTML using the element? Jun 19, 2025 pm 11:41 PM

To create an HTML checkbox, use the type attribute to set the element of the checkbox. 1. The basic structure includes id, name and label tags to ensure that clicking text can switch options; 2. Multiple related check boxes should use the same name but different values, and wrap them with fieldset to improve accessibility; 3. Hide native controls when customizing styles and use CSS to design alternative elements while maintaining the complete functions; 4. Ensure availability, pair labels, support keyboard navigation, and avoid relying on only visual prompts. The above steps can help developers correctly implement checkbox components that have both functional and aesthetics.

See all articles

国产av日韩一区二区三区精品,成人性爱视频在线观看,国产,欧美,日韩,一区,www.成色av久久成人,2222eeee成人天堂

Using XPATH and HTML Cleaner to parse HTML/XML_html/css_WEB-ITnose

Hot AI Tools

Undress AI Tool

Undresser.AI Undress

AI Clothes Remover

Clothoff.io

Video Face Swap

Hot Article

Hot Tools

Notepad++7.3.1

SublimeText3 Chinese version

Zend Studio 13.0.1

Dreamweaver CS6

SublimeText3 Mac version

Hot Topics