Using XPATH and HTML Cleaner to parse HTML/XML_html/css_WEB-ITnose
Jun 24, 2016 am 11:51 AM
Using XPATH and HTML Cleaner to parse HTML/XML
The Beautiful Life of Sun Vulcan ()
This article follows the "Attribution-NonCommercial-Consistency" Creative Commons License
Please keep this sentence for reprinting: The Beautiful Life of the Sun Vulcan - This blog focuses on agile development and research on mobile and IoT devices : iOS, Android, Html5, Arduino, pcDuino, otherwise, the articles from this blog will not be reproduced or reprinted, thank you for your cooperation.
Using XPATH and HTML Cleaner to parse HTML/XML
JANUARY 5, 2010
tags: android, examples, HTML, parse, scraping, XML, XPATH
Hey everyone
Hey everyone,
So something that I've found to be extremely useful (especially in web related applications) is the ability to retrieve HTML from websites and parse their HTML for data or whatever you may be looking for (in my case it is almost always data).
I actually use this technique to do the real time stock/option imports for my Black-Scholes/Implied Volatility applications, so if you're looking for an example on how to retrieve and parse HTML and run “queries” over it using, say, XPATH, then this post is for you.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
public class OptionScraper { ? ???? // EXAMPLE XPATH QUERIES IN THE FORM OF STRINGS - WILL BE USED LATER ???? private static final String NAME_XPATH = "http://div[@class='yfi_quote']/div[@class='hd']/h2" ; ? ???? private static final String TIME_XPATH = "http://table[@id='time_table']/tbody/tr/td[@class='yfnc_tabledata1']" ; ? ???? private static final String PRICE_XPATH = "http://table[@id='price_table']//tr//span" ; ? ???? // TAGNODE OBJECT, ITS USE WILL COME IN LATER ???? private static TagNode node; ? ???? // A METHOD THAT HELPS ME RETRIEVE THE STOCK OPTION'S DATA BASED OFF THE NAME (I.E. GOUAA IS ONE OF GOOGLE'S STOCK OPTIONS) ???? public static Option getOptionFromName(String name) throws XPatherException, ParserConfigurationException,SAXException, IOException, XPatherException { ? ???????? // THE URL WHOSE HTML I WANT TO RETRIEVE AND PARSE ???????? String option_url = " http://finance.yahoo.com/q?s=" name.toUpperCase(); ? ???????? // THIS IS WHERE THE HTMLCLEANER COMES IN, I INITIALIZE IT HERE ???????? HtmlCleaner cleaner = new HtmlCleaner(); ???????? CleanerProperties props = cleaner.getProperties(); ???????? props.setAllowHtmlInsideAttributes( true ); ???????? props.setAllowMultiWordAttributes( true ); ???????? props.setRecognizeUnicodeChars( true ); ???????? props.setOmitComments( true ); ? ???????? // OPEN A CONNECTION TO THE DESIRED URL ???????? URL url = new URL(option_url); ???????? URLConnection conn = url.openConnection(); ? ???????? //USE THE CLEANER TO "CLEAN" THE HTML AND RETURN IT AS A TAGNODE OBJECT ???????? node = cleaner.clean( new InputStreamReader(conn.getInputStream())); ? ???????? // ONCE THE HTML IS CLEANED, THEN YOU CAN RUN YOUR XPATH EXPRESSIONS ON THE NODE, WHICH WILL THEN RETURN AN ARRAY OF TAGNODE OBJECTS (THESE ARE RETURNED AS OBJECTS BUT GET CASTED BELOW) ???????? Object[] info_nodes = node.evaluateXPath(NAME_XPATH); ???????? Object[] time_nodes = node.evaluateXPath(TIME_XPATH); ???????? Object[] price_nodes = node.evaluateXPath(PRICE_XPATH); ? ???????? // HERE I JUST DO A SIMPLE CHECK TO MAKE SURE THAT MY XPATH WAS CORRECT AND THAT AN ACTUAL NODE(S) WAS RETURNED ???????? if (info_nodes.length > 0 ) { ???????????? // CASTED TO A TAGNODE ???????????? TagNode info_node = (TagNode) info_nodes[ 0 ]; ???????????? // HOW TO RETRIEVE THE CONTENTS AS A STRING ???????????? String info = info_node.getChildren().iterator().next().toString().trim(); ? ???????????? // SOME METHOD THAT PROCESSES THE STRING OF INFORMATION (IN MY CASE, THIS WAS THE STOCK QUOTE, ETC) ???????????? processInfoNode(o, info); ???????? } ? ???????? if (time_nodes.length > 0 ) { ???????????? TagNode time_node = (TagNode) time_nodes[ 0 ]; ???????????? String date = time_node.getChildren().iterator().next().toString().trim(); ? ???????????? // DATE RETURNED IN 15-JAN-10 FORMAT, SO THIS IS SOME METHOD I WROTE TO JUST PARSE THAT STRING INTO THE FORMAT THAT I USE ???????????? processDateNode(o, date); ???????? } ? ???????? if (price_nodes.length > 0 ) { ???????????? TagNode price_node = (TagNode) price_nodes[ 0 ]; ???????????? double price = Double.parseDouble(price_node.getChildren().iterator().next().toString().trim()); ???????????? o.setPremium(price); ???????? } ? ???????? return o; ???? } } |
So that’s it! Once you include the JAR in your build path, everything else is pretty easy! It’s a great tool to use. However, it does require knowledge of?XPATH?but XPATH isn’t too hard to pick up and is useful to know so if you don’t know it then take a look at the link.
Now, a warning to everyone. It’s documented that the XPATH expressions recognized by HtmlCleaner is not complete in the sense that only “basic” XPATH is recognized. What’s excluded? For instance, you can’t use any of the “axes” operators (i.e. parent, ancestor, following, following-sibling, etc), but in my experience everything else is fair game. Yes, it sucks, and many times it can make your life a little bit harder, but usually it just requires you to be a tad more clever with your XPATH expressions before you can pull the desired information.
And of course, this technique works for XML documents as well!
Hope this was helpful to everyone. Let me know if you’re confused anywhere.
- jwei

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

ARIA's role attribute is used to define the role of web elements and improve accessibility. 1. Role attribute helps assistive technology to understand the functions of elements, such as buttons, navigation, etc. 2. Use role attributes to assign specific roles to non-semantic HTML elements. 3. The role attribute should be consistent with the element behavior and be verified by the accessibility tool test.

How to create a website layout? 1. Use HTML tags to define the content structure, such as, ,. 2. Control styles and positions through CSS, using box model, float or Flexbox layout. 3. Optimize performance, reduce HTTP requests, use cache and optimize images, and ensure responsive design.

Improve the readability and maintainability of HTML code can be achieved through the following steps: 1. Use semantic tags, such as, etc. to make the code structure clear and improve SEO effect; 2. Keep the code formatted and use consistent indentation and spaces; 3. Add appropriate comments to explain the code intention; 4. Avoid excessive nesting and simplify the structure; 5. Use external style sheets and scripts to keep the HTML concise.

The key to keep up with HTML standards and best practices is to do it intentionally rather than follow it blindly. First, follow the summary or update logs of official sources such as WHATWG and W3C, understand new tags (such as) and attributes, and use them as references to solve difficult problems; second, subscribe to trusted web development newsletters and blogs, spend 10-15 minutes a week to browse updates, focus on actual use cases rather than just collecting articles; second, use developer tools and linters such as HTMLHint to optimize the code structure through instant feedback; finally, interact with the developer community, share experiences and learn other people's practical skills, so as to continuously improve HTML skills.

The reason for using tags is to improve the semantic structure and accessibility of web pages, make it easier for screen readers and search engines to understand page content, and allow users to quickly jump to core content. Here are the key points: 1. Each page should contain only one element; 2. It should not include content that is repeated across pages (such as sidebars or footers); 3. It can be used in conjunction with ARIA properties to enhance accessibility. Usually located after and before, it is used to wrap unique page content, such as articles, forms or product details, and should be avoided in, or in; to improve accessibility, aria-labeledby or aria-label can be used to clearly identify parts.

To create a basic HTML document, you first need to understand its basic structure and write code in a standard format. 1. Use the declaration document type at the beginning; 2. Use the tag to wrap the entire content; 3. Include and two main parts in it, which are used to store metadata such as titles, style sheet links, etc., and include user-visible content such as titles, paragraphs, pictures and links; 4. Save the file in .html format and open the viewing effect in the browser; 5. Then you can gradually add more elements to enrich the page content. Follow these steps to quickly build a basic web page.

HTMLtagsareessentialforstructuringwebpages.Theydefinecontentandlayoutusinganglebrackets,ofteninpairslikeand,withsomebeingself-closinglike.HTMLtagsarecrucialforcreatingstructured,accessible,andSEO-friendlywebpages.

To create an HTML checkbox, use the type attribute to set the element of the checkbox. 1. The basic structure includes id, name and label tags to ensure that clicking text can switch options; 2. Multiple related check boxes should use the same name but different values, and wrap them with fieldset to improve accessibility; 3. Hide native controls when customizing styles and use CSS to design alternative elements while maintaining the complete functions; 4. Ensure availability, pair labels, support keyboard navigation, and avoid relying on only visual prompts. The above steps can help developers correctly implement checkbox components that have both functional and aesthetics.
