国产av日韩一区二区三区精品,成人性爱视频在线观看,国产,欧美,日韩,一区,www.成色av久久成人,2222eeee成人天堂

首頁 web前端 html教學(xué) 使用 XPATH 和 HTML Cleaner 解析 HTML/XML(Using XPATH and HTML Cleaner to parse HTML / XML)_html/css_WEB-ITnose

使用 XPATH 和 HTML Cleaner 解析 HTML/XML(Using XPATH and HTML Cleaner to parse HTML / XML)_html/css_WEB-ITnose

Jun 24, 2016 am 11:51 AM

使用 XPATH 和 HTML Cleaner 解析 HTML/XML
(Using XPATH and HTML Cleaner to parse HTML / XML)

太陽火神的美麗人生 ()

本文遵循“署名-非商業(yè)用途-保持一致”創(chuàng)作公用協(xié)議

轉(zhuǎn)載請保留此句:太陽火神的美麗人生 - ?本博客專注于?敏捷開發(fā)及移動和物聯(lián)設(shè)備研究:iOS、Android、Html5、Arduino、pcDuino,否則,出自本博客的文章拒絕轉(zhuǎn)載或再轉(zhuǎn)載,謝謝合作。



使用 XPATH 和 HTML Cleaner 解析 HTML/XML
(Using XPATH and HTML Cleaner to parse HTML /?XML)

JANUARY 5, 2010

tags: ? android, ? examples, ? HTML, ? parse, ? scraping, ? XML, ? XPATH

大家好
Hey everyone,

有時我發(fā)現(xiàn)有一種能力十分有用,尤其在 Web 相關(guān)的應(yīng)用中,那就是從 web 站點獲取 HTML 并且從 HTML 解析數(shù)據(jù),或是任何你要想得到的內(nèi)容(對于我的情況大多總是數(shù)據(jù))。
So something that I’ve found to be extremely useful (especially in web related applications) is the ability to retrieve HTML from websites and parse their HTML for data or whatever you may be looking for (in my case it is almost always data).


I actually use this technique to do the real time stock/option imports for my Black-Scholes/Implied Volatility applications, so if you’re looking for an example on how to retrieve and parse HTML and run “queries” over it using, say, XPATH, then this post is for you.

Now, before we begin, in order to do this you will have to reference an external JAR in your project’s build path. The JAR that I use comes from?HtmlCleaner?which even gives you an example of how they use it here?HtmlCleaner Example, but in addition to that I’ll show you an example of how I use it.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

public class OptionScraper {

?

???? // EXAMPLE XPATH QUERIES IN THE FORM OF STRINGS - WILL BE USED LATER

???? private static final String NAME_XPATH = "http://div[@class='yfi_quote']/div[@class='hd']/h2" ;

?

???? private static final String TIME_XPATH = "http://table[@id='time_table']/tbody/tr/td[@class='yfnc_tabledata1']" ;

?

???? private static final String PRICE_XPATH = "http://table[@id='price_table']//tr//span" ;

?

???? // TAGNODE OBJECT, ITS USE WILL COME IN LATER

???? private static TagNode node;

?

???? // A METHOD THAT HELPS ME RETRIEVE THE STOCK OPTION'S DATA BASED OFF THE NAME (I.E. GOUAA IS ONE OF GOOGLE'S STOCK OPTIONS)

???? public static Option getOptionFromName(String name) throws XPatherException, ParserConfigurationException,SAXException, IOException, XPatherException {

?

???????? // THE URL WHOSE HTML I WANT TO RETRIEVE AND PARSE

???????? String option_url = " http://finance.yahoo.com/q?s=" + name.toUpperCase();

?

???????? // THIS IS WHERE THE HTMLCLEANER COMES IN, I INITIALIZE IT HERE

???????? HtmlCleaner cleaner = new HtmlCleaner();

???????? CleanerProperties props = cleaner.getProperties();

???????? props.setAllowHtmlInsideAttributes( true );

???????? props.setAllowMultiWordAttributes( true );

???????? props.setRecognizeUnicodeChars( true );

???????? props.setOmitComments( true );

?

???????? // OPEN A CONNECTION TO THE DESIRED URL

???????? URL url = new URL(option_url);

???????? URLConnection conn = url.openConnection();

?

???????? //USE THE CLEANER TO "CLEAN" THE HTML AND RETURN IT AS A TAGNODE OBJECT

???????? node = cleaner.clean( new InputStreamReader(conn.getInputStream()));

?

???????? // ONCE THE HTML IS CLEANED, THEN YOU CAN RUN YOUR XPATH EXPRESSIONS ON THE NODE, WHICH WILL THEN RETURN AN ARRAY OF TAGNODE OBJECTS (THESE ARE RETURNED AS OBJECTS BUT GET CASTED BELOW)

???????? Object[] info_nodes = node.evaluateXPath(NAME_XPATH);

???????? Object[] time_nodes = node.evaluateXPath(TIME_XPATH);

???????? Object[] price_nodes = node.evaluateXPath(PRICE_XPATH);

?

???????? // HERE I JUST DO A SIMPLE CHECK TO MAKE SURE THAT MY XPATH WAS CORRECT AND THAT AN ACTUAL NODE(S) WAS RETURNED

???????? if (info_nodes.length > 0 ) {

???????????? // CASTED TO A TAGNODE

???????????? TagNode info_node = (TagNode) info_nodes[ 0 ];

???????????? // HOW TO RETRIEVE THE CONTENTS AS A STRING

???????????? String info = info_node.getChildren().iterator().next().toString().trim();

?

???????????? // SOME METHOD THAT PROCESSES THE STRING OF INFORMATION (IN MY CASE, THIS WAS THE STOCK QUOTE, ETC)

???????????? processInfoNode(o, info);

???????? }

?

???????? if (time_nodes.length > 0 ) {

???????????? TagNode time_node = (TagNode) time_nodes[ 0 ];

???????????? String date = time_node.getChildren().iterator().next().toString().trim();

?

???????????? // DATE RETURNED IN 15-JAN-10 FORMAT, SO THIS IS SOME METHOD I WROTE TO JUST PARSE THAT STRING INTO THE FORMAT THAT I USE

???????????? processDateNode(o, date);

???????? }

?

???????? if (price_nodes.length > 0 ) {

???????????? TagNode price_node = (TagNode) price_nodes[ 0 ];

???????????? double price = Double.parseDouble(price_node.getChildren().iterator().next().toString().trim());

???????????? o.setPremium(price);

???????? }

?

???????? return o;

???? }

}

So that’s it! Once you include the JAR in your build path, everything else is pretty easy! It’s a great tool to use. However, it does require knowledge of?XPATH?but XPATH isn’t too hard to pick up and is useful to know so if you don’t know it then take a look at the link.

Now, a warning to everyone. It’s documented that the XPATH expressions recognized by HtmlCleaner is not complete in the sense that only “basic” XPATH is recognized. What’s excluded? For instance, you can’t use any of the “axes” operators (i.e. parent, ancestor, following, following-sibling, etc), but in my experience everything else is fair game. Yes, it sucks, and many times it can make your life a little bit harder, but usually it just requires you to be a tad more clever with your XPATH expressions before you can pull the desired information.

And of course, this technique works for XML documents as well!

Hope this was helpful to everyone. Let me know if you’re confused anywhere.

- jwei



本網(wǎng)站聲明
本文內(nèi)容由網(wǎng)友自願投稿,版權(quán)歸原作者所有。本站不承擔(dān)相應(yīng)的法律責(zé)任。如發(fā)現(xiàn)涉嫌抄襲或侵權(quán)的內(nèi)容,請聯(lián)絡(luò)admin@php.cn

熱AI工具

Undress AI Tool

Undress AI Tool

免費脫衣圖片

Undresser.AI Undress

Undresser.AI Undress

人工智慧驅(qū)動的應(yīng)用程序,用於創(chuàng)建逼真的裸體照片

AI Clothes Remover

AI Clothes Remover

用於從照片中去除衣服的線上人工智慧工具。

Clothoff.io

Clothoff.io

AI脫衣器

Video Face Swap

Video Face Swap

使用我們完全免費的人工智慧換臉工具,輕鬆在任何影片中換臉!

熱工具

記事本++7.3.1

記事本++7.3.1

好用且免費的程式碼編輯器

SublimeText3漢化版

SublimeText3漢化版

中文版,非常好用

禪工作室 13.0.1

禪工作室 13.0.1

強大的PHP整合開發(fā)環(huán)境

Dreamweaver CS6

Dreamweaver CS6

視覺化網(wǎng)頁開發(fā)工具

SublimeText3 Mac版

SublimeText3 Mac版

神級程式碼編輯軟體(SublimeText3)

解釋ARIA中角色屬性的目的。 解釋ARIA中角色屬性的目的。 Jun 14, 2025 am 12:35 AM

ARIA的role屬性用於定義網(wǎng)頁元素的角色,提升無障礙性。 1.role屬性幫助輔助技術(shù)理解元素的功能,如按鈕、導(dǎo)航等。 2.使用role屬性可以為非語義HTML元素賦予特定角色。 3.role屬性應(yīng)與元素行為一致,並通過無障礙工具測試驗證。

HTML和設(shè)計:創(chuàng)建網(wǎng)站的視覺佈局 HTML和設(shè)計:創(chuàng)建網(wǎng)站的視覺佈局 Jun 14, 2025 am 12:39 AM

如何創(chuàng)建網(wǎng)站佈局? 1.使用HTML標(biāo)籤定義內(nèi)容結(jié)構(gòu),如、、。 2.通過CSS控製樣式和位置,使用盒模型、浮動或Flexbox佈局。 3.優(yōu)化性能,減少HTTP請求,使用緩存和優(yōu)化圖像,確保響應(yīng)式設(shè)計。

如何確保您的HTML代碼可讀和可維護(hù)? 如何確保您的HTML代碼可讀和可維護(hù)? Jun 10, 2025 am 12:06 AM

提升HTML代碼的可讀性和可維護(hù)性可以通過以下步驟實現(xiàn):1.使用語義化標(biāo)籤,如、、等,使代碼結(jié)構(gòu)清晰,提升SEO效果;2.保持代碼格式化,使用一致的縮進(jìn)和空格;3.添加適當(dāng)?shù)脑]釋,解釋代碼意圖;4.避免過度嵌套,簡化結(jié)構(gòu);5.使用外部樣式表和腳本,保持HTML簡潔。

我如何了解最新的HTML標(biāo)準(zhǔn)和最佳實踐? 我如何了解最新的HTML標(biāo)準(zhǔn)和最佳實踐? Jun 20, 2025 am 08:33 AM

要跟上HTML標(biāo)準(zhǔn)和最佳實踐,關(guān)鍵在於有意為之而非盲目追隨。首先,關(guān)注官方來源如WHATWG和W3C的摘要或更新日誌,了解新標(biāo)籤(如)和屬性,將其作為參考解決疑難問題;其次,訂閱可信的網(wǎng)頁開發(fā)新聞通訊和博客,每週花10-15分鐘瀏覽更新,關(guān)注實際用例而非僅收藏文章;再次,使用開發(fā)者工具和linters如HTMLHint,通過即時反饋優(yōu)化代碼結(jié)構(gòu);最後,與開發(fā)者社區(qū)互動,分享經(jīng)驗並學(xué)習(xí)他人實戰(zhàn)技巧,從而持續(xù)提升HTML技能。

如何使用元素來表示文檔的主要內(nèi)容? 如何使用元素來表示文檔的主要內(nèi)容? Jun 19, 2025 pm 11:09 PM

使用標(biāo)籤的原因是提升網(wǎng)頁的語義化結(jié)構(gòu)和可訪問性,使屏幕閱讀器和搜索引擎更易理解頁面內(nèi)容,並允許用戶快速跳轉(zhuǎn)至核心內(nèi)容。以下是關(guān)鍵要點:1.每個頁面應(yīng)僅包含一個元素;2.不應(yīng)包括跨頁面重複的內(nèi)容(如側(cè)邊欄或頁腳);3.可與ARIA屬性結(jié)合使用以增強無障礙體驗。通常位於和之後、之前,用於包裹唯一的頁面內(nèi)容,例如文章、表單或產(chǎn)品詳情,並應(yīng)避免嵌套在、或中;為提高輔助功能,可使用aria-labelledby或aria-label明確標(biāo)識部分。

如何創(chuàng)建基本的HTML文檔? 如何創(chuàng)建基本的HTML文檔? Jun 19, 2025 pm 11:01 PM

要創(chuàng)建一個基本的HTML文檔,首先需要了解其基本結(jié)構(gòu)並按照標(biāo)準(zhǔn)格式編寫代碼。 1.開始時使用聲明文檔類型;2.使用標(biāo)籤包裹整個內(nèi)容;3.在其中包含和兩個主要部分,用於存放元數(shù)據(jù)如標(biāo)題、樣式錶鍊接等,而則包含用戶可見的內(nèi)容如標(biāo)題、段落、圖片和鏈接;4.保存文件為.html格式並在瀏覽器中打開查看效果;5.隨後可逐步添加更多元素以豐富頁面內(nèi)容。遵循這些步驟即可快速構(gòu)建一個基礎(chǔ)網(wǎng)頁。

什麼是HTML標(biāo)籤? 什麼是HTML標(biāo)籤? Jun 13, 2025 am 12:36 AM

htmltagsareessentialforsstructuringwebpages.theydefinecontentandandlayoutingusingangusinginganglebrackets,通常是likeand,withsomebeingself-closingsellike.htmltagsarecracialforcialforcialforcialforcreatingstructructstructstructruct,可訪問,可訪問,可訪問,可訪問,henseo-seo-seo-seo-weeprylyweblages。

如何使用 如何使用 Jun 19, 2025 pm 11:41 PM

要創(chuàng)建HTML複選框,需使用type屬性設(shè)為checkbox的元素。 1.基本結(jié)構(gòu)包含id、name和label標(biāo)籤,確保點擊文字可切換選項;2.多個相關(guān)複選框應(yīng)使用相同name但不同value,並用fieldset包裹提升可訪問性;3.自定義樣式時隱藏原生控件並用CSS設(shè)計替代元素,同時保持功能完整;4.確??捎眯裕鋵abel、支持鍵盤導(dǎo)航且避免僅依賴視覺提示。以上步驟能幫助開發(fā)者正確實現(xiàn)兼具功能與美觀的複選框組件。

See all articles