Java HTML Parser

자바 html parser중에 css selector를 지원하는 opensource프로젝트를 찾았더니 jsoup이라는 것이 있어서 소개

Site : http://jsoup.org/

WHATWG HTML5 명세를 구현했으며 아래와 같은 기능을 제공한다.

scrape and parse HTML from a URL, file, or string
find and extract data, using DOM traversal or CSS selectors
manipulate the HTML elements, attributes, and text
clean user-submitted content against a safe white-list, to prevent XSS attacks
output tidy HTML

URL, file, string으로 부터 받은 HTML을 스크랩하거나 파싱하는 기능

DOM을 탐색하거나 CSS selector를 이용해 데이터를 추출하는 기능

HTML의 element, attributes, test를 다루는 기능

크로스사이트 어택(XSS)를 방지하기위해 사용자로 부터 submit된 content를 검사하는 기능

깔끔한 HTML output?

아무튼 내가 나중에 사용하고 싶은 부분은 Selector를 이용하는 부분이다.

Example

Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from theIn the news section into a list of Elements (online sample):

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

위와 같이 selector를 사용해서 원하는 element들을 찾아낼 수 있어서 간단한 파싱을 할 때 유용할 것 같다.

메이븐을 사용해서 간단하게 사용해 볼 수있다.

maven dependency 설정 방법은 아래와 같다.

Maven

If you use Maven to manage the dependencies in your Java project (and you should!), you do not need to download; just place the following into your POM's <dependencies> section:

<dependency>
  <!-- jsoup HTML parser library @ http://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.7.2</version>
</dependency>

저작자표시 비영리 동일조건

'Programing > Java' 카테고리의 다른 글

java profiling (자바 프로파일링) YourKit 사용 방법 (0)	2011.09.16
Java pipeline framework (자바 파이프라이닝 처리) (0)	2011.09.01
java remote debugger, 자바 원격 디버깅 (0)	2011.08.29
java multi-thread, thread pool 사용 (0)	2011.07.29
Mysql Connection pooling (0)	2011.06.24

여긴지구

Java HTML Parser

Maven

'Programing > Java' 카테고리의 다른 글

티스토리툴바

Java HTML Parser

Maven

'Programing > Java' 카테고리의 다른 글

'Programing/Java' Related Articles

티스토리툴바