Innoplexia DevTools to Crawl Webpages

Technology

d0x
  • 1.DevTools to crawl Webpages.
  • 2. DevTools09.05.12 @chrschneider 2
  • 3. DevTools… Apache … toolset of low level Java componentsfocused on HTTP and associated protocols.“● HttpComponents Core… is a set of low level HTTP transport components● HttpComponents Client… provides reusable components for client-side ... HTTP connectionmanagement.● HttpComponents AsyncClient (DEV)… ability to handle a great number of concurrent connections ... more ...performance in terms of a raw data throughput.● Commons HttpClient (Legacy) … All users of Commons HttpClient 3.x are strongly encouraged to upgrade to HttpClient 4.1.09.05.12 @chrschneider3
  • 4. DevToolsHttpComponents Client Example Components ● Get, Post, Delete, … Request Objects ● Cookie Manager ● SSL ● Content Encoding Aware ● HTTP Authentication (Basic, Digest, ...)09.05.12 @chrschneider4
  • 5. DevToolsHttpComponents Client Example public final static void main(final String[] args) throws Exception {final HttpClient httpclient = new DefaultHttpClient();try{final HttpGet httpget = new HttpGet("http://www.google.com/");System.out.println("executing request " + httpget.getURI());// Create a response handlerfinal ResponseHandler<String> responseHandler = new BasicResponseHandler();final String responseBody = httpclient.execute(httpget, responseHandler);System.out.println("----------------------------------------");System.out.println(responseBody);System.out.println("----------------------------------------");}finally{httpclient.getConnectionManager().shutdown();} }http://hc.apache.org/httpcomponents-client-ga/examples.html09.05.12 @chrschneider 5
  • 6. DevToolsHttpComponents Client Demo09.05.12 @chrschneider6
  • 7. DevTools … is an asynchronous event-driven network application framework for rapid development of maintainable high performance protocol servers & clients.See: http://netty.io/09.05.12 @chrschneider 7
  • 8. DevTools… is a "GUI-Less browser for Java programs" Features (extraction):● Support for the HTTP and HTTPS protocols● Support for cookies● Ability to specify whether failing responses from the server should throw exceptionsor should be returned as pages of the appropriate type (based on content type)● Ability to customize the request headers being sent to the server● Support for HTML responses ● Support for submitting forms ● Support for clicking links ● Support for walking the DOM model of the HTML document ● JavaScript support09.05.12 @chrschneider 8
  • 9. DevTools… is a "GUI-Less browser for Java programs"@Testpublic void homePage() throws Exception{final WebClient webClient = new WebClient();final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net"); System.out.println(page.getTitleText()); assertEquals("Welcome to HtmlUnit", page.getTitleText()); final String pageAsXml = page.asXml(); assertTrue(pageAsXml.contains("<body class="composite">")); final String pageAsText = page.asText(); assertTrue(pageAsText.contains("Support for the HTTP and HTTPS protocols")); webClient.closeAllWindows();} http://htmlunit.sourceforge.net/gettingStarted.html09.05.12@chrschneider9
  • 10. DevTools … is a "GUI-Less browser for Java programs" @Test public void getElements() throws Exception { final WebClient webClient = new WebClient(); final HtmlPage page = webClient.getPage("http://some_url"); final HtmlDivision div = page.getHtmlElementById("some_div_id"); final HtmlAnchor anchor = page.getAnchorByName("anchor_name"); webClient.closeAllWindows(); } Luxus :) Note: Also html tables are supported. They wrote easy wrapper classes to walk though them. … Handy! http://htmlunit.sourceforge.net/table-howto.html http://htmlunit.sourceforge.net/gettingStarted.html09.05.12@chrschneider 10
  • 11. DevTools … automates browsers. Thats it.Selenium-WebDriver supports the following browsers along with theoperating systems these browsers are compatible with.●Google Chrome 12.0.712.0+●Internet Explorer 6, 7, 8, 9 - 32 and 64-bit where applicable●Firefox 3.0, 3.5, 3.6, 4.0, 5.0, 6, 7●Opera 11.5+●HtmlUnit 2.9●Android – 2.3+ for phones and tablets (devices & emulators)●iOS 3+ for phones (devices & emulators) and 3.2+ for tablets (devices & emulators)09.05.12@chrschneider11
  • 12. DevTools… automates browsers. Thats it. The Selenium Family Selenium IDE Also c#, Phython, Ruby, ... Selenium WebDriver Also on Windows and Mac Selenium Grid09.05.12 @chrschneider12
  • 13. DevTools… automates browsers. Thats it. The Selenium Family… create quick bug reproduction scripts Selenium IDE… create scripts to aid in automation-aidedexploratory testing Selenium WebDriver … create robust, browser-based regressionautomation… scale and distribute scripts across manyenvironments Selenium Grid http://seleniumhq.org/09.05.12 @chrschneider 13
  • 14. DevToolsRequirements for Selenium WebDriver with Firefox (and HtmlUnit)DependenciesBrowser Binaries <dependency> <groupId>org.seleniumhq.selenium</groupId> <artifactId>selenium-java</artifactId> <version>2.21.0</version> </dependency> <dependency> <groupId>org.seleniumhq.selenium</groupId> <artifactId>selenium-htmlunit-driver</artifactId> <version>2.21.0</version> </dependency> <dependency> <groupId>org.seleniumhq.selenium</groupId> it. <artifactId>selenium-firefox-driver</artifactId>s <version>2.21.0</version>atTh </dependency>09.05.12 @chrschneider 14
  • 15. DevTools Basic Selenium example@Testpublic void testSeleniumWithFirefox() throws InterruptedException{final WebDriver webDriver = new FirefoxDriver(); webDriver.get("http://www.majug.de"); final WebElement veranstaltungenLink = webDriver.findElement(By.linkText("Veranstaltungen")); veranstaltungenLink.click(); // Close the browser Thread.sleep(5000); webDriver.quit();}09.05.12 @chrschneider 15
  • 16. DevToolsSelenium WebDriver Locator Strategies Its also possible to call findElements(...) to get a List<> of WebElements.: List<WebElement> hits = webDriver.findElements(By.tagName("a"));09.05.12 @chrschneider16
  • 17. DevToolsSelenium WebDriver InteractionsIf you got a webElement, you can... ● webElement.click() it ● webElement.sendKeys(...) to it ● webElement.submit() on it.It is also possible to perform “Actions“ like DoubleClick, DragAndDrop, ClickAndHold, …with the “Actions“ class.09.05.12@chrschneider17
  • 18. DevTools Selenium WebDriverDemo09.05.12 @chrschneider18
  • 19. DevToolsSelenium WebDriver PitfallsNewbie Pitfalls:● Selenium doesnt wait until the hole site is loaded (Keyword: Implicit wait)● webElement.xPath(“@// ...“) starts from root of the DOM (use “.//...“ instead)● Google brings up “Selenium RC“ solutions. This is the old Selenium project.● A reference to a WebElement will become invalid if the driver “moves“ toanother page.● Firefox doesnt run on our CI because it is a headless system (try Xvfb)● New XPath 2.0 functions (like ends-with(...)) are failing. This is because Seleniumuses the drivers native Xpath engine. For Firefox this means it is Xpath 1.0 today.09.05.12@chrschneider 19
  • 20. Noch Fragen?Vielen Dank für Ihre Aufmerksamkeit!
    Please download to view
  • 20
    All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
    Description
    Text
    • 1.DevTools to crawl Webpages.
  • 2. DevTools09.05.12 @chrschneider 2
  • 3. DevTools… Apache … toolset of low level Java componentsfocused on HTTP and associated protocols.“● HttpComponents Core… is a set of low level HTTP transport components● HttpComponents Client… provides reusable components for client-side ... HTTP connectionmanagement.● HttpComponents AsyncClient (DEV)… ability to handle a great number of concurrent connections ... more ...performance in terms of a raw data throughput.● Commons HttpClient (Legacy) … All users of Commons HttpClient 3.x are strongly encouraged to upgrade to HttpClient 4.1.09.05.12 @chrschneider3
  • 4. DevToolsHttpComponents Client Example Components ● Get, Post, Delete, … Request Objects ● Cookie Manager ● SSL ● Content Encoding Aware ● HTTP Authentication (Basic, Digest, ...)09.05.12 @chrschneider4
  • 5. DevToolsHttpComponents Client Example public final static void main(final String[] args) throws Exception {final HttpClient httpclient = new DefaultHttpClient();try{final HttpGet httpget = new HttpGet("http://www.google.com/");System.out.println("executing request " + httpget.getURI());// Create a response handlerfinal ResponseHandler<String> responseHandler = new BasicResponseHandler();final String responseBody = httpclient.execute(httpget, responseHandler);System.out.println("----------------------------------------");System.out.println(responseBody);System.out.println("----------------------------------------");}finally{httpclient.getConnectionManager().shutdown();} }http://hc.apache.org/httpcomponents-client-ga/examples.html09.05.12 @chrschneider 5
  • 6. DevToolsHttpComponents Client Demo09.05.12 @chrschneider6
  • 7. DevTools … is an asynchronous event-driven network application framework for rapid development of maintainable high performance protocol servers & clients.See: http://netty.io/09.05.12 @chrschneider 7
  • 8. DevTools… is a "GUI-Less browser for Java programs" Features (extraction):● Support for the HTTP and HTTPS protocols● Support for cookies● Ability to specify whether failing responses from the server should throw exceptionsor should be returned as pages of the appropriate type (based on content type)● Ability to customize the request headers being sent to the server● Support for HTML responses ● Support for submitting forms ● Support for clicking links ● Support for walking the DOM model of the HTML document ● JavaScript support09.05.12 @chrschneider 8
  • 9. DevTools… is a "GUI-Less browser for Java programs"@Testpublic void homePage() throws Exception{final WebClient webClient = new WebClient();final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net"); System.out.println(page.getTitleText()); assertEquals("Welcome to HtmlUnit", page.getTitleText()); final String pageAsXml = page.asXml(); assertTrue(pageAsXml.contains("<body class="composite">")); final String pageAsText = page.asText(); assertTrue(pageAsText.contains("Support for the HTTP and HTTPS protocols")); webClient.closeAllWindows();} http://htmlunit.sourceforge.net/gettingStarted.html09.05.12@chrschneider9
  • 10. DevTools … is a "GUI-Less browser for Java programs" @Test public void getElements() throws Exception { final WebClient webClient = new WebClient(); final HtmlPage page = webClient.getPage("http://some_url"); final HtmlDivision div = page.getHtmlElementById("some_div_id"); final HtmlAnchor anchor = page.getAnchorByName("anchor_name"); webClient.closeAllWindows(); } Luxus :) Note: Also html tables are supported. They wrote easy wrapper classes to walk though them. … Handy! http://htmlunit.sourceforge.net/table-howto.html http://htmlunit.sourceforge.net/gettingStarted.html09.05.12@chrschneider 10
  • 11. DevTools … automates browsers. Thats it.Selenium-WebDriver supports the following browsers along with theoperating systems these browsers are compatible with.●Google Chrome 12.0.712.0+●Internet Explorer 6, 7, 8, 9 - 32 and 64-bit where applicable●Firefox 3.0, 3.5, 3.6, 4.0, 5.0, 6, 7●Opera 11.5+●HtmlUnit 2.9●Android – 2.3+ for phones and tablets (devices & emulators)●iOS 3+ for phones (devices & emulators) and 3.2+ for tablets (devices & emulators)09.05.12@chrschneider11
  • 12. DevTools… automates browsers. Thats it. The Selenium Family Selenium IDE Also c#, Phython, Ruby, ... Selenium WebDriver Also on Windows and Mac Selenium Grid09.05.12 @chrschneider12
  • 13. DevTools… automates browsers. Thats it. The Selenium Family… create quick bug reproduction scripts Selenium IDE… create scripts to aid in automation-aidedexploratory testing Selenium WebDriver … create robust, browser-based regressionautomation… scale and distribute scripts across manyenvironments Selenium Grid http://seleniumhq.org/09.05.12 @chrschneider 13
  • 14. DevToolsRequirements for Selenium WebDriver with Firefox (and HtmlUnit)DependenciesBrowser Binaries <dependency> <groupId>org.seleniumhq.selenium</groupId> <artifactId>selenium-java</artifactId> <version>2.21.0</version> </dependency> <dependency> <groupId>org.seleniumhq.selenium</groupId> <artifactId>selenium-htmlunit-driver</artifactId> <version>2.21.0</version> </dependency> <dependency> <groupId>org.seleniumhq.selenium</groupId> it. <artifactId>selenium-firefox-driver</artifactId>s <version>2.21.0</version>atTh </dependency>09.05.12 @chrschneider 14
  • 15. DevTools Basic Selenium example@Testpublic void testSeleniumWithFirefox() throws InterruptedException{final WebDriver webDriver = new FirefoxDriver(); webDriver.get("http://www.majug.de"); final WebElement veranstaltungenLink = webDriver.findElement(By.linkText("Veranstaltungen")); veranstaltungenLink.click(); // Close the browser Thread.sleep(5000); webDriver.quit();}09.05.12 @chrschneider 15
  • 16. DevToolsSelenium WebDriver Locator Strategies Its also possible to call findElements(...) to get a List<> of WebElements.: List<WebElement> hits = webDriver.findElements(By.tagName("a"));09.05.12 @chrschneider16
  • 17. DevToolsSelenium WebDriver InteractionsIf you got a webElement, you can... ● webElement.click() it ● webElement.sendKeys(...) to it ● webElement.submit() on it.It is also possible to perform “Actions“ like DoubleClick, DragAndDrop, ClickAndHold, …with the “Actions“ class.09.05.12@chrschneider17
  • 18. DevTools Selenium WebDriverDemo09.05.12 @chrschneider18
  • 19. DevToolsSelenium WebDriver PitfallsNewbie Pitfalls:● Selenium doesnt wait until the hole site is loaded (Keyword: Implicit wait)● webElement.xPath(“@// ...“) starts from root of the DOM (use “.//...“ instead)● Google brings up “Selenium RC“ solutions. This is the old Selenium project.● A reference to a WebElement will become invalid if the driver “moves“ toanother page.● Firefox doesnt run on our CI because it is a headless system (try Xvfb)● New XPath 2.0 functions (like ends-with(...)) are failing. This is because Seleniumuses the drivers native Xpath engine. For Firefox this means it is Xpath 1.0 today.09.05.12@chrschneider 19
  • 20. Noch Fragen?Vielen Dank für Ihre Aufmerksamkeit!
  • Comments
    Top