Innoplexia DevTools to Crawl Webpages

  • CategoryTechnology

  • View2094

  • 1.DevTools to crawl Webpages.
  • 2. DevTools09.05.12 @chrschneider 2
  • 3. DevTools… Apache … toolset of low level Java componentsfocused on HTTP and associated protocols.“● HttpComponents Core… is a set of low level HTTP transport components● HttpComponents Client… provides reusable components for client-side ... HTTP connectionmanagement.● HttpComponents AsyncClient (DEV)… ability to handle a great number of concurrent connections ... more ...performance in terms of a raw data throughput.● Commons HttpClient (Legacy) … All users of Commons HttpClient 3.x are strongly encouraged to upgrade to HttpClient @chrschneider3
  • 4. DevToolsHttpComponents Client Example Components ● Get, Post, Delete, … Request Objects ● Cookie Manager ● SSL ● Content Encoding Aware ● HTTP Authentication (Basic, Digest, ...)09.05.12 @chrschneider4
  • 5. DevToolsHttpComponents Client Example public final static void main(final String[] args) throws Exception {final HttpClient httpclient = new DefaultHttpClient();try{final HttpGet httpget = new HttpGet("");System.out.println("executing request " + httpget.getURI());// Create a response handlerfinal ResponseHandler responseHandler = new BasicResponseHandler();final String responseBody = httpclient.execute(httpget, responseHandler);System.out.println("----------------------------------------");System.out.println(responseBody);System.out.println("----------------------------------------");}finally{httpclient.getConnectionManager().shutdown();} } @chrschneider 5
  • 6. DevToolsHttpComponents Client Demo09.05.12 @chrschneider6
  • 7. DevTools … is an asynchronous event-driven network application framework for rapid development of maintainable high performance protocol servers & clients.See: @chrschneider 7
  • 8. DevTools… is a "GUI-Less browser for Java programs" Features (extraction):● Support for the HTTP and HTTPS protocols● Support for cookies● Ability to specify whether failing responses from the server should throw exceptionsor should be returned as pages of the appropriate type (based on content type)● Ability to customize the request headers being sent to the server● Support for HTML responses ● Support for submitting forms ● Support for clicking links ● Support for walking the DOM model of the HTML document ● JavaScript support09.05.12 @chrschneider 8
  • 9. DevTools… is a "GUI-Less browser for Java programs"@Testpublic void homePage() throws Exception{final WebClient webClient = new WebClient();final HtmlPage page = webClient.getPage(""); System.out.println(page.getTitleText()); assertEquals("Welcome to HtmlUnit", page.getTitleText()); final String pageAsXml = page.asXml(); assertTrue(pageAsXml.contains("")); final String pageAsText = page.asText(); assertTrue(pageAsText.contains("Support for the HTTP and HTTPS protocols")); webClient.closeAllWindows();}
  • 10. DevTools … is a "GUI-Less browser for Java programs" @Test public void getElements() throws Exception { final WebClient webClient = new WebClient(); final HtmlPage page = webClient.getPage("http://some_url"); final HtmlDivision div = page.getHtmlElementById("some_div_id"); final HtmlAnchor anchor = page.getAnchorByName("anchor_name"); webClient.closeAllWindows(); } Luxus :) Note: Also html tables are supported. They wrote easy wrapper classes to walk though them. … Handy! 10
  • 11. DevTools … automates browsers. Thats it.Selenium-WebDriver supports the following browsers along with theoperating systems these browsers are compatible with.●Google Chrome 12.0.712.0+●Internet Explorer 6, 7, 8, 9 - 32 and 64-bit where applicable●Firefox 3.0, 3.5, 3.6, 4.0, 5.0, 6, 7●Opera 11.5+●HtmlUnit 2.9●Android – 2.3+ for phones and tablets (devices & emulators)●iOS 3+ for phones (devices & emulators) and 3.2+ for tablets (devices & emulators)09.05.12@chrschneider11
  • 12. DevTools… automates browsers. Thats it. The Selenium Family Selenium IDE Also c#, Phython, Ruby, ... Selenium WebDriver Also on Windows and Mac Selenium Grid09.05.12 @chrschneider12
  • 13. DevTools… automates browsers. Thats it. The Selenium Family… create quick bug reproduction scripts Selenium IDE… create scripts to aid in automation-aidedexploratory testing Selenium WebDriver … create robust, browser-based regressionautomation… scale and distribute scripts across manyenvironments Selenium Grid @chrschneider 13
  • 14. DevToolsRequirements for Selenium WebDriver with Firefox (and HtmlUnit)DependenciesBrowser Binaries org.seleniumhq.selenium selenium-java 2.21.0 org.seleniumhq.selenium selenium-htmlunit-driver 2.21.0 org.seleniumhq.selenium it. selenium-firefox-drivers 2.21.0atTh 09.05.12 @chrschneider 14
  • 15. DevTools Basic Selenium example@Testpublic void testSeleniumWithFirefox() throws InterruptedException{final WebDriver webDriver = new FirefoxDriver(); webDriver.get(""); final WebElement veranstaltungenLink = webDriver.findElement(By.linkText("Veranstaltungen"));; // Close the browser Thread.sleep(5000); webDriver.quit();}09.05.12 @chrschneider 15
  • 16. DevToolsSelenium WebDriver Locator Strategies Its also possible to call findElements(...) to get a List of WebElements.: List hits = webDriver.findElements(By.tagName("a"));09.05.12 @chrschneider16
  • 17. DevToolsSelenium WebDriver InteractionsIf you got a webElement, you can... ● it ● webElement.sendKeys(...) to it ● webElement.submit() on it.It is also possible to perform “Actions“ like DoubleClick, DragAndDrop, ClickAndHold, …with the “Actions“ class.09.05.12@chrschneider17
  • 18. DevTools Selenium WebDriverDemo09.05.12 @chrschneider18
  • 19. DevToolsSelenium WebDriver PitfallsNewbie Pitfalls:● Selenium doesnt wait until the hole site is loaded (Keyword: Implicit wait)● webElement.xPath(“@// ...“) starts from root of the DOM (use “.//...“ instead)● Google brings up “Selenium RC“ solutions. This is the old Selenium project.● A reference to a WebElement will become invalid if the driver “moves“ toanother page.● Firefox doesnt run on our CI because it is a headless system (try Xvfb)● New XPath 2.0 functions (like ends-with(...)) are failing. This is because Seleniumuses the drivers native Xpath engine. For Firefox this means it is Xpath 1.0 today.09.05.12@chrschneider 19
  • 20. Noch Fragen?Vielen Dank für Ihre Aufmerksamkeit!
  • Description
    1.DevTools to crawl Webpages. 2. DevTools09.05.12 @chrschneider 2 3. DevTools… Apache … toolset of low level Java componentsfocused on HTTP and associated protocols.“●…