commit | b83e819e09323ef6310fe8351cfa7a7247fc29be | [log] [tgz] |
---|---|---|
author | Wei-Yin Chen (陳威尹) <wychen@chromium.org> | Thu Jun 18 22:18:10 2015 |
committer | Wei-Yin Chen (陳威尹) <wychen@chromium.org> | Thu Jun 18 22:18:10 2015 |
tree | 0cb01fdc23b5cb576fd53cb2c9ad8bd3174a4bdf | |
parent | bd686e8f04120786e948e100035a21df3729232b [diff] |
Refactor attribute stripping stripIds(), stripFontColorAttributes(), stripTableBackgroundColorAttributes(), and stripStyleAttributes() are refactored to stripAttributeFromTags(). R=mdjones@chromium.org Review URL: https://codereview.chromium.org/1185453010.
DOM Distiller aims to provide a better reading experience by distilling the content of the page. This distilled content can then be used in a variety of ways.
The current efforts that will be powered by DOM Distiller:
In a folder where you want the code (outside of the chromium checkout):
git clone https://github.com/chromium/dom-distiller.git
A dom-distiller
folder will be created in the folder you run that command.
Before you build for the first time, you need to install the build dependencies.
For all platforms, it is require to download and install Google Chrome browser.
ChromeDriver requires Google Chrome to be installed at a specific location (the default location for the platform). See ChromeDriver documentation for details.
Install the dependencies by entering the dom-distiller
folder and running:
sudo ./install-build-deps.sh
Ubuntu 14.04 64-bit is recommended.
ant
and python
using Homebrew:brew install ant python
protobuf
package with the --with-python
command line parameter:brew install protobuf --with-python
buildtools
inside your DOM Distiller checkoutchromedriver_mac32.zip
and ensure the binary ends up in your buildtools
folder.pip
by running:sudo easy_install pip
selenium
using pip
:pip install --user selenium
For the rest of this guide, there are sometimes references to a tool called xvfb
and specifically when running shell commands using xvfb-run
. When you develop using a Mac OS X, you can remove that part of the command. For example xvfb-run echo
would just become echo
.
This option could be useful if you want to develop on an unsupported system like Windows or Red Hat Linux. Even if you are on a supported system but would rather not touch the system too much, Vagrant is a viable alternative.
The Vagrant VM is based on Ubuntu 14.04.
vagrant up
vagrant ssh
The DOM Distiller project uses the Chromium tools for collaboration. For code reviews, the Chromium Rietveld code review tool is used and the set of tools found in depot_tools
is also required.
To get depot_tools
, follow the guide at Chrome infrastructure documentation for depot_tools.
The TL;DR of that is to run this from a folder where you install developer tools, for example in your $HOME
folder:
git clone https://chromium.googlesource.com/chromium/tools/depot_tools export PATH="/path/to/depot_tools:$PATH"
You must also setup your local checkout needs to point to the Chromium Rietveld server. This is a one-time setup for your checkout, so from your dom-distiller
checkout folder, run:
git cl config
Rietveld server
: https://codereview.chromium.org
ant
is the tool we use to build, and the available targets can be listed using ant -p
, but the typical targets you might use when you work on this project is:
ant test
Runs all tests.ant test -Dtest.filter=$FILTER_PATTERN
where $FILTER_PATTERN
is a [gtest_filter pattern] (https://code.google.com/p/googletest/wiki/AdvancedGuide#Running_a_Subset_of_the_Tests). For example *.FilterTest.*:*Foo*-*Bar*
would run all tests containing .FilterTest.
and Foo
, but not those with Bar
.ant gwtc
compiles .class+.java files to JavaScript. Standalone JavaScript is available at war/domdistiller/domdistiller.nocache.js
.ant gwtc.jstests
creates a standalone JavaScript for the tests.ant package
Copies the main build artifacts into the out/package
folder, typically the extracted JS and protocol buffer files.You can use regular git
command when developing in this project and use git cl
for collaboration.
On your branch, run: git cl upload
. The first time you do this, you will have to provide a username and password.
machine code.google.com login
line to your ~/.netrc
file.git branch -u origin/master git cl land git branch -u master
Before uploading a CL it is recommended to run git cl format
. However, this requires adding symbolic links to your chromium checkout.
Inside the buildtools
folder of your checkout, add the following symbolic links:
clang_format
→ /path/to/chromium/src/buildtools/clang_format/
linux64
-> → /path/to/chromium/src/buildtools/linux64/
(only for Linux 64-bit platform)mac
→ /path/to/chromium/mac/buildtools/linux64/
(only for Mac platform)Doing this enables you to run the command git cl format
to fix the formatting of your code.
In this section, the following shell variables and are assumed correctly set:
export CHROME_SRC=/path/to/chromium/src export DOM_DISTILLER_DIR=/path/to/dom-distiller
roll-distiller () { ( (cd $DOM_DISTILLER_DIR && ant package) && \ rm -rf $CHROME_SRC/third_party/dom_distiller_js/dist/* && \ cp -rf $DOM_DISTILLER_DIR/out/package/* $CHROME_SRC/third_party/dom_distiller_js/dist/ && \ touch $CHROME_SRC/components/resources/dom_distiller_resources.grdp ) }
$CHROME_SRC
run GYP to setup ninja build files usingbuild/gyp_chromium
chrome
target:ninja -C out/Debug chrome
out/Debug/chrome --enable-dom-distiller
Distill page
that you can use to distill web pages.chrome://dom-distiller
to access the debug page.--user-data-dir=/tmp/$(mktemp -d)
as a command line parameter. On Mac OS X, you can instead write --user-data-dir=$(mktemp -d 2>/dev/null || mktemp -d -t 'chromeprofile')
.components_browsertests
target:ninja -C out/Debug components_browsertests
components_browsertests
binary to execute the tests. You can prefix the command with xvfb-run
to avoid pop-up windows:xvfb-run out/Debug/components_browsertests
xvfb-run out/Debug/components_browsertests --gtest_filter=\*Distiller\*
components_browsertests_run
and execute them using the swarming tool:ninja -C out/Debug components_browsertests_run python tools/swarming_client/isolate.py run -s out/Debug/components_browsertests.isolated
To extract the content from a web page directly, you can run:
xvfb-run out/Debug/components_browsertests \ --gtest_filter='*MANUAL_ExtractUrl' \ --run-manual \ --test-tiny-timeout=600000 \ --output-file=./extract.out \ --url=http://www.example.com \ > ./extract.log 2>&1
extract.out
has the extracted HTML, extract.log
has the console logging.
If you need more logging, you can add the following arguments to the command:
--vmodule=*distiller*=2
--debug-level=99
If this is something you often do, you can put the following function in a bash file you include (for example ~/.bashrc
) and use it for iterative development:
distill() { ( roll-distiller && \ ninja -C out/Debug components_browsertests && xvfb-run out/Debug/components_browsertests \ --gtest_filter='*MANUAL_ExtractUrl' \ --run-manual \ --test-tiny-timeout=600000 \ --output-file=./extract.out \ --url=$1 \ > ./extract.log 2>&1 ) }
Usage when running from $CHROME_SRC
:
distill http://example.com/article.html
You can use the Chrome Developer Tools to debug DOM Distiller:
ant gwtc.jstests
or ant test
.war/test.html
in Chrome desktopConsole
panel in Developer Tools (Ctrl-Shift-J). On Mac OS X you can use ⌥-⌘-I (uppercase I
) as the shortcut.org.chromium.distiller.JsTestEntry.run()
org.chromium.distiller.JsTestEntry.runWithFilter('MyTestClass.testSomething')
The Sources
panel contains both the extracted JavaScript and all the Java source files as long as you haven't disabled JavaScript source maps in Developer Tools. You can set breakpoints in the Java source files and then inspect all kinds of different interesting things when that breakpoint is hit.
When a test fails, you will see several stack traces. One of these contains clickable links to the corresponding Java source files for the stack frames.
After running ant package
, the out/extension
folder contains an unpacked Chrome extension. This can be added to Chrome and used for development.
chrome://extensions
out/extension
folder.The extension currently supports profiling the extraction code.
It also adds a panel to the Developer Tools which you can use to trigger extraction on the inspected page. This can be used to trigger and profile extraction on a mobile device which you are currently inspecting using chrome://inspect
.
To add logging, you can use the LogUtil. You can use the Java function LogUtil.logToConsole()
. Destination of logs:
ant test
: Terminal. To get more verbose output, use ant test -Dtest.debug_level=99
.$CHROME_LOG_FILE
. A release mode build of Chrome will log all JavaScript INFO
there if you start Chrome with --enable-logging
. You can add --enable-logging=stderr
to have the log go to stderr instead of a file.extract.log
above.For an example, see $DOM_DISTILLER_DIR/java/org/chromium/distiller/PagingLinksFinder.java
.
Use ant package '-Dgwt.custom.args=-style PRETTY'
for easier JavaScript debugging.
Device
and reload the page. Verify that you get what you expect. For example a Nexus 4 might get a mobile site, whereas Nexus 7 might get the desktop site.UA
field. This field does not even require reload after changing device, but it is good practice to verify that you get what you expect. Copy this to the clipboard.--user-agent="$USER_AGENT_FROM_CLIPBOARD"
. Remember to also add --enable-dom-distiller
.Distill page
or by going to chrome://dom-distiller
and using the input field there.If you want you can copy some of these User-Agent aliases into normal bash aliases for easy access later. For example, Nexus 4 would be:
--user-agent="Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 4 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19"
Steps 1-3 in the guide above can typically be done in a stable version of Chrome, whereas the rest of the steps is typically done in your own build of Chrome (hence the “(Re)” in step 4). Besides speed, this also facilitates side-by-side comparison.