Tools Home : HTML Tools : Tokenize HTML Document

Click here to show HTML tools HTML Tools

Click here to expand XML tools XML tools

Click here to expand plain text tools Plain Text Tools

Click here to expand other tools Other tools

 Beta tools
 Add Tools Demo
 Manual
 About

Tokenize HTML Document
?
Summary

This tool splits an HTML (.html) document, located either at a user-specified web address or uploaded from the user’s files, at specified points (tokens). Tokens may be words, lines, sentences or paragraphs, as well as specific characters, patterns or HTML tags. The results can be listed with or without the token, and with the split made before or after the token.

Please click the ? buttons at the bottom right of each set of options for more information on that set.

For further information on this tool, please see the TADA Wiki's Tokenize entry here. A glossary of terms is also available here.

Walkthrough

To process the text found within http://www.w3.org/; extract text between <body> and </body> tags, split the resulting text into sentences before the token, preserve the token, and display the results as HTML:
  1. Source text
    1. Enter ‘http://www.w3.org/’ into the ‘URL’ field.
  2. Subtext limited to
    1. Enter ‘body’ in the ‘Elements’ field.
  3. Token types
    1. Click the radio button next to 'Sentences'.
  4. Results
    1. Set the ‘Display options’ drop menu to ‘keep with previous token’.
    2. Set the ‘Display as’ drop menu to ‘HTML’.
    3. Click the ‘Submit’ button to process the text.
*
» Source text
  Example: http://taporware.ualberta.ca/einstein-bio.html

?
Summary

This section determines the source of the document you wish the tool to process. HTML can be obtained either from a web address or by uploading a file.

Fields

Source URL
To use content from a web page, enter a full web address (URL) in the field provided. Copy and paste from your browser’s address bar for best results.

Local file
To upload an HTML (.html) file from your computer, choose ‘Local file,’ click ‘Browse,’ and select the file you wish to use from your directory.
*
» Subtext limited to
(separate multiple elements with a `,')
?
Summary

This section determines which HTML tags to extract text from.

Fields

Elements
Use this field to specify which HTML tag(s) to extract text from. Multiple tags must be separated by commas (ex: 'p, h1, h2'). This field defaults to 'body'.
*
» Token types





(separate characters with spaces; to separate by whitespace, use ^s)
?
Summary

This section determines what to consider a token within the source document. Tokens are fragments of a text that can be defined as words, lines, sentences, paragraphs, characters or specific patterns.

Fields

Words
Splits text on individual words.

Lines
Splits text line by line.

Sentences
Splits text sentence by sentence.

Paragraphs
Splits text paragraph by paragraph.

Characters
Splits text along a user-specified character (separated by spaces). To include a space use ^s.

Pattern
Splits text along a user-specified pattern, either in Unix format or as a regular expression.

Unix Format
Splits the text by tokens based on a Unix format regular expression

Regular Exp.
Splits the text by tokens found with a regular expression. Do not use \d, \D, \w, \W etc. Instead, please use [0-9], [^0-9], [a-zA-Z] or [^a-zA-Z]. Using \n, \r, or \t is fine.
» Results
?
Summary

This section allows users to choose how the results will be formatted, and whether to display results in a new browser window.

Fields

Display options
This drop menu allows users to treat the token one of four ways in the final results: strip out the separator, keep the separator as a token, keep the separator with the previous token, or keep the separator with the following token.

Display as
This drop-down lists enables users to choose from several output formats: HTML, XML text in HTML, XML tree, and Tab delimited text.

Open results in new window
Check this box to display the results in a new window or browser tab. This option is selected by default. Some pop-up blockers may prevent a new window from being opened; if so, un-check the box to open the results in the same window instead.
`*' indicates a required field
`' do not use \d, \D, \w, \W etc. use [0-9], [^0-9], [a-zA-Z] or [^a-zA-Z] instead. Using \n, \r, \t is fine.

 

 

TAPoRware Project, McMaster University,