Tools Home : HTML Tools : Text Extractor

Click here to show HTML tools HTML Tools

Click here to expand XML tools XML tools

Click here to expand plain text tools Plain Text Tools

Click here to expand other tools Other tools

 Beta tools
 Add Tools Demo
 Manual
 About

Extract Text from HTML Tag/Attribute
?
Summary

This tool is used to extract text from an HTML document, either located at a specified web address or uploaded from the user’s files, including the removal of all text associated with HTML formatting, such as links, div containers or navigation bars.

Please click the ? buttons at the bottom right of each set of options for more information on that set.

For further information on this tool, please see the TADA Wiki’s HTML Text Extractor entry here. A glossary of terms is also available here.

Walkthrough

To extract all text from http://www.globalautonomy.ca/global1/article.jsp?index=RA_Brook_Violence.xml between <div id="content"> and </div> tags, and display the results in HTML format:
    1. Enter 'http://www.globalautonomy.ca/global1/article.jsp?index=RA_Brook_Violence.xml' in the URL field.
  1. Subtext limited to
    1. Enter ‘div’ in the ‘Element’ field.
    2. Enter ‘id’ in the ‘Attribute name’ field.
    3. Enter ‘content’ in the ‘Attribute value’ field.
  2. Results
    1. Select HTML in the Display as drop-down menu.
  3. Click the ‘Submit’ button to process the text.
*
» Source text
  Example: http://taporware.ualberta.ca/einstein-bio.html

?
Summary

This section determines the source of the document you wish the tool to process. HTML can be obtained either from a web address or by uploading a file.

Fields

Source URL
To use content from a web page, enter a full web address (URL) in the field provided. Copy and paste from your browser’s address bar for best results.

Local file
To upload an HTML (.html) file from your computer, choose ‘Local file,’ click ‘Browse,’ and select the file you wish to use from your directory.
» Subtext limited to
        
?
Summary

This section allows the user to limit the extracted text to particular elements, attribute names and attribute values.

If no attribute information is specified, the tool will extract the text from all matching tags. If an attribute name/value pair is entered, the tool will extract text from the first matching tags.

Element
Enter the desired HTML element here, such as 'body'. Note: This field is required.

Attribute name
If desired, add the name of an attribute modifying the element specified above, such as 'id' or 'class'. Note: This field is optional, but must be used in combination with an element.

Attribute value
If desired, enter the name of a specific attribute value here. Note: This field is optional, but must be used in combination with an attribute name.
» Results
?
Summary

This section allows the user to choose how the tool's final results will be displayed.

Fields

Display as
Use this drop-down list to choose between HTML and HTML tags in HTML formats. Note: HTML is the only option available for the Acronym Finder tool.

Open results in new window
Check this box to display the results in a new window or browser tab. This option is selected by default. Some pop-up blockers may prevent a new window from being opened; if so, un-check the box to open the results in the same window instead.
`*' indicates a required field

 

 

TAPoRware Project, McMaster University,