Personal tools

Swap:Data Access: Difference between revisions

From Adapt

Jump to: navigation, search
No edit summary
 
No edit summary
 
Line 14: Line 14:
The url would be http://server1.university.edu:8080/get/processdata/webcrawls/2004crawl/oct2004/35/crawlfile.arc.gz
The url would be http://server1.university.edu:8080/get/processdata/webcrawls/2004crawl/oct2004/35/crawlfile.arc.gz


==Downloading Data==
==Downloading Files==


'''Whole File'''
'''Whole File'''
Line 20: Line 20:
'''Partial File'''
'''Partial File'''


'''Arc File'''
==Download Arc Files===
 
This function assumes any files that you are pulling is a file containing concatenated arc entries, where each arc entry has been gzip'd.
 
* offset - offset to start reading within the compressed file
* contentonly - (optional) set to true to strip out arc http header information (default: false)
 
http://server1.university.edu:8080/arc/processdata/webcrawls/2004crawl/oct2004/35/crawlfile.arc.gz?offset=6789&contentonly=true

Latest revision as of 17:08, 19 May 2010

HTTP Access

Data can be uploaded and downloaded using http. Data is accessed using a REST-ish mechanism. To construct a URL, you will need to know three things

  • The address of any swap server
  • The base path of the file group containing the data you want to pull
  • The path within the file group to your file

After you know these items, the url is constructed as follows:

http://[server[:port]]/[function]/[group_path]/[file_path]?[function_options]

Let's assume you have a file on server1.university.edu running on port 8080 (the default), the file is in a file group with prefix processdata/webcrawls/2004crawl and the file is located in the directory /oct2004/35/crawlfile.arc.gz

The url would be http://server1.university.edu:8080/get/processdata/webcrawls/2004crawl/oct2004/35/crawlfile.arc.gz

Downloading Files

Whole File

Partial File

Download Arc Files=

This function assumes any files that you are pulling is a file containing concatenated arc entries, where each arc entry has been gzip'd.

  • offset - offset to start reading within the compressed file
  • contentonly - (optional) set to true to strip out arc http header information (default: false)

http://server1.university.edu:8080/arc/processdata/webcrawls/2004crawl/oct2004/35/crawlfile.arc.gz?offset=6789&contentonly=true