Difference between revisions of "Downloading web resources with http.stream - basics"
From MorphOS Library
(New section started.) |
(→Introduction: HTTPS) |
||
(7 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
''Grzegorz Kraszewski'' | ''Grzegorz Kraszewski'' | ||
==Introduction== | ==Introduction== | ||
− | The ''http.stream'' class is one of Reggae stream classes, in other words data sources. In a chain of Reggae objects, a ''http.stream'' instance will be always the first object, having only one, output port. A ''http.stream'' object may be also used standalone, not connected to anything, just to retrieve any data resource reachable via HTTP protocol and particularly its GET | + | The ''http.stream'' class is one of Reggae stream classes, in other words data sources. In a chain of Reggae objects, a ''http.stream'' instance will be always the first object, having only one, output port. A ''http.stream'' object may be also used standalone, not connected to anything, just to retrieve any data resource reachable via HTTP protocol and particularly its GET and POST requests. From this point of view, ''http.stream'' is just embeddable HTTP/1.1 client with simple yet powerful API. A brief list of its features is given below: |
* Socket API encapsulation. ''http.stream'' completely isolates application (and its programmer) from ''bsdsocket.library'' and TCP/IP stack. Only very basic knowledge of TCP/IP is needed to use ''http.stream'' with success. | * Socket API encapsulation. ''http.stream'' completely isolates application (and its programmer) from ''bsdsocket.library'' and TCP/IP stack. Only very basic knowledge of TCP/IP is needed to use ''http.stream'' with success. | ||
* Unlike ''bsdsocket.library'' base instances, ''http.stream'' objects may be shared between processes (with the only exception that object must be disposed by proces which created it). | * Unlike ''bsdsocket.library'' base instances, ''http.stream'' objects may be shared between processes (with the only exception that object must be disposed by proces which created it). | ||
Line 13: | Line 13: | ||
* Easy protocol debugging via MediaLogger. | * Easy protocol debugging via MediaLogger. | ||
The class has some disadvantages however. Some of them may be removed in future versions: | The class has some disadvantages however. Some of them may be removed in future versions: | ||
− | * No support for | + | * No support for HTTP secure below v52. |
* No support for persistent connections. | * No support for persistent connections. | ||
+ | * No support for HTTP compression. | ||
* Making connection, sending request and receiving response header is done in the constructor, so it is synchronous to the application. Any network delay in constructor blocks the application until timeout or other error is reached. It can be worked around by putting all the network operation on a subprocess. | * Making connection, sending request and receiving response header is done in the constructor, so it is synchronous to the application. Any network delay in constructor blocks the application until timeout or other error is reached. It can be worked around by putting all the network operation on a subprocess. | ||
Line 31: | Line 32: | ||
We assume here, ''http.stream'' class has been loaded previously with ''OpenLibrary()'' (see [[Reggae_tutorial:_Accessing_Reggae_in_applications#Opening_and_closing_individual_classes|Opening and closing individual classes]]). The code will download first 7465 bytes of MorphZone main page (HTML code), assuming there will be no error. This assumption is rather risky, because a network operation can fail for numerous reasons. Then we will be calling method on the NULL pointer and disposing NULL later, which can even lead to application crash. For this reason ''http.stream'' offers a few ways for handling errors. They will be discussed later, for now a minimal error handling is checking ''NewObject()'' result against NULL. This is used in a [http://krashan.ppa.pl/reggae/library/http_simple.c simple example] downloading the first 1000 bytes of a resource specified in the commandline and dumping them into the console. Note that using this program for binary resources (like images) may result in rather weird output... I recommend running this example along with MediaLogger, to learn ''http.stream'' protocol debugging features. | We assume here, ''http.stream'' class has been loaded previously with ''OpenLibrary()'' (see [[Reggae_tutorial:_Accessing_Reggae_in_applications#Opening_and_closing_individual_classes|Opening and closing individual classes]]). The code will download first 7465 bytes of MorphZone main page (HTML code), assuming there will be no error. This assumption is rather risky, because a network operation can fail for numerous reasons. Then we will be calling method on the NULL pointer and disposing NULL later, which can even lead to application crash. For this reason ''http.stream'' offers a few ways for handling errors. They will be discussed later, for now a minimal error handling is checking ''NewObject()'' result against NULL. This is used in a [http://krashan.ppa.pl/reggae/library/http_simple.c simple example] downloading the first 1000 bytes of a resource specified in the commandline and dumping them into the console. Note that using this program for binary resources (like images) may result in rather weird output... I recommend running this example along with MediaLogger, to learn ''http.stream'' protocol debugging features. | ||
==Length of data== | ==Length of data== | ||
+ | Usefulness of the above example is limited. It downloads only predefined amount of data (or less, if the resource turns out to be shorter). Usually we want to download all the data and this implies getting the length of it somehow. A few scenarios are possible: | ||
+ | |||
+ | |||
+ | '''The length of data is known before downloading''' | ||
+ | |||
+ | This is the easiest, but the most rare case. It can be handled exactly as in the example from the previous section – a statically sized buffer and single ''MMM_Pull()'' call. | ||
+ | |||
+ | |||
+ | '''The server sends a static file''' | ||
+ | |||
+ | Then it knows the size and passes it in the response header (''Content-Length'' field). The ''http.stream'' object extracts it automatically. Then data length may be obtained by getting ''MMA_StreamLength'' attribute. It means that the length is known before data downloading, so a buffer may be allocated dynamically. The attribute is 64-bit, so it should be get as follows: | ||
+ | |||
+ | |||
+ | <tt>QUAD length;<br><br>length = MediaGetPort64(http, 0, MMA_StreamLength);</tt> | ||
+ | |||
+ | |||
+ | [http://krashan.ppa.pl/reggae/library/http_static.c This example] shows ''MMA_StreamLength'' usage. It creates the object, asks of the data length, allocates a buffer, downloads data to the buffer and finally stores the buffer in a file. | ||
+ | |||
+ | |||
+ | '''The server sends dynamically generated data''' | ||
+ | |||
+ | Data are usually generated by some server-side script, written in PHP or other language. In this case server does not know the length ''a priori'' so it switches to HTTP chunked transfer mode. The ''http.stream'' object handles it automatically, and reports 0 as ''MMA_StreamLength'', which means that the length is unknown. The only way to process such data is downloading it in blocks in a loop until the object reports MMERR_END_OF_DATA error code. The loop code may look like this: | ||
+ | |||
+ | |||
+ | <tt>LONG chunk, error = 0;<br><br> | ||
+ | while (!error)<br> | ||
+ | {<br> | ||
+ | chunk = DoMethod(http, MMM_Pull, 0, buffer, BUFFER_SIZE);<br><br> | ||
+ | /* Do something with 'chunk' bytes of data in 'buffer'. */<br><br> | ||
+ | if (chunk < BUFFER_SIZE)<br> | ||
+ | {<br> | ||
+ | if (MediaGetPort(http, 0, MMA_ErrorCode) == MMERR_END_OF_DATA))<br> | ||
+ | {<br> | ||
+ | break; /* downloading finished */<br> | ||
+ | }<br> | ||
+ | else<br> | ||
+ | {<br> | ||
+ | error = 1; /* downloading failed */<br> | ||
+ | }<br> | ||
+ | }<br> | ||
+ | }</tt> | ||
+ | |||
+ | |||
+ | The same loop, just enhanced with progress and error reporting is used in the [http://krashan.ppa.pl/reggae/library/http_dynamic.c complete example] being just a Reggae based, very simple, universal HTTP downloader application. What may be interesting, it deals properly with data longer than 4 GB, assuming the fiilesystem of destination file is 64-bit. | ||
+ | |||
+ | |||
+ | It is important that [http://www.ietf.org/rfc/rfc2616.txt RFC 2616], the HTTP specification, does not specify, that static files '''must''' be served without chunked transfer. On the other hand the server is not forced to use chunked transfer for dynamically generated contents. Assumption that server will not use chunks for a file just because the file is static one, may fail. Then, the safe way is to use download loop always and treat ''MMA_StreamLength'' as a hint only. |
Latest revision as of 21:02, 17 February 2019
Grzegorz Kraszewski
Introduction
The http.stream class is one of Reggae stream classes, in other words data sources. In a chain of Reggae objects, a http.stream instance will be always the first object, having only one, output port. A http.stream object may be also used standalone, not connected to anything, just to retrieve any data resource reachable via HTTP protocol and particularly its GET and POST requests. From this point of view, http.stream is just embeddable HTTP/1.1 client with simple yet powerful API. A brief list of its features is given below:
- Socket API encapsulation. http.stream completely isolates application (and its programmer) from bsdsocket.library and TCP/IP stack. Only very basic knowledge of TCP/IP is needed to use http.stream with success.
- Unlike bsdsocket.library base instances, http.stream objects may be shared between processes (with the only exception that object must be disposed by proces which created it).
- The class has builtin parser of HTTP response headers.
- The class has also an easy to use HTTP request header builder, so custom fields may be added to the header.
- HTTP proxies are supported.
- The class supports chunked transfer and media streaming over HTTP.
- Optional user agent spoofing is possible.
- When connecting, HTTP redirections may be followed automatically.
- The class is able to handle streams longer than 4 GB.
- Easy protocol debugging via MediaLogger.
The class has some disadvantages however. Some of them may be removed in future versions:
- No support for HTTP secure below v52.
- No support for persistent connections.
- No support for HTTP compression.
- Making connection, sending request and receiving response header is done in the constructor, so it is synchronous to the application. Any network delay in constructor blocks the application until timeout or other error is reached. It can be worked around by putting all the network operation on a subprocess.
Minimal example
When we skip any error handling, the whole process of downloading data via HTTP protocol reduces to three lines of code:
#define DATA_LENGTH 7465 /* just example value */
UBYTE buffer[DATA_LENGTH]; /* place for data */
Object *http;
http = NewObject(NULL, "http.stream", MMA_StreamName, "www.morphzone.org", TAG_END);
DoMethod(http, MMM_Pull, 0, buffer, DATA_LENGTH);
DisposeObject(http);
We assume here, http.stream class has been loaded previously with OpenLibrary() (see Opening and closing individual classes). The code will download first 7465 bytes of MorphZone main page (HTML code), assuming there will be no error. This assumption is rather risky, because a network operation can fail for numerous reasons. Then we will be calling method on the NULL pointer and disposing NULL later, which can even lead to application crash. For this reason http.stream offers a few ways for handling errors. They will be discussed later, for now a minimal error handling is checking NewObject() result against NULL. This is used in a simple example downloading the first 1000 bytes of a resource specified in the commandline and dumping them into the console. Note that using this program for binary resources (like images) may result in rather weird output... I recommend running this example along with MediaLogger, to learn http.stream protocol debugging features.
Length of data
Usefulness of the above example is limited. It downloads only predefined amount of data (or less, if the resource turns out to be shorter). Usually we want to download all the data and this implies getting the length of it somehow. A few scenarios are possible:
The length of data is known before downloading
This is the easiest, but the most rare case. It can be handled exactly as in the example from the previous section – a statically sized buffer and single MMM_Pull() call.
The server sends a static file
Then it knows the size and passes it in the response header (Content-Length field). The http.stream object extracts it automatically. Then data length may be obtained by getting MMA_StreamLength attribute. It means that the length is known before data downloading, so a buffer may be allocated dynamically. The attribute is 64-bit, so it should be get as follows:
QUAD length;
length = MediaGetPort64(http, 0, MMA_StreamLength);
This example shows MMA_StreamLength usage. It creates the object, asks of the data length, allocates a buffer, downloads data to the buffer and finally stores the buffer in a file.
The server sends dynamically generated data
Data are usually generated by some server-side script, written in PHP or other language. In this case server does not know the length a priori so it switches to HTTP chunked transfer mode. The http.stream object handles it automatically, and reports 0 as MMA_StreamLength, which means that the length is unknown. The only way to process such data is downloading it in blocks in a loop until the object reports MMERR_END_OF_DATA error code. The loop code may look like this:
LONG chunk, error = 0;
while (!error)
{
chunk = DoMethod(http, MMM_Pull, 0, buffer, BUFFER_SIZE);
/* Do something with 'chunk' bytes of data in 'buffer'. */
if (chunk < BUFFER_SIZE)
{
if (MediaGetPort(http, 0, MMA_ErrorCode) == MMERR_END_OF_DATA))
{
break; /* downloading finished */
}
else
{
error = 1; /* downloading failed */
}
}
}
The same loop, just enhanced with progress and error reporting is used in the complete example being just a Reggae based, very simple, universal HTTP downloader application. What may be interesting, it deals properly with data longer than 4 GB, assuming the fiilesystem of destination file is 64-bit.
It is important that RFC 2616, the HTTP specification, does not specify, that static files must be served without chunked transfer. On the other hand the server is not forced to use chunked transfer for dynamically generated contents. Assumption that server will not use chunks for a file just because the file is static one, may fail. Then, the safe way is to use download loop always and treat MMA_StreamLength as a hint only.