Showing posts with label Post. Show all posts
Showing posts with label Post. Show all posts

Wednesday, March 25, 2009

PHP Curl and ASP.NET

The main reason I created this blog is to document my efforts in using PHP in unique ways. I think PHP's Curl module has been touched on my several blogs and articles. But info on it's use against ASP.NET sites was a little hard to come by. So I hope you pick up something here that could help you out.

Our company is actually quite automated. We have many databases, Oracle 10g, SQL Server 2005, SSAS 2008, Teradata, etc. We have several ways to extract data from them. Some could be accessed through a custom desktop application. Others through VBA code embedded in an Excel file (such as OLAP). Lastly, most used a web-based GUI. Unfortunately, for me as a PHP noob, most of these sites used ASP.NET. This makes automating data extraction (also called scrapping) from them rather difficult, if not impossible.

You may probably ask why bother automating the website interaction to extract data from it when you can go directly to the tables of database behind it. Well... because I dont have access to the database directly. Our IT department deliberately only allows data gathering through (in this case) the website. They do not allow direct access to the database to prevent people bogging down the database with too much queries. So, I really had no choice. Secondly, even if I did have access I probably would not be able to understand the structure of the database since that information is controlled as well.

Now, if the site used the GET method of form submission then I would probably be able to get by using file_get_contents. But the site uses the POST method so Curl is the only way to go. One tool I really am truly grateful for is Wireshark. It is a network protocol analyzer for Unix and Windows allowing me to see all the nitty-gritty details of the HTTP transactions that my web browser creates as I interact with the website. By logging the whole session, I am able to dissect the POST transactions to get at the exact POST variables being sent. These POST variables will be supplied to the Curl function to simulate the exact operation as done by the web browser. In this case what is being simulated is a PostBack operation upon selecting an option in a drop-down list box.

The PostBack JavaScript code looks like this:
function __doPostBack(eventTarget, eventArgument) {
var theform;
if (window.navigator.appName.toLowerCase().indexOf("microsoft") > -1) {
theform = document.Form1;
}
else {
theform = document.forms["Form1"];
}
theform.__EVENTTARGET.value = eventTarget.split("$").join(":");
theform.__EVENTARGUMENT.value = eventArgument;
theform.submit();
}
For those unfamiliar, what this does is generate a form POST transaction just like what would happen if you clicked on a form Submit button.
This function is called every time you select an option in a drop-down list box:
HTML Code:
<select name="ddl_name" onchange="__doPostBack('ddl_name','')" language="javascript" id="ddl_name" >
<option value="- Choose Name -">- Choose Name -</option>
<option value="Adam">Adam</option>
<option value="Nathan">Nathan</option>
<option value="George">George</option>
</select>
In a normal POST request, the POST would contain, at the minimum, the Select form element name (for this example it would be ddl_name) and corresponding Option value you selected (for example Adam). But ASP.NET automatically generates some hidden form variables that need to be included in the POST:
<input name="__EVENTTARGET" value="" type="hidden">
<input name="__EVENTARGUMENT" value="" type="hidden">
<input name="__VIEWSTATE" value="dDwyMDE1Nz..." type="hidden">

The __EVENTTARGET and __EVENTARGUMENT values are pretty straightforward. You just need to decode the __doPostBack function. The last hidden variable is more interesting. ViewState is a feature built-in to ASP.NET (__VIEWSTATE) and is defined by Scott Mitchell as:
"Microsoft® ASP.NET view state, in a nutshell, is the technique used by an ASP.NET Web page to persist changes to the state of a Web Form across postbacks."
Basically, it is a serialized representation of all the form variables and their states when the page was sent-out from the Web server. It is required that you send this exact value back during the POST request or the Web server will reject it.

So to complete 1 full cycle you will need to undertake a 3-step process:
  1. Retrieve the initial page
  2. Parse the page for the ViewState value
  3. Send a POST request with all the needed form variables including the ViewState
The fourth step would be to readback the response of the POST request for further processing (data collection). But this can be viewed as being part of the 3rd step.

First, to retrieve the initial page you could probably get by with a simple file_get_contents. Try it, but if it doesn't work then you would probably need to use the Curl extension (Make sure that it is enabled in your php.ini) which has a few tricks up its sleeve. For the site I was interacting with, it required the creation of cookies. So I went ahead and enabled cookies in my Curl function call. Another feature of Curl is the ability to perform user agent spoofing. For more secure sites, you might need to specify and username and password.

For the 2nd part, you can probably use the DOM extension or go with the simpler RegEx approach. Below is an example of how I did it:
$result = curl_exec ($ch);
preg_match_all("/name=\"__VIEWSTATE\" value=\"(.*?
)\"/", $result, $arr_viewstate);
$viewstate = urlencode(
$arr_viewstate[1][0]);
Lastly, you plug in the values for the POST variables in the Curl options array like so:
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => '__EVENTTARGET='.urlencode('ddl_name').'&__EVENTARGUMENT='.urlencode('').'&__VIEWSTATE='.$viewstate.'&ddl_name=Adam');
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$result = curl_exec ($ch);
That's it! You can now parse your needed data contained in the $result variable. I used the DOM extension for this since it is more tolerant to malformed HTML markup than RegEx.