Wednesday, March 25, 2009

PHP Curl and ASP.NET

The main reason I created this blog is to document my efforts in using PHP in unique ways. I think PHP's Curl module has been touched on my several blogs and articles. But info on it's use against ASP.NET sites was a little hard to come by. So I hope you pick up something here that could help you out.

Our company is actually quite automated. We have many databases, Oracle 10g, SQL Server 2005, SSAS 2008, Teradata, etc. We have several ways to extract data from them. Some could be accessed through a custom desktop application. Others through VBA code embedded in an Excel file (such as OLAP). Lastly, most used a web-based GUI. Unfortunately, for me as a PHP noob, most of these sites used ASP.NET. This makes automating data extraction (also called scrapping) from them rather difficult, if not impossible.

You may probably ask why bother automating the website interaction to extract data from it when you can go directly to the tables of database behind it. Well... because I dont have access to the database directly. Our IT department deliberately only allows data gathering through (in this case) the website. They do not allow direct access to the database to prevent people bogging down the database with too much queries. So, I really had no choice. Secondly, even if I did have access I probably would not be able to understand the structure of the database since that information is controlled as well.

Now, if the site used the GET method of form submission then I would probably be able to get by using file_get_contents. But the site uses the POST method so Curl is the only way to go. One tool I really am truly grateful for is Wireshark. It is a network protocol analyzer for Unix and Windows allowing me to see all the nitty-gritty details of the HTTP transactions that my web browser creates as I interact with the website. By logging the whole session, I am able to dissect the POST transactions to get at the exact POST variables being sent. These POST variables will be supplied to the Curl function to simulate the exact operation as done by the web browser. In this case what is being simulated is a PostBack operation upon selecting an option in a drop-down list box.

The PostBack JavaScript code looks like this:
function __doPostBack(eventTarget, eventArgument) {
var theform;
if (window.navigator.appName.toLowerCase().indexOf("microsoft") > -1) {
theform = document.Form1;
}
else {
theform = document.forms["Form1"];
}
theform.__EVENTTARGET.value = eventTarget.split("$").join(":");
theform.__EVENTARGUMENT.value = eventArgument;
theform.submit();
}
For those unfamiliar, what this does is generate a form POST transaction just like what would happen if you clicked on a form Submit button.
This function is called every time you select an option in a drop-down list box:
HTML Code:
<select name="ddl_name" onchange="__doPostBack('ddl_name','')" language="javascript" id="ddl_name" >
<option value="- Choose Name -">- Choose Name -</option>
<option value="Adam">Adam</option>
<option value="Nathan">Nathan</option>
<option value="George">George</option>
</select>
In a normal POST request, the POST would contain, at the minimum, the Select form element name (for this example it would be ddl_name) and corresponding Option value you selected (for example Adam). But ASP.NET automatically generates some hidden form variables that need to be included in the POST:
<input name="__EVENTTARGET" value="" type="hidden">
<input name="__EVENTARGUMENT" value="" type="hidden">
<input name="__VIEWSTATE" value="dDwyMDE1Nz..." type="hidden">

The __EVENTTARGET and __EVENTARGUMENT values are pretty straightforward. You just need to decode the __doPostBack function. The last hidden variable is more interesting. ViewState is a feature built-in to ASP.NET (__VIEWSTATE) and is defined by Scott Mitchell as:
"Microsoft® ASP.NET view state, in a nutshell, is the technique used by an ASP.NET Web page to persist changes to the state of a Web Form across postbacks."
Basically, it is a serialized representation of all the form variables and their states when the page was sent-out from the Web server. It is required that you send this exact value back during the POST request or the Web server will reject it.

So to complete 1 full cycle you will need to undertake a 3-step process:
  1. Retrieve the initial page
  2. Parse the page for the ViewState value
  3. Send a POST request with all the needed form variables including the ViewState
The fourth step would be to readback the response of the POST request for further processing (data collection). But this can be viewed as being part of the 3rd step.

First, to retrieve the initial page you could probably get by with a simple file_get_contents. Try it, but if it doesn't work then you would probably need to use the Curl extension (Make sure that it is enabled in your php.ini) which has a few tricks up its sleeve. For the site I was interacting with, it required the creation of cookies. So I went ahead and enabled cookies in my Curl function call. Another feature of Curl is the ability to perform user agent spoofing. For more secure sites, you might need to specify and username and password.

For the 2nd part, you can probably use the DOM extension or go with the simpler RegEx approach. Below is an example of how I did it:
$result = curl_exec ($ch);
preg_match_all("/name=\"__VIEWSTATE\" value=\"(.*?
)\"/", $result, $arr_viewstate);
$viewstate = urlencode(
$arr_viewstate[1][0]);
Lastly, you plug in the values for the POST variables in the Curl options array like so:
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => '__EVENTTARGET='.urlencode('ddl_name').'&__EVENTARGUMENT='.urlencode('').'&__VIEWSTATE='.$viewstate.'&ddl_name=Adam');
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$result = curl_exec ($ch);
That's it! You can now parse your needed data contained in the $result variable. I used the DOM extension for this since it is more tolerant to malformed HTML markup than RegEx.

Thursday, March 19, 2009

PHP in the corporate world

It's been a LONG time since my last post so let's get on with it...

A new group was formed in our division tasked to create and implement new methodologies to improve employee efficiency and productivity. Although it was not my supervisor's original plan for me, I quickly proved myself as a very promising Automation Engineer and have since solidified my position as one. I guess I fell in love with programming all over again given the opportunity and having the unique background of knowing the operations and maintenance side of the group gave me an edge.

When I began thinking of creating an automation tool I was faced with the question: What usage model should I aim for? I could just make a series of VBA macros for Excel but this would limit the scope to the particular workbook the code was embedded in. I could create a desktop application using VB/C#.NET. This would remove the dependency on Excel but ensuring that the code ran on each and every laptop/desktop configuration is not something I would like to waste time on or am equipped to do. Being a high-tech company we are all have desktops/laptops connected to the company intranet so I investigated the web application platform.

There are several advantages to creating web applications over the traditional desktop application. First, once you set it up on a web server, the application is available to all the thousands of employees worldwide 24x7. No need to install per computer because all you need is a web browser and you're up and running. Maintenance would be simpler as I would only have to program on and maintain a single system. Flexibility, I would have the freedom to choose and even switch what programming language to use on the server and the end user would care less since all they care about is what they see on the browser. Great!

Now, I had zero knowledge on programming for the Web. Even HTML was new to me. But Google is my best friend and he did not disappoint. I did some reading-up and aside from learning HTML, CSS and Javascript, I determined that I had to choose between the two most popular web programming paths, PHP and ASP.NET. I read further and looked at the Pros and Cons. In the end, I picked PHP.

Why PHP, you may ask? Surprisignly, cost was not really a factor in my decision. This is a major selling point going for PHP. But we already have a Windows server available serving up a simple static group website plus we have VERY close ties with MS so we have access to VS.NET for free (or you can use the SDK). What's going for ASP.NET is it's tight integration with all thing Windows which makes it the ideal choice in big companies that have invested heavily in MS due to the extensive use of MS Office.

This is where my personal agenda came in. I was not an IT person and there are probably close to a hundred highly experienced ASP.NET programmers already employed by our IT division. So how effective would I be being an ASP.NET newbie where the experienced coder would probably take just a couple of hours what would take me a week to do. Might as well just let the Pros do it. But if I invested my time in PHP I would have a unique competency and would be a more valuable asset (I hope this is true). Also, I would be able to capitalize on the differences of ASP.NET and PHP where what would probably be more difficult or impossible (maybe not) to do in ASP.NET is easier or possible to do in PHP. Luckily for me, and as I would find-out soon, there are great capabilities built into PHP that did set it appart. So, I would be filling-in the gaps that no one else could do. Cool!

Then there is the long-term view to take into account. What would I like to do with my career once I leave this company? ASP.NET programmers would arguably be paid more than PHP ones in general. But PHP, being an open-source and low-cost development platform, really won me over. Small web sites (blogs) and big social networking sites like Facebook all run on PHP. So I would be catering to a larger customer base. I would also have the option of being unemployed. What!?! you may ask... Hold on your horses. What I meant to say was more of a self-employed, consultant, or work from home kind of setup. I likey!

So, there you have it. The reason I chose to use PHP in a MS dominated corporate world. I believe it was the right choice and presents an interesting challenge. The challenge of finding unique uses of PHP to bring together various data sources and bind them together in a company's internal network.