Dec 11 2011

Swim Event Times W7P Development -- BeginGetRequest

Category: � Administrator @ 07:22

Now the engine that's the heart of this application lies in its ability to scrape a URL.  To do that we need to use the asynchronous calls (BeginGetRequest) out to a URL in the hope that we eventually get a page's worth of data back.  We also need to control the call by means of a timeout. 

So let's talk about the functional setup and what the code has to handle.

To complicate things, the website that I'm hitting is an ASP.Net application which does not use cookies stored on the client.  They're using session state to maintain position within the list of pages that you have requested. The site also does not use the QueryString in any form on the URL, which also makes it a bit stickier since you cannot hit a desired page straight away. You always have to start with the Search page and pump through the rest of the pages.  All of these little things added up to an annoying set of problems. 

From purely a user experience, the site design is poor since they make you re-enter (no cookies) the same information every time you visit the site.   Maybe this is by design but it really does not lend itself to a good customer experience. There's also no mobile support which means that using your phone to hit the site is a real boondoggle.

Now for the code.

 

To kick off any request to a URL you'll need to do it on a separate thread:

public void SendPost()
        {           
            // Create a background thread to run the web request
            Thread t = new Thread(new ThreadStart(SendPostThreadFunc));
            t.Name = "URLRequest_For_" + "TODO";
            t.IsBackground = true;
            t.Start();
        }

 

Next we need to keep the primed Request Stream since we need it on subsequent calls to the site.  So in this case we use BeginGetRequestStream:

void SendPostThreadFunc()
        {

            //test the network first
            if (online == false)
            {
                this.Dispatcher.BeginInvoke(() =>
                {
                    progressBar1.IsLoading = false;
                    MessageBox.Show("Network Disconnected.  Please try again when you have a good Network signal.");
                });
                return;
            }


            // Create the web request object
            try
            {
                HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(CookieColUri);

                //Trying to use the THreadPool for timeout  -- waiting on code
                ThreadPool.QueueUserWorkItem(state =>
                                               {

                                                   webRequest.Method = "POST";
                                                   webRequest.ContentType = "application/x-www-form-urlencoded";
                                                   webRequest.CookieContainer = cookieJar;

                                                   // RequestState is a custom class to pass info
                                                   RequestState reqstate = new RequestState();
                                                   reqstate.Request = webRequest;
                                                   reqstate.Data = "passed data";

                                                   webRequest.BeginGetRequestStream(GetReqeustStreamCallback, reqstate);

                                               }
                                               );


            }


            catch (Exception ex)
            {
                //Debug.WriteLine(" --> BGRS3 Exception: " + ex.Message + ", Thread: " + Thread.CurrentThread.ManagedThreadId); 

                // notify your app of a problem here
                this.Dispatcher.BeginInvoke(() =>
                {
                    progressBar1.IsLoading = false;
                    MessageBox.Show("BGRS3: " + ex.Message);
                });

            }


        }

 Now we handle the callback from the BeginGetRequestStream (lots of error handling):

 

void GetReqeustStreamCallback(IAsyncResult asynchronousResult)
        {

            if (!asynchronousResult.IsCompleted)
                return;

            RequestState reqstate = null;

            try
            {
                // grab the custom state object
                reqstate = (RequestState)asynchronousResult.AsyncState;

                //Thread.Sleep(15000);  // uncomment this line to test the timeout condition

                HttpWebRequest webRequest = (HttpWebRequest)reqstate.Request;                

                // End the stream request operation
                Stream postStream = webRequest.EndGetRequestStream(asynchronousResult);

                // Create the post data
                string postData = "";
                for (int i = 0; i < paramNames.Count; i++)
                {

                    if (paramNames[i] == "POSTDATA")
                    {
                        postData = paramValues[i];
                        break;

                    }
                    else
                    {
                        // Parameter seperator
                        if (i > 0)
                        {
                            postData += "&";
                        }

                        // Parameter data
                        postData += paramNames[i] + "=" + paramValues[i];
                    }
                }
                byte[] byteArray = Encoding.UTF8.GetBytes(postData);

                // Add the post data to the web request
                postStream.Write(byteArray, 0, postData.Length);
                postStream.Close();


                ThreadPool.QueueUserWorkItem(new WaitCallback(target =>
              {
                  try
                  { // you must have this try-catch here to handle exceptions from the callback                      

                      // RequestState is a custom class to pass info
                      RequestState reqstate2 = new RequestState();
                      reqstate2.Request = webRequest;
                      reqstate2.Data = "passed data";
                      reqstate2.AllDone = new AutoResetEvent(false);
                      
                      IAsyncResult result = (IAsyncResult)webRequest.BeginGetResponse(new AsyncCallback(GetResponseCallback), reqstate2);

                      bool waitOneResult = true;

                      if (!reqstate2.AllDone.WaitOne(DefaultTimeout))
                      {
                          waitOneResult = false;

                          if (webRequest != null)

                              webRequest.Abort();
                      }
                      
                  }
                  catch (WebException webExcp)
                  {                   

                      WebExceptionStatus status = webExcp.Status;
                      if (status == WebExceptionStatus.ProtocolError)
                      {
                          // Get HttpWebResponse so that you can check the HTTP status code.
                          HttpWebResponse httpResponse = (HttpWebResponse)webExcp.Response;                   

                          this.Dispatcher.BeginInvoke(() =>
                              {
                                  progressBar1.IsLoading = false;
                                  MessageBox.Show("Unable to reach site. Please try later! " + (int)httpResponse.StatusCode + " - "
                                 + httpResponse.StatusCode + ".");
                              });
                      }
                  }

                  catch (Exception ex)
                  { // you must handle the exception or it will be unhandled and crash your app
                      
                      // notify your app of a problem here
                      this.Dispatcher.BeginInvoke(() =>
                      {
                          progressBar1.IsLoading = false;
                          MessageBox.Show("BGR1: " + ex.Message);
                      });
                  }


              }
                                  ));
            }


            catch (WebException webExcp)
            {               
                WebExceptionStatus status = webExcp.Status;
                if (status == WebExceptionStatus.ProtocolError)
                {
                    // Get HttpWebResponse so that you can check the HTTP status code.
                    HttpWebResponse httpResponse = (HttpWebResponse)webExcp.Response;                    

                    this.Dispatcher.BeginInvoke(() =>
                        {
                            progressBar1.IsLoading = false;
                            MessageBox.Show("Unable to reach site." + (int)httpResponse.StatusCode + " - "
                           + httpResponse.StatusCode + ".");
                        });
                }

            }
            catch (Exception ex)
            {
                // notify your app of a problem here
                this.Dispatcher.BeginInvoke(() =>
                    {
                        progressBar1.IsLoading = false;
                        MessageBox.Show("BGR3: " + ex.Message);
                    });             
            }            

        }

 Now get the response from the website:

void GetResponseCallback(IAsyncResult asynchronousResult)
        {

            if (!asynchronousResult.IsCompleted)
                return;

            // grab the custom state object
            RequestState reqstate = (RequestState)asynchronousResult.AsyncState;

            //Thread.Sleep(50000);  // uncomment this line to test the timeout condition 50 seconds (timeout 45)

            try
            {
                HttpWebRequest webRequest = (HttpWebRequest)reqstate.Request;                

                // End the get response operation
                HttpWebResponse response = (HttpWebResponse)webRequest.EndGetResponse(asynchronousResult);               

                Stream streamResponse = response.GetResponseStream();
                StreamReader streamReader = new StreamReader(streamResponse);
                Response = streamReader.ReadToEnd();
                streamResponse.Close();
                streamReader.Close();
                response.Close();

                // Call the response callback
                if (callback != null)
                {
                    callback();
                }

            }
            catch (WebException webExcp)
            {
                // If you reach this point, an exception has been caught.           
                WebExceptionStatus status = webExcp.Status;
                if (status == WebExceptionStatus.ProtocolError)
                {
                    // Get HttpWebResponse so that you can check the HTTP status code.
                    HttpWebResponse httpResponse = (HttpWebResponse)webExcp.Response;                 

                    this.Dispatcher.BeginInvoke(() =>
                        {
                            progressBar1.IsLoading = false;
                            MessageBox.Show("Unable to reach site." + (int)httpResponse.StatusCode + " - "
                           + httpResponse.StatusCode + ". Launching browser directly at site to show error!");

                        });

                    //Launcher for main page in which we got the error.
                    WebBrowserTask webBrowserTask = new WebBrowserTask();
                    webBrowserTask.URL = CookieColUri.ToString();
                    webBrowserTask.Show();

                    return;
                }
                else
                {
                    if (status == WebExceptionStatus.RequestCanceled)
                    { //abort from time -out                        
                        this.Dispatcher.BeginInvoke(() =>
                            {
                                progressBar1.IsLoading = false;
                                MessageBox.Show("Network Connection lost.  Please try when you have a good Network signal.");
                            });
                        return;
                    }
                    else
                    {
                        this.Dispatcher.BeginInvoke(() =>
                        {
                            progressBar1.IsLoading = false;
                            MessageBox.Show("Request lost.  Please try when you have a good Network signal.");
                        });
                        return;

                    }

                }
            }
            catch (Exception excp)
            {
                this.Dispatcher.BeginInvoke(() =>
                {
                    progressBar1.IsLoading = false;
                    MessageBox.Show("Request for Swimmer lost.  Please try when you have a good Network signal.");
                });
                return;
            }


            reqstate.AllDone.Set();

        } 

 

Now go ahead and use the HTML Agility Pack on the return results (in the callback) to strip any data you want from that page.

Note:  You must use the timeout on these calls otherwise you'll have a zillion crashes in your app. 

The timeout code was provided by Dan Colasanti.  www.improvisoft.com/blog (I owe him many beers!)

 

Tags: , , , , ,

Oct 2 2011

SwimEventTimes W7P Development -- HTML Agility Pack

Category: � Administrator @ 08:13

When this app started out, a long time ago, I happened upon the HTML Agility Pack or HAP.  This tool was originally written for .Net and used XPath to search and define the elements that you needed from the original source HTML.  So when it comes time to SCRAPE data from a page, you'll need a tool that provides a robust and almost carefree nature about the HTML structure.  It does so much for you that without it I would have never attempted what I was thinking:

http://htmlagilitypack.codeplex.com/

Kudos to the authors, especially DarthObiwan, of this fantastic piece of work.

Basically it allows you to define what tags you want from a page and search and pull out that piece of the text.  But the twist here is that for HAP to work on the phone it had to work without Xpath since Xpath was not supported in the OS 7.0 release on the phone.

So what took the place of the Xpath is LINQ.  This added yet another dimension to my learning since I had never really used LINQ and had only recently started looking into using LINQ when I thought about converting a project from XSLT.  It takes time to make the mental switch from Xpath to LINQ but there was no other way.  Also, at that time, none of the code releases for HAP worked on the phone but by getting the source and following the comments of the authors on how to re-engineer the code I was able to get it compile and now it works like a charm.

So now let's talk about how we use HAP:

Here we have entire HTML loaded up in htmDoc.

//let's detemine what came back on the response

HtmlAgilityPack.HtmlDocument htmDoc = new HtmlAgilityPack.HtmlDocument();
htmDoc.LoadHtml(responseData);

Next you can start to look for specific items:

string pattern = @".*txtSearchLastName$"; 
var SearchPagenode = htmDoc.DocumentNode
                      .Descendants("input")
                      .FirstOrDefault(x => Regex.IsMatch(x.Id, pattern));


So now I can look at the element and get the id:

CTLID = SearchPagenode.Id;

Other things like pulling out <a> tags out of table contained in the last <tr>:

pattern = @".*dgPersonSearchResults$"; 
var links = htmDoc.DocumentNode.Descendants("table")
           .First(n => Regex.IsMatch(n.Id, pattern))
          .Elements("tr").Last() .Descendants("a")
          .Select(x => x.GetAttributeValue("href", "")).ToArray();


You can go crazy with HAP and as your LINQ gets better you can go further and refine these queries.

HAP provides the foundation and performs the grunt work required to interrogate/parse a HTML source.


Tags: , , , ,

Sep 2 2011

W7P SwimEventTimes development - Fiddler2

Category: Features � Administrator @ 11:19

Back to the beginning.  When you need content and the content exists on other web sites you'll have to devise a way to acquire the data.  And I'm not talking about OData or some other nice structure of data lying around on the internet.  What I'm talking about is, for a lack of better term, "SCRAPING" data from other URL's.  Some would call this Web Scraping but this technique really applies to a whole realm of techniques and not just in use on the web.

Some believe that type of development is fraught with problems since your tying your app to a URL and that web design.  Just know these pitfalls up front and try to mitigate as many problems as possible.

Try and contact the URL's owners and try to get them to come onboard with your development.  That way at least you'll have knowledge of upcoming changes which could cause side-effects on your W7P app.

Disclaimer:  I'm a developer and not a copyright lawyer but be warned that some sites believe that they own the data and have the sole rights to that data.  It's the old "Creative" vs "Collection" argument. Collection of data is not creative but do your own copyright research.

Now let's talk about overall design.  Some designs call for an intermediate server to make the calls to the target URL but I opted to make the Windows 7 Phone software smart enough to exist on its own and pull the data directly from the target.  This minimizes the points of failure. It's either working from the W7P or not plus it also permits the phone to operate anonymously from the target URL's perspective.

Now the main tool that I quickly made use of is Fiddler2.  Fiddler2 allows you to look at the payload when making calls across the web.  This includes everything that is sent from your Request and the subsequent Response.

The site from which the data for this app will be scraped is an ASP.NET site.  I can't be sure but the site appears to make use of a Content Management System. I could have gone further into looking at the CMS since I believe that it's Open Source and pulling apart the code but I decided to spend the time on how and if I could get access to the data across the web from the phone. 

Other sites and their software platforms and how they implement cookies/session data etc. will change the makeup of how you access the pages but you need to start with Fiddler2.

I know IE9/Firefox have similar developer tools but when this application started up, Fiddler2 was my choice.

Get to know Fiddler2 and exactly what pages you need to hit.  Make screen shots of the flow that you'll need to capture and test scenarios that will cover the flow of pages.

You and Fiddler2 will spend many long nights together.

Tags: , ,