Friday, September 28, 2007

12.WebSpiderReview

The goal of reviewing a classmate's WebSpider assignment is to improve the ability to read, write, and test Java systems.

1. Installation Review
This process should be fairly smooth since we all used the same IDE and the same tools to write and test our code. I downloaded my classmate's software and extracted it into my Eclipse workspace. The packages contains two source code files (WebSpiderExample and TestWebSpider), a lib directory with checkstyle, emma, findbugs, and pmd sub directories, and various build and documentation files. The package had all the necessary files. Therefore, I was able to import his project into Eclipse without any effort.

The command ant -f verify.build.xml returned no errors and completed successfully. Because, I was a little bit suspicious, I decided to execute each build file individually but no errors were found. However, emma block coverage is only 83% and the line coverage is only at 85%.

I was able to build a jar file and execute his code with java -jar webspider-ivanw.jar -totallinks http://www.hackystat.org 100. As read from the console, it seemed that some Websites refused the connection. However, this did not bother the code execution and there were 1585 links found. At this time I can not be certain if this number is true since my code returns a slightly lower number of links (Read on).

2. Code format and conventions review
After reviewing the code, I recommend the following formatting corrections:

FILE
LINESVIOLATION
COMMENTS
WebSpiderExample.java46
EJS 6
Spelling ("of" twice)

WebSpiderExample.java58, 161
EJS 33
Keep comments and code in sync

WebSpiderExample.java71, 174, 183, *
EJS 7
use space between code blocks
WebSpiderExample.java150
EJS 49
Omit the subject in summary descriptions of actions or services
TestWebSpiderExample.java3
ICS-SE-Java-2
Do not use wild card "*" in import statements
TestWebSpiderExample.java22
EJS 33
Keep comments and code in sync


3. Test case review

Black box

Ivan's TestWebSpiderExample.java tests all methods within WebSpiderExample through calling the main method. Since all methods are in the same class, it seem like the test cases cover all methods written. Form a black box perspective this might be enough. But I would suggest covering all if-then-else statements as well.

In TestWebSpiderExample.java there are 4 calls to the main method each testing with different number of parameters but the same values. In addition there is one call to testMainMethod.

A nice extension to the tests that he already has would be testing for -totallinks and -mostpopular with 0 pages. Then testing it with a high number of pages and later with some illegal value such as, providing string input instead of number. He also could test with an empty parameter. For example, the URL String, or any other required parameter, could be empty/null. In addition a self made website could be set up to see if the code returns the correct values.

Since only the one call to testMainMethod has a assert statement I would suggest putting more assert statements and maybe calling testMainMethod on all tests since this method returns a value that can be compared for correctness.

White box

The emma coverage report returned 83% of block and 85% of line coverage, which does not seem satisfying from a white box perspective.

Analyzing emma's coverage report, I would suggest to include testing for the following:

1 - The length of linksVisited array is set to 65535. Since all test cases test only 5 pages, the linksVisited will never reach over 65535 links. Therefore, a test case must be set up or the length should not be preset.

2 - Various catch boxes are not covered. My suggestion is to set up a Website that would satisfy covering all the exceptions.

3 - Set up a test case were the user enters something else than the supposed -totallinks or -mostpopular option, and one of the else statement would have sufficient coverage under emma.

Break da buggah

I have set up my own very basic test website. The expected test results for -totallinks should be 3 and -mostpopular should return http://www2.hawaii.edu/~michalz/WebSpiderTestWebSite.html with 2 links when the maximum number of pages to traverse is set to 3 or higher.

Running Ivan's code on it returned incorrect results.

command: java -jar webspider-ivanw.jar -totallinks http://www2.hawaii.edu/~michalz/WebSpiderTestWebSite.html 3

output: 5

Changing the number of pages to traverse from 3 to 5 returned 9.

command : java -jar webspider-ivanw.jar -mostpopular http://www2.hawaii.edu/~michalz/WebSpiderTestWebSite.html 3

output: The most popular page is: http://www2.hawaii.edu/~michalz/WebSpiderTestWebSite.html with 0 pages that point to it.

comment: Right website, wrong number of pages.

Setting the maximum pages to traverse to a different number, gives all kinds of results with one of them being the right one.

command: java -jar webspider-ivanw.jar -totallinks http://www.myspace.com/shakadownstreet 10

output: crash - Sys.Application.notifyScriptLoader(); 'failed: SyntaxError: illegal character

4. Summary and lessons learned
I have learned to think and design better test cases for programs. Most people, including me, tried to get emma coverages as high as possible without even thinking if the tests we write are actually helping to test the actual program execution. I have also learned that it is important to set up an own testing environment before testing the code for correct output. For example, if I would set up a test website for WebSpider right from the beginning, I could be more certain that my code does what it supposed to be doing. I have also learned that using someone's test website isn't very good since I found one link on Randy's website that was down, and therefore my code returned a different answer than expected. If you know what your code does, than it is easy to set up a testing environment and write more sufficient test cases.

Wednesday, September 26, 2007

Extra Credit (revised)

Source Code: Available here

So, I did some research on the httpunit package and I found a nice article named "introduction to Httpunit". From, this article I have learned that we must set the script error exception to false. However, this did not solve my problem crawling the myspace Website. Then I found other useful Website that helped me learn and understand the Httpunit package. However, most of the other articles did not help in solving the problem. Since you are allowed to set options in Httpunit as in : HttpUnitOptions.setExceptionsThrownOnScriptError(false);
I then looked, with help of the Javadoc in Eclipse, at setting other options. When I set HttpUnitOptions.setScriptingEnabled(false);
the problem of crawling the myspace site was solved.

11 WebSpider (revised)

Source Code: Available here

First of all, it was a good experience having this assignment extended because I was able to improve my logging feature which is implemented with java.util.Logging.Logger. Although, my web spider was working already before, I decided to revise the code and add more security to it. Therefore, I have surrounded necessary code with try and catch blocks. The only problem associated with it is when running Emma on it. Since the code won't always go through the catch block, Emma will complain about it and the coverage drops down from the expected 100%. But at least it won't stop the code execution when a link is down or when a website is not accessible. In addition, I think that writing Junit tests isn't always very easy. Sometimes it seems like Junit tests are not necessary for certain blocks of code but then Emma coverages will drop. So, to satisfy this I must write a test which just goes through a block which only contains a print statement. I have also found a way of turning off the scripting errors in httpunit and therefore my code works for nearly all web pages, including myspace.com.

Monday, September 24, 2007

Extra Credit

I have tried to accomplish the extra credit assignment by playing with the code until it would actually execute the myspace site without any exceptions. However it always complained about ASP.NET AJAX Framework is missing. While trying to make it work I found that httpunit has a function called HttpUnitOptions.setExceptionsThrownOnScriptError(false); that would skip all the scripting but this did not help either. The function takes either true or false as the parameter whether you want scripting errors to be printed or not. Then I went on to Google to get some more information on how to solve the problem however in the end I was not able to execute myspace without any errors. At leat my web spider works for a lot of other site.

11. WebSpider

Source Code: Please click here to download

I was able to accomplish all the tasks. However the third task which required us to implement logging, I was able to completed it only partially since I choose to put a bunch of System.out.print line statements. I could have implemented a better logging, but I did not have the time to finish it with a complete logging application before the due date. Therefore, I realized that it is important to start on harder assignments at least a week before the due date.

Write the web crawler was for me the hardest part. Since the httpunit package doesn’t come with all the Javadoc comments which are always very useful when writing Java code in Eclipse. I also missed some helpful documentation or examples for httpunit on Google or even on their own Website. If I would be more familiar with the package then I am sure that I could write better and more robust code then the one which is available here. That is what I think that writing the actual web crawler made more difficult.

A problem that I ran into in the very beginning was:

When I issued the command ant run, it gave me an error saying that Junit.org does not exist. It took me a while to figure out that putting Junit into the fileset in the build.xml file would actually solve my problem. After that it ran fine, but I ran into problem with the httpunit package.

Finally, I was able to get a good coverage from emma which was near 100% for all. I was not able to get the full 100% because PMD complained about ArrayList which I could not solve in the short time.

In conclusion, I think that this assignment was helpful in dealing with unfamiliar packages and it also gave me a chance to brush up on recursion. Once I heard from a senior worker that he has never used recursive methods in his career when writing programs. I would agree that recursion is not necessary since it is heavier on the memory, has slower execution (as can be observed when executing this code) and is more prone to errors.

By the way, be patient when executing this code since it may take a while until it recursively traverses through the Web links until a specified number of pages have been visited.