At work, I am do a lot of data mining, one tool that I do recommend is Formulize which is an incredible powerful scientific data mining software package. I am not sure if there is anything quite like it on the market.
For those who are not sure how data mining works here is an example
I decided to check the Japanese US Pacific conflict during WW2. Using data from the Wikipedia. Let me firstly say I am aware that many of the figures are a bit dubious there, for example, Japanese laborers are sometimes listed as soldiers. Still, I did my best.
The period I examined was the Island hopping campaign mainly by the US in the Pacific against Japan from late 1943 to 1945. Some of these but not all were extremely one-sided conflicts with many times more American troops than Japanese. As well, the Americans had both air and naval superiority plus better and more equipment on the ground. All these advantages improved over time however...
Here is a table of the US Island hopping campaign from 1943 to 1945, I created
Battle, Start,Day No, Days, US, US-dead, Japan, Japan-dead
Battle, Start, D3, E3, F3, G3, H3, I3,
Battle of Tarawa ,20/11/1943, 713, 4, 35000, 1009 ,4819, 4673
Battle of Kwajalein ,31/01/1944 ,785 ,4 ,42000 ,372 ,8100 ,7870
Battle of Saipan ,15/06/1944 ,921 ,25 ,71000 ,2949 ,31000 ,29000
Second Battle of Guam ,21/07/1944 ,957 ,21 ,36000 ,1747 ,22000 ,18040
Battle of Tinian ,24/07/1944 ,960 ,9 ,30000 ,328 ,8810 ,8010
Battle of Peleliu ,15/09/1944 ,1013 ,74 ,10994 ,1794 ,11000 ,10695
Battle of Angaur ,17/09/1944 ,1015 ,14 ,15000 ,260 ,1400 ,1338
Battle of Luzo ,9/01/1945 ,1129 ,219 ,175000 ,8310 ,250000 ,205535
Battle of Iwo Jima ,19/02/1945 ,1170 ,36 ,70000 ,6812 ,22060 ,21844
The Battle of Okinawa ,1/04/1945 ,1211 ,82 ,183000 ,12513 ,117000 ,100000
As I used excel, for my variables I used excel conventions.
D3=Number of days since Pearl Harbor
E3=days the battle raged.
F3=US attacking force
G3=US dead, it may include air force too. I am not sure what the wiki uses
H3=Japanese original defending force.
The totals were on this table
Please bear with me, as I think, it will be interesting.
I am trying to measure E3 in terms of D3, F3 and H3 so Formulize came up was
E3=32.76+0.001234*H3 + 4.372*TAN(D3) - 0.0007463*F3 - 10.57*SIN(5.776 + F3)
So plugging these values into these equations, I have
Battle of Tarawa ,3
Battle of Kwajalein ,5
Battle of Saipan ,26
Second Battle of Guam ,22
Battle of Tinian ,8
Battle of Peleliu ,74
Battle of Angaur ,14
Battle of Luzo ,219
Battle of Iwo Jima ,36
The Battle of Okinawa ,82
The totals on this table is
Which is almost spot on to what the actual figures were.
What I then decided to do was change the dates of the battles.
So I asked the computer let us assume that all battles took place on the earlier date 20/11/1943. What would be the result. The total of E3=403 Days of battle. As you see the Japanese less days then they did in WW2.
I then asked the computer let us assume that all battles took place on the lastest date 1/04/1945, the day of the Battle of Okinawa which was seen as a dress rehearsal for the invasion of Japan. Now the result was now.
As you see from this example the data mining is showing the Japanese improving over the Americans. The Japanese strategic plan during this period was to hold off the American advance and this model suggest that they were getting better at it. This would lend weight to those that argue that the invasion of Japan was not going to be quick as some suggest.
I hope this example gives you a feel of how it can be used. I have used it for calculating many things. I have used economic data, business data from work like hours worked by people, charity donations, ww2 battles etc. It can be anything that has a numberic values.
What you do is put in your raw data, from what I suggest an excel spreadsheet. Then once it is right there transfer it to excel.
Pros: It has a terrific user base; many are extremely helpful as is the writer of the software.
It does not need a powerful machine to run it, although I would suggest it.
Cons: It can take a long time to run. I generally need at least two hours, often I run it overnight sometimes to come back the next day and find that, for some reason, what I have done is useless and needs to be redone. Usually it takes me many attempts to get a decent result. Sometimes I confess, it is interesting to find out why it failed.
It takes a long time to get the information in correctly. I often use a key macro program, to enter in the information because it is just too much mouse pressing. The writers of the program need some pointers on making it user friendly.
It is not easy to learn. Partly, as because it is so powerful, and because it is a complex field. Partly too though as it is in beta, it is not well documented. Much you are just going to have to work out for yourself.
It has no excel conversion so the formulas cannot be transported quickly from formulize back to excel. Often I have to pick simple options to keep excel convensions. Although you can have problems for example 0^0 in Formulize is 1, while in excel 0^0 = Error. Here Formulize is right but what can we do about Microsoft?
For the time being, it is in beta and free, so I suggest that you grab it while you can for free. It is available here. Latest version Eureqa 0.98.2 (build 1071) − June 26, 2013.
Also if anyone has any questions on how to use it, I am always willing to help.