| Bob F ( @ 2009-01-04 23:53:00 |
Finished
So I've officially finished version .7 of the program I have been working on (no name yet)
Basically I'm very happy with where I'm at now.
This program grew out of a need for me to build prospecting lists for the business I'm in. I had been using manta to browse companies but I found that clicking on each one for information was too slow.
So I built an app in perl to parse the search page and snag all the relevant information for each company presented. I could have done a lot of this using prebuilt code (not that I didn't use modules and whatnot) but I wanted to learn a fair bit about the language and really tackle a programming project, since I haven't had any serious forays into programming sophomore year of college (not counting php/asp which I don't count as real [at least not the way I was using them, I'm sure designing a CRM is different])
So I setteled on Perl as I figured parsing would require a lot of regex (I was right.) In the end I did decide to use HTML::Parser but getting the information still required a lot of parsing.
It works fairly well but it is SLOW. It takes about a second a page, which doesn't seem like a lot... however 1000 companies takes around 15 minutes.
Anyway as of right now I figure I'm at version .7
just for me a todo list:
Version .75:
Add support for searching via category and state/city and for acquiring the cookie from logging into the website so that I can authenticate from the program.
Version .8:
Add Tk GUI
Version .9:
Add error checking
/*I'm terrible about doing this on the front end as I know I'm the only one entering input so it will inevitably be good, then when I'm about to release it I realize people are morons. I'm now resigned to designing this way so instead of coding in error scripts I just stick in comments about what to do in pseudo code and do it in the end.*/
Version 1
Figure out some way to add multithreading and speed up the operations. I've determined via use benchmark; that the holdup is the fact that I get one page after another and parse them sequentially, if I could parse many at a time everything would go much quicker.
.75 .8 and .9 are simple matters of having the time to implement the various features (I've never worked with Tk or GUI's before but after browsing some online tutorials it seems fairly easy to get familiar with, especially for basic GUIs
However I really have no idea about where to begin on multithreading. I can state what I want to do in pseudo code
Concurrently call several instances of get_page_html() on several url's and then call parse_page() from inside each of those calls and have them running at the same time.
But I have no idea how to do this in practice. Any tips on where to start are much appreciated. Even experience in multithreading other languages would be helpful as this is only my second or third foray into Perl (I feel very comfortable learning new syntax quickly) so any examples in any language (within reason) would probably be helpful.
I am not initially planning on releasing this under a GPL until I find out if I can get some money out of company for designing this. (yes I know with Perl it will be open source anyway) but if you're interested in seeing the source I would be glad to share it as long as you agree to not release it.
Thanks
-Bob
So I've officially finished version .7 of the program I have been working on (no name yet)
Basically I'm very happy with where I'm at now.
This program grew out of a need for me to build prospecting lists for the business I'm in. I had been using manta to browse companies but I found that clicking on each one for information was too slow.
So I built an app in perl to parse the search page and snag all the relevant information for each company presented. I could have done a lot of this using prebuilt code (not that I didn't use modules and whatnot) but I wanted to learn a fair bit about the language and really tackle a programming project, since I haven't had any serious forays into programming sophomore year of college (not counting php/asp which I don't count as real [at least not the way I was using them, I'm sure designing a CRM is different])
So I setteled on Perl as I figured parsing would require a lot of regex (I was right.) In the end I did decide to use HTML::Parser but getting the information still required a lot of parsing.
It works fairly well but it is SLOW. It takes about a second a page, which doesn't seem like a lot... however 1000 companies takes around 15 minutes.
Anyway as of right now I figure I'm at version .7
just for me a todo list:
Version .75:
Add support for searching via category and state/city and for acquiring the cookie from logging into the website so that I can authenticate from the program.
Version .8:
Add Tk GUI
Version .9:
Add error checking
/*I'm terrible about doing this on the front end as I know I'm the only one entering input so it will inevitably be good, then when I'm about to release it I realize people are morons. I'm now resigned to designing this way so instead of coding in error scripts I just stick in comments about what to do in pseudo code and do it in the end.*/
Version 1
Figure out some way to add multithreading and speed up the operations. I've determined via use benchmark; that the holdup is the fact that I get one page after another and parse them sequentially, if I could parse many at a time everything would go much quicker.
.75 .8 and .9 are simple matters of having the time to implement the various features (I've never worked with Tk or GUI's before but after browsing some online tutorials it seems fairly easy to get familiar with, especially for basic GUIs
However I really have no idea about where to begin on multithreading. I can state what I want to do in pseudo code
Concurrently call several instances of get_page_html() on several url's and then call parse_page() from inside each of those calls and have them running at the same time.
But I have no idea how to do this in practice. Any tips on where to start are much appreciated. Even experience in multithreading other languages would be helpful as this is only my second or third foray into Perl (I feel very comfortable learning new syntax quickly) so any examples in any language (within reason) would probably be helpful.
I am not initially planning on releasing this under a GPL until I find out if I can get some money out of company for designing this. (yes I know with Perl it will be open source anyway) but if you're interested in seeing the source I would be glad to share it as long as you agree to not release it.
Thanks
-Bob