Bob F ([info]mrdzone) wrote,
@ 2009-01-04 23:53:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
Finished
So I've officially finished version .7 of the program I have been working on (no name yet)

Basically I'm very happy with where I'm at now.

This program grew out of a need for me to build prospecting lists for the business I'm in. I had been using manta to browse companies but I found that clicking on each one for information was too slow.

So I built an app in perl to parse the search page and snag all the relevant information for each company presented. I could have done a lot of this using prebuilt code (not that I didn't use modules and whatnot) but I wanted to learn a fair bit about the language and really tackle a programming project, since I haven't had any serious forays into programming sophomore year of college (not counting php/asp which I don't count as real [at least not the way I was using them, I'm sure designing a CRM is different])

So I setteled on Perl as I figured parsing would require a lot of regex (I was right.) In the end I did decide to use HTML::Parser but getting the information still required a lot of parsing.

It works fairly well but it is SLOW. It takes about a second a page, which doesn't seem like a lot... however 1000 companies takes around 15 minutes.

Anyway as of right now I figure I'm at version .7

just for me a todo list:
Version .75:
Add support for searching via category and state/city and for acquiring the cookie from logging into the website so that I can authenticate from the program.

Version .8:
Add Tk GUI

Version .9:
Add error checking
/*I'm terrible about doing this on the front end as I know I'm the only one entering input so it will inevitably be good, then when I'm about to release it I realize people are morons. I'm now resigned to designing this way so instead of coding in error scripts I just stick in comments about what to do in pseudo code and do it in the end.*/

Version 1
Figure out some way to add multithreading and speed up the operations. I've determined via use benchmark; that the holdup is the fact that I get one page after another and parse them sequentially, if I could parse many at a time everything would go much quicker.

.75 .8 and .9 are simple matters of having the time to implement the various features (I've never worked with Tk or GUI's before but after browsing some online tutorials it seems fairly easy to get familiar with, especially for basic GUIs

However I really have no idea about where to begin on multithreading. I can state what I want to do in pseudo code

Concurrently call several instances of get_page_html() on several url's and then call parse_page() from inside each of those calls and have them running at the same time.

But I have no idea how to do this in practice. Any tips on where to start are much appreciated. Even experience in multithreading other languages would be helpful as this is only my second or third foray into Perl (I feel very comfortable learning new syntax quickly) so any examples in any language (within reason) would probably be helpful.

I am not initially planning on releasing this under a GPL until I find out if I can get some money out of company for designing this. (yes I know with Perl it will be open source anyway) but if you're interested in seeing the source I would be glad to share it as long as you agree to not release it.

Thanks

-Bob



(Post a new comment)


[info]jonthebrit
2009-01-06 02:05 am UTC (link)
So I have hardly any idea what any of this shit means, but I will tell you that you should check to see the intellectual property and proprietary information agreement with your company. Generally while employed, anything you make to help you do your job is property of the company. Just because it's in Perl doesn't mean it's open source.

Which is no reason to stop, it'll get you a huge 'atta boy'

(Reply to this) (Thread)


[info]mrdzone
2009-01-06 04:49 am UTC (link)
Yea my company isn't really a tech company, they move freight... most of the people there don't even know shit like this is possible. However I figured I would double check, so poking around our intranet I checked out our IT Docs section and the only 'policy' there is about network use (which apparently I break several times a day ... meh) and they say nothing about IP.

(Reply to this) (Parent)


[info]mrdzone
2009-01-06 04:52 am UTC (link)
Also you are using open source in a different context then me. You are using open source to mean freely distributable/modifiable. That is what the Gnu Public License is for (which I am not using)

You can still restrict the rights of users through a license agreement even if they have the source code (all this means is it is easier for them to circumvent the licensing agreement)

Since perl is an interpreted language, the source must be included in the distribution of the software. There are ways of obfuscating the code to make it less human readable, but in the end it can be gotten around.

So I am using the term open source with the meaning being narrowly define as "possessing the source code" not "being (legally) able to do whatever you want with the source code" there is a big difference.

(Reply to this) (Parent)


Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…