Tuesday, October 16, 2007

extract links from a webpage

this is useful if you decide to make a web crawler, and dont want to bother with an html parser. You have to read the body of the page into a string, and then use this regex to extract all the links.

supported link types: <img src=...<a href=...


an absolute url is in the form of "http://something.com/blah"
a relative url is in the form of "/something/path.blee"

now you can figure out what to do with these...

Monday, October 15, 2007

SQL: change column datatype

in this note to self, i need have a table where a column (COL2) is an NVARCHAR, and it need to become an NCLOB:

CREATE table TABLENAME_TEMP as select * from TABLENAME;
ALTER table TABLENAME drop COLUMN COL2;
DELETE from TABLENAME;
ALTER table TABLENAME add (COL2 NCLOB NOT NULL);
-- at this point the columns may not line up, so specify order...
INSERT into TABLENAME TN (TN.COL1, TN.COL3, TN.COL2) select * from TABLENAME_TEMP;
COMMIT;
DROP table TABLENAME_TEMP;

now the table TABLENAME has a column which is an NCLOB, yey!s

Monday, October 08, 2007

Remove HTML tags regex

when using an HTML parser is too much work, you may want to use a small regex to remove all the html tags

this is java: