nbk2000
February 22nd, 2005, 02:59 PM
While sorting through my offline copy of the FTP, specifically the Patents folder, I discovered a large number of duplicate patents had been uploaded, each with different names and in different formats.
This is not only wasteful of bandwidth and storage, but also makes searching through them in a useful manner nearly impossible.
The Problem
For instance:
Firearm.pdf
is totally useless for sorting, as we don't even know if it belongs in the Patents folder!
Patent.pdf
doesn't tell us shit-all about what it's about.
Firearm Patent
at least tells us it IS a patent, but the category of 'Firearm' is so all-encompassing as to make it moot for searches.
788,866 Firearm.pdf
is only slightly better, as we can now assume (perhaps incorrectly, might be missing a number) that it's an OLD patent, but unless it's already in a patent folder, we don't know that it IS a patent, as the word 'Patent' isn't included, and unless retrosynthetic opens it and see that it IS a patent, or someone informs him that it's not...who knows?
788,866 Firearm Patent.pdf
is getting better, but the numeric prefixing would result in a patent from the early 20th century being placed after patents from the 21st century, as 7 comes after 6, even if the 7 is part of a six digit patent number (early 20th century), and the 6 part of a seven digit number (early 21st century). :rolleyes:
Also, the , (comma) symbol makes it impossible to search for a unique patent number, as it's not a searchable character, so ANY patent containing 788 or 866 would bring up a hit, which if you have hundreds of patents, needlessly complicates searches. So REMOVE any spaces, commas, or other symbols from the numbers of patents.
Patent 0788866 Firearm.pdf
is getting better, as a search for 'patent' will pull it up, and it'll be sorted in proper order, and a search for the specific patent number will result in a unique hit, but the title still doesn't tell what KIND of firearm it is.
Patent 0788866 Pen Gun.pdf
is better still, as we now know that it is an old patent about a pen-gun. But is it an old American patent or an old british patent, or a recent WO patent?
Country of Origin Prefixes
prevents this confusion.
US Patent 0788866 Pen Gun.pdf
tells us that this is a United States or US patent.
If it had been a british Pen Gun patent, then the prefix GB (Great Britan) would be used, as in:
GB Patent 0788866 Pen Gun.pdf
German patents are prefixed with DE, as in:
DE Patent 0788866 Pen Gun.pdf
and a World Patent, WO, as in:
WO Patent 0788866 Pen Gun.pdf
You can, if you want, skip using the symbols between the country prefix and the word 'Patent', as in:
USPatent-0788866-Pen_Gun.pdf
GBPatent-0788866-Pen_Gun.pdf
WOPatent-0788866-Pen_Gun.pdf
DEPatent-0788866-Pen_Gun.pdf
but then a person would have to remember to use a wildcard prefix before 'Patent', such as:
*patent
when doing a search on their computer for patents, as 'USPatent' is NOT a 'whole' word as far as a search for 'patent' is concerned.
Sometimes patents are download from the ETO site with multiple zero's in the numeric prefix, as in:
WO00788866
Remove any zeros, from left to right, until you have a seven digit numeric prefix.
UPLOADING and DOWNLOADING from the FTP
But now, when it's uploaded to the FTP, it'll be saved as this:
US%20Patent%200788866%20Pen%20Gun.pdf
because the spaces are replaced by %20 on FTPs.
Use of symbols to replace the spaces, as in:
US_Patent-0788866-Pen_Gun.pdf
prior to uploading it to the FTP, keeps it readable once downloaded, and a macro can easily replace the _ and - symbols with spaces once downloaded, if you so choose.
Relevant Titles
There are more than a dozen 'Pen Gun' patents on the FTP, but only a couple were actually called what they were. They were called 'Gas Projector', 'Signal Device', and other verbose but inaccurate titles. The exact projectile which the device fires is irrelevant if it is being fired from a 'Pen Gun'.
If it's capable of shooting a bullet, and is in the shape of a pen, then it's a 'Pen Gun'. If it's shaped like a pen, but is only capable of launching signal flares, then it is NOT a 'Pen Gun', but a 'Pen Flare Launcher', as it's not a gun in the sense of firing a lethal projectile from a barrel.
Just as there was numerous pen gun patents under various names, so too was there numerous patents related to the launching of a projectile from the muzzle of a shotgun, projectiles that did not originate from inside of a shotgun shell.
These are NOT duck-decoy launchers, as the FTP is not related to hunting (well, maybe humans...;)), so the purpose of the patent is NOT as a duck-decoy launcher, but as a 'Shotgun Spigot Grenade Launcher' or a 'Muzzle Mounted Shotgun Cup Grenade Launcher', as the relevant titles are now.
Production of Pentaerythritolpentanitrate
Nitration of Pentaerythritol
Nitration Process (this one was real specific! :rolleyes: )
and all the rest equate to PETN. Not 'making PETN' or 'Preparation of PETN' or 'Production of PETN', but just simply PETN, as that is the end result, regardless of the steps leading up to it.
If the patent isn't specifically about the making of PETN, but some variant or purification, than the names should reflect this:
US_Patent-1933754-Purification_of_PETN.pdf
US_Patent-2204059-Crystallizing_PETN.pdf
US_Patent-3408383-PETN-Trinitrate_Salt.pdf
US_Patent-3520744-Free_Flowing_PETN.pdf
If there's a choice between using HMX or RDX, than use RDX, as that's the more common of the two acryonyms.
Always use the most common acronym because, while calling a substance 1,3,5-trinitrohexahydro-s-triazine may be technically accurate, it doesn't help anyone looking for RDX, as well as being a pain in the ass to type out.
US6502657-Transformable Vehicle.pdf was a mystery till I opened it and recognized it as what it is...a 'Throwbot' developed by MIT for use as a recon robot to be used by soldiers in MOUT war.
Hence US6502657-Remote Control Robot Grenade-'Throwbot'.pdf
because it is 'remote control'ly controlled, it is a 'robot', and is shaped and thrown like a 'grenade', and is known in the trade as a 'throwbot'.
If you don't feel up to creating a relevant title, then as a MINIMUM, use the title of the patent and some common-sense description of what it's about, and let someone else more capable will do the job for you.
HTML
Personally, this is my preferred way of saving a patent, as it's compact, easily searched, editable, and can be easily converted to other formats.
When saving the text of a patent as an .HTML file, please don't use the .MHT or similar 'all in one' format to save it. When saving an .HTML file from the www.uspto.gov site as an .MHT file, you not only save the useful text, but also all the useless navigation buttons that add nothing but bandwidth and storage overhead to the FTP.
Save the .HTML as an 'HTML Only' file.
Do NOT save it as a 'Text Only' file, as this results in a jumbled mess of no use to anybody.
Once saved, you don't archive it, as the file size is minimal anyways, and most FTP clients and servers automatically compress such files prior to transmission anyways.
An exception to the use of HTML is when there are data tables. Unfortunately, the vast majority of these get horribly mangled by the patent servers, rendering them useless, in which case a PDF or image would be more appropriate to preserve the table formatting.
Images
Patents earlier than 1971 are downloadable only as .TIF format graphic files from the www.uspto.gov site, so these should be properly named as previously described, with the addition of a single digit numeric page suffix, if there is LESS than ten page images, as in:
US_Patent-0788866-Pen_Gun1.tif
...
US_Patent-0788866-Pen_Gun9.tif
If there are MORE than nine page images, than use a two digit suffix, as in:
US_Patent-0788866-Pen_Gun01.tif
...
US_Patent-0788866-Pen_Gun09.tif
US_Patent-0788866-Pen_Gun10.tif
US_Patent-0788866-Pen_Gun11.tif
If you don't, then the pages get sorted like this:
US_Patent-0788866-Pen_Gun1.tif
US_Patent-0788866-Pen_Gun10.tif
US_Patent-0788866-Pen_Gun11.tif
US_Patent-0788866-Pen_Gun2.tif
US_Patent-0788866-Pen_Gun3.tif
...
And this is NOT very readable. :p
Archiving
Once the page images are properly named, compress them into a single archive file, such as .ZIP or .RAR, with the archive file being the properly formatted name, with the suffix -Images appended so that we know that it contains images, and not just a compressed .PDF or .HTML file.
A .ZIP file containing page images of US_Patent-0788866-Pen_Gun01.tif through US_Patent-0788866-Pen_Gun12.tif would be called US_Patent-0788866-Pen_Gun-Images.zip
Spelling
As always, proper spelling is of vital importance, as transposition of just two letters can cause a search for RDX to miss the file named RXD, which may have had the very thing they were looking for.
OCR
PDF image files are not text searchable unless they are first OCR'd. You could do this yourself, and it would be appreciated, BUT unless you're going to do a perfect job of it (meaning proofreading literally EVERY word and correcting EVERY error), than please don't do it, as a sloppy OCR job encourages lazy errors in the reader, who'll just copy the incorrect text without verifying that it is correct., like they'd have to do if they hand copied it from an image.
Let the user OCR it themselves if they want a searchable copy.
Though all this can be obviated by using an HTML version of the patent if such a copy is available from the originating patent server, like the VAST majority of the patent .PDF files on the FTP are, HTML being much more compact as well as easily searchable.
SUMMARY
So, in closing, name patents as follows:
[country of origin code: US, GB, DE, WO]_Patent-[Seven Digit patent number, with six digit patents being prefixed by 0]-[Title of Patent, or clarified version, with underscore _ between each word].[file extension: .HTM, .DJVU, .PDF, .ZIP]
Peeves
The Deja Vu (.DJVU) format is useless for patent archiving, as there's no way to search a PICTURE for TEXT, and OCR is likely impossible too. So, instead of being able to do a keyword search through the files, you have to manually open EACH and EVERY one that MIGHT have what you are looking for which, even if the filenames are accurate, still makes extraction of the content for use difficult, as it has to be manually transcribed.
What an incredibly productive use of our time this is. :mad:
While on the subject of unreadable, who uploaded "High-Impact Terrorism - Priapo"?
Almost 300 pages in an archive, each page as an individual .PDF, and numbered in the poor 1, 10, 2, 20, etc. format
While there may be some good information in there, and I'll eventually get around to getting it PROPERLY sorted...and compiled into a SINGLE .PDF...and OCR'D...it's unreadable 'till then. Thank you. :rolleyes:
This is not only wasteful of bandwidth and storage, but also makes searching through them in a useful manner nearly impossible.
The Problem
For instance:
Firearm.pdf
is totally useless for sorting, as we don't even know if it belongs in the Patents folder!
Patent.pdf
doesn't tell us shit-all about what it's about.
Firearm Patent
at least tells us it IS a patent, but the category of 'Firearm' is so all-encompassing as to make it moot for searches.
788,866 Firearm.pdf
is only slightly better, as we can now assume (perhaps incorrectly, might be missing a number) that it's an OLD patent, but unless it's already in a patent folder, we don't know that it IS a patent, as the word 'Patent' isn't included, and unless retrosynthetic opens it and see that it IS a patent, or someone informs him that it's not...who knows?
788,866 Firearm Patent.pdf
is getting better, but the numeric prefixing would result in a patent from the early 20th century being placed after patents from the 21st century, as 7 comes after 6, even if the 7 is part of a six digit patent number (early 20th century), and the 6 part of a seven digit number (early 21st century). :rolleyes:
Also, the , (comma) symbol makes it impossible to search for a unique patent number, as it's not a searchable character, so ANY patent containing 788 or 866 would bring up a hit, which if you have hundreds of patents, needlessly complicates searches. So REMOVE any spaces, commas, or other symbols from the numbers of patents.
Patent 0788866 Firearm.pdf
is getting better, as a search for 'patent' will pull it up, and it'll be sorted in proper order, and a search for the specific patent number will result in a unique hit, but the title still doesn't tell what KIND of firearm it is.
Patent 0788866 Pen Gun.pdf
is better still, as we now know that it is an old patent about a pen-gun. But is it an old American patent or an old british patent, or a recent WO patent?
Country of Origin Prefixes
prevents this confusion.
US Patent 0788866 Pen Gun.pdf
tells us that this is a United States or US patent.
If it had been a british Pen Gun patent, then the prefix GB (Great Britan) would be used, as in:
GB Patent 0788866 Pen Gun.pdf
German patents are prefixed with DE, as in:
DE Patent 0788866 Pen Gun.pdf
and a World Patent, WO, as in:
WO Patent 0788866 Pen Gun.pdf
You can, if you want, skip using the symbols between the country prefix and the word 'Patent', as in:
USPatent-0788866-Pen_Gun.pdf
GBPatent-0788866-Pen_Gun.pdf
WOPatent-0788866-Pen_Gun.pdf
DEPatent-0788866-Pen_Gun.pdf
but then a person would have to remember to use a wildcard prefix before 'Patent', such as:
*patent
when doing a search on their computer for patents, as 'USPatent' is NOT a 'whole' word as far as a search for 'patent' is concerned.
Sometimes patents are download from the ETO site with multiple zero's in the numeric prefix, as in:
WO00788866
Remove any zeros, from left to right, until you have a seven digit numeric prefix.
UPLOADING and DOWNLOADING from the FTP
But now, when it's uploaded to the FTP, it'll be saved as this:
US%20Patent%200788866%20Pen%20Gun.pdf
because the spaces are replaced by %20 on FTPs.
Use of symbols to replace the spaces, as in:
US_Patent-0788866-Pen_Gun.pdf
prior to uploading it to the FTP, keeps it readable once downloaded, and a macro can easily replace the _ and - symbols with spaces once downloaded, if you so choose.
Relevant Titles
There are more than a dozen 'Pen Gun' patents on the FTP, but only a couple were actually called what they were. They were called 'Gas Projector', 'Signal Device', and other verbose but inaccurate titles. The exact projectile which the device fires is irrelevant if it is being fired from a 'Pen Gun'.
If it's capable of shooting a bullet, and is in the shape of a pen, then it's a 'Pen Gun'. If it's shaped like a pen, but is only capable of launching signal flares, then it is NOT a 'Pen Gun', but a 'Pen Flare Launcher', as it's not a gun in the sense of firing a lethal projectile from a barrel.
Just as there was numerous pen gun patents under various names, so too was there numerous patents related to the launching of a projectile from the muzzle of a shotgun, projectiles that did not originate from inside of a shotgun shell.
These are NOT duck-decoy launchers, as the FTP is not related to hunting (well, maybe humans...;)), so the purpose of the patent is NOT as a duck-decoy launcher, but as a 'Shotgun Spigot Grenade Launcher' or a 'Muzzle Mounted Shotgun Cup Grenade Launcher', as the relevant titles are now.
Production of Pentaerythritolpentanitrate
Nitration of Pentaerythritol
Nitration Process (this one was real specific! :rolleyes: )
and all the rest equate to PETN. Not 'making PETN' or 'Preparation of PETN' or 'Production of PETN', but just simply PETN, as that is the end result, regardless of the steps leading up to it.
If the patent isn't specifically about the making of PETN, but some variant or purification, than the names should reflect this:
US_Patent-1933754-Purification_of_PETN.pdf
US_Patent-2204059-Crystallizing_PETN.pdf
US_Patent-3408383-PETN-Trinitrate_Salt.pdf
US_Patent-3520744-Free_Flowing_PETN.pdf
If there's a choice between using HMX or RDX, than use RDX, as that's the more common of the two acryonyms.
Always use the most common acronym because, while calling a substance 1,3,5-trinitrohexahydro-s-triazine may be technically accurate, it doesn't help anyone looking for RDX, as well as being a pain in the ass to type out.
US6502657-Transformable Vehicle.pdf was a mystery till I opened it and recognized it as what it is...a 'Throwbot' developed by MIT for use as a recon robot to be used by soldiers in MOUT war.
Hence US6502657-Remote Control Robot Grenade-'Throwbot'.pdf
because it is 'remote control'ly controlled, it is a 'robot', and is shaped and thrown like a 'grenade', and is known in the trade as a 'throwbot'.
If you don't feel up to creating a relevant title, then as a MINIMUM, use the title of the patent and some common-sense description of what it's about, and let someone else more capable will do the job for you.
HTML
Personally, this is my preferred way of saving a patent, as it's compact, easily searched, editable, and can be easily converted to other formats.
When saving the text of a patent as an .HTML file, please don't use the .MHT or similar 'all in one' format to save it. When saving an .HTML file from the www.uspto.gov site as an .MHT file, you not only save the useful text, but also all the useless navigation buttons that add nothing but bandwidth and storage overhead to the FTP.
Save the .HTML as an 'HTML Only' file.
Do NOT save it as a 'Text Only' file, as this results in a jumbled mess of no use to anybody.
Once saved, you don't archive it, as the file size is minimal anyways, and most FTP clients and servers automatically compress such files prior to transmission anyways.
An exception to the use of HTML is when there are data tables. Unfortunately, the vast majority of these get horribly mangled by the patent servers, rendering them useless, in which case a PDF or image would be more appropriate to preserve the table formatting.
Images
Patents earlier than 1971 are downloadable only as .TIF format graphic files from the www.uspto.gov site, so these should be properly named as previously described, with the addition of a single digit numeric page suffix, if there is LESS than ten page images, as in:
US_Patent-0788866-Pen_Gun1.tif
...
US_Patent-0788866-Pen_Gun9.tif
If there are MORE than nine page images, than use a two digit suffix, as in:
US_Patent-0788866-Pen_Gun01.tif
...
US_Patent-0788866-Pen_Gun09.tif
US_Patent-0788866-Pen_Gun10.tif
US_Patent-0788866-Pen_Gun11.tif
If you don't, then the pages get sorted like this:
US_Patent-0788866-Pen_Gun1.tif
US_Patent-0788866-Pen_Gun10.tif
US_Patent-0788866-Pen_Gun11.tif
US_Patent-0788866-Pen_Gun2.tif
US_Patent-0788866-Pen_Gun3.tif
...
And this is NOT very readable. :p
Archiving
Once the page images are properly named, compress them into a single archive file, such as .ZIP or .RAR, with the archive file being the properly formatted name, with the suffix -Images appended so that we know that it contains images, and not just a compressed .PDF or .HTML file.
A .ZIP file containing page images of US_Patent-0788866-Pen_Gun01.tif through US_Patent-0788866-Pen_Gun12.tif would be called US_Patent-0788866-Pen_Gun-Images.zip
Spelling
As always, proper spelling is of vital importance, as transposition of just two letters can cause a search for RDX to miss the file named RXD, which may have had the very thing they were looking for.
OCR
PDF image files are not text searchable unless they are first OCR'd. You could do this yourself, and it would be appreciated, BUT unless you're going to do a perfect job of it (meaning proofreading literally EVERY word and correcting EVERY error), than please don't do it, as a sloppy OCR job encourages lazy errors in the reader, who'll just copy the incorrect text without verifying that it is correct., like they'd have to do if they hand copied it from an image.
Let the user OCR it themselves if they want a searchable copy.
Though all this can be obviated by using an HTML version of the patent if such a copy is available from the originating patent server, like the VAST majority of the patent .PDF files on the FTP are, HTML being much more compact as well as easily searchable.
SUMMARY
So, in closing, name patents as follows:
[country of origin code: US, GB, DE, WO]_Patent-[Seven Digit patent number, with six digit patents being prefixed by 0]-[Title of Patent, or clarified version, with underscore _ between each word].[file extension: .HTM, .DJVU, .PDF, .ZIP]
Peeves
The Deja Vu (.DJVU) format is useless for patent archiving, as there's no way to search a PICTURE for TEXT, and OCR is likely impossible too. So, instead of being able to do a keyword search through the files, you have to manually open EACH and EVERY one that MIGHT have what you are looking for which, even if the filenames are accurate, still makes extraction of the content for use difficult, as it has to be manually transcribed.
What an incredibly productive use of our time this is. :mad:
While on the subject of unreadable, who uploaded "High-Impact Terrorism - Priapo"?
Almost 300 pages in an archive, each page as an individual .PDF, and numbered in the poor 1, 10, 2, 20, etc. format
While there may be some good information in there, and I'll eventually get around to getting it PROPERLY sorted...and compiled into a SINGLE .PDF...and OCR'D...it's unreadable 'till then. Thank you. :rolleyes: