Intuition says that the results of #1 vs #2 should be identical, or that #1 should be worse because you're doing two passes on source material. But I always get better results with case #1 (i.e., high-res scan and re-encoding afterward) regardless of the type or model of scanner, or whether the scanner does the JPG encoding on-board the device itself or through a Windows/Linux/Mac driver bundled with the scanner.

My theory is that scanner manufacturers are deliberately choosing the JPG encoding profile that gets them the fastest result. They want to brag about pages per minute which is an easily measured metric. Quality of JPG encoding and file size take effort to compare, but everyone understands pages per minute.

If anyone has contrary experience I'd like to hear it. I've been seeing this for years with different document scanners and flatbed scanners -- regardless of how I tweak the scanner's settings, I can always get good quality in a small file by re-encoding afterward.

If you're scanning at a lower resolution, the scanner has fewer samples to work with when trying to make a visual representation of your document. If you scan at a higher resolution, the algorithm could at the very least average together nearby samples. It could also detect sharp lines vs fuzzy borders and decide whether the low-res version has a sharp transition or an averaged color between areas.

Low-resolution scans are faster than high-resolution scans for this exact reason: the sampling is worse. There might be the same number of samples per line captured by the linear CCD (and then downsampled), but the distance between scanlines varies by DPI.

My guess slightly different: manufacturers are deliberately choosing the JPG encoding profile that gets them good quality in worst cases. Which also happens to get fast encoding and bigger files. Their motivation is simple. One case of negative user experience outweigh a thousand positive ones and hurts their reputation hard.

While the JPEG encoder in Photoshop is likely a lot better than what's in your scanner, I am, similarly to others here, fairly convinced that the majority of the difference is in the sample rate by the scanner.

I did some vanilla JS that simulates scaling from high DPI vs sampling directly at a specific DPI and the result does resemble the result of a low resolution scan.

Here is why: if you go with (2), then (1) is still done: by the crap firmware in your printer or its driver. It scans at high resolution and then downsamples in some way over which you have no control. It might not even be done with floating-point math.

1. is a somewhat like getting the raw image from a camera: a higher quality source for your own processing.

On the top image, I see that the back side of the page has clearly leaked through. In my experiences with scanning paper, I found a trick that essentially eliminates any visible backside content: Using a flatbed scanner, I would scan with the lid open, and the room darkened.

The worst thing to do is to scan with the lid closed, with a lid that has a white background. This would increase the reflection from the backside of the page.

Nice to see a bit of k-means clustering. I was worried that this might attempt to be "smart" by converting to symbols, replicating the "Xerox changes numbers in copied documents" bug, but it's pure pixel image processing.

Very clean results. In some ways it's a smarter version of the "posterize" feature.

Note how the grid is completely gone, the Sharpie strokes are fuller and the ghosting around the red ink is gone. (The word "Red" seems to have been written faintly, like with a non-working ball point pen, and then written over properly.)

The thing is, I took a completely different approach here. I won't give a complete step-by-step recipe, but the gist of it is this:

1. Create a copy layer of the image.

2. Optionally level the intensity with the divide trick; I didn't bother.

3. Convert this copy to grayscale.

4. Threshold it to black and white, such that the grid is eliminated, but the writing remains solid.

5. Blur the writing (radius 3-4).

6. Threshold again.

7. Now you have a black and white version that is a bit thicker than the original. TURN THIS INTO A LAYER MASK. An inverted one which passes through the writing, and renders everything else transparent.

8. Apply this mask to the original image. This requires transfering a layer mask between layers.

9. Slide a white background under the masked layer. Now you have the lettering clean on white.

10. Play with simple Color->Brightness-Contrast. I ended up with something like brightness -66, contrast +88.

In the final step, because of the layer mask that is in effect, these controls affect only the writing: the white coming from the unaffected layer below stays white no matter what you do with the contrast and brightness controls.

Why the different approach: I first tried the original approach and the result was good. But I thought you wouldn't like it either. It was similar to Matt's. I did a better job of eliminating the grid, but the writing was less vivid. (Likewise I also preserved the yellow tint of the paper.) I wanted the grid completely gone, with vivid writing. Playing with the intensity transfer curves was not quite doing it; there was poor separation between vanshing the grid while preserving the ink.

I like the idea, but DjVu seems to be very proprietary / single vendor and not in widespread use. This has made me reluctant to use it for archival purposes (vs say PDF, which has its own issues, but feels slightly more future proof to me).

I think PDF can cover pretty much the same ground with JBig and Jpeg2k. (And I believe archive.org is doing that.) But I don't know of any open source code to do the segmentation / encoding. (You have to split the bitmap from the background for jbig / jpeg encoding.)

Really? Do you care to explain? What is the dividend and what is the divisor? Why can dividing a image by its low pass filtered version (or vice versa) be used to "clean up" the image, i.e. subtract the background, find main colors and cluster similar colors with k-means? What if the divisor has pixels near zero?

Areas of low contrast become whiter and areas of high contrast become more saturated.

It is also more robust than k-means. The author's algo will only work on scanned images. Photographed pages from a book will often have a slight shadow on half the page from the curvature. Blur-divide will clean this up. K-means will think you've used a lot of gray and not figure out that there are multiple background colors.

Is this really standard writing paper? I assume it would be useful for calligraphy or learning how to write (as you can use the subdivision to draw letters to the correct height) but I find it weird for it to be standard issue paper.

It helps children learning how to write.
Lowercase letters start from the thicker line to the first thinner line above.
Uppercase letters and taller lowercase letters like "t" or "d" go to the second line.
And the tails of letters like "g" or "y" go the first line below.

If you don't use the numpy==1.9.0 you'll get the 1.14.2 version which is also broke.

The rest of the options allow pip to soft-override the macOS built-in numpy 1.8.0 which is immutable in the /System/ directory.

Anyway, after I did all that I was able to start playing with the app, I had previously been using a kludge workflow to get a nice output in black and white by using the imagemagick convert -shave option to remove the scanned edges of images, then doing a -depth 1 to force the depth down (which only works well on really clean scans), then I can -trim to clear the framing white pixels and re-center using the -gravity center -extent 5100x6600 to frame the contents centered inside a 600dpi image.

Rough but it works, I was hassling with trying to isolate "spot colors" for another thing, but this might actually do the trick!!!

Great job with that. I've only just started taking notes by hand once again, after being keyboard-only for many years.

In your scenario, since you have assigned "scribes" taking the notes, you might be able to streamline the process with a "smart pen."

There are several on the market. The one I got as a hand-me-down from a family member lets you write dozens of pages of notes, then Bluetooth them to a smartphone app that exports to PDF, Box, Google Drive, etc... Or it can actually copy the notes to the app in real time. Combined with a projector, this might be useful for the other students during class.

It's supposed to be able to OCR the notes, too, but I haven't bothered to figure out how. But there's a cool little envelope icon in the corner of each notebook page that if you put a checkmark on, it will automatically e-mail the page to a pre-designated address.

Again, there are several models on the market. Mine retails for about $100. Notebooks come in about 15 different sizes and cost about the same as a regular quality notebook.

I have found that my Galaxy Note 2014 is pretty much hands-down the best note taking tablet in my opinion. It's better than the crap that Microsoft and Apple are trying to hawk off. It doesn't have as many fancy apps but for _strictly_ note taking, sharing notes via email, and book reading, it's pretty awesome.

I just wish its price would come down. It's still full price from four years ago :| and even getting more expensive because it's so old

You are referring to the 10.1 tablet? I know its not the same but I had a Note4 and was pleased with its note taking ability. I imagine that with the extra screen real-estate the tablet was even better. It looks like they are $453 on Newegg! I agree that does seem crazy for an old device. If you can put up with a used device there are two on swappa for around $200 https://swappa.com/buy/samsung-galaxy-note-101-2014-wifi
If it truly is the hands-down best note taking tablet you might as well buy them both and standardize to the platform.

However, I prefer the Wifi-only model. I neither need nor want cell service on my tablet. Not only that, but cellular providers end up installing complete garbage for their software.

I recently broke my first device. I bought a new one that was advertised Wifi-only. The received device was branded for Verizon and was literally and completely unusable without a Verizon SIM card. I ended up getting it replaced with a T-Mobile one which... while isn't not completely unusable, there's still tons of crapware from T-Mobile that I cannot uninstall.

I have to say that I love the galaxy note 8.0 for note taking. So much so that I'm trying to buy up as many as I can find so that I will always have one.
There are 2 main reasons for this:

1st: I love the size. The 10s are a little too big to cart around in a jacket pocket, and the note phones are too small to be a decent notebook. (I must admit though, I'm rethinking this point, as I used to be a huge moleskin fan)

2nd: The Samsung s-note app from that time is awesome! Honestly, every time samsung comes out with a new pen based tablet or phone, I always check it out. And I'm looking for 2 features: a: the ability to import a pdf file & b: goddamned smart shapes! I suck at drawing, and so I love that on my 8.0 I can draw a shitty square or circle and the software will make it look all ship shape. But they seemed to have removed that feature from newer versions of s-note, and I haven't found a 3rd party app that can do just those 2 things.

Any pointers to an app that will allow me to import pdfs and draw on them using smart shapes would be awesome!

As far as Apple's offerings: no idea. The last Apple device I used was an 4th-generation iPod back when that was the new thing. Coming from a desktop-power-user background, iOS 4's UX was extremely offensive. So is Android's, by the way, but at least Android's horrid UX cheaper.

Technicalities:
1) the S Pen's sensitivity is much closer to the image. Microsoft Surface, in contrast, feels and appears about 1/4" above the image: that causes me to mis-write a lot since I feel the stylus at a different location from where the drawing is being applied. I don't know about you, but I don't write with my pen/pencil/stylus straight up 90 degrees from the surface being written upon which is the only way I could reliably use a Surface for writing.

2) the Galaxy Note 2014's resolution is higher than the Surfaces I've used even though it's on a smaller form factor. That means that the image simply looks sharper.

3) the S Pen uses a different tech so that the Galaxy Note is able to recognize the difference between my fingers (or my palm) resting on the tablet's surface vs the stylus writing whatever's being written.

4) Both of the Galaxy Note 2014s I've used have had a battery life of about a full week from a full charge. It does depend on how much you use it, of course, but typically it's about 4.5 days, dying right at the tail end of Friday. That's with writing notes for about 2 or 3 hours during every workday. Contrast with Surface 3 and Surface Pro ... one had lasted a day and a half on average while the other lasted about 10 hours.

5) S Note translates my handwriting with about a 90% accuracy. But OneNote translates my handwriting to literal garbage text for about 50% of the time. When it doesn't translate to garbage text, there's of course the common typos from thinking I->l->1 B->0->O->o->D->etc (which S Note also has trouble with, but not as much).

Opinions:
I absolutely despite Microsoft's and Apple's OS offerings these days. I don't like Android, either, but it's the least hated of the three. I use Windows 7 at home for games but otherwise Linux the whole way. Linux allows me to actually customize the UI the way I want; when it doesn't, there's the good ole' command line.

I've found the S Note app to be fairly intuitive and straightforward. It's really easy to export notes to PDF or email, eg to share with coworkers. If you can get around OneNote's shitfest version of OCR, I'll admit that the rest of it pretty intuitive. I really didn't care to learn (yet another) different way of doing things though.

If I had _one_ complaint about the Galaxy Note, it's that I have yet to find a way to have it require Active Directory to unlock; so if that's a requirement then ~~your company sucks~~ you're SOL for using it at work. If someone knows of a viable way to login to an Active Directory account on Android, that'd be awesome.

I also have to say that I actively use Google Play Books. Good luck getting that to work on Surface or iOS outside of ~~Chrome~~ a browser; I never could anyway. If I have to log in to a site using a browser and use the browser to use the site instead of using native apps, then why not just use a Chromebook?

In terms of compression for scanned notes, I haven't found anything that comes close to what even an older version of Adobe Acrobat yields, due to the use of the JBIG2 codec. Has anybody found any way to compress PDF files with JBIG2 on Linux/Mac? It's pretty much the only reason I have to find a Windows machine with Acrobat installed a couple of times a year, to postprocess a batch of scanned PDFs.

I have made some progress on this as my home project using same compression and scan. I call it DFA - digital file analytics where data/images/scanned documents are sent remotely using Kafka to Hadoop and then run OCR to extract text and compression. If the document is more then 10MB go to HBase otherwise HDFS. Near real-time streaming using Spark and Flink is done too. Visualization using Banana dashboard is not so cool as it shows word counts, storage location, images and tags. Analytics on top of extracted data using ML would like to do next.

It seems to me that the inaccuracies/inefficiencies/errors/whatever you like in using RGB are basically truncated out of existence by the very, very harsh binning that is occurring. I wouldn't expect any visible differences to emerge from any alternate color space.