The Process
Choose a website to steal, and get started! For this tutorial, I will use this article from sive.rs.
Download the HTML with curl
curl https://sive.rs/confabExtract the text with html2text
Install html2text with your distro’s package manager.
curl https://sive.rs/confab | html2textExtra: glow!
Lets bring in a tool from my favorite FOSS group, charmbracelet.
We can simply pipe our output to glow, and watch the magic!
curl https://sive.rs/confab | html2text | glow -Projects
Pull Pictures from a Website
Curl is used to fetch the whole website’s HTML. Then find all .jpg links and download them with wget. xargs is used to “execute commands from standard input”
curl https://wallhaven.cc | grep -o -e 'http[^"]*\.jpg' | xargs wgetTurn in into a script
This essentially wraps that command in a script format.
pics_dir="$(basename "$1")"
# remove $pics_dir if it already exists
[ "$pics_dir" ] && rm -rf "$pics_dir"
mkdir "$pics_dir"; cd "$pics_dir" || exit
curl "$1" | grep -o -e 'http[^"]*\.jpg' | xargs wgetPull All Articles from Website
This builds on the tutorial
Get a List of Links
- Get the HTML of the blogs index site.
- Sort the lines that have
<li>it them. - Replace everything before
href='withhttps:// - Replace everything after
>with nothing - Pipe that all to >
links.txt
curl https://sive.rs/blog | grep '<li>' | sed "s/^.*href=/https\:\/\/sive\.rs/" | sed "s/>.*$//" > links.txtGet all Articles
Loop through all links and pull them.
for i in $(cat links.txt); do
curl $i | html2text > $(basename $i) &
done& makes them all execute in parallel. Much faster!
Viewer Script
ls . | fzf | xargs glow -