July 1st, 2024

Show HN: AI assisted image editing with audio instructions

The GitHub repository hosts "AAIELA: AI Assisted Image Editing with Language and Audio," a project enabling image editing via audio commands and AI models. It integrates various technologies for object detection, language processing, and image inpainting. Future plans involve model enhancements and feature integrations.

Read original article

Show HN: AI assisted image editing with audio instructions

The GitHub repository at the provided URL pertains to the project "AAIELA: AI Assisted Image Editing with Language and Audio." This project aims to enable users to edit images through audio commands and AI models, incorporating computer vision, speech-to-text, language models, and text-to-image inpainting. The project structure encompasses components for object detection, audio transcription, language models, and inpainting models. The workflow involves image upload, segmentation, audio input, transcription, language understanding, image inpainting, and output generation. Future research directions include retraining the inpainting model, automatic mask generation, contextual reasoning, multi-object mask generation, and visual language model integration. The project's to-do list features tasks like integrating TensorRT for Stable Diffusion models, ControlNet integration, Mediapipe Face Mesh integration for facial features modification, pose landmark detection, super-resolution model implementation, and interactive mask editing using Segment Anything.

Optimizing AI Inference at Character.ai

Character.AI optimizes AI inference for LLMs, handling 20,000+ queries/sec globally. Innovations like Multi-Query Attention and int8 quantization reduced serving costs by 33x since late 2022, aiming to enhance AI capabilities worldwide.

Generating audio for video

Google DeepMind introduces V2A technology for video soundtracks, enhancing silent videos with synchronized audio. The system allows users to guide sound creation, aligning audio closely with visuals for realistic outputs. Ongoing research addresses challenges like maintaining audio quality and improving lip synchronization. DeepMind prioritizes responsible AI development, incorporating diverse perspectives and planning safety assessments before wider public access.

Show HN: Feedback on Sketch Colourisation

The GitHub repository contains SketchDeco, a project for colorizing black and white sketches without training. It includes setup instructions, usage guidelines, acknowledgments, and future plans. Users can seek support if needed.

Show HN: a Rust lib to trigger actions based on your screen activity (with LLMs)

The GitHub project "Screen Pipe" uses Large Language Models to convert screen content into actions. Implemented in Rust + WASM, inspired by `adept.ai`, `rewind.ai`, and `Apple Shortcut`. Open source under MIT license.

Mozilla.ai did what? When silliness goes dangerous

Mozilla.ai, a Mozilla Foundation project, faced criticism for using biased statistical models to summarize qualitative data, leading to doubts about its scientific rigor and competence in AI. The approach was deemed ineffective and compromised credibility.

14 comments

By @throwaway4aday - 10 months

Forgot to share this link as well, not sure if you're aware of it but it's a great write up on fine tuning small local models on specific APIs and seems like it would be a perfect fit for your project. https://bair.berkeley.edu/blog/2024/05/29/tiny-agent/

By @ShaShekhar - 10 months

Example instructions: 1. Replace the sky with a deep blue sky then replace the mountain with a Himalayan mountain covered in snow. 2. Stylize the car with a cyberpunk aesthetic, then change the background to a neon-lit cityscape at night. 2. Replace the person with sculpture complementing the architecture.

Check out the Research section for more complex instructions.

By @G1N - 10 months

We're so close to being able to create our own Tayne

(https://www.youtube.com/watch?v=a8K6QUPmv8Q)

By @throwaway4aday - 10 months

Love it! Voice interaction is a great modality for UI. A lot of people have a bad taste left over from early attempts but I expect to see a lot of progress made now that STT and natural language understanding is so much better.

The biggest reason we should be adding conversational UI to everything is the harm done by RSI and sedentary keyboard and mouse interfaces. We're crippling entire generations of people by sticking to outdated hardware. The good news is we can break free of this now that we have huge improvements in LLMs and AR hardware. We'll be back to healthy levels of activity in 5 to 10 years. Sorry Keeb builders, it's time to join the stamp collectors and typewriter enthusiasts. We'll be working in the park today.

By @vunderba - 10 months

Nice job. I actually experimented with a chat driven instruct2pix sort of interface that connected via API to a stable diffusion backend. The big problem is that it's difficult to know if the inpainting job you've done is satisfactory to the user.

This is why usually when you're doing this sort of traditional inpainting in automatic1111 you generate several iterations with various mask blurs, whole picture vs only masked section, padding and of course the optimal inpainting checkpoint model to use depends on whether or not the original images is photorealistic versus illustrated, etc.

By @benzguo - 10 months

Super cool! We're building an API that makes it easy to build chained multi-model workflows like this that run with zero latency between tasks - https://www.substrate.run/

By @beautifulfreak - 10 months

It didn't just replace the sky and background, it replaced the trees. That wasn't part of the instructions.

By @leobg - 10 months

I love how in the demo video, even the audio instructions themselves are AI generated. No human in the loop, at all! :)

By @omerhac - 10 months

Very cool - which method do you use for editing the images? is it SDEdit or InstructPix2Pix? another one?

By @parentheses - 10 months

soon the movie trope of saying "enhance" repeatedly could be a real thing!

By @kveykva - 10 months

This pitches a lot but only seems to support a specific inpainting operation?

By @sgbeal - 10 months

Wow! We're now just a hair's-width away from finally being able to say, "Computer, enhance image!" without sounding like we're in a bad sci-fi show.

By @whatnotests2 - 10 months

Zoom. Enhance!

Show HN: AI assisted image editing with audio instructions

Related

Optimizing AI Inference at Character.ai

Generating audio for video

Show HN: Feedback on Sketch Colourisation

Show HN: a Rust lib to trigger actions based on your screen activity (with LLMs)

Mozilla.ai did what? When silliness goes dangerous

Related

Optimizing AI Inference at Character.ai

Generating audio for video

Show HN: Feedback on Sketch Colourisation

Show HN: a Rust lib to trigger actions based on your screen activity (with LLMs)

Mozilla.ai did what? When silliness goes dangerous